demoFL ====== This document outlines the step-by-step workflow for the analysis of follicular lymphoma using the `dds_analysis` package. Each step is described along with the corresponding bash code. Set variable paths: ------------------- The following code sets variable paths. User can adjust this code according to their files and folders. .. code-block:: bash #path of main input data folder IN_FOLDER='../../data/fl_12samples/in_data/' #a) Path of mutation blocks and block summary files exported by bpb3 package IN_BLOCKS=${IN_FOLDER}'/bpb3_demo3_fl_cohort_small/out/mussd_blocks/' IN_BLOCK_SUMMARY='blocks_summary.tsv' #Path of mutation blocks with patient ID will be exported later by dds_analysis find_block_patientID #convert bpb3 predicted mutation block summary file to bed format file with positions. This is done by bpb3summary_format dds_analysis bpb3summary2bed_format --in_block_summary_file ${IN_BLOCKS}/$IN_BLOCK_SUMMARY echo ${IN_BLOCKS}'/'${IN_BLOCK_SUMMARY} echo "preprocess_data - Done" #make file name of the bed format of block summary file that will be used in the rest of analysis replace='_block_position.bed' finds='.tsv' IN_BLOCK_FILE=${IN_BLOCK_SUMMARY//$finds/$replace} #b) Path of predicted DMRs/MRs from "dmr_analysis" IN_GENOME_REGIONS=${IN_FOLDER}'/final_demo_data/fl_12samples/out_data/DMR_CpG_context/' #a list of predefined genomic regions that will be used to map mutation blocks. #This file is generated by dmr_analysis dmr_gene_annotation but user has to double check the location of files in this list if you need to add enhancer or other genomic region files IN_LIST_REGIONS='list_region_files.txt' #c) Path to refFlat gene annotation file that generated by hmst-seq-analyzer IN_GENOME_refFlat=${IN_GENOME_REGIONS}'/data/hg19.refFlat_clean_sorted.bed' #d) Path of other genomic information files such as TAD, chromatin segment, and genome files et al IN_GENOME_LIB='../../data/' #path of chromatin segment files that will be used to intersect with mutation blocks IN_CHROMSEG=${IN_GENOME_LIB}'/chromSegment/hg19/' #e) Path of TAD information that will be used to map mutation blocks IN_TAD=${IN_GENOME_LIB}'/in_tad/' #f) Path to gene position file in sorted bed format which is generated by dmr_analysis dmr_gene_annotation IN_GENE_POSITION=${IN_TAD}'/out/data/gene_Up1_Down1removedShort.bed' #g) Path to common TAD positions across 5 human cell lines IN_TAD_INFO=${IN_TAD}'/Table4_TAD_annotations_sorted_chr.bed' #path to common TAD boundary position among 5 human cell lines IN_TAD_BOUNDARY=${IN_TAD}'/Table1_common_boundaries_merged_sorted_chr.bed' #Path of differentially methylated regions, differentially expressed genes, and output folder of mutation blocks mapped to predefined genomic regions. #h) Path to DMRs predicted and ranked by dmr_analysis IN_DMR=${IN_GENOME_REGIONS}'/2_chroms_all_mr_data_range_dmrRanking.bed' #i) path to DEG genes exported from bpb3 package IN_EXPRESS=${IN_FOLDER}'/bpb3_demo3_fl_cohort_small/out/differentially_expressed_genes.txt' #Try to add N bp flank region on the two sides of mutation blocks before intersecting them with DMRs FLANK_BP=0 ###---OUTPUT PATH # path of main output data folder OUT_FOLDER='../../data/fl_12samples/out_data/' #path for exporting mutation blocks mapped to predefined genomic regions OUT_GENOME=${OUT_FOLDER}'/out_genome/' #path for exporting mutation blocks mapped to chromatin segmentations OUT_CHROMSEG=${OUT_FOLDER}'/out_chromSegment/' #path for exported mutation blocks associated with DEGs OUT_EXPRESS=${OUT_FOLDER}'/out_expression/' #path for final results that mutation blocks associated with DMR, DEG and TAD OUT_FINAL=${OUT_FOLDER}/'out_DmrDeg2block' Here we will look at input files. bpb3 block summary looks like following: .. code-block:: bash block_id chrom start_pos end_pos number_of_mutations number_of_patients mutation_distribution region_names block_1_18_60987833_60988302 18 60987808 60988326 102 9 6,4,11,7,7,5,18,13,31 BCL2;-;up1000;down1000 block_0_18_60986372_60987028 18 60986347 60987052 59 8 4,3,10,9,6,9,5,13 BCL2;-;up1000;down1000 IN_LIST_REGIONS file looks like the following: .. code-block:: bash ../../final_demo_data/fl_12samples/out_data/DMR_CpG_context/data/TSS_Up5000_Down1000_removedShort.bed ../../final_demo_data/fl_12samples/out_data/DMR_CpG_context/data/gene_Up5000_Down1000removedShort.bed ../../final_demo_data/fl_12samples/out_data/DMR_CpG_context/data/TES_Up5000_Down1000removedShort.bed ../../final_demo_data/fl_12samples/out_data/DMR_CpG_context/data/5dist_Up1000000_Up5000removedShort.bed ../../final_demo_data/fl_12samples/out_data/DMR_CpG_context/data/intergenic_uniqueSorted_betweenTSS_TES_genes_minLen10.bed ../../final_demo_data/fl_12samples/out_data/DMR_CpG_context/data/5dist_Down1000000_Down5000removedShort.bed Head of the hg19.refFlat_clean_sorted.bed looks like the following .. code-block:: bash chr1 11868 14362 LOC102725121:NR_148357 . + chr1 11873 14409 DDX11L1:NR_046018 . + chr1 14361 29370 WASH7P:NR_024540 . - chr1 34610 36081 FAM138A:NR_026818 . - chr1 69090 70008 OR4F5:NM_001005484 . + chr1 134772 140566 LOC729737:NR_039983 . - chr1 323891 328581 LOC100132287:NR_028322 . + chr1 367658 368597 OR4F29:NM_001005221 . + Predicted DMRs by dmr_analysis file is also used as input to this demo which looks like the following: .. code-block:: bash mr_chrs mr_start_pos mr_end_pos mr_info mr_logReg_proba genome_info chromSegment_info chr18 57022124 57027729 chr18:mr5:hypo:D 0.999871 chr18:57025497:57031497:NM_005570||TSS:5000:1000||LMAN1:-:56995055:57026497~chr18:56996055:57025497:NM_005570||gene:5000:1000||LMAN1:-:56995055:5702649~E~R~TSS~PF~T chr18 57028952 57030829 chr18:mr6:hypo:D 0.999902 chr18:57025497:57031497:NM_005570||TSS:5000:1000||LMAN1:-:56995055:57026497~chr18:56343696:57338696:NM_006785||5distD:5000:1000000||MALT1:+:56338696:56221709~chr18:56486111:57481111:NR_146904||5distD:5000:1000000||LINC01926:+:56481111:56501596~chr18:56535155:57530155:NM_018181||5distD:5000:1000000||ZNF332:+:56530155:56653712~chr18:56535714:57530714:NM_001353526||5distD:5000:1000000||ZNF532:+:56530714:56653712~chr18:56536322:57531322:NM_001318728||5distD:5000:1000000||ZNF532:+:56531322:56653712~chr18:56536590:57531590:NM_001353531||5distD:5000:1000000||ZNF532:+:56531590:56653712~chr18:56537108:57532108:NM_001353527||5distD:5000:1000000||ZNF532:+:56532108:56653712~chr18:56707910:57702910:NR_024021||5distD:5000:1000000||OACYLP:+:56702910:56720446~chr18:56812115:57807115:NM_001307941||5distD:5000:1000000||SEC11C:+:56807115:56826063~chr18:56892389:57887389:NM_001012513||5distD:5000:1000000||GRP:+:56887389:56898002~chr18:56364655:57359655:NM_133459||5distD:5000:1000000||CCBE1:-:57098170:57364655~chr18:56301323:57296323:NM_052947||5dist:5000:1000000||ALPK2:-:56148481:56296323~chr18:56945686:57940686:NM_013435||5dist:5000:1000000||RAX:-:56934269:56940686~chr18:56990881:57985881:NM_181654||5dist:5000:1000000||CPLX4:-:56962633:56985881~chr18:56567227:57562227:NM_021127||5dist:5000:1000000||PMAIP1:+:57567227:57571537 R~T Head of differentially_expressed_genes.txt is following: .. code-block:: bash A1BG 0.006841601076234209 1.03562166970378 1.0044718601447777 0.7682775946401803 0.9032574892066986 0.9586228797372297 0.8600161280044952 1.1129622825040923 0.727123126962006 0.7943658809104911 0.9030097652361022 0.7743942653504368 0.6272536039977595 0.7647032561463141 0.7540873544744898 A2M 7.466186411474e-08 1.7068617121006318 1.5722363221730744 1.7241477215681602 2.512746238429057 1.192179193233052 1.9616053111713239 2.0831787155792267 1.3698921594123463 1.8944650163311454 2.049522066949797 0.026853332696881425 0.11955544285228693 0.1306156621631196 0.02935698777439702 A2ML1 0.013243211051909093 0.0038341538116092736 0.005223240799061258 0.0035200158291957538 0.006688701271386327 0.01231806318166027 0.006651830866222021 0.01240860696633966 0.006867506126330923 0.0028447826724574037 0.004404865042883453 0.09306972144679593 0.09312915276803117 0.07871043255363304 0.03729749320155993 A3GALT2 0.023045652491540343 0.017919176318647874 0.009595308902594371 0.019196980444105404 0.030547395118981288 0.012650172586582973 0.023783244992637877 0.01579925924075979 0.015948641690104712 0.005135873671568252 0.009253322651483684 0.060041819656531735 0.12801386599049638 0.06660916252501874 0.07245621656048234 Step 1: Map mutation blocks to genomic regions ---------------------------------------------- The `map_block2genome` module from the `dds_analysis` package is used to map mutation blocks to predefined genomic regions. .. code-block:: bash dds_analysis map_block2genome --in_sortedBlock_file $IN_BLOCK_FILE --in_genomicRegion_file $GENOMIC_REGION_FILE --in_reference_genome $REFERENCE_GENOME_FILE --out_folder $OUTPUT_FOLDER Step 2: Map mutation blocks to chromatin segments ------------------------------------------------- The `map_block2chromSegment` module from the `dds_analysis` package is used to map mutation blocks to chromatin segments. .. code-block:: bash dds_analysis map_block2chromSegment --in_sortedBlock_file $IN_BLOCK_FILE --in_chromatinSegment_folder $CHROMATIN_SEGMENT_FOLDER --out_folder $OUTPUT_FOLDER Step 3: Map mutation blocks to differentially methylated regions (DMRs) ----------------------------------------------------------------------- The `map_block2dmr` module from the `dds_analysis` package is used to map mutation blocks to DMRs after adding flank regions. .. code-block:: bash dds_analysis map_block2dmr --in_sortedBlock_file $IN_BLOCK_FILE --in_dmr_file $DMR_FILE --flank_region_size $FLANK_REGION_SIZE --out_folder $OUTPUT_FOLDER Step 4: Combine genomic regions with mutation block information and find differentially expressed genes ------------------------------------------------------------------------------------------------------- The `find_geneExp4block` module from the `dds_analysis` package is used to combine genomic regions with mutation block information and identify differentially expressed genes. .. code-block:: bash dds_analysis find_geneExp4block --in_blocks_genome_folder $BLOCKS_GENOME_FOLDER --in_sortedBlock_file $IN_BLOCK_FILE --in_de_genes_file $DE_GENES_FILE --in_feature_list $FEATURE_LIST --out_folder $OUTPUT_FOLDER Step 5: Find patient IDs for each mutation block ------------------------------------------------ The `find_block_patientID` module from the `dds_analysis` package is used to find patient IDs associated with each mutation block. .. code-block:: bash dds_analysis find_block_patientID --in_block_summary_file $BLOCK_SUMMARY_FILE --in_block_folder $BLOCK_FOLDER Step 6: Combine DMRs, differentially expressed genes (DEGs), and mutation block information ------------------------------------------------------------------------------------------- The `combine_dmr_deg2block` module from the `dds_analysis` package is used to combine DMRs, DEGs, and mutation block information. .. code-block:: bash dds_analysis combine_dmr_deg2block --in_sortedBlock_patient_file $SORTED_BLOCK_PATIENT_FILE --in_dmr_file $DMR_FILE --in_deg_folder $DEG_FOLDER --deg_file_suffix $DEG_FILE_SUFFIX --out_folder $OUTPUT_FOLDER Step 7: Filter blocks based on DMR or DEG information ----------------------------------------------------- The `filter_blocks` module from the `dds_analysis` package is used to filter blocks based on DMR or DEG information. .. code-block:: bash dds_analysis filter_blocks --in_combined_dmr_deg_block_file $COMBINED_DMR_DEG_BLOCK_FILE Step 8: Collect unique gene names from predicted blocks after filtering ----------------------------------------------------------------------- The `collect_gene_names4blocks` module from the `dds_analysis` package is used to collect unique gene names from predicted blocks. .. code-block:: bash dds_analysis collect_gene_names4blocks --in_filtered_block_file $FILTERED_BLOCK_FILE --out_gene_file $GENE_FILE Step 9: Filter Mutation block and genes: ---------------------------------------- The `check_block_gene_inTAD` module from the `dds_analysis` package is used to perform filter mutation block and genes that they are not in the same TAD or boundary .. code-block:: bash dds_analysis check_block_gene_inTAD --in_filtered_blockUqGene_file $OUT_FINAL/${IN_BLOCK_DMR_NAME}_deg_info_filtered_DMR_and_DEG_uniqGene.tsv \ --in_gene_position_file ${IN_GENE_POSITION} \ --in_TAD_position_file ${IN_TAD_INFO} \ --in_TAD_boundary_file ${IN_TAD_BOUNDARY} echo 9a. check_block_gene_inTAD - Done for selecting blocks with both DMR and DEG dds_analysis check_block_gene_inTAD --in_filtered_blockUqGene_file $OUT_FINAL/${IN_BLOCK_DMR_NAME}_deg_info_filtered_DMR_or_DEG_uniqGene.tsv \ --in_gene_position_file ${IN_GENE_POSITION} \ --in_TAD_position_file ${IN_TAD_INFO} \ --in_TAD_boundary_file ${IN_TAD_BOUNDARY} echo 9b. check_block_gene_inTAD - Done for selecting blocks with either DMR or DEG Here is list of genes that exist in TAD: .. code-block:: bash gene_name patients gene_type block_id new_mr_sites patient_id enhancers TAD2gene Boundary2gene TAD2block Boundary2block isTAD BCL2 9 TSS block_1_18_60987833_60988302 chr18:mr621 patient_0,patient_1,patient_2,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10 nan 18:60675000:61075000:Low-active na 18:60675000:61075000:Low-active na 1.0 BCL2 8 TSS block_0_18_60986372_60987028 chr18:mr621 patient_0,patient_1,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10 enhancer 18:60675000:61075000:Low-active na 18:60675000:61075000:Low-active na 1.0 SERPINB8 9 5dist block_1_18_60987833_60988302 chr18:mr621 patient_0,patient_1,patient_2,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10 nan 18:61650000:66375000:Low 18:61575000:61650000:1 18:60675000:61075000:Low-active na 0.0 SERPINB2 9 5dist block_1_18_60987833_60988302 chr18:mr621 patient_0,patient_1,patient_2,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10 nan 18:61150000:61575000:Low na 18:60675000:61075000:Low-active na 0.0 SERPINB5 9 5dist block_1_18_60987833_60988302 chr18:mr621 patient_0,patient_1,patient_2,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10 nan 18:61150000:61575000:Low 18:61075000:61150000:1 18:60675000:61075000:Low-active na 0.0 SERPINB12 9 5dist block_1_18_60987833_60988302 chr18:mr621 patient_0,patient_1,patient_2,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10 nan 18:61150000:61575000:Low na 18:60675000:61075000:Low-active na 0.0 Step 10: Rank gene list: ------------------------ The `dds_geneRanking` module from the `dds_analysis` package is used to rank gene list based on predefined weights such as default setting is tss=4, enhancer=3, tes=gene=2, 5dist=1 .. code-block:: bash dds_analysis dds_geneRanking --in_unique_gene_file $OUT_FINAL/${IN_BLOCK_DMR_NAME}_deg_info_filtered_DMR_or_DEG_uniqGene_commonTAD_Boundary_list2UqGene.tsv \ --in_DEG_file ${IN_EXPRESS} \ --in_DMR_file ${IN_DMR} -inCutoff 0.5 echo 10. geneRanking - Done Here is the output of gene ranking of unique gene: .. code-block:: bash gene_name gene_type block_id new_mr_sites patients isTAD enhancers patient_id BCL2 TSS~TSS block_1_18_60987833_60988302~block_0_18_60986372_60987028 chr18:mr621~chr18:mr621 9~8 1.0~1.0 nan~enhancer patient_0,patient_1,patient_2,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10~patient_0,patient_1,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10 Step 11: Find enhancer target genes ----------------------------------- The `find_enhancer_target_genes` module from the `dds_analysis` package is used to find enhancer target genes by overlapping predicted enhancers with selected mutation blocks and a predicted target gene. .. code-block:: bash search_dir='find ${OUT_FINAL} -name *_gt_* -type f ' echo "Find results file from below list: " for entry in ${search_dir} do echo "$entry" done SEARCH_GENE='BCL2' echo "and input one of file name and path to step 11" read -p "To select a file from the above list before running step 12. for finding enhancers that overlapping to selected mutation blocks with a predicted target gene $SEARCH_GENE : " IN_FINAL_RESULT echo "IN_FINAL_RESULT: $IN_FINAL_RESULT " dds_analysis find_enhancer_target_genes --in_enhancer_folder $ENHANCER_FOLDER --in_dds_file $DDS_FILE --in_gene_file $GENE_FILE --out_folder $OUTPUT_FOLDER Following is the output for gene "BCL2" with its enhancer region which is predicted for follicular lymphoma. .. code-block:: bash chr18 60986372 60987028 block_0_18_60986372_60987028 chr18:mr621 TSS 8 enhancer