demoFL

This document outlines the step-by-step workflow for the analysis of follicular lymphoma using the dds_analysis package. Each step is described along with the corresponding bash code.

Set variable paths:

The following code sets variable paths. User can adjust this code according to their files and folders.

#path of main input data folder
IN_FOLDER='../../data/fl_12samples/in_data/'

#a) Path of mutation blocks and block summary files exported by bpb3 package
IN_BLOCKS=${IN_FOLDER}'/bpb3_demo3_fl_cohort_small/out/mussd_blocks/'
IN_BLOCK_SUMMARY='blocks_summary.tsv'
#Path of mutation blocks with patient ID will be exported later by  dds_analysis find_block_patientID

#convert bpb3 predicted mutation block summary file to bed format file with positions. This is done by bpb3summary_format
dds_analysis bpb3summary2bed_format --in_block_summary_file ${IN_BLOCKS}/$IN_BLOCK_SUMMARY
echo  ${IN_BLOCKS}'/'${IN_BLOCK_SUMMARY}
echo "preprocess_data - Done"

#make file name of the bed format of block summary file that will be used in the rest of analysis
replace='_block_position.bed'
finds='.tsv'
IN_BLOCK_FILE=${IN_BLOCK_SUMMARY//$finds/$replace}

#b) Path of predicted DMRs/MRs from "dmr_analysis"
IN_GENOME_REGIONS=${IN_FOLDER}'/final_demo_data/fl_12samples/out_data/DMR_CpG_context/'
#a list of predefined genomic regions that will be used to map mutation blocks.
#This file is generated by dmr_analysis dmr_gene_annotation but user has to double check the location of files in this list if you need to add enhancer or other genomic region files
IN_LIST_REGIONS='list_region_files.txt'
#c) Path to refFlat gene annotation file that generated by hmst-seq-analyzer
IN_GENOME_refFlat=${IN_GENOME_REGIONS}'/data/hg19.refFlat_clean_sorted.bed'

#d) Path of other genomic information files such as TAD,  chromatin segment, and genome files et al
IN_GENOME_LIB='../../data/'
#path of chromatin segment files that will be used to intersect with mutation blocks
IN_CHROMSEG=${IN_GENOME_LIB}'/chromSegment/hg19/'

#e) Path of TAD information that will be used to map mutation blocks
IN_TAD=${IN_GENOME_LIB}'/in_tad/'

#f) Path to gene position file in sorted bed format which is generated by dmr_analysis dmr_gene_annotation
IN_GENE_POSITION=${IN_TAD}'/out/data/gene_Up1_Down1removedShort.bed'

#g) Path to common TAD positions across 5 human cell lines
IN_TAD_INFO=${IN_TAD}'/Table4_TAD_annotations_sorted_chr.bed'
#path to common TAD boundary position among 5 human cell lines
IN_TAD_BOUNDARY=${IN_TAD}'/Table1_common_boundaries_merged_sorted_chr.bed'

#Path of differentially methylated regions, differentially expressed genes, and output folder of mutation blocks mapped to predefined genomic regions.
#h) Path to DMRs predicted and ranked by dmr_analysis
IN_DMR=${IN_GENOME_REGIONS}'/2_chroms_all_mr_data_range_dmrRanking.bed'

#i) path to DEG genes exported from bpb3 package
IN_EXPRESS=${IN_FOLDER}'/bpb3_demo3_fl_cohort_small/out/differentially_expressed_genes.txt'

#Try to add N bp flank region on the two sides of mutation blocks before intersecting them with DMRs
FLANK_BP=0

###---OUTPUT PATH
# path of main output data folder
OUT_FOLDER='../../data/fl_12samples/out_data/'

#path for exporting mutation blocks mapped to predefined genomic regions
OUT_GENOME=${OUT_FOLDER}'/out_genome/'
#path for exporting mutation blocks mapped to chromatin segmentations
OUT_CHROMSEG=${OUT_FOLDER}'/out_chromSegment/'
#path for exported mutation blocks associated with DEGs
OUT_EXPRESS=${OUT_FOLDER}'/out_expression/'
#path for final results that mutation blocks associated with DMR, DEG and TAD
OUT_FINAL=${OUT_FOLDER}/'out_DmrDeg2block'

Here we will look at input files.

bpb3 block summary looks like following:

block_id    chrom   start_pos       end_pos number_of_mutations     number_of_patients      mutation_distribution   region_names
block_1_18_60987833_60988302        18      60987808        60988326        102     9       6,4,11,7,7,5,18,13,31   BCL2;-;up1000;down1000
block_0_18_60986372_60987028        18      60986347        60987052        59      8       4,3,10,9,6,9,5,13       BCL2;-;up1000;down1000

IN_LIST_REGIONS file looks like the following:

../../final_demo_data/fl_12samples/out_data/DMR_CpG_context/data/TSS_Up5000_Down1000_removedShort.bed
../../final_demo_data/fl_12samples/out_data/DMR_CpG_context/data/gene_Up5000_Down1000removedShort.bed
../../final_demo_data/fl_12samples/out_data/DMR_CpG_context/data/TES_Up5000_Down1000removedShort.bed
../../final_demo_data/fl_12samples/out_data/DMR_CpG_context/data/5dist_Up1000000_Up5000removedShort.bed
../../final_demo_data/fl_12samples/out_data/DMR_CpG_context/data/intergenic_uniqueSorted_betweenTSS_TES_genes_minLen10.bed
../../final_demo_data/fl_12samples/out_data/DMR_CpG_context/data/5dist_Down1000000_Down5000removedShort.bed

Head of the hg19.refFlat_clean_sorted.bed looks like the following

chr1        11868   14362   LOC102725121:NR_148357  .       +
chr1        11873   14409   DDX11L1:NR_046018       .       +
chr1        14361   29370   WASH7P:NR_024540        .       -
chr1        34610   36081   FAM138A:NR_026818       .       -
chr1        69090   70008   OR4F5:NM_001005484      .       +
chr1        134772  140566  LOC729737:NR_039983     .       -
chr1        323891  328581  LOC100132287:NR_028322  .       +
chr1        367658  368597  OR4F29:NM_001005221     .       +

Predicted DMRs by dmr_analysis file is also used as input to this demo which looks like the following:

mr_chrs mr_start_pos    mr_end_pos      mr_info mr_logReg_proba genome_info     chromSegment_info
chr18   57022124        57027729        chr18:mr5:hypo:D        0.999871        chr18:57025497:57031497:NM_005570||TSS:5000:1000||LMAN1:-:56995055:57026497~chr18:56996055:57025497:NM_005570||gene:5000:1000||LMAN1:-:56995055:5702649~E~R~TSS~PF~T
chr18   57028952        57030829        chr18:mr6:hypo:D        0.999902        chr18:57025497:57031497:NM_005570||TSS:5000:1000||LMAN1:-:56995055:57026497~chr18:56343696:57338696:NM_006785||5distD:5000:1000000||MALT1:+:56338696:56221709~chr18:56486111:57481111:NR_146904||5distD:5000:1000000||LINC01926:+:56481111:56501596~chr18:56535155:57530155:NM_018181||5distD:5000:1000000||ZNF332:+:56530155:56653712~chr18:56535714:57530714:NM_001353526||5distD:5000:1000000||ZNF532:+:56530714:56653712~chr18:56536322:57531322:NM_001318728||5distD:5000:1000000||ZNF532:+:56531322:56653712~chr18:56536590:57531590:NM_001353531||5distD:5000:1000000||ZNF532:+:56531590:56653712~chr18:56537108:57532108:NM_001353527||5distD:5000:1000000||ZNF532:+:56532108:56653712~chr18:56707910:57702910:NR_024021||5distD:5000:1000000||OACYLP:+:56702910:56720446~chr18:56812115:57807115:NM_001307941||5distD:5000:1000000||SEC11C:+:56807115:56826063~chr18:56892389:57887389:NM_001012513||5distD:5000:1000000||GRP:+:56887389:56898002~chr18:56364655:57359655:NM_133459||5distD:5000:1000000||CCBE1:-:57098170:57364655~chr18:56301323:57296323:NM_052947||5dist:5000:1000000||ALPK2:-:56148481:56296323~chr18:56945686:57940686:NM_013435||5dist:5000:1000000||RAX:-:56934269:56940686~chr18:56990881:57985881:NM_181654||5dist:5000:1000000||CPLX4:-:56962633:56985881~chr18:56567227:57562227:NM_021127||5dist:5000:1000000||PMAIP1:+:57567227:57571537     R~T

Head of differentially_expressed_genes.txt is following:

A1BG        0.006841601076234209    1.03562166970378        1.0044718601447777      0.7682775946401803      0.9032574892066986      0.9586228797372297      0.8600161280044952      1.1129622825040923      0.727123126962006       0.7943658809104911      0.9030097652361022      0.7743942653504368      0.6272536039977595      0.7647032561463141      0.7540873544744898
A2M 7.466186411474e-08      1.7068617121006318      1.5722363221730744      1.7241477215681602      2.512746238429057       1.192179193233052       1.9616053111713239      2.0831787155792267      1.3698921594123463      1.8944650163311454      2.049522066949797       0.026853332696881425    0.11955544285228693     0.1306156621631196      0.02935698777439702
A2ML1       0.013243211051909093    0.0038341538116092736   0.005223240799061258    0.0035200158291957538   0.006688701271386327    0.01231806318166027     0.006651830866222021    0.01240860696633966     0.006867506126330923    0.0028447826724574037   0.004404865042883453    0.09306972144679593     0.09312915276803117     0.07871043255363304     0.03729749320155993
A3GALT2     0.023045652491540343    0.017919176318647874    0.009595308902594371    0.019196980444105404    0.030547395118981288    0.012650172586582973    0.023783244992637877    0.01579925924075979     0.015948641690104712    0.005135873671568252    0.009253322651483684    0.060041819656531735    0.12801386599049638     0.06660916252501874     0.07245621656048234

Step 1: Map mutation blocks to genomic regions

The map_block2genome module from the dds_analysis package is used to map mutation blocks to predefined genomic regions.

dds_analysis map_block2genome --in_sortedBlock_file $IN_BLOCK_FILE --in_genomicRegion_file $GENOMIC_REGION_FILE --in_reference_genome $REFERENCE_GENOME_FILE --out_folder $OUTPUT_FOLDER

Step 2: Map mutation blocks to chromatin segments

The map_block2chromSegment module from the dds_analysis package is used to map mutation blocks to chromatin segments.

dds_analysis map_block2chromSegment --in_sortedBlock_file $IN_BLOCK_FILE --in_chromatinSegment_folder $CHROMATIN_SEGMENT_FOLDER --out_folder $OUTPUT_FOLDER

Step 3: Map mutation blocks to differentially methylated regions (DMRs)

The map_block2dmr module from the dds_analysis package is used to map mutation blocks to DMRs after adding flank regions.

dds_analysis map_block2dmr --in_sortedBlock_file $IN_BLOCK_FILE --in_dmr_file $DMR_FILE --flank_region_size $FLANK_REGION_SIZE --out_folder $OUTPUT_FOLDER

Step 4: Combine genomic regions with mutation block information and find differentially expressed genes

The find_geneExp4block module from the dds_analysis package is used to combine genomic regions with mutation block information and identify differentially expressed genes.

dds_analysis find_geneExp4block --in_blocks_genome_folder $BLOCKS_GENOME_FOLDER --in_sortedBlock_file $IN_BLOCK_FILE --in_de_genes_file $DE_GENES_FILE --in_feature_list $FEATURE_LIST --out_folder $OUTPUT_FOLDER

Step 5: Find patient IDs for each mutation block

The find_block_patientID module from the dds_analysis package is used to find patient IDs associated with each mutation block.

dds_analysis find_block_patientID --in_block_summary_file $BLOCK_SUMMARY_FILE --in_block_folder $BLOCK_FOLDER

Step 6: Combine DMRs, differentially expressed genes (DEGs), and mutation block information

The combine_dmr_deg2block module from the dds_analysis package is used to combine DMRs, DEGs, and mutation block information.

dds_analysis combine_dmr_deg2block --in_sortedBlock_patient_file $SORTED_BLOCK_PATIENT_FILE --in_dmr_file $DMR_FILE --in_deg_folder $DEG_FOLDER --deg_file_suffix $DEG_FILE_SUFFIX --out_folder $OUTPUT_FOLDER

Step 7: Filter blocks based on DMR or DEG information

The filter_blocks module from the dds_analysis package is used to filter blocks based on DMR or DEG information.

dds_analysis filter_blocks --in_combined_dmr_deg_block_file $COMBINED_DMR_DEG_BLOCK_FILE

Step 8: Collect unique gene names from predicted blocks after filtering

The collect_gene_names4blocks module from the dds_analysis package is used to collect unique gene names from predicted blocks.

dds_analysis collect_gene_names4blocks --in_filtered_block_file $FILTERED_BLOCK_FILE --out_gene_file $GENE_FILE

Step 9: Filter Mutation block and genes:

The check_block_gene_inTAD module from the dds_analysis package is used to perform filter mutation block and genes that they are not in the same TAD or boundary

dds_analysis check_block_gene_inTAD --in_filtered_blockUqGene_file $OUT_FINAL/${IN_BLOCK_DMR_NAME}_deg_info_filtered_DMR_and_DEG_uniqGene.tsv \
     --in_gene_position_file ${IN_GENE_POSITION} \
     --in_TAD_position_file ${IN_TAD_INFO}  \
     --in_TAD_boundary_file ${IN_TAD_BOUNDARY}
 echo 9a. check_block_gene_inTAD - Done for selecting blocks with both DMR and DEG



 dds_analysis check_block_gene_inTAD --in_filtered_blockUqGene_file $OUT_FINAL/${IN_BLOCK_DMR_NAME}_deg_info_filtered_DMR_or_DEG_uniqGene.tsv \
         --in_gene_position_file ${IN_GENE_POSITION} \
         --in_TAD_position_file ${IN_TAD_INFO} \
         --in_TAD_boundary_file ${IN_TAD_BOUNDARY}
 echo 9b. check_block_gene_inTAD - Done for selecting blocks with either DMR or DEG

Here is list of genes that exist in TAD:

gene_name   patients        gene_type       block_id        new_mr_sites    patient_id      enhancers       TAD2gene        Boundary2gene   TAD2block       Boundary2block  isTAD
BCL2        9       TSS     block_1_18_60987833_60988302    chr18:mr621     patient_0,patient_1,patient_2,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10      nan     18:60675000:61075000:Low-active na      18:60675000:61075000:Low-active na      1.0
BCL2        8       TSS     block_0_18_60986372_60987028    chr18:mr621     patient_0,patient_1,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10        enhancer        18:60675000:61075000:Low-active na      18:60675000:61075000:Low-active na      1.0
SERPINB8    9       5dist   block_1_18_60987833_60988302    chr18:mr621     patient_0,patient_1,patient_2,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10      nan     18:61650000:66375000:Low        18:61575000:61650000:1  18:60675000:61075000:Low-active na      0.0
SERPINB2    9       5dist   block_1_18_60987833_60988302    chr18:mr621     patient_0,patient_1,patient_2,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10      nan     18:61150000:61575000:Low        na      18:60675000:61075000:Low-active na      0.0
SERPINB5    9       5dist   block_1_18_60987833_60988302    chr18:mr621     patient_0,patient_1,patient_2,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10      nan     18:61150000:61575000:Low        18:61075000:61150000:1  18:60675000:61075000:Low-active na      0.0
SERPINB12   9       5dist   block_1_18_60987833_60988302    chr18:mr621     patient_0,patient_1,patient_2,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10      nan     18:61150000:61575000:Low        na      18:60675000:61075000:Low-active na      0.0

Step 10: Rank gene list:

The dds_geneRanking module from the dds_analysis package is used to rank gene list based on predefined weights such as default setting is tss=4, enhancer=3, tes=gene=2, 5dist=1

dds_analysis dds_geneRanking --in_unique_gene_file $OUT_FINAL/${IN_BLOCK_DMR_NAME}_deg_info_filtered_DMR_or_DEG_uniqGene_commonTAD_Boundary_list2UqGene.tsv \
    --in_DEG_file ${IN_EXPRESS} \
    --in_DMR_file ${IN_DMR} -inCutoff 0.5
 echo 10. geneRanking - Done

Here is the output of gene ranking of unique gene:

gene_name   gene_type       block_id        new_mr_sites    patients        isTAD   enhancers       patient_id
BCL2        TSS~TSS block_1_18_60987833_60988302~block_0_18_60986372_60987028       chr18:mr621~chr18:mr621 9~8     1.0~1.0 nan~enhancer    patient_0,patient_1,patient_2,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10~patient_0,patient_1,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10

Step 11: Find enhancer target genes

The find_enhancer_target_genes module from the dds_analysis package is used to find enhancer target genes by overlapping predicted enhancers with selected mutation blocks and a predicted target gene.

 search_dir='find ${OUT_FINAL} -name *_gt_* -type f '

 echo "Find results file from below list: "
 for entry in ${search_dir}
 do
   echo "$entry"
 done

 SEARCH_GENE='BCL2'
 echo "and input one of file name and path to step 11"

 read -p "To select a file from the above list before running step 12.
      for finding enhancers that overlapping to selected mutation blocks with a predicted target gene $SEARCH_GENE
      : "  IN_FINAL_RESULT
 echo "IN_FINAL_RESULT: $IN_FINAL_RESULT "

dds_analysis find_enhancer_target_genes --in_enhancer_folder $ENHANCER_FOLDER --in_dds_file $DDS_FILE --in_gene_file $GENE_FILE --out_folder $OUTPUT_FOLDER

Following is the output for gene “BCL2” with its enhancer region which is predicted for follicular lymphoma.

chr18       60986372        60987028        block_0_18_60986372_60987028    chr18:mr621     TSS     8       enhancer