demoFL
This document outlines the step-by-step workflow for the analysis of follicular lymphoma using the dds_analysis package. Each step is described along with the corresponding bash code.
Set variable paths:
The following code sets variable paths. User can adjust this code according to their files and folders.
#path of main input data folder
IN_FOLDER='../../data/fl_12samples/in_data/'
#a) Path of mutation blocks and block summary files exported by bpb3 package
IN_BLOCKS=${IN_FOLDER}'/bpb3_demo3_fl_cohort_small/out/mussd_blocks/'
IN_BLOCK_SUMMARY='blocks_summary.tsv'
#Path of mutation blocks with patient ID will be exported later by dds_analysis find_block_patientID
#convert bpb3 predicted mutation block summary file to bed format file with positions. This is done by bpb3summary_format
dds_analysis bpb3summary2bed_format --in_block_summary_file ${IN_BLOCKS}/$IN_BLOCK_SUMMARY
echo ${IN_BLOCKS}'/'${IN_BLOCK_SUMMARY}
echo "preprocess_data - Done"
#make file name of the bed format of block summary file that will be used in the rest of analysis
replace='_block_position.bed'
finds='.tsv'
IN_BLOCK_FILE=${IN_BLOCK_SUMMARY//$finds/$replace}
#b) Path of predicted DMRs/MRs from "dmr_analysis"
IN_GENOME_REGIONS=${IN_FOLDER}'/final_demo_data/fl_12samples/out_data/DMR_CpG_context/'
#a list of predefined genomic regions that will be used to map mutation blocks.
#This file is generated by dmr_analysis dmr_gene_annotation but user has to double check the location of files in this list if you need to add enhancer or other genomic region files
IN_LIST_REGIONS='list_region_files.txt'
#c) Path to refFlat gene annotation file that generated by hmst-seq-analyzer
IN_GENOME_refFlat=${IN_GENOME_REGIONS}'/data/hg19.refFlat_clean_sorted.bed'
#d) Path of other genomic information files such as TAD, chromatin segment, and genome files et al
IN_GENOME_LIB='../../data/'
#path of chromatin segment files that will be used to intersect with mutation blocks
IN_CHROMSEG=${IN_GENOME_LIB}'/chromSegment/hg19/'
#e) Path of TAD information that will be used to map mutation blocks
IN_TAD=${IN_GENOME_LIB}'/in_tad/'
#f) Path to gene position file in sorted bed format which is generated by dmr_analysis dmr_gene_annotation
IN_GENE_POSITION=${IN_TAD}'/out/data/gene_Up1_Down1removedShort.bed'
#g) Path to common TAD positions across 5 human cell lines
IN_TAD_INFO=${IN_TAD}'/Table4_TAD_annotations_sorted_chr.bed'
#path to common TAD boundary position among 5 human cell lines
IN_TAD_BOUNDARY=${IN_TAD}'/Table1_common_boundaries_merged_sorted_chr.bed'
#Path of differentially methylated regions, differentially expressed genes, and output folder of mutation blocks mapped to predefined genomic regions.
#h) Path to DMRs predicted and ranked by dmr_analysis
IN_DMR=${IN_GENOME_REGIONS}'/2_chroms_all_mr_data_range_dmrRanking.bed'
#i) path to DEG genes exported from bpb3 package
IN_EXPRESS=${IN_FOLDER}'/bpb3_demo3_fl_cohort_small/out/differentially_expressed_genes.txt'
#Try to add N bp flank region on the two sides of mutation blocks before intersecting them with DMRs
FLANK_BP=0
###---OUTPUT PATH
# path of main output data folder
OUT_FOLDER='../../data/fl_12samples/out_data/'
#path for exporting mutation blocks mapped to predefined genomic regions
OUT_GENOME=${OUT_FOLDER}'/out_genome/'
#path for exporting mutation blocks mapped to chromatin segmentations
OUT_CHROMSEG=${OUT_FOLDER}'/out_chromSegment/'
#path for exported mutation blocks associated with DEGs
OUT_EXPRESS=${OUT_FOLDER}'/out_expression/'
#path for final results that mutation blocks associated with DMR, DEG and TAD
OUT_FINAL=${OUT_FOLDER}/'out_DmrDeg2block'
Here we will look at input files.
bpb3 block summary looks like following:
block_id chrom start_pos end_pos number_of_mutations number_of_patients mutation_distribution region_names
block_1_18_60987833_60988302 18 60987808 60988326 102 9 6,4,11,7,7,5,18,13,31 BCL2;-;up1000;down1000
block_0_18_60986372_60987028 18 60986347 60987052 59 8 4,3,10,9,6,9,5,13 BCL2;-;up1000;down1000
IN_LIST_REGIONS file looks like the following:
../../final_demo_data/fl_12samples/out_data/DMR_CpG_context/data/TSS_Up5000_Down1000_removedShort.bed
../../final_demo_data/fl_12samples/out_data/DMR_CpG_context/data/gene_Up5000_Down1000removedShort.bed
../../final_demo_data/fl_12samples/out_data/DMR_CpG_context/data/TES_Up5000_Down1000removedShort.bed
../../final_demo_data/fl_12samples/out_data/DMR_CpG_context/data/5dist_Up1000000_Up5000removedShort.bed
../../final_demo_data/fl_12samples/out_data/DMR_CpG_context/data/intergenic_uniqueSorted_betweenTSS_TES_genes_minLen10.bed
../../final_demo_data/fl_12samples/out_data/DMR_CpG_context/data/5dist_Down1000000_Down5000removedShort.bed
Head of the hg19.refFlat_clean_sorted.bed looks like the following
chr1 11868 14362 LOC102725121:NR_148357 . +
chr1 11873 14409 DDX11L1:NR_046018 . +
chr1 14361 29370 WASH7P:NR_024540 . -
chr1 34610 36081 FAM138A:NR_026818 . -
chr1 69090 70008 OR4F5:NM_001005484 . +
chr1 134772 140566 LOC729737:NR_039983 . -
chr1 323891 328581 LOC100132287:NR_028322 . +
chr1 367658 368597 OR4F29:NM_001005221 . +
Predicted DMRs by dmr_analysis file is also used as input to this demo which looks like the following:
mr_chrs mr_start_pos mr_end_pos mr_info mr_logReg_proba genome_info chromSegment_info
chr18 57022124 57027729 chr18:mr5:hypo:D 0.999871 chr18:57025497:57031497:NM_005570||TSS:5000:1000||LMAN1:-:56995055:57026497~chr18:56996055:57025497:NM_005570||gene:5000:1000||LMAN1:-:56995055:5702649~E~R~TSS~PF~T
chr18 57028952 57030829 chr18:mr6:hypo:D 0.999902 chr18:57025497:57031497:NM_005570||TSS:5000:1000||LMAN1:-:56995055:57026497~chr18:56343696:57338696:NM_006785||5distD:5000:1000000||MALT1:+:56338696:56221709~chr18:56486111:57481111:NR_146904||5distD:5000:1000000||LINC01926:+:56481111:56501596~chr18:56535155:57530155:NM_018181||5distD:5000:1000000||ZNF332:+:56530155:56653712~chr18:56535714:57530714:NM_001353526||5distD:5000:1000000||ZNF532:+:56530714:56653712~chr18:56536322:57531322:NM_001318728||5distD:5000:1000000||ZNF532:+:56531322:56653712~chr18:56536590:57531590:NM_001353531||5distD:5000:1000000||ZNF532:+:56531590:56653712~chr18:56537108:57532108:NM_001353527||5distD:5000:1000000||ZNF532:+:56532108:56653712~chr18:56707910:57702910:NR_024021||5distD:5000:1000000||OACYLP:+:56702910:56720446~chr18:56812115:57807115:NM_001307941||5distD:5000:1000000||SEC11C:+:56807115:56826063~chr18:56892389:57887389:NM_001012513||5distD:5000:1000000||GRP:+:56887389:56898002~chr18:56364655:57359655:NM_133459||5distD:5000:1000000||CCBE1:-:57098170:57364655~chr18:56301323:57296323:NM_052947||5dist:5000:1000000||ALPK2:-:56148481:56296323~chr18:56945686:57940686:NM_013435||5dist:5000:1000000||RAX:-:56934269:56940686~chr18:56990881:57985881:NM_181654||5dist:5000:1000000||CPLX4:-:56962633:56985881~chr18:56567227:57562227:NM_021127||5dist:5000:1000000||PMAIP1:+:57567227:57571537 R~T
Head of differentially_expressed_genes.txt is following:
A1BG 0.006841601076234209 1.03562166970378 1.0044718601447777 0.7682775946401803 0.9032574892066986 0.9586228797372297 0.8600161280044952 1.1129622825040923 0.727123126962006 0.7943658809104911 0.9030097652361022 0.7743942653504368 0.6272536039977595 0.7647032561463141 0.7540873544744898
A2M 7.466186411474e-08 1.7068617121006318 1.5722363221730744 1.7241477215681602 2.512746238429057 1.192179193233052 1.9616053111713239 2.0831787155792267 1.3698921594123463 1.8944650163311454 2.049522066949797 0.026853332696881425 0.11955544285228693 0.1306156621631196 0.02935698777439702
A2ML1 0.013243211051909093 0.0038341538116092736 0.005223240799061258 0.0035200158291957538 0.006688701271386327 0.01231806318166027 0.006651830866222021 0.01240860696633966 0.006867506126330923 0.0028447826724574037 0.004404865042883453 0.09306972144679593 0.09312915276803117 0.07871043255363304 0.03729749320155993
A3GALT2 0.023045652491540343 0.017919176318647874 0.009595308902594371 0.019196980444105404 0.030547395118981288 0.012650172586582973 0.023783244992637877 0.01579925924075979 0.015948641690104712 0.005135873671568252 0.009253322651483684 0.060041819656531735 0.12801386599049638 0.06660916252501874 0.07245621656048234
Step 1: Map mutation blocks to genomic regions
The map_block2genome module from the dds_analysis package is used to map mutation blocks to predefined genomic regions.
dds_analysis map_block2genome --in_sortedBlock_file $IN_BLOCK_FILE --in_genomicRegion_file $GENOMIC_REGION_FILE --in_reference_genome $REFERENCE_GENOME_FILE --out_folder $OUTPUT_FOLDER
Step 2: Map mutation blocks to chromatin segments
The map_block2chromSegment module from the dds_analysis package is used to map mutation blocks to chromatin segments.
dds_analysis map_block2chromSegment --in_sortedBlock_file $IN_BLOCK_FILE --in_chromatinSegment_folder $CHROMATIN_SEGMENT_FOLDER --out_folder $OUTPUT_FOLDER
Step 3: Map mutation blocks to differentially methylated regions (DMRs)
The map_block2dmr module from the dds_analysis package is used to map mutation blocks to DMRs after adding flank regions.
dds_analysis map_block2dmr --in_sortedBlock_file $IN_BLOCK_FILE --in_dmr_file $DMR_FILE --flank_region_size $FLANK_REGION_SIZE --out_folder $OUTPUT_FOLDER
Step 4: Combine genomic regions with mutation block information and find differentially expressed genes
The find_geneExp4block module from the dds_analysis package is used to combine genomic regions with mutation block information and identify differentially expressed genes.
dds_analysis find_geneExp4block --in_blocks_genome_folder $BLOCKS_GENOME_FOLDER --in_sortedBlock_file $IN_BLOCK_FILE --in_de_genes_file $DE_GENES_FILE --in_feature_list $FEATURE_LIST --out_folder $OUTPUT_FOLDER
Step 5: Find patient IDs for each mutation block
The find_block_patientID module from the dds_analysis package is used to find patient IDs associated with each mutation block.
dds_analysis find_block_patientID --in_block_summary_file $BLOCK_SUMMARY_FILE --in_block_folder $BLOCK_FOLDER
Step 6: Combine DMRs, differentially expressed genes (DEGs), and mutation block information
The combine_dmr_deg2block module from the dds_analysis package is used to combine DMRs, DEGs, and mutation block information.
dds_analysis combine_dmr_deg2block --in_sortedBlock_patient_file $SORTED_BLOCK_PATIENT_FILE --in_dmr_file $DMR_FILE --in_deg_folder $DEG_FOLDER --deg_file_suffix $DEG_FILE_SUFFIX --out_folder $OUTPUT_FOLDER
Step 7: Filter blocks based on DMR or DEG information
The filter_blocks module from the dds_analysis package is used to filter blocks based on DMR or DEG information.
dds_analysis filter_blocks --in_combined_dmr_deg_block_file $COMBINED_DMR_DEG_BLOCK_FILE
Step 8: Collect unique gene names from predicted blocks after filtering
The collect_gene_names4blocks module from the dds_analysis package is used to collect unique gene names from predicted blocks.
dds_analysis collect_gene_names4blocks --in_filtered_block_file $FILTERED_BLOCK_FILE --out_gene_file $GENE_FILE
Step 9: Filter Mutation block and genes:
The check_block_gene_inTAD module from the dds_analysis package is used to perform filter mutation block and genes that they are not in the same TAD or boundary
dds_analysis check_block_gene_inTAD --in_filtered_blockUqGene_file $OUT_FINAL/${IN_BLOCK_DMR_NAME}_deg_info_filtered_DMR_and_DEG_uniqGene.tsv \
--in_gene_position_file ${IN_GENE_POSITION} \
--in_TAD_position_file ${IN_TAD_INFO} \
--in_TAD_boundary_file ${IN_TAD_BOUNDARY}
echo 9a. check_block_gene_inTAD - Done for selecting blocks with both DMR and DEG
dds_analysis check_block_gene_inTAD --in_filtered_blockUqGene_file $OUT_FINAL/${IN_BLOCK_DMR_NAME}_deg_info_filtered_DMR_or_DEG_uniqGene.tsv \
--in_gene_position_file ${IN_GENE_POSITION} \
--in_TAD_position_file ${IN_TAD_INFO} \
--in_TAD_boundary_file ${IN_TAD_BOUNDARY}
echo 9b. check_block_gene_inTAD - Done for selecting blocks with either DMR or DEG
Here is list of genes that exist in TAD:
gene_name patients gene_type block_id new_mr_sites patient_id enhancers TAD2gene Boundary2gene TAD2block Boundary2block isTAD
BCL2 9 TSS block_1_18_60987833_60988302 chr18:mr621 patient_0,patient_1,patient_2,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10 nan 18:60675000:61075000:Low-active na 18:60675000:61075000:Low-active na 1.0
BCL2 8 TSS block_0_18_60986372_60987028 chr18:mr621 patient_0,patient_1,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10 enhancer 18:60675000:61075000:Low-active na 18:60675000:61075000:Low-active na 1.0
SERPINB8 9 5dist block_1_18_60987833_60988302 chr18:mr621 patient_0,patient_1,patient_2,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10 nan 18:61650000:66375000:Low 18:61575000:61650000:1 18:60675000:61075000:Low-active na 0.0
SERPINB2 9 5dist block_1_18_60987833_60988302 chr18:mr621 patient_0,patient_1,patient_2,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10 nan 18:61150000:61575000:Low na 18:60675000:61075000:Low-active na 0.0
SERPINB5 9 5dist block_1_18_60987833_60988302 chr18:mr621 patient_0,patient_1,patient_2,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10 nan 18:61150000:61575000:Low 18:61075000:61150000:1 18:60675000:61075000:Low-active na 0.0
SERPINB12 9 5dist block_1_18_60987833_60988302 chr18:mr621 patient_0,patient_1,patient_2,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10 nan 18:61150000:61575000:Low na 18:60675000:61075000:Low-active na 0.0
Step 10: Rank gene list:
The dds_geneRanking module from the dds_analysis package is used to rank gene list based on predefined weights such as default setting is tss=4, enhancer=3, tes=gene=2, 5dist=1
dds_analysis dds_geneRanking --in_unique_gene_file $OUT_FINAL/${IN_BLOCK_DMR_NAME}_deg_info_filtered_DMR_or_DEG_uniqGene_commonTAD_Boundary_list2UqGene.tsv \
--in_DEG_file ${IN_EXPRESS} \
--in_DMR_file ${IN_DMR} -inCutoff 0.5
echo 10. geneRanking - Done
Here is the output of gene ranking of unique gene:
gene_name gene_type block_id new_mr_sites patients isTAD enhancers patient_id
BCL2 TSS~TSS block_1_18_60987833_60988302~block_0_18_60986372_60987028 chr18:mr621~chr18:mr621 9~8 1.0~1.0 nan~enhancer patient_0,patient_1,patient_2,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10~patient_0,patient_1,patient_4,patient_5,patient_6,patient_7,patient_8,patient_10
Step 11: Find enhancer target genes
The find_enhancer_target_genes module from the dds_analysis package is used to find enhancer target genes by overlapping predicted enhancers with selected mutation blocks and a predicted target gene.
search_dir='find ${OUT_FINAL} -name *_gt_* -type f '
echo "Find results file from below list: "
for entry in ${search_dir}
do
echo "$entry"
done
SEARCH_GENE='BCL2'
echo "and input one of file name and path to step 11"
read -p "To select a file from the above list before running step 12.
for finding enhancers that overlapping to selected mutation blocks with a predicted target gene $SEARCH_GENE
: " IN_FINAL_RESULT
echo "IN_FINAL_RESULT: $IN_FINAL_RESULT "
dds_analysis find_enhancer_target_genes --in_enhancer_folder $ENHANCER_FOLDER --in_dds_file $DDS_FILE --in_gene_file $GENE_FILE --out_folder $OUTPUT_FOLDER
Following is the output for gene “BCL2” with its enhancer region which is predicted for follicular lymphoma.
chr18 60986372 60987028 block_0_18_60986372_60987028 chr18:mr621 TSS 8 enhancer