Preprocessing for MethylBERT fine-tuning training data
MethylBERT fine-tuning needs DNA methylation data from tumour (T) and normal (N) samples as training data. You can give a list of sample files with annotations in a tab-deliminated file.
[1]:
cat ../test/data/bam_list.txt
../test/data/T_sample.bam T
../test/data/N_sample.bam N
As described in the data preparation tutorial, DMRs and the reference genome should be prepared in the required format.
MethylBERT provides finetune_data_generate function to preprocess the given tumour and normal data.
[2]:
from methylbert.data import finetune_data_generate as fdg
f_bam_file_list = "../test/data/bam_list.txt"
f_dmr = "../test/data/dmrs.csv"
f_ref = "../../../genome/hg19.fa"
out_dir = "tmp/"
fdg.finetune_data_generate(
sc_dataset = f_bam_file_list,
f_dmr = f_dmr,
f_ref = f_ref,
output_dir=out_dir,
split_ratio = 0.8, # Split ratio to make training and validation data
n_mers=3, # 3-mer DNA sequences
n_cores=20
)
DMRs sorted by areaStat
chr start end length nCG meanMethy1 meanMethy2 \
1 chr10 134597480 134602875.0 5396 670 0.861029 0.140400
0 chr7 1268957 1277884.0 8928 753 0.793278 0.129747
2 chr4 1395812 1402597.0 6786 663 0.831162 0.185272
5 chr16 54962053 54967980.0 5928 546 0.783631 0.096095
9 chr18 76736906 76741580.0 4675 510 0.829475 0.104403
diff.Methy areaStat abs_areaStat abs_diff.Methy ctype dmr_id
1 0.720629 6144.089331 6144.089331 0.720629 T 0
0 0.663531 5722.091790 5722.091790 0.663531 T 1
2 0.645891 4941.410089 4941.410089 0.645891 T 2
5 0.687536 4714.551799 4714.551799 0.687536 T 3
9 0.725072 4684.608381 4684.608381 0.725072 T 4
Number of DMRs to extract sequence reads: 20
../test/data/T_sample.bam processing (T)...
../test/data/N_sample.bam processing (N)...
Fine-tuning data generated: name flag ref_name ref_pos \
0 SRR10165994.69237235_69237235_length=151 163 chr7 156797584
1 SRR10165464.24148712_24148712_length=151 99 chr10 131770809
2 SRR10165994.26664131_26664131_length=151 163 chr10 131766813
3 SRR10165464.61126854_61126854_length=150 147 chr10 131769430
4 SRR10165994.14375046_14375046_length=150 83 chr5 1884762
map_quality cigar next_ref_name next_ref_pos length \
0 24 151M = 156797848 344
1 42 151M = 131770846 188
2 42 149M = 131767027 365
3 24 151M = 131769291 -290
4 40 149M = 1884665 -246
seq ... PG XG \
0 GGGGAAGAAAAAAAACTAAATAATAATTTAACATACATACGTAAAC... ... MarkDuplicates GA
1 GGTTTGTCGGGAAGGTTGTGAGTAGAGGCCAACGGAGGTCTCCCAG... ... MarkDuplicates CT
2 GGGGGCCTCTAAAAACGCTCCAAATTCGTCTTACGCCACGAAATCA... ... MarkDuplicates GA
3 GTTGGGTGGTAAGGTGGTTTAGGGTATAGTTAGGGGTTATGTAGAA... ... MarkDuplicates CT
4 AATAATTATTTCTAAATTCTATATTAATTTCGCGACAAACCGCGTT... ... MarkDuplicates GA
NM XM XR \
0 23 HHH.z..hhh.h..h...............h...h.....Z..h..... GA
1 11 ..hxz.xZ.......xz.z...x.....HH..Z.....hH.HHX..... CT
2 38 .Z.ZX.....x.h.h.Z......h...Z....h.Z....Zxhh...... GA
3 31 .x....z..h....z..hhx....h....h......hh........... GA
4 18 .......h.....x......x....h.....Z.Zx..xh..Z.Z..... CT
dna_seq \
0 GGG GGC GCG CGA GAT ATG TGG GGG GGA GAG AGA GA...
1 GGC GCC CCC CCG CGC GCC CCG CGG GGG GGA GAA AA...
2 CGC GCG CGG GGC GCC CCT CTC TCT CTG TGA GAG AG...
3 GCT CTG TGG GGG GGC GCG CGG GGC GCA CAA AAG AG...
4 AAT ATA TAA AAT ATT TTG TGT GTT TTT TTC TCT CT...
methyl_seq dmr_ctype dmr_label ctype
0 2202222222222222222222222222222222222212222212... T 17 T
1 2220221222222220202222222222222122222222222222... T 5 N
2 2122222222222212222222222122222212222122222221... T 5 T
3 2222202222222022222222222222222222222222222222... T 5 N
4 2222222222222222222222222222212122222221212222... T 8 T
[5 rows x 22 columns]
Size - train 3051 seqs , valid 763 seqs
After the preprocessing, you get three different files:
dmrs.csv : Selected DMRs (when the number of DMRs is given) with
dmr_labelcolumntrain_seq.csv : Preprocessed training data
test_seq.csv : Preprocessed evaluation data (20% of given data, due to the split_ratio=0.8)
[22]:
ls tmp/
dmrs.csv test_seq.csv train_seq.csv
Each preprocessed data is a tab-deliminated .csv file where each column contains the individual field of given BAM/SAM file. Additionally dmr_ctype, dmr_label and ctype are given:
dmr_ctype: The specific cell type for each DMRdmr_label: DMR label. This is used for the read classifier fully-connected network in MethylBERTctype: Cell-type of the read (indicated in the input file)
[11]:
import pandas as pd
pd.read_csv("tmp/test_seq.csv", sep='\t').head()
[11]:
| name | flag | ref_name | ref_pos | map_quality | cigar | next_ref_name | next_ref_pos | length | seq | ... | PG | XG | NM | XM | XR | dna_seq | methyl_seq | dmr_ctype | dmr_label | ctype | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | SRR10165464.6790597_6790597_length=151 | 83 | chr2 | 176943541 | 40 | 151M | = | 176943475 | -217 | AATTAACAATTTTCATCATAATCTACACATTATTAACATCAAACTT... | ... | MarkDuplicates | GA | 37 | h...hh........z.........x..........h............. | CT | GAT ATT TTG TGG GGC GCA CAA AAT ATT TTT TTT TT... | 2222222222220222222222222222222222222222222222... | T | 12 | N |
| 1 | SRR10165994.18752987_18752987_length=149 | 163 | chr7 | 157486616 | 40 | 149M | = | 157486650 | 183 | AGGCACGCGACCACCCTAAACCTCGAACAAAACTAAAAAAACGCAA... | ... | MarkDuplicates | GA | 51 | ..Z...Z.Zx.......xhh....Zx...xhh...hhhhh..Z..x... | GA | CCG CGC GCA CAC ACG CGC GCG CGG GGC GCC CCA CA... | 1222121222222222222222122222222222222222122222... | T | 11 | T |
| 2 | SRR10165994.2935274_2935274_length=150 | 83 | chr7 | 1270222 | 42 | 150M | = | 1269981 | -391 | ACGAACATTAAAACGCACGGAACCGCCGCGACGCGGACTCGCTCTT... | ... | MarkDuplicates | GA | 27 | h.Z.h....hhh..Z...ZX.h..Z..Z.Zx.Z.ZX....Z........ | CT | GCG CGA GAG AGC GCA CAT ATT TTG TGG GGG GGA GA... | 1222222222221222122222122121221212222212222222... | T | 1 | T |
| 3 | SRR10165464.56090327_56090327_length=151 | 163 | chr2 | 176949511 | 42 | 149M | = | 176949602 | 242 | AGGATTTCTTACTACATAACCACAAAAATACATTAAACCCACACCT... | ... | MarkDuplicates | GA | 36 | h.Z.......h....z.hh..z.zx.hh.h....hhh...z.z...... | GA | GCG CGC GCT CTT TTT TTC TCT CTT TTG TGC GCT CT... | 1222222222222022222020222222222222222202022222... | T | 12 | N |
| 4 | SRR10165464.47924911_47924911_length=150 | 147 | chr7 | 1272480 | 42 | 151M | = | 1272378 | -253 | AATTATTGGGAGTTTGATGTTGATAAGTAAAGTGTTGGAGTGTGGG... | ... | MarkDuplicates | CT | 31 | ......z.....h...................z.xz......z...... | GA | AAT ATT TTA TAT ATC TCG CGG GGG GGA GAG AGC GC... | 2222202222222222222222222222222022022222202220... | T | 1 | N |
5 rows × 22 columns