Preprocessing for bulk data

The bulk sample you want to deconvolute using MethylBERT also needs to be preprocessed using finetune_data_generate function.

[1]:
from methylbert.data import finetune_data_generate as fdg

f_bam = "../test/data/bulk.bam"
f_dmr = "../test/data/dmrs.csv"
f_ref = "../../../genome/hg19.fa"
out_dir = "tmp/"

fdg.finetune_data_generate(
    input_file = f_bam,
    f_dmr = f_dmr,
    f_ref = f_ref,
    output_dir=out_dir,
    n_mers=3, # 3-mer DNA sequences
    n_cores=20
)
DMRs sorted by areaStat
     chr      start          end  length  nCG  meanMethy1  meanMethy2  \
1  chr10  134597480  134602875.0    5396  670    0.861029    0.140400
0   chr7    1268957    1277884.0    8928  753    0.793278    0.129747
2   chr4    1395812    1402597.0    6786  663    0.831162    0.185272
5  chr16   54962053   54967980.0    5928  546    0.783631    0.096095
9  chr18   76736906   76741580.0    4675  510    0.829475    0.104403

   diff.Methy     areaStat  abs_areaStat  abs_diff.Methy ctype  dmr_id
1    0.720629  6144.089331   6144.089331        0.720629     T       0
0    0.663531  5722.091790   5722.091790        0.663531     T       1
2    0.645891  4941.410089   4941.410089        0.645891     T       2
5    0.687536  4714.551799   4714.551799        0.687536     T       3
9    0.725072  4684.608381   4684.608381        0.725072     T       4
Number of DMRs to extract sequence reads: 20
Fine-tuning data generated:                                        name flag ref_name    ref_pos  \
0    SRR10166000.9089788_9089788_length=151  147    chr10  131767360
1  SRR10165998.65829390_65829390_length=150  163     chr4   20254248
2  SRR10165467.85837758_85837758_length=151   99     chr4    1401206
3  SRR10165995.16747267_16747267_length=149   83     chr2  176945656
4  SRR10165995.46034072_46034072_length=151   99     chr4   20253524

  map_quality cigar next_ref_name next_ref_pos length  \
0          42  151M             =    131767187   -324
1          23  151M             =     20254343    244
2          40  151M             =      1401285    227
3          40  149M             =    176945572   -233
4          40  151M             =     20253771    398

                                                 seq  ...  NM  \
0  GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG...  ...  49
1  GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT...  ...  35
2  AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG...  ...  40
3  AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC...  ...  35
4  TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT...  ...   2

                                                  XM  XR  \
0  ........xZ.x..Z.x..xZ.....xZ.....x....x..hx......  GA
1  H..............h......xh.h...x..Z.Zx.h..x.Zx.....  GA
2  ...........x..h....hhh.h....hxz.hhhhh............  CT
3  x...hh...hh.............Z.....h.........z.h......  CT
4  .Z...h......................Z.hXZ...Z..Z....H....  CT

                        PG                                    RG  \
0  MarkDuplicates-287B47C6  diffuse_large_B_cell_lymphoma_test_8
1  MarkDuplicates-3DAAB091  diffuse_large_B_cell_lymphoma_test_8
2  MarkDuplicates-36E4BA78                Bcell_noncancer_test_8
3  MarkDuplicates-74536757  diffuse_large_B_cell_lymphoma_test_8
4  MarkDuplicates-74536757  diffuse_large_B_cell_lymphoma_test_8

                                             dna_seq  \
0  GTG TGG GGA GAG AGT GTG TGC GCC CCG CGC GCT CT...
1  GTT TTT TTC TCT CTT TTC TCT CTA TAC ACC CCT CT...
2  AAA AAA AAT ATG TGA GAG AGA GAG AGA GAC ACT CT...
3  GAA AAT ATG TGG GGC GCT CTT TTG TGG GGT GTC TC...
4  TCG CGG GGA GAC ACT CTT TTG TGG GGT GTG TGT GT...

                                          methyl_seq dmr_ctype dmr_label ctype
0  2222222212222122222122222212222222222222222222...         T         5    NA
1  2222222222222222222222222222221212222222122222...         T        19    NA
2  2222222222222222222222222222202222222222222222...         T         2    NA
3  2222222222222222222222122222222222222202222222...         T        12    NA
4  1222222222222222222222222221222122212212222222...         T        19    NA

[5 rows x 23 columns]

This process generates a new file data.csv where the preprocessed bulk data is contained.

[4]:
ls tmp/
data.csv  dmrs.csv  test_seq.csv  train_seq.csv

Since the cell-type information is not given with the bulk sample, ctype column only contains NaN value.

[3]:
import pandas as pd
pd.read_csv("tmp/data.csv", sep="\t").head()
[3]:
name flag ref_name ref_pos map_quality cigar next_ref_name next_ref_pos length seq ... NM XM XR PG RG dna_seq methyl_seq dmr_ctype dmr_label ctype
0 SRR10166000.9089788_9089788_length=151 147 chr10 131767360 42 151M = 131767187 -324 GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG... ... 49 ........xZ.x..Z.x..xZ.....xZ.....x....x..hx...... GA MarkDuplicates-287B47C6 diffuse_large_B_cell_lymphoma_test_8 GTG TGG GGA GAG AGT GTG TGC GCC CCG CGC GCT CT... 2222222212222122222122222212222222222222222222... T 5 NaN
1 SRR10165998.65829390_65829390_length=150 163 chr4 20254248 23 151M = 20254343 244 GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT... ... 35 H..............h......xh.h...x..Z.Zx.h..x.Zx..... GA MarkDuplicates-3DAAB091 diffuse_large_B_cell_lymphoma_test_8 GTT TTT TTC TCT CTT TTC TCT CTA TAC ACC CCT CT... 2222222222222222222222222222221212222222122222... T 19 NaN
2 SRR10165467.85837758_85837758_length=151 99 chr4 1401206 40 151M = 1401285 227 AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG... ... 40 ...........x..h....hhh.h....hxz.hhhhh............ CT MarkDuplicates-36E4BA78 Bcell_noncancer_test_8 AAA AAA AAT ATG TGA GAG AGA GAG AGA GAC ACT CT... 2222222222222222222222222222202222222222222222... T 2 NaN
3 SRR10165995.16747267_16747267_length=149 83 chr2 176945656 40 149M = 176945572 -233 AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC... ... 35 x...hh...hh.............Z.....h.........z.h...... CT MarkDuplicates-74536757 diffuse_large_B_cell_lymphoma_test_8 GAA AAT ATG TGG GGC GCT CTT TTG TGG GGT GTC TC... 2222222222222222222222122222222222222202222222... T 12 NaN
4 SRR10165995.46034072_46034072_length=151 99 chr4 20253524 40 151M = 20253771 398 TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT... ... 2 .Z...h......................Z.hXZ...Z..Z....H.... CT MarkDuplicates-74536757 diffuse_large_B_cell_lymphoma_test_8 TCG CGG GGA GAC ACT CTT TTG TGG GGT GTG TGT GT... 1222222222222222222222222221222122212212222222... T 19 NaN

5 rows × 23 columns