Tumour deconvolution with the fine-tuned MethylBERT model

Load the bulk data and the fine-tuned model

Please load your preprocessed bulk data following the tutorial into the MethylBertFinetuneDataset and DataLoader.

[1]:
from methylbert.utils import set_seed

set_seed(42)
seq_len=150
n_mers=3
batch_size=5
num_workers=20
output_path="tmp/deconvolution/"
[2]:
from methylbert.data.vocab import MethylVocab
from methylbert.data.dataset import MethylBertFinetuneDataset
from torch.utils.data import DataLoader

tokenizer = MethylVocab(k=n_mers)
dataset = MethylBertFinetuneDataset("tmp/data.csv",
                                    tokenizer,
                                    seq_len=seq_len)
data_loader = DataLoader(dataset, batch_size=batch_size, num_workers=num_workers)

Building Vocab
Total number of sequences :  3070
# of reads in each label:  [3070.]

Load the fine-tuned MethylBERT model

load function in MethylBertFinetuneTrainer automatically detects config.json, pytorch_model.bin, dmr_encoder.pickle and read_classification_model.pickle files in the given directory and load the fine-tuned model.

[3]:
from methylbert.trainer import MethylBertFinetuneTrainer

restore_dir = "tmp/fine_tune/"
trainer = MethylBertFinetuneTrainer(len(tokenizer),
                                    train_dataloader=data_loader,
                                    test_dataloader=data_loader,
                                    )
trainer.load(restore_dir) # Load the fine-tuned MethylBERT model
No detected GPU device. Load the model on CPU
The model is loaded on CPU
Restore the pretrained model tmp/fine_tune/
Restore DMR encoder from tmp/fine_tune/dmr_encoder.pickle
Restore read classification FCN model from tmp/fine_tune/read_classification_model.pickle
Total Parameters: 32754130

Deconvolution

For the deconvolution, the training data as a pandas.DataFrame object is required for the maginal probability of cell types.

[4]:
import pandas as pd
from methylbert.deconvolute import deconvolute

deconvolute(trainer = trainer,
            tokenizer = tokenizer,
            data_loader = data_loader,
            output_path = output_path,
            df_train = pd.read_csv("tmp/train_seq.csv", sep="\t"))
100%|██████████| 614/614 [01:03<00:00,  9.73it/s]
Margins :  [0.5114678899082569, 0.48853211009174313]
                                       name flag ref_name    ref_pos  \
0    SRR10166000.9089788_9089788_length=151  147    chr10  131767360
1  SRR10165998.65829390_65829390_length=150  163     chr4   20254248
2  SRR10165467.85837758_85837758_length=151   99     chr4    1401206
3  SRR10165995.16747267_16747267_length=149   83     chr2  176945656
4  SRR10165995.46034072_46034072_length=151   99     chr4   20253524

  map_quality cigar next_ref_name next_ref_pos length  \
0          42  151M             =    131767187   -324
1          23  151M             =     20254343    244
2          40  151M             =      1401285    227
3          40  149M             =    176945572   -233
4          40  151M             =     20253771    398

                                                 seq  ...  \
0  GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG...  ...
1  GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT...  ...
2  AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG...  ...
3  AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC...  ...
4  TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT...  ...

                                     RG  \
0  diffuse_large_B_cell_lymphoma_test_8
1  diffuse_large_B_cell_lymphoma_test_8
2                Bcell_noncancer_test_8
3  diffuse_large_B_cell_lymphoma_test_8
4  diffuse_large_B_cell_lymphoma_test_8

                                             dna_seq  \
0  GTGGAGTGCCGCTGCGCAGCCGGGAGCCGGGAGCAGAACAGCCTGG...
1  GTTTCTTCTACCTTTGCCATCAGGTGTCTGCCGCGGAGCTGCGGCT...
2  AAAATGAGAGACTGCTTGTCCCTCTTAACCCGCCCCCAAAAGGGGG...
3  GAATGGCTTGGTCTACTTCTCTCCGACCAAGCCCAACCCCGAGTAC...
4  TCGGACTTGGTGTTATTTATTTGGGAAGCGCCCGGACGGCGGAGCT...

                                          methyl_seq dmr_ctype dmr_label  \
0  2222222221222212222212222221222222222222222222...         T         5
1  2222222222222222222222222222222121222222212222...         T        19
2  2222222222222222222222222222220222222222222222...         T         2
3  2222222222222222222222212222222222222220222222...         T        12
4  2122222222222222222222222222122212221221222222...         T        19

  ctype pred n_cpg       P_N       P_T
0    NA    0    14  0.502180  0.497820
1    NA    0    10  0.515485  0.484515
2    NA    0     5  0.512614  0.487386
3    NA    0    10  0.505069  0.494931
4    NA    0    19  0.513832  0.486168

[5 rows x 27 columns]
100%|██████████| 10000/10000 [00:02<00:00, 3849.83it/s]

deconvolute function creates three files in the given directiory:

  1. decconvolution.csv : tumour deconvolution result

  2. FI.csv : the Fisher information value

  3. res.csv : read classification result (the classification result for each read is given in pred column)

[5]:
import os
pd.read_csv(os.path.join(output_path, "deconvolution.csv"), sep="\t").head()
[5]:
cell_type pred
0 T 0.0
1 N 1.0
[6]:
pd.read_csv(os.path.join(output_path, "FI.csv"), sep="\t").head()
[6]:
fi
0 4.243251
[7]:
pd.read_csv(os.path.join(output_path, "res.csv"), sep="\t").head()
[7]:
name flag ref_name ref_pos map_quality cigar next_ref_name next_ref_pos length seq ... XM XR PG RG dna_seq methyl_seq dmr_ctype dmr_label ctype pred
0 SRR10166000.9089788_9089788_length=151 147 chr10 131767360 42 151M = 131767187 -324 GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG... ... ........xZ.x..Z.x..xZ.....xZ.....x....x..hx...... GA MarkDuplicates-287B47C6 diffuse_large_B_cell_lymphoma_test_8 GTGGAGTGCCGCTGCGCAGCCGGGAGCCGGGAGCAGAACAGCCTGG... 2222222221222212222212222221222222222222222222... T 5 NaN 0
1 SRR10165998.65829390_65829390_length=150 163 chr4 20254248 23 151M = 20254343 244 GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT... ... H..............h......xh.h...x..Z.Zx.h..x.Zx..... GA MarkDuplicates-3DAAB091 diffuse_large_B_cell_lymphoma_test_8 GTTTCTTCTACCTTTGCCATCAGGTGTCTGCCGCGGAGCTGCGGCT... 2222222222222222222222222222222121222222212222... T 19 NaN 0
2 SRR10165467.85837758_85837758_length=151 99 chr4 1401206 40 151M = 1401285 227 AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG... ... ...........x..h....hhh.h....hxz.hhhhh............ CT MarkDuplicates-36E4BA78 Bcell_noncancer_test_8 AAAATGAGAGACTGCTTGTCCCTCTTAACCCGCCCCCAAAAGGGGG... 2222222222222222222222222222220222222222222222... T 2 NaN 0
3 SRR10165995.16747267_16747267_length=149 83 chr2 176945656 40 149M = 176945572 -233 AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC... ... x...hh...hh.............Z.....h.........z.h...... CT MarkDuplicates-74536757 diffuse_large_B_cell_lymphoma_test_8 GAATGGCTTGGTCTACTTCTCTCCGACCAAGCCCAACCCCGAGTAC... 2222222222222222222222212222222222222220222222... T 12 NaN 0
4 SRR10165995.46034072_46034072_length=151 99 chr4 20253524 40 151M = 20253771 398 TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT... ... .Z...h......................Z.hXZ...Z..Z....H.... CT MarkDuplicates-74536757 diffuse_large_B_cell_lymphoma_test_8 TCGGACTTGGTGTTATTTATTTGGGAAGCGCCCGGACGGCGGAGCT... 2122222222222222222222222222122212221221222222... T 19 NaN 0

5 rows × 24 columns

Deconvolution with the estimate adjustment

MethylBERT supports the tumour purity estimation adjustment considering the different distribution of tumour-derived reads in DMRs.

You can turn adjustment option on for this.

[8]:
from methylbert.deconvolute import deconvolute
import pandas as pd

# Read classification
deconvolute(trainer = trainer,
            tokenizer = tokenizer,
            data_loader = data_loader,
            output_path = output_path,
            df_train = pd.read_csv("tmp/train_seq.csv", sep="\t"),
           adjustment = True) # tumour purity estimation adjustment
100%|██████████| 614/614 [00:56<00:00, 10.86it/s]
Margins :  [0.5114678899082569, 0.48853211009174313]
                                       name flag ref_name    ref_pos  \
0    SRR10166000.9089788_9089788_length=151  147    chr10  131767360
1  SRR10165998.65829390_65829390_length=150  163     chr4   20254248
2  SRR10165467.85837758_85837758_length=151   99     chr4    1401206
3  SRR10165995.16747267_16747267_length=149   83     chr2  176945656
4  SRR10165995.46034072_46034072_length=151   99     chr4   20253524

  map_quality cigar next_ref_name next_ref_pos length  \
0          42  151M             =    131767187   -324
1          23  151M             =     20254343    244
2          40  151M             =      1401285    227
3          40  149M             =    176945572   -233
4          40  151M             =     20253771    398

                                                 seq  ...  \
0  GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG...  ...
1  GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT...  ...
2  AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG...  ...
3  AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC...  ...
4  TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT...  ...

                                     RG  \
0  diffuse_large_B_cell_lymphoma_test_8
1  diffuse_large_B_cell_lymphoma_test_8
2                Bcell_noncancer_test_8
3  diffuse_large_B_cell_lymphoma_test_8
4  diffuse_large_B_cell_lymphoma_test_8

                                             dna_seq  \
0  GTGGAGTGCCGCTGCGCAGCCGGGAGCCGGGAGCAGAACAGCCTGG...
1  GTTTCTTCTACCTTTGCCATCAGGTGTCTGCCGCGGAGCTGCGGCT...
2  AAAATGAGAGACTGCTTGTCCCTCTTAACCCGCCCCCAAAAGGGGG...
3  GAATGGCTTGGTCTACTTCTCTCCGACCAAGCCCAACCCCGAGTAC...
4  TCGGACTTGGTGTTATTTATTTGGGAAGCGCCCGGACGGCGGAGCT...

                                          methyl_seq dmr_ctype dmr_label  \
0  2222222221222212222212222221222222222222222222...         T         5
1  2222222222222222222222222222222121222222212222...         T        19
2  2222222222222222222222222222220222222222222222...         T         2
3  2222222222222222222222212222222222222220222222...         T        12
4  2122222222222222222222222222122212221221222222...         T        19

  ctype pred n_cpg       P_N       P_T
0    NA    0    14  0.502180  0.497820
1    NA    0    10  0.515485  0.484515
2    NA    0     5  0.512614  0.487386
3    NA    0    10  0.505069  0.494931
4    NA    0    19  0.513832  0.486168

[5 rows x 27 columns]
 10%|█         | 1/10 [00:00<00:00, 11.06it/s]

When the adjustment is applied, the Fisher information is calculated in each DMR.

[9]:
pd.read_csv(os.path.join(output_path, "FI.csv"), sep="\t").head()
[9]:
dmr_label fi
0 0 1.0
1 1 1.0
2 2 1.0
3 3 1.0
4 4 1.0