Tumour deconvolution with the fine-tuned MethylBERT model
Load the bulk data and the fine-tuned model
Please load your preprocessed bulk data following the tutorial into the MethylBertFinetuneDataset and DataLoader.
[1]:
from methylbert.utils import set_seed
set_seed(42)
seq_len=150
n_mers=3
batch_size=5
num_workers=20
output_path="tmp/deconvolution/"
[2]:
from methylbert.data.vocab import MethylVocab
from methylbert.data.dataset import MethylBertFinetuneDataset
from torch.utils.data import DataLoader
tokenizer = MethylVocab(k=n_mers)
dataset = MethylBertFinetuneDataset("tmp/data.csv",
tokenizer,
seq_len=seq_len)
data_loader = DataLoader(dataset, batch_size=batch_size, num_workers=num_workers)
Building Vocab
Total number of sequences : 3070
# of reads in each label: [3070.]
Load the fine-tuned MethylBERT model
load function in MethylBertFinetuneTrainer automatically detects config.json, pytorch_model.bin, dmr_encoder.pickle and read_classification_model.pickle files in the given directory and load the fine-tuned model.
[3]:
from methylbert.trainer import MethylBertFinetuneTrainer
restore_dir = "tmp/fine_tune/"
trainer = MethylBertFinetuneTrainer(len(tokenizer),
train_dataloader=data_loader,
test_dataloader=data_loader,
)
trainer.load(restore_dir) # Load the fine-tuned MethylBERT model
No detected GPU device. Load the model on CPU
The model is loaded on CPU
Restore the pretrained model tmp/fine_tune/
Restore DMR encoder from tmp/fine_tune/dmr_encoder.pickle
Restore read classification FCN model from tmp/fine_tune/read_classification_model.pickle
Total Parameters: 32754130
Deconvolution
For the deconvolution, the training data as a pandas.DataFrame object is required for the maginal probability of cell types.
[4]:
import pandas as pd
from methylbert.deconvolute import deconvolute
deconvolute(trainer = trainer,
tokenizer = tokenizer,
data_loader = data_loader,
output_path = output_path,
df_train = pd.read_csv("tmp/train_seq.csv", sep="\t"))
100%|██████████| 614/614 [01:03<00:00, 9.73it/s]
Margins : [0.5114678899082569, 0.48853211009174313]
name flag ref_name ref_pos \
0 SRR10166000.9089788_9089788_length=151 147 chr10 131767360
1 SRR10165998.65829390_65829390_length=150 163 chr4 20254248
2 SRR10165467.85837758_85837758_length=151 99 chr4 1401206
3 SRR10165995.16747267_16747267_length=149 83 chr2 176945656
4 SRR10165995.46034072_46034072_length=151 99 chr4 20253524
map_quality cigar next_ref_name next_ref_pos length \
0 42 151M = 131767187 -324
1 23 151M = 20254343 244
2 40 151M = 1401285 227
3 40 149M = 176945572 -233
4 40 151M = 20253771 398
seq ... \
0 GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG... ...
1 GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT... ...
2 AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG... ...
3 AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC... ...
4 TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT... ...
RG \
0 diffuse_large_B_cell_lymphoma_test_8
1 diffuse_large_B_cell_lymphoma_test_8
2 Bcell_noncancer_test_8
3 diffuse_large_B_cell_lymphoma_test_8
4 diffuse_large_B_cell_lymphoma_test_8
dna_seq \
0 GTGGAGTGCCGCTGCGCAGCCGGGAGCCGGGAGCAGAACAGCCTGG...
1 GTTTCTTCTACCTTTGCCATCAGGTGTCTGCCGCGGAGCTGCGGCT...
2 AAAATGAGAGACTGCTTGTCCCTCTTAACCCGCCCCCAAAAGGGGG...
3 GAATGGCTTGGTCTACTTCTCTCCGACCAAGCCCAACCCCGAGTAC...
4 TCGGACTTGGTGTTATTTATTTGGGAAGCGCCCGGACGGCGGAGCT...
methyl_seq dmr_ctype dmr_label \
0 2222222221222212222212222221222222222222222222... T 5
1 2222222222222222222222222222222121222222212222... T 19
2 2222222222222222222222222222220222222222222222... T 2
3 2222222222222222222222212222222222222220222222... T 12
4 2122222222222222222222222222122212221221222222... T 19
ctype pred n_cpg P_N P_T
0 NA 0 14 0.502180 0.497820
1 NA 0 10 0.515485 0.484515
2 NA 0 5 0.512614 0.487386
3 NA 0 10 0.505069 0.494931
4 NA 0 19 0.513832 0.486168
[5 rows x 27 columns]
100%|██████████| 10000/10000 [00:02<00:00, 3849.83it/s]
deconvolute function creates three files in the given directiory:
decconvolution.csv: tumour deconvolution resultFI.csv: the Fisher information valueres.csv: read classification result (the classification result for each read is given inpredcolumn)
[5]:
import os
pd.read_csv(os.path.join(output_path, "deconvolution.csv"), sep="\t").head()
[5]:
| cell_type | pred | |
|---|---|---|
| 0 | T | 0.0 |
| 1 | N | 1.0 |
[6]:
pd.read_csv(os.path.join(output_path, "FI.csv"), sep="\t").head()
[6]:
| fi | |
|---|---|
| 0 | 4.243251 |
[7]:
pd.read_csv(os.path.join(output_path, "res.csv"), sep="\t").head()
[7]:
| name | flag | ref_name | ref_pos | map_quality | cigar | next_ref_name | next_ref_pos | length | seq | ... | XM | XR | PG | RG | dna_seq | methyl_seq | dmr_ctype | dmr_label | ctype | pred | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | SRR10166000.9089788_9089788_length=151 | 147 | chr10 | 131767360 | 42 | 151M | = | 131767187 | -324 | GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG... | ... | ........xZ.x..Z.x..xZ.....xZ.....x....x..hx...... | GA | MarkDuplicates-287B47C6 | diffuse_large_B_cell_lymphoma_test_8 | GTGGAGTGCCGCTGCGCAGCCGGGAGCCGGGAGCAGAACAGCCTGG... | 2222222221222212222212222221222222222222222222... | T | 5 | NaN | 0 |
| 1 | SRR10165998.65829390_65829390_length=150 | 163 | chr4 | 20254248 | 23 | 151M | = | 20254343 | 244 | GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT... | ... | H..............h......xh.h...x..Z.Zx.h..x.Zx..... | GA | MarkDuplicates-3DAAB091 | diffuse_large_B_cell_lymphoma_test_8 | GTTTCTTCTACCTTTGCCATCAGGTGTCTGCCGCGGAGCTGCGGCT... | 2222222222222222222222222222222121222222212222... | T | 19 | NaN | 0 |
| 2 | SRR10165467.85837758_85837758_length=151 | 99 | chr4 | 1401206 | 40 | 151M | = | 1401285 | 227 | AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG... | ... | ...........x..h....hhh.h....hxz.hhhhh............ | CT | MarkDuplicates-36E4BA78 | Bcell_noncancer_test_8 | AAAATGAGAGACTGCTTGTCCCTCTTAACCCGCCCCCAAAAGGGGG... | 2222222222222222222222222222220222222222222222... | T | 2 | NaN | 0 |
| 3 | SRR10165995.16747267_16747267_length=149 | 83 | chr2 | 176945656 | 40 | 149M | = | 176945572 | -233 | AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC... | ... | x...hh...hh.............Z.....h.........z.h...... | CT | MarkDuplicates-74536757 | diffuse_large_B_cell_lymphoma_test_8 | GAATGGCTTGGTCTACTTCTCTCCGACCAAGCCCAACCCCGAGTAC... | 2222222222222222222222212222222222222220222222... | T | 12 | NaN | 0 |
| 4 | SRR10165995.46034072_46034072_length=151 | 99 | chr4 | 20253524 | 40 | 151M | = | 20253771 | 398 | TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT... | ... | .Z...h......................Z.hXZ...Z..Z....H.... | CT | MarkDuplicates-74536757 | diffuse_large_B_cell_lymphoma_test_8 | TCGGACTTGGTGTTATTTATTTGGGAAGCGCCCGGACGGCGGAGCT... | 2122222222222222222222222222122212221221222222... | T | 19 | NaN | 0 |
5 rows × 24 columns
Deconvolution with the estimate adjustment
MethylBERT supports the tumour purity estimation adjustment considering the different distribution of tumour-derived reads in DMRs.
You can turn adjustment option on for this.
[8]:
from methylbert.deconvolute import deconvolute
import pandas as pd
# Read classification
deconvolute(trainer = trainer,
tokenizer = tokenizer,
data_loader = data_loader,
output_path = output_path,
df_train = pd.read_csv("tmp/train_seq.csv", sep="\t"),
adjustment = True) # tumour purity estimation adjustment
100%|██████████| 614/614 [00:56<00:00, 10.86it/s]
Margins : [0.5114678899082569, 0.48853211009174313]
name flag ref_name ref_pos \
0 SRR10166000.9089788_9089788_length=151 147 chr10 131767360
1 SRR10165998.65829390_65829390_length=150 163 chr4 20254248
2 SRR10165467.85837758_85837758_length=151 99 chr4 1401206
3 SRR10165995.16747267_16747267_length=149 83 chr2 176945656
4 SRR10165995.46034072_46034072_length=151 99 chr4 20253524
map_quality cigar next_ref_name next_ref_pos length \
0 42 151M = 131767187 -324
1 23 151M = 20254343 244
2 40 151M = 1401285 227
3 40 149M = 176945572 -233
4 40 151M = 20253771 398
seq ... \
0 GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG... ...
1 GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT... ...
2 AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG... ...
3 AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC... ...
4 TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT... ...
RG \
0 diffuse_large_B_cell_lymphoma_test_8
1 diffuse_large_B_cell_lymphoma_test_8
2 Bcell_noncancer_test_8
3 diffuse_large_B_cell_lymphoma_test_8
4 diffuse_large_B_cell_lymphoma_test_8
dna_seq \
0 GTGGAGTGCCGCTGCGCAGCCGGGAGCCGGGAGCAGAACAGCCTGG...
1 GTTTCTTCTACCTTTGCCATCAGGTGTCTGCCGCGGAGCTGCGGCT...
2 AAAATGAGAGACTGCTTGTCCCTCTTAACCCGCCCCCAAAAGGGGG...
3 GAATGGCTTGGTCTACTTCTCTCCGACCAAGCCCAACCCCGAGTAC...
4 TCGGACTTGGTGTTATTTATTTGGGAAGCGCCCGGACGGCGGAGCT...
methyl_seq dmr_ctype dmr_label \
0 2222222221222212222212222221222222222222222222... T 5
1 2222222222222222222222222222222121222222212222... T 19
2 2222222222222222222222222222220222222222222222... T 2
3 2222222222222222222222212222222222222220222222... T 12
4 2122222222222222222222222222122212221221222222... T 19
ctype pred n_cpg P_N P_T
0 NA 0 14 0.502180 0.497820
1 NA 0 10 0.515485 0.484515
2 NA 0 5 0.512614 0.487386
3 NA 0 10 0.505069 0.494931
4 NA 0 19 0.513832 0.486168
[5 rows x 27 columns]
10%|█ | 1/10 [00:00<00:00, 11.06it/s]
When the adjustment is applied, the Fisher information is calculated in each DMR.
[9]:
pd.read_csv(os.path.join(output_path, "FI.csv"), sep="\t").head()
[9]:
| dmr_label | fi | |
|---|---|---|
| 0 | 0 | 1.0 |
| 1 | 1 | 1.0 |
| 2 | 2 | 1.0 |
| 3 | 3 | 1.0 |
| 4 | 4 | 1.0 |