Tumour deconvolution with the fine-tuned `MethylBERT` model

Load the bulk data and the fine-tuned model

Please load your preprocessed bulk data following the tutorial into the MethylBertFinetuneDataset and DataLoader.

[1]:

from methylbert.utils import set_seed

set_seed(42)
seq_len=150
n_mers=3
batch_size=5
num_workers=20
output_path="tmp/deconvolution/"

[2]:

from methylbert.data.vocab import MethylVocab
from methylbert.data.dataset import MethylBertFinetuneDataset
from torch.utils.data import DataLoader

tokenizer = MethylVocab(k=n_mers)
dataset = MethylBertFinetuneDataset("tmp/data.csv",
                                    tokenizer,
                                    seq_len=seq_len)
data_loader = DataLoader(dataset, batch_size=batch_size, num_workers=num_workers)

Building Vocab
Total number of sequences :  3070
# of reads in each label:  [3070.]

Load the fine-tuned `MethylBERT` model

load function in MethylBertFinetuneTrainer automatically detects config.json, pytorch_model.bin, dmr_encoder.pickle and read_classification_model.pickle files in the given directory and load the fine-tuned model.

[3]:

from methylbert.trainer import MethylBertFinetuneTrainer

restore_dir = "tmp/fine_tune/"
trainer = MethylBertFinetuneTrainer(len(tokenizer),
                                    train_dataloader=data_loader,
                                    test_dataloader=data_loader,
                                    )
trainer.load(restore_dir) # Load the fine-tuned MethylBERT model

No detected GPU device. Load the model on CPU
The model is loaded on CPU
Restore the pretrained model tmp/fine_tune/
Restore DMR encoder from tmp/fine_tune/dmr_encoder.pickle
Restore read classification FCN model from tmp/fine_tune/read_classification_model.pickle
Total Parameters: 32754130

Deconvolution

For the deconvolution, the training data as a pandas.DataFrame object is required for the maginal probability of cell types.

[4]:

import pandas as pd
from methylbert.deconvolute import deconvolute

deconvolute(trainer = trainer,
            tokenizer = tokenizer,
            data_loader = data_loader,
            output_path = output_path,
            df_train = pd.read_csv("tmp/train_seq.csv", sep="\t"))

100%|██████████| 614/614 [01:03<00:00,  9.73it/s]

Margins :  [0.5114678899082569, 0.48853211009174313]
                                       name flag ref_name    ref_pos  \
0    SRR10166000.9089788_9089788_length=151  147    chr10  131767360
1  SRR10165998.65829390_65829390_length=150  163     chr4   20254248
2  SRR10165467.85837758_85837758_length=151   99     chr4    1401206
3  SRR10165995.16747267_16747267_length=149   83     chr2  176945656
4  SRR10165995.46034072_46034072_length=151   99     chr4   20253524

  map_quality cigar next_ref_name next_ref_pos length  \
0          42  151M             =    131767187   -324
1          23  151M             =     20254343    244
2          40  151M             =      1401285    227
3          40  149M             =    176945572   -233
4          40  151M             =     20253771    398

                                                 seq  ...  \
0  GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG...  ...
1  GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT...  ...
2  AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG...  ...
3  AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC...  ...
4  TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT...  ...

                                     RG  \
0  diffuse_large_B_cell_lymphoma_test_8
1  diffuse_large_B_cell_lymphoma_test_8
2                Bcell_noncancer_test_8
3  diffuse_large_B_cell_lymphoma_test_8
4  diffuse_large_B_cell_lymphoma_test_8

                                             dna_seq  \
0  GTGGAGTGCCGCTGCGCAGCCGGGAGCCGGGAGCAGAACAGCCTGG...
1  GTTTCTTCTACCTTTGCCATCAGGTGTCTGCCGCGGAGCTGCGGCT...
2  AAAATGAGAGACTGCTTGTCCCTCTTAACCCGCCCCCAAAAGGGGG...
3  GAATGGCTTGGTCTACTTCTCTCCGACCAAGCCCAACCCCGAGTAC...
4  TCGGACTTGGTGTTATTTATTTGGGAAGCGCCCGGACGGCGGAGCT...

                                          methyl_seq dmr_ctype dmr_label  \
0  2222222221222212222212222221222222222222222222...         T         5
1  2222222222222222222222222222222121222222212222...         T        19
2  2222222222222222222222222222220222222222222222...         T         2
3  2222222222222222222222212222222222222220222222...         T        12
4  2122222222222222222222222222122212221221222222...         T        19

  ctype pred n_cpg       P_N       P_T
0    NA    0    14  0.502180  0.497820
1    NA    0    10  0.515485  0.484515
2    NA    0     5  0.512614  0.487386
3    NA    0    10  0.505069  0.494931
4    NA    0    19  0.513832  0.486168

[5 rows x 27 columns]

100%|██████████| 10000/10000 [00:02<00:00, 3849.83it/s]

deconvolute function creates three files in the given directiory:

decconvolution.csv : tumour deconvolution result
FI.csv : the Fisher information value
res.csv : read classification result (the classification result for each read is given in pred column)

[5]:

import os
pd.read_csv(os.path.join(output_path, "deconvolution.csv"), sep="\t").head()

[5]:

	cell_type	pred
0	T	0.0
1	N	1.0

[6]:

pd.read_csv(os.path.join(output_path, "FI.csv"), sep="\t").head()

[6]:

	fi
0	4.243251

[7]:

pd.read_csv(os.path.join(output_path, "res.csv"), sep="\t").head()

[7]:

	name	flag	ref_name	ref_pos	map_quality	cigar	next_ref_name	next_ref_pos	length	seq	...	XM	XR	PG	RG	dna_seq	methyl_seq	dmr_ctype	dmr_label	ctype
0	SRR10166000.9089788_9089788_length=151	147	chr10	131767360	42	151M	=	131767187	-324	GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG...	...	........xZ.x..Z.x..xZ.....xZ.....x....x..hx......	GA	MarkDuplicates-287B47C6	diffuse_large_B_cell_lymphoma_test_8	GTGGAGTGCCGCTGCGCAGCCGGGAGCCGGGAGCAGAACAGCCTGG...	2222222221222212222212222221222222222222222222...	T	5	NaN
1	SRR10165998.65829390_65829390_length=150	163	chr4	20254248	23	151M	=	20254343	244	GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT...	...	H..............h......xh.h...x..Z.Zx.h..x.Zx.....	GA	MarkDuplicates-3DAAB091	diffuse_large_B_cell_lymphoma_test_8	GTTTCTTCTACCTTTGCCATCAGGTGTCTGCCGCGGAGCTGCGGCT...	2222222222222222222222222222222121222222212222...	T	19	NaN
2	SRR10165467.85837758_85837758_length=151	99	chr4	1401206	40	151M	=	1401285	227	AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG...	...	...........x..h....hhh.h....hxz.hhhhh............	CT	MarkDuplicates-36E4BA78	Bcell_noncancer_test_8	AAAATGAGAGACTGCTTGTCCCTCTTAACCCGCCCCCAAAAGGGGG...	2222222222222222222222222222220222222222222222...	T	2	NaN
3	SRR10165995.16747267_16747267_length=149	83	chr2	176945656	40	149M	=	176945572	-233	AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC...	...	x...hh...hh.............Z.....h.........z.h......	CT	MarkDuplicates-74536757	diffuse_large_B_cell_lymphoma_test_8	GAATGGCTTGGTCTACTTCTCTCCGACCAAGCCCAACCCCGAGTAC...	2222222222222222222222212222222222222220222222...	T	12	NaN
4	SRR10165995.46034072_46034072_length=151	99	chr4	20253524	40	151M	=	20253771	398	TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT...	...	.Z...h......................Z.hXZ...Z..Z....H....	CT	MarkDuplicates-74536757	diffuse_large_B_cell_lymphoma_test_8	TCGGACTTGGTGTTATTTATTTGGGAAGCGCCCGGACGGCGGAGCT...	2122222222222222222222222222122212221221222222...	T	19	NaN

5 rows × 24 columns

Deconvolution with the estimate adjustment

MethylBERT supports the tumour purity estimation adjustment considering the different distribution of tumour-derived reads in DMRs.

You can turn adjustment option on for this.

[8]:

from methylbert.deconvolute import deconvolute
import pandas as pd

# Read classification
deconvolute(trainer = trainer,
            tokenizer = tokenizer,
            data_loader = data_loader,
            output_path = output_path,
            df_train = pd.read_csv("tmp/train_seq.csv", sep="\t"),
           adjustment = True) # tumour purity estimation adjustment

100%|██████████| 614/614 [00:56<00:00, 10.86it/s]

Margins :  [0.5114678899082569, 0.48853211009174313]
                                       name flag ref_name    ref_pos  \
0    SRR10166000.9089788_9089788_length=151  147    chr10  131767360
1  SRR10165998.65829390_65829390_length=150  163     chr4   20254248
2  SRR10165467.85837758_85837758_length=151   99     chr4    1401206
3  SRR10165995.16747267_16747267_length=149   83     chr2  176945656
4  SRR10165995.46034072_46034072_length=151   99     chr4   20253524

  map_quality cigar next_ref_name next_ref_pos length  \
0          42  151M             =    131767187   -324
1          23  151M             =     20254343    244
2          40  151M             =      1401285    227
3          40  149M             =    176945572   -233
4          40  151M             =     20253771    398

                                                 seq  ...  \
0  GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG...  ...
1  GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT...  ...
2  AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG...  ...
3  AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC...  ...
4  TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT...  ...

                                     RG  \
0  diffuse_large_B_cell_lymphoma_test_8
1  diffuse_large_B_cell_lymphoma_test_8
2                Bcell_noncancer_test_8
3  diffuse_large_B_cell_lymphoma_test_8
4  diffuse_large_B_cell_lymphoma_test_8

                                             dna_seq  \
0  GTGGAGTGCCGCTGCGCAGCCGGGAGCCGGGAGCAGAACAGCCTGG...
1  GTTTCTTCTACCTTTGCCATCAGGTGTCTGCCGCGGAGCTGCGGCT...
2  AAAATGAGAGACTGCTTGTCCCTCTTAACCCGCCCCCAAAAGGGGG...
3  GAATGGCTTGGTCTACTTCTCTCCGACCAAGCCCAACCCCGAGTAC...
4  TCGGACTTGGTGTTATTTATTTGGGAAGCGCCCGGACGGCGGAGCT...

                                          methyl_seq dmr_ctype dmr_label  \
0  2222222221222212222212222221222222222222222222...         T         5
1  2222222222222222222222222222222121222222212222...         T        19
2  2222222222222222222222222222220222222222222222...         T         2
3  2222222222222222222222212222222222222220222222...         T        12
4  2122222222222222222222222222122212221221222222...         T        19

  ctype pred n_cpg       P_N       P_T
0    NA    0    14  0.502180  0.497820
1    NA    0    10  0.515485  0.484515
2    NA    0     5  0.512614  0.487386
3    NA    0    10  0.505069  0.494931
4    NA    0    19  0.513832  0.486168

[5 rows x 27 columns]

 10%|█         | 1/10 [00:00<00:00, 11.06it/s]

When the adjustment is applied, the Fisher information is calculated in each DMR.

[9]:

pd.read_csv(os.path.join(output_path, "FI.csv"), sep="\t").head()

[9]:

	dmr_label	fi
0	0	1.0
1	1	1.0
2	2	1.0
3	3	1.0
4	4	1.0

Tumour deconvolution with the fine-tuned MethylBERT model

Load the bulk data and the fine-tuned model

Load the fine-tuned MethylBERT model

Deconvolution

Deconvolution with the estimate adjustment

Tumour deconvolution with the fine-tuned `MethylBERT` model

Load the fine-tuned `MethylBERT` model