{ "cells": [ { "cell_type": "markdown", "id": "f48c7917-8d6c-47fd-9378-ace38e238cdc", "metadata": {}, "source": [ "# Tumour deconvolution with the fine-tuned `MethylBERT` model \n", "\n", "### Load the bulk data and the fine-tuned model\n", "\n", "Please load your preprocessed bulk data following the [tutorial](https://github.com/hanyangii/methylbert/blob/main/tutorials/03_Preprocessing_bulk_data.ipynb) into the `MethylBertFinetuneDataset` and `DataLoader`." ] }, { "cell_type": "code", "execution_count": 1, "id": "57ed608b-9a4a-42a7-b38a-7f1831f51d2c", "metadata": {}, "outputs": [], "source": [ "from methylbert.utils import set_seed\n", "\n", "set_seed(42)\n", "seq_len=150\n", "n_mers=3\n", "batch_size=5\n", "num_workers=20\n", "output_path=\"tmp/deconvolution/\"" ] }, { "cell_type": "code", "execution_count": 2, "id": "7543b0f3-943f-4989-9ca5-67a1c2a27ac1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Building Vocab\n", "Total number of sequences : 3070\n", "# of reads in each label: [3070.]\n" ] } ], "source": [ "from methylbert.data.vocab import MethylVocab\n", "from methylbert.data.dataset import MethylBertFinetuneDataset\n", "from torch.utils.data import DataLoader\n", "\n", "tokenizer = MethylVocab(k=n_mers)\n", "dataset = MethylBertFinetuneDataset(\"tmp/data.csv\", \n", " tokenizer, \n", " seq_len=seq_len)\n", "data_loader = DataLoader(dataset, batch_size=batch_size, num_workers=num_workers)\n" ] }, { "cell_type": "markdown", "id": "8179433f-55d6-4f04-8d23-8a816ebdad48", "metadata": {}, "source": [ "### Load the fine-tuned `MethylBERT` model\n", "\n", "`load` function in `MethylBertFinetuneTrainer` automatically detects `config.json`, `pytorch_model.bin`, `dmr_encoder.pickle` and `read_classification_model.pickle` files in the given directory and load the fine-tuned model. " ] }, { "cell_type": "code", "execution_count": 3, "id": "d732913f-01be-4bae-8064-aecdcd7b63d4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "No detected GPU device. Load the model on CPU\n", "The model is loaded on CPU\n", "Restore the pretrained model tmp/fine_tune/\n", "Restore DMR encoder from tmp/fine_tune/dmr_encoder.pickle\n", "Restore read classification FCN model from tmp/fine_tune/read_classification_model.pickle\n", "Total Parameters: 32754130\n" ] } ], "source": [ "from methylbert.trainer import MethylBertFinetuneTrainer\n", "\n", "restore_dir = \"tmp/fine_tune/\"\n", "trainer = MethylBertFinetuneTrainer(len(tokenizer), \n", " train_dataloader=data_loader, \n", " test_dataloader=data_loader,\n", " )\n", "trainer.load(restore_dir) # Load the fine-tuned MethylBERT model" ] }, { "cell_type": "markdown", "id": "7aef54f0-8565-4e89-85bf-b42aa2b64206", "metadata": {}, "source": [ "### Deconvolution\n", "For the deconvolution, the training data as a `pandas.DataFrame` object is required for the maginal probability of cell types. " ] }, { "cell_type": "code", "execution_count": 4, "id": "270cb5c1-da7a-4cc3-812f-20c17643c230", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 614/614 [01:03<00:00, 9.73it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Margins : [0.5114678899082569, 0.48853211009174313]\n", " name flag ref_name ref_pos \\\n", "0 SRR10166000.9089788_9089788_length=151 147 chr10 131767360 \n", "1 SRR10165998.65829390_65829390_length=150 163 chr4 20254248 \n", "2 SRR10165467.85837758_85837758_length=151 99 chr4 1401206 \n", "3 SRR10165995.16747267_16747267_length=149 83 chr2 176945656 \n", "4 SRR10165995.46034072_46034072_length=151 99 chr4 20253524 \n", "\n", " map_quality cigar next_ref_name next_ref_pos length \\\n", "0 42 151M = 131767187 -324 \n", "1 23 151M = 20254343 244 \n", "2 40 151M = 1401285 227 \n", "3 40 149M = 176945572 -233 \n", "4 40 151M = 20253771 398 \n", "\n", " seq ... \\\n", "0 GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG... ... \n", "1 GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT... ... \n", "2 AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG... ... \n", "3 AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC... ... \n", "4 TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT... ... \n", "\n", " RG \\\n", "0 diffuse_large_B_cell_lymphoma_test_8 \n", "1 diffuse_large_B_cell_lymphoma_test_8 \n", "2 Bcell_noncancer_test_8 \n", "3 diffuse_large_B_cell_lymphoma_test_8 \n", "4 diffuse_large_B_cell_lymphoma_test_8 \n", "\n", " dna_seq \\\n", "0 GTGGAGTGCCGCTGCGCAGCCGGGAGCCGGGAGCAGAACAGCCTGG... \n", "1 GTTTCTTCTACCTTTGCCATCAGGTGTCTGCCGCGGAGCTGCGGCT... \n", "2 AAAATGAGAGACTGCTTGTCCCTCTTAACCCGCCCCCAAAAGGGGG... \n", "3 GAATGGCTTGGTCTACTTCTCTCCGACCAAGCCCAACCCCGAGTAC... \n", "4 TCGGACTTGGTGTTATTTATTTGGGAAGCGCCCGGACGGCGGAGCT... \n", "\n", " methyl_seq dmr_ctype dmr_label \\\n", "0 2222222221222212222212222221222222222222222222... T 5 \n", "1 2222222222222222222222222222222121222222212222... T 19 \n", "2 2222222222222222222222222222220222222222222222... T 2 \n", "3 2222222222222222222222212222222222222220222222... T 12 \n", "4 2122222222222222222222222222122212221221222222... T 19 \n", "\n", " ctype pred n_cpg P_N P_T \n", "0 NA 0 14 0.502180 0.497820 \n", "1 NA 0 10 0.515485 0.484515 \n", "2 NA 0 5 0.512614 0.487386 \n", "3 NA 0 10 0.505069 0.494931 \n", "4 NA 0 19 0.513832 0.486168 \n", "\n", "[5 rows x 27 columns]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 10000/10000 [00:02<00:00, 3849.83it/s]\n" ] } ], "source": [ "import pandas as pd\n", "from methylbert.deconvolute import deconvolute\n", "\n", "deconvolute(trainer = trainer,\n", " tokenizer = tokenizer, \n", " data_loader = data_loader,\n", " output_path = output_path,\n", " df_train = pd.read_csv(\"tmp/train_seq.csv\", sep=\"\\t\"))" ] }, { "cell_type": "markdown", "id": "b43e4135-40d0-4b44-9518-e4c909cf62fe", "metadata": {}, "source": [ "`deconvolute` function creates three files in the given directiory:\n", "1. `decconvolution.csv` : tumour deconvolution result\n", "2. `FI.csv` : the Fisher information value\n", "3. `res.csv` : read classification result (the classification result for each read is given in `pred` column)" ] }, { "cell_type": "code", "execution_count": 5, "id": "6225f768-020e-4819-9fac-2c2099ba4a91", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cell_typepred
0T0.0
1N1.0
\n", "
" ], "text/plain": [ " cell_type pred\n", "0 T 0.0\n", "1 N 1.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import os\n", "pd.read_csv(os.path.join(output_path, \"deconvolution.csv\"), sep=\"\\t\").head()" ] }, { "cell_type": "code", "execution_count": 6, "id": "6647712f-1d4a-4ba3-b770-83ec04e12fad", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fi
04.243251
\n", "
" ], "text/plain": [ " fi\n", "0 4.243251" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(os.path.join(output_path, \"FI.csv\"), sep=\"\\t\").head()" ] }, { "cell_type": "code", "execution_count": 7, "id": "dbf770b9-8cbc-476e-9603-0770ecbed7b5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameflagref_nameref_posmap_qualitycigarnext_ref_namenext_ref_poslengthseq...XMXRPGRGdna_seqmethyl_seqdmr_ctypedmr_labelctypepred
0SRR10166000.9089788_9089788_length=151147chr1013176736042151M=131767187-324GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG..............xZ.x..Z.x..xZ.....xZ.....x....x..hx......GAMarkDuplicates-287B47C6diffuse_large_B_cell_lymphoma_test_8GTGGAGTGCCGCTGCGCAGCCGGGAGCCGGGAGCAGAACAGCCTGG...2222222221222212222212222221222222222222222222...T5NaN0
1SRR10165998.65829390_65829390_length=150163chr42025424823151M=20254343244GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT......H..............h......xh.h...x..Z.Zx.h..x.Zx.....GAMarkDuplicates-3DAAB091diffuse_large_B_cell_lymphoma_test_8GTTTCTTCTACCTTTGCCATCAGGTGTCTGCCGCGGAGCTGCGGCT...2222222222222222222222222222222121222222212222...T19NaN0
2SRR10165467.85837758_85837758_length=15199chr4140120640151M=1401285227AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG.................x..h....hhh.h....hxz.hhhhh............CTMarkDuplicates-36E4BA78Bcell_noncancer_test_8AAAATGAGAGACTGCTTGTCCCTCTTAACCCGCCCCCAAAAGGGGG...2222222222222222222222222222220222222222222222...T2NaN0
3SRR10165995.16747267_16747267_length=14983chr217694565640149M=176945572-233AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC......x...hh...hh.............Z.....h.........z.h......CTMarkDuplicates-74536757diffuse_large_B_cell_lymphoma_test_8GAATGGCTTGGTCTACTTCTCTCCGACCAAGCCCAACCCCGAGTAC...2222222222222222222222212222222222222220222222...T12NaN0
4SRR10165995.46034072_46034072_length=15199chr42025352440151M=20253771398TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT.......Z...h......................Z.hXZ...Z..Z....H....CTMarkDuplicates-74536757diffuse_large_B_cell_lymphoma_test_8TCGGACTTGGTGTTATTTATTTGGGAAGCGCCCGGACGGCGGAGCT...2122222222222222222222222222122212221221222222...T19NaN0
\n", "

5 rows × 24 columns

\n", "
" ], "text/plain": [ " name flag ref_name ref_pos \\\n", "0 SRR10166000.9089788_9089788_length=151 147 chr10 131767360 \n", "1 SRR10165998.65829390_65829390_length=150 163 chr4 20254248 \n", "2 SRR10165467.85837758_85837758_length=151 99 chr4 1401206 \n", "3 SRR10165995.16747267_16747267_length=149 83 chr2 176945656 \n", "4 SRR10165995.46034072_46034072_length=151 99 chr4 20253524 \n", "\n", " map_quality cigar next_ref_name next_ref_pos length \\\n", "0 42 151M = 131767187 -324 \n", "1 23 151M = 20254343 244 \n", "2 40 151M = 1401285 227 \n", "3 40 149M = 176945572 -233 \n", "4 40 151M = 20253771 398 \n", "\n", " seq ... \\\n", "0 GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG... ... \n", "1 GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT... ... \n", "2 AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG... ... \n", "3 AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC... ... \n", "4 TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT... ... \n", "\n", " XM XR \\\n", "0 ........xZ.x..Z.x..xZ.....xZ.....x....x..hx...... GA \n", "1 H..............h......xh.h...x..Z.Zx.h..x.Zx..... GA \n", "2 ...........x..h....hhh.h....hxz.hhhhh............ CT \n", "3 x...hh...hh.............Z.....h.........z.h...... CT \n", "4 .Z...h......................Z.hXZ...Z..Z....H.... CT \n", "\n", " PG RG \\\n", "0 MarkDuplicates-287B47C6 diffuse_large_B_cell_lymphoma_test_8 \n", "1 MarkDuplicates-3DAAB091 diffuse_large_B_cell_lymphoma_test_8 \n", "2 MarkDuplicates-36E4BA78 Bcell_noncancer_test_8 \n", "3 MarkDuplicates-74536757 diffuse_large_B_cell_lymphoma_test_8 \n", "4 MarkDuplicates-74536757 diffuse_large_B_cell_lymphoma_test_8 \n", "\n", " dna_seq \\\n", "0 GTGGAGTGCCGCTGCGCAGCCGGGAGCCGGGAGCAGAACAGCCTGG... \n", "1 GTTTCTTCTACCTTTGCCATCAGGTGTCTGCCGCGGAGCTGCGGCT... \n", "2 AAAATGAGAGACTGCTTGTCCCTCTTAACCCGCCCCCAAAAGGGGG... \n", "3 GAATGGCTTGGTCTACTTCTCTCCGACCAAGCCCAACCCCGAGTAC... \n", "4 TCGGACTTGGTGTTATTTATTTGGGAAGCGCCCGGACGGCGGAGCT... \n", "\n", " methyl_seq dmr_ctype dmr_label \\\n", "0 2222222221222212222212222221222222222222222222... T 5 \n", "1 2222222222222222222222222222222121222222212222... T 19 \n", "2 2222222222222222222222222222220222222222222222... T 2 \n", "3 2222222222222222222222212222222222222220222222... T 12 \n", "4 2122222222222222222222222222122212221221222222... T 19 \n", "\n", " ctype pred \n", "0 NaN 0 \n", "1 NaN 0 \n", "2 NaN 0 \n", "3 NaN 0 \n", "4 NaN 0 \n", "\n", "[5 rows x 24 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(os.path.join(output_path, \"res.csv\"), sep=\"\\t\").head()" ] }, { "cell_type": "markdown", "id": "9ba25384-4d34-421f-9086-ac5b98088cc4", "metadata": {}, "source": [ "### Deconvolution with the estimate adjustment\n", "\n", "_MethylBERT_ supports the tumour purity estimation adjustment considering the different distribution of tumour-derived reads in DMRs. \n", "\n", "You can turn `adjustment` option on for this. " ] }, { "cell_type": "code", "execution_count": 8, "id": "9345dd97-63c5-4e79-a52d-1786ae1faaa3", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 614/614 [00:56<00:00, 10.86it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Margins : [0.5114678899082569, 0.48853211009174313]\n", " name flag ref_name ref_pos \\\n", "0 SRR10166000.9089788_9089788_length=151 147 chr10 131767360 \n", "1 SRR10165998.65829390_65829390_length=150 163 chr4 20254248 \n", "2 SRR10165467.85837758_85837758_length=151 99 chr4 1401206 \n", "3 SRR10165995.16747267_16747267_length=149 83 chr2 176945656 \n", "4 SRR10165995.46034072_46034072_length=151 99 chr4 20253524 \n", "\n", " map_quality cigar next_ref_name next_ref_pos length \\\n", "0 42 151M = 131767187 -324 \n", "1 23 151M = 20254343 244 \n", "2 40 151M = 1401285 227 \n", "3 40 149M = 176945572 -233 \n", "4 40 151M = 20253771 398 \n", "\n", " seq ... \\\n", "0 GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG... ... \n", "1 GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT... ... \n", "2 AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG... ... \n", "3 AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC... ... \n", "4 TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT... ... \n", "\n", " RG \\\n", "0 diffuse_large_B_cell_lymphoma_test_8 \n", "1 diffuse_large_B_cell_lymphoma_test_8 \n", "2 Bcell_noncancer_test_8 \n", "3 diffuse_large_B_cell_lymphoma_test_8 \n", "4 diffuse_large_B_cell_lymphoma_test_8 \n", "\n", " dna_seq \\\n", "0 GTGGAGTGCCGCTGCGCAGCCGGGAGCCGGGAGCAGAACAGCCTGG... \n", "1 GTTTCTTCTACCTTTGCCATCAGGTGTCTGCCGCGGAGCTGCGGCT... \n", "2 AAAATGAGAGACTGCTTGTCCCTCTTAACCCGCCCCCAAAAGGGGG... \n", "3 GAATGGCTTGGTCTACTTCTCTCCGACCAAGCCCAACCCCGAGTAC... \n", "4 TCGGACTTGGTGTTATTTATTTGGGAAGCGCCCGGACGGCGGAGCT... \n", "\n", " methyl_seq dmr_ctype dmr_label \\\n", "0 2222222221222212222212222221222222222222222222... T 5 \n", "1 2222222222222222222222222222222121222222212222... T 19 \n", "2 2222222222222222222222222222220222222222222222... T 2 \n", "3 2222222222222222222222212222222222222220222222... T 12 \n", "4 2122222222222222222222222222122212221221222222... T 19 \n", "\n", " ctype pred n_cpg P_N P_T \n", "0 NA 0 14 0.502180 0.497820 \n", "1 NA 0 10 0.515485 0.484515 \n", "2 NA 0 5 0.512614 0.487386 \n", "3 NA 0 10 0.505069 0.494931 \n", "4 NA 0 19 0.513832 0.486168 \n", "\n", "[5 rows x 27 columns]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " 10%|█ | 1/10 [00:00<00:00, 11.06it/s]\n" ] } ], "source": [ "from methylbert.deconvolute import deconvolute\n", "import pandas as pd\n", "\n", "# Read classification\n", "deconvolute(trainer = trainer,\n", " tokenizer = tokenizer, \n", " data_loader = data_loader,\n", " output_path = output_path,\n", " df_train = pd.read_csv(\"tmp/train_seq.csv\", sep=\"\\t\"),\n", " adjustment = True) # tumour purity estimation adjustment" ] }, { "cell_type": "markdown", "id": "66ce345d-1acf-488f-a728-66b773b22d0f", "metadata": {}, "source": [ "When the adjustment is applied, the Fisher information is calculated in each DMR. " ] }, { "cell_type": "code", "execution_count": 9, "id": "15d5c16c-64c7-4949-a4b9-9febb8ce6bb8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dmr_labelfi
001.0
111.0
221.0
331.0
441.0
\n", "
" ], "text/plain": [ " dmr_label fi\n", "0 0 1.0\n", "1 1 1.0\n", "2 2 1.0\n", "3 3 1.0\n", "4 4 1.0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(os.path.join(output_path, \"FI.csv\"), sep=\"\\t\").head()" ] } ], "metadata": { "kernelspec": { "display_name": "dnabert", "language": "python", "name": "dnabert" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 5 }