{
"cells": [
{
"cell_type": "markdown",
"id": "f48c7917-8d6c-47fd-9378-ace38e238cdc",
"metadata": {},
"source": [
"# Tumour deconvolution with the fine-tuned `MethylBERT` model \n",
"\n",
"### Load the bulk data and the fine-tuned model\n",
"\n",
"Please load your preprocessed bulk data following the [tutorial](https://github.com/hanyangii/methylbert/blob/main/tutorials/03_Preprocessing_bulk_data.ipynb) into the `MethylBertFinetuneDataset` and `DataLoader`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "57ed608b-9a4a-42a7-b38a-7f1831f51d2c",
"metadata": {},
"outputs": [],
"source": [
"from methylbert.utils import set_seed\n",
"\n",
"set_seed(42)\n",
"seq_len=150\n",
"n_mers=3\n",
"batch_size=5\n",
"num_workers=20\n",
"output_path=\"tmp/deconvolution/\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "7543b0f3-943f-4989-9ca5-67a1c2a27ac1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Building Vocab\n",
"Total number of sequences : 3070\n",
"# of reads in each label: [3070.]\n"
]
}
],
"source": [
"from methylbert.data.vocab import MethylVocab\n",
"from methylbert.data.dataset import MethylBertFinetuneDataset\n",
"from torch.utils.data import DataLoader\n",
"\n",
"tokenizer = MethylVocab(k=n_mers)\n",
"dataset = MethylBertFinetuneDataset(\"tmp/data.csv\", \n",
" tokenizer, \n",
" seq_len=seq_len)\n",
"data_loader = DataLoader(dataset, batch_size=batch_size, num_workers=num_workers)\n"
]
},
{
"cell_type": "markdown",
"id": "8179433f-55d6-4f04-8d23-8a816ebdad48",
"metadata": {},
"source": [
"### Load the fine-tuned `MethylBERT` model\n",
"\n",
"`load` function in `MethylBertFinetuneTrainer` automatically detects `config.json`, `pytorch_model.bin`, `dmr_encoder.pickle` and `read_classification_model.pickle` files in the given directory and load the fine-tuned model. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "d732913f-01be-4bae-8064-aecdcd7b63d4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"No detected GPU device. Load the model on CPU\n",
"The model is loaded on CPU\n",
"Restore the pretrained model tmp/fine_tune/\n",
"Restore DMR encoder from tmp/fine_tune/dmr_encoder.pickle\n",
"Restore read classification FCN model from tmp/fine_tune/read_classification_model.pickle\n",
"Total Parameters: 32754130\n"
]
}
],
"source": [
"from methylbert.trainer import MethylBertFinetuneTrainer\n",
"\n",
"restore_dir = \"tmp/fine_tune/\"\n",
"trainer = MethylBertFinetuneTrainer(len(tokenizer), \n",
" train_dataloader=data_loader, \n",
" test_dataloader=data_loader,\n",
" )\n",
"trainer.load(restore_dir) # Load the fine-tuned MethylBERT model"
]
},
{
"cell_type": "markdown",
"id": "7aef54f0-8565-4e89-85bf-b42aa2b64206",
"metadata": {},
"source": [
"### Deconvolution\n",
"For the deconvolution, the training data as a `pandas.DataFrame` object is required for the maginal probability of cell types. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "270cb5c1-da7a-4cc3-812f-20c17643c230",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 614/614 [01:03<00:00, 9.73it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Margins : [0.5114678899082569, 0.48853211009174313]\n",
" name flag ref_name ref_pos \\\n",
"0 SRR10166000.9089788_9089788_length=151 147 chr10 131767360 \n",
"1 SRR10165998.65829390_65829390_length=150 163 chr4 20254248 \n",
"2 SRR10165467.85837758_85837758_length=151 99 chr4 1401206 \n",
"3 SRR10165995.16747267_16747267_length=149 83 chr2 176945656 \n",
"4 SRR10165995.46034072_46034072_length=151 99 chr4 20253524 \n",
"\n",
" map_quality cigar next_ref_name next_ref_pos length \\\n",
"0 42 151M = 131767187 -324 \n",
"1 23 151M = 20254343 244 \n",
"2 40 151M = 1401285 227 \n",
"3 40 149M = 176945572 -233 \n",
"4 40 151M = 20253771 398 \n",
"\n",
" seq ... \\\n",
"0 GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG... ... \n",
"1 GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT... ... \n",
"2 AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG... ... \n",
"3 AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC... ... \n",
"4 TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT... ... \n",
"\n",
" RG \\\n",
"0 diffuse_large_B_cell_lymphoma_test_8 \n",
"1 diffuse_large_B_cell_lymphoma_test_8 \n",
"2 Bcell_noncancer_test_8 \n",
"3 diffuse_large_B_cell_lymphoma_test_8 \n",
"4 diffuse_large_B_cell_lymphoma_test_8 \n",
"\n",
" dna_seq \\\n",
"0 GTGGAGTGCCGCTGCGCAGCCGGGAGCCGGGAGCAGAACAGCCTGG... \n",
"1 GTTTCTTCTACCTTTGCCATCAGGTGTCTGCCGCGGAGCTGCGGCT... \n",
"2 AAAATGAGAGACTGCTTGTCCCTCTTAACCCGCCCCCAAAAGGGGG... \n",
"3 GAATGGCTTGGTCTACTTCTCTCCGACCAAGCCCAACCCCGAGTAC... \n",
"4 TCGGACTTGGTGTTATTTATTTGGGAAGCGCCCGGACGGCGGAGCT... \n",
"\n",
" methyl_seq dmr_ctype dmr_label \\\n",
"0 2222222221222212222212222221222222222222222222... T 5 \n",
"1 2222222222222222222222222222222121222222212222... T 19 \n",
"2 2222222222222222222222222222220222222222222222... T 2 \n",
"3 2222222222222222222222212222222222222220222222... T 12 \n",
"4 2122222222222222222222222222122212221221222222... T 19 \n",
"\n",
" ctype pred n_cpg P_N P_T \n",
"0 NA 0 14 0.502180 0.497820 \n",
"1 NA 0 10 0.515485 0.484515 \n",
"2 NA 0 5 0.512614 0.487386 \n",
"3 NA 0 10 0.505069 0.494931 \n",
"4 NA 0 19 0.513832 0.486168 \n",
"\n",
"[5 rows x 27 columns]\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 10000/10000 [00:02<00:00, 3849.83it/s]\n"
]
}
],
"source": [
"import pandas as pd\n",
"from methylbert.deconvolute import deconvolute\n",
"\n",
"deconvolute(trainer = trainer,\n",
" tokenizer = tokenizer, \n",
" data_loader = data_loader,\n",
" output_path = output_path,\n",
" df_train = pd.read_csv(\"tmp/train_seq.csv\", sep=\"\\t\"))"
]
},
{
"cell_type": "markdown",
"id": "b43e4135-40d0-4b44-9518-e4c909cf62fe",
"metadata": {},
"source": [
"`deconvolute` function creates three files in the given directiory:\n",
"1. `decconvolution.csv` : tumour deconvolution result\n",
"2. `FI.csv` : the Fisher information value\n",
"3. `res.csv` : read classification result (the classification result for each read is given in `pred` column)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "6225f768-020e-4819-9fac-2c2099ba4a91",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" cell_type | \n",
" pred | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" T | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 1 | \n",
" N | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" cell_type pred\n",
"0 T 0.0\n",
"1 N 1.0"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import os\n",
"pd.read_csv(os.path.join(output_path, \"deconvolution.csv\"), sep=\"\\t\").head()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "6647712f-1d4a-4ba3-b770-83ec04e12fad",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" fi | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 4.243251 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" fi\n",
"0 4.243251"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv(os.path.join(output_path, \"FI.csv\"), sep=\"\\t\").head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "dbf770b9-8cbc-476e-9603-0770ecbed7b5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" flag | \n",
" ref_name | \n",
" ref_pos | \n",
" map_quality | \n",
" cigar | \n",
" next_ref_name | \n",
" next_ref_pos | \n",
" length | \n",
" seq | \n",
" ... | \n",
" XM | \n",
" XR | \n",
" PG | \n",
" RG | \n",
" dna_seq | \n",
" methyl_seq | \n",
" dmr_ctype | \n",
" dmr_label | \n",
" ctype | \n",
" pred | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" SRR10166000.9089788_9089788_length=151 | \n",
" 147 | \n",
" chr10 | \n",
" 131767360 | \n",
" 42 | \n",
" 151M | \n",
" = | \n",
" 131767187 | \n",
" -324 | \n",
" GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG... | \n",
" ... | \n",
" ........xZ.x..Z.x..xZ.....xZ.....x....x..hx...... | \n",
" GA | \n",
" MarkDuplicates-287B47C6 | \n",
" diffuse_large_B_cell_lymphoma_test_8 | \n",
" GTGGAGTGCCGCTGCGCAGCCGGGAGCCGGGAGCAGAACAGCCTGG... | \n",
" 2222222221222212222212222221222222222222222222... | \n",
" T | \n",
" 5 | \n",
" NaN | \n",
" 0 | \n",
"
\n",
" \n",
" | 1 | \n",
" SRR10165998.65829390_65829390_length=150 | \n",
" 163 | \n",
" chr4 | \n",
" 20254248 | \n",
" 23 | \n",
" 151M | \n",
" = | \n",
" 20254343 | \n",
" 244 | \n",
" GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT... | \n",
" ... | \n",
" H..............h......xh.h...x..Z.Zx.h..x.Zx..... | \n",
" GA | \n",
" MarkDuplicates-3DAAB091 | \n",
" diffuse_large_B_cell_lymphoma_test_8 | \n",
" GTTTCTTCTACCTTTGCCATCAGGTGTCTGCCGCGGAGCTGCGGCT... | \n",
" 2222222222222222222222222222222121222222212222... | \n",
" T | \n",
" 19 | \n",
" NaN | \n",
" 0 | \n",
"
\n",
" \n",
" | 2 | \n",
" SRR10165467.85837758_85837758_length=151 | \n",
" 99 | \n",
" chr4 | \n",
" 1401206 | \n",
" 40 | \n",
" 151M | \n",
" = | \n",
" 1401285 | \n",
" 227 | \n",
" AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG... | \n",
" ... | \n",
" ...........x..h....hhh.h....hxz.hhhhh............ | \n",
" CT | \n",
" MarkDuplicates-36E4BA78 | \n",
" Bcell_noncancer_test_8 | \n",
" AAAATGAGAGACTGCTTGTCCCTCTTAACCCGCCCCCAAAAGGGGG... | \n",
" 2222222222222222222222222222220222222222222222... | \n",
" T | \n",
" 2 | \n",
" NaN | \n",
" 0 | \n",
"
\n",
" \n",
" | 3 | \n",
" SRR10165995.16747267_16747267_length=149 | \n",
" 83 | \n",
" chr2 | \n",
" 176945656 | \n",
" 40 | \n",
" 149M | \n",
" = | \n",
" 176945572 | \n",
" -233 | \n",
" AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC... | \n",
" ... | \n",
" x...hh...hh.............Z.....h.........z.h...... | \n",
" CT | \n",
" MarkDuplicates-74536757 | \n",
" diffuse_large_B_cell_lymphoma_test_8 | \n",
" GAATGGCTTGGTCTACTTCTCTCCGACCAAGCCCAACCCCGAGTAC... | \n",
" 2222222222222222222222212222222222222220222222... | \n",
" T | \n",
" 12 | \n",
" NaN | \n",
" 0 | \n",
"
\n",
" \n",
" | 4 | \n",
" SRR10165995.46034072_46034072_length=151 | \n",
" 99 | \n",
" chr4 | \n",
" 20253524 | \n",
" 40 | \n",
" 151M | \n",
" = | \n",
" 20253771 | \n",
" 398 | \n",
" TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT... | \n",
" ... | \n",
" .Z...h......................Z.hXZ...Z..Z....H.... | \n",
" CT | \n",
" MarkDuplicates-74536757 | \n",
" diffuse_large_B_cell_lymphoma_test_8 | \n",
" TCGGACTTGGTGTTATTTATTTGGGAAGCGCCCGGACGGCGGAGCT... | \n",
" 2122222222222222222222222222122212221221222222... | \n",
" T | \n",
" 19 | \n",
" NaN | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 24 columns
\n",
"
"
],
"text/plain": [
" name flag ref_name ref_pos \\\n",
"0 SRR10166000.9089788_9089788_length=151 147 chr10 131767360 \n",
"1 SRR10165998.65829390_65829390_length=150 163 chr4 20254248 \n",
"2 SRR10165467.85837758_85837758_length=151 99 chr4 1401206 \n",
"3 SRR10165995.16747267_16747267_length=149 83 chr2 176945656 \n",
"4 SRR10165995.46034072_46034072_length=151 99 chr4 20253524 \n",
"\n",
" map_quality cigar next_ref_name next_ref_pos length \\\n",
"0 42 151M = 131767187 -324 \n",
"1 23 151M = 20254343 244 \n",
"2 40 151M = 1401285 227 \n",
"3 40 149M = 176945572 -233 \n",
"4 40 151M = 20253771 398 \n",
"\n",
" seq ... \\\n",
"0 GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG... ... \n",
"1 GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT... ... \n",
"2 AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG... ... \n",
"3 AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC... ... \n",
"4 TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT... ... \n",
"\n",
" XM XR \\\n",
"0 ........xZ.x..Z.x..xZ.....xZ.....x....x..hx...... GA \n",
"1 H..............h......xh.h...x..Z.Zx.h..x.Zx..... GA \n",
"2 ...........x..h....hhh.h....hxz.hhhhh............ CT \n",
"3 x...hh...hh.............Z.....h.........z.h...... CT \n",
"4 .Z...h......................Z.hXZ...Z..Z....H.... CT \n",
"\n",
" PG RG \\\n",
"0 MarkDuplicates-287B47C6 diffuse_large_B_cell_lymphoma_test_8 \n",
"1 MarkDuplicates-3DAAB091 diffuse_large_B_cell_lymphoma_test_8 \n",
"2 MarkDuplicates-36E4BA78 Bcell_noncancer_test_8 \n",
"3 MarkDuplicates-74536757 diffuse_large_B_cell_lymphoma_test_8 \n",
"4 MarkDuplicates-74536757 diffuse_large_B_cell_lymphoma_test_8 \n",
"\n",
" dna_seq \\\n",
"0 GTGGAGTGCCGCTGCGCAGCCGGGAGCCGGGAGCAGAACAGCCTGG... \n",
"1 GTTTCTTCTACCTTTGCCATCAGGTGTCTGCCGCGGAGCTGCGGCT... \n",
"2 AAAATGAGAGACTGCTTGTCCCTCTTAACCCGCCCCCAAAAGGGGG... \n",
"3 GAATGGCTTGGTCTACTTCTCTCCGACCAAGCCCAACCCCGAGTAC... \n",
"4 TCGGACTTGGTGTTATTTATTTGGGAAGCGCCCGGACGGCGGAGCT... \n",
"\n",
" methyl_seq dmr_ctype dmr_label \\\n",
"0 2222222221222212222212222221222222222222222222... T 5 \n",
"1 2222222222222222222222222222222121222222212222... T 19 \n",
"2 2222222222222222222222222222220222222222222222... T 2 \n",
"3 2222222222222222222222212222222222222220222222... T 12 \n",
"4 2122222222222222222222222222122212221221222222... T 19 \n",
"\n",
" ctype pred \n",
"0 NaN 0 \n",
"1 NaN 0 \n",
"2 NaN 0 \n",
"3 NaN 0 \n",
"4 NaN 0 \n",
"\n",
"[5 rows x 24 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv(os.path.join(output_path, \"res.csv\"), sep=\"\\t\").head()"
]
},
{
"cell_type": "markdown",
"id": "9ba25384-4d34-421f-9086-ac5b98088cc4",
"metadata": {},
"source": [
"### Deconvolution with the estimate adjustment\n",
"\n",
"_MethylBERT_ supports the tumour purity estimation adjustment considering the different distribution of tumour-derived reads in DMRs. \n",
"\n",
"You can turn `adjustment` option on for this. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "9345dd97-63c5-4e79-a52d-1786ae1faaa3",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 614/614 [00:56<00:00, 10.86it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Margins : [0.5114678899082569, 0.48853211009174313]\n",
" name flag ref_name ref_pos \\\n",
"0 SRR10166000.9089788_9089788_length=151 147 chr10 131767360 \n",
"1 SRR10165998.65829390_65829390_length=150 163 chr4 20254248 \n",
"2 SRR10165467.85837758_85837758_length=151 99 chr4 1401206 \n",
"3 SRR10165995.16747267_16747267_length=149 83 chr2 176945656 \n",
"4 SRR10165995.46034072_46034072_length=151 99 chr4 20253524 \n",
"\n",
" map_quality cigar next_ref_name next_ref_pos length \\\n",
"0 42 151M = 131767187 -324 \n",
"1 23 151M = 20254343 244 \n",
"2 40 151M = 1401285 227 \n",
"3 40 149M = 176945572 -233 \n",
"4 40 151M = 20253771 398 \n",
"\n",
" seq ... \\\n",
"0 GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG... ... \n",
"1 GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT... ... \n",
"2 AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG... ... \n",
"3 AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC... ... \n",
"4 TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT... ... \n",
"\n",
" RG \\\n",
"0 diffuse_large_B_cell_lymphoma_test_8 \n",
"1 diffuse_large_B_cell_lymphoma_test_8 \n",
"2 Bcell_noncancer_test_8 \n",
"3 diffuse_large_B_cell_lymphoma_test_8 \n",
"4 diffuse_large_B_cell_lymphoma_test_8 \n",
"\n",
" dna_seq \\\n",
"0 GTGGAGTGCCGCTGCGCAGCCGGGAGCCGGGAGCAGAACAGCCTGG... \n",
"1 GTTTCTTCTACCTTTGCCATCAGGTGTCTGCCGCGGAGCTGCGGCT... \n",
"2 AAAATGAGAGACTGCTTGTCCCTCTTAACCCGCCCCCAAAAGGGGG... \n",
"3 GAATGGCTTGGTCTACTTCTCTCCGACCAAGCCCAACCCCGAGTAC... \n",
"4 TCGGACTTGGTGTTATTTATTTGGGAAGCGCCCGGACGGCGGAGCT... \n",
"\n",
" methyl_seq dmr_ctype dmr_label \\\n",
"0 2222222221222212222212222221222222222222222222... T 5 \n",
"1 2222222222222222222222222222222121222222212222... T 19 \n",
"2 2222222222222222222222222222220222222222222222... T 2 \n",
"3 2222222222222222222222212222222222222220222222... T 12 \n",
"4 2122222222222222222222222222122212221221222222... T 19 \n",
"\n",
" ctype pred n_cpg P_N P_T \n",
"0 NA 0 14 0.502180 0.497820 \n",
"1 NA 0 10 0.515485 0.484515 \n",
"2 NA 0 5 0.512614 0.487386 \n",
"3 NA 0 10 0.505069 0.494931 \n",
"4 NA 0 19 0.513832 0.486168 \n",
"\n",
"[5 rows x 27 columns]\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" 10%|█ | 1/10 [00:00<00:00, 11.06it/s]\n"
]
}
],
"source": [
"from methylbert.deconvolute import deconvolute\n",
"import pandas as pd\n",
"\n",
"# Read classification\n",
"deconvolute(trainer = trainer,\n",
" tokenizer = tokenizer, \n",
" data_loader = data_loader,\n",
" output_path = output_path,\n",
" df_train = pd.read_csv(\"tmp/train_seq.csv\", sep=\"\\t\"),\n",
" adjustment = True) # tumour purity estimation adjustment"
]
},
{
"cell_type": "markdown",
"id": "66ce345d-1acf-488f-a728-66b773b22d0f",
"metadata": {},
"source": [
"When the adjustment is applied, the Fisher information is calculated in each DMR. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "15d5c16c-64c7-4949-a4b9-9febb8ce6bb8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" dmr_label | \n",
" fi | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 1.0 | \n",
"
\n",
" \n",
" | 2 | \n",
" 2 | \n",
" 1.0 | \n",
"
\n",
" \n",
" | 3 | \n",
" 3 | \n",
" 1.0 | \n",
"
\n",
" \n",
" | 4 | \n",
" 4 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" dmr_label fi\n",
"0 0 1.0\n",
"1 1 1.0\n",
"2 2 1.0\n",
"3 3 1.0\n",
"4 4 1.0"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv(os.path.join(output_path, \"FI.csv\"), sep=\"\\t\").head()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "dnabert",
"language": "python",
"name": "dnabert"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}