{ "cells": [ { "cell_type": "markdown", "id": "70e78126-74b8-451d-9651-6ca635484225", "metadata": {}, "source": [ "# Preprocessing for _MethylBERT_ fine-tuning training data\n", "\n", "_MethylBERT_ fine-tuning needs DNA methylation data from tumour (T) and normal (N) samples as training data. You can give a list of sample files with annotations in a tab-deliminated file. " ] }, { "cell_type": "code", "execution_count": 1, "id": "baa86abd-334e-4c46-a19b-32e2b79b6170", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "../test/data/T_sample.bam\tT\n", "../test/data/N_sample.bam\tN\n" ] } ], "source": [ "cat ../test/data/bam_list.txt" ] }, { "cell_type": "markdown", "id": "97d1fcb0-3641-48cb-8e6b-f55e68c46e12", "metadata": {}, "source": [ "\n", "As described in the [data preparation](https://github.com/hanyangii/methylbert/blob/main/tutorials/01_Data_Preparation.md) tutorial, DMRs and the reference genome should be prepared in the required format. \n", "\n", "_MethylBERT_ provides `finetune_data_generate` function to preprocess the given tumour and normal data." ] }, { "cell_type": "code", "execution_count": 2, "id": "def4854d-9be7-4ff5-b790-ed823ca2e384", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DMRs sorted by areaStat\n", " chr start end length nCG meanMethy1 meanMethy2 \\\n", "1 chr10 134597480 134602875.0 5396 670 0.861029 0.140400 \n", "0 chr7 1268957 1277884.0 8928 753 0.793278 0.129747 \n", "2 chr4 1395812 1402597.0 6786 663 0.831162 0.185272 \n", "5 chr16 54962053 54967980.0 5928 546 0.783631 0.096095 \n", "9 chr18 76736906 76741580.0 4675 510 0.829475 0.104403 \n", "\n", " diff.Methy areaStat abs_areaStat abs_diff.Methy ctype dmr_id \n", "1 0.720629 6144.089331 6144.089331 0.720629 T 0 \n", "0 0.663531 5722.091790 5722.091790 0.663531 T 1 \n", "2 0.645891 4941.410089 4941.410089 0.645891 T 2 \n", "5 0.687536 4714.551799 4714.551799 0.687536 T 3 \n", "9 0.725072 4684.608381 4684.608381 0.725072 T 4 \n", "Number of DMRs to extract sequence reads: 20\n", "../test/data/T_sample.bam processing (T)...\n", "../test/data/N_sample.bam processing (N)...\n", "Fine-tuning data generated: name flag ref_name ref_pos \\\n", "0 SRR10165994.69237235_69237235_length=151 163 chr7 156797584 \n", "1 SRR10165464.24148712_24148712_length=151 99 chr10 131770809 \n", "2 SRR10165994.26664131_26664131_length=151 163 chr10 131766813 \n", "3 SRR10165464.61126854_61126854_length=150 147 chr10 131769430 \n", "4 SRR10165994.14375046_14375046_length=150 83 chr5 1884762 \n", "\n", " map_quality cigar next_ref_name next_ref_pos length \\\n", "0 24 151M = 156797848 344 \n", "1 42 151M = 131770846 188 \n", "2 42 149M = 131767027 365 \n", "3 24 151M = 131769291 -290 \n", "4 40 149M = 1884665 -246 \n", "\n", " seq ... PG XG \\\n", "0 GGGGAAGAAAAAAAACTAAATAATAATTTAACATACATACGTAAAC... ... MarkDuplicates GA \n", "1 GGTTTGTCGGGAAGGTTGTGAGTAGAGGCCAACGGAGGTCTCCCAG... ... MarkDuplicates CT \n", "2 GGGGGCCTCTAAAAACGCTCCAAATTCGTCTTACGCCACGAAATCA... ... MarkDuplicates GA \n", "3 GTTGGGTGGTAAGGTGGTTTAGGGTATAGTTAGGGGTTATGTAGAA... ... MarkDuplicates CT \n", "4 AATAATTATTTCTAAATTCTATATTAATTTCGCGACAAACCGCGTT... ... MarkDuplicates GA \n", "\n", " NM XM XR \\\n", "0 23 HHH.z..hhh.h..h...............h...h.....Z..h..... GA \n", "1 11 ..hxz.xZ.......xz.z...x.....HH..Z.....hH.HHX..... CT \n", "2 38 .Z.ZX.....x.h.h.Z......h...Z....h.Z....Zxhh...... GA \n", "3 31 .x....z..h....z..hhx....h....h......hh........... GA \n", "4 18 .......h.....x......x....h.....Z.Zx..xh..Z.Z..... CT \n", "\n", " dna_seq \\\n", "0 GGG GGC GCG CGA GAT ATG TGG GGG GGA GAG AGA GA... \n", "1 GGC GCC CCC CCG CGC GCC CCG CGG GGG GGA GAA AA... \n", "2 CGC GCG CGG GGC GCC CCT CTC TCT CTG TGA GAG AG... \n", "3 GCT CTG TGG GGG GGC GCG CGG GGC GCA CAA AAG AG... \n", "4 AAT ATA TAA AAT ATT TTG TGT GTT TTT TTC TCT CT... \n", "\n", " methyl_seq dmr_ctype dmr_label ctype \n", "0 2202222222222222222222222222222222222212222212... T 17 T \n", "1 2220221222222220202222222222222122222222222222... T 5 N \n", "2 2122222222222212222222222122222212222122222221... T 5 T \n", "3 2222202222222022222222222222222222222222222222... T 5 N \n", "4 2222222222222222222222222222212122222221212222... T 8 T \n", "\n", "[5 rows x 22 columns]\n", "Size - train 3051 seqs , valid 763 seqs \n" ] } ], "source": [ "from methylbert.data import finetune_data_generate as fdg\n", "\n", "f_bam_file_list = \"../test/data/bam_list.txt\"\n", "f_dmr = \"../test/data/dmrs.csv\"\n", "f_ref = \"../../../genome/hg19.fa\"\n", "out_dir = \"tmp/\"\n", "\n", "fdg.finetune_data_generate(\n", " sc_dataset = f_bam_file_list,\n", " f_dmr = f_dmr,\n", " f_ref = f_ref,\n", " output_dir=out_dir,\n", " split_ratio = 0.8, # Split ratio to make training and validation data\n", " n_mers=3, # 3-mer DNA sequences \n", " n_cores=20\n", ")" ] }, { "cell_type": "markdown", "id": "8338c2f6-3d11-48fd-813f-7bce45f2d3e1", "metadata": {}, "source": [ "After the preprocessing, you get three different files:\n", "1. dmrs.csv : Selected DMRs (when the number of DMRs is given) with `dmr_label` column\n", "2. train_seq.csv : Preprocessed training data\n", "3. test_seq.csv : Preprocessed evaluation data (20% of given data, due to the split_ratio=0.8)" ] }, { "cell_type": "code", "execution_count": 22, "id": "fae1e53b-a4bc-4d9c-a97d-73e98ef52885", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dmrs.csv test_seq.csv train_seq.csv\n" ] } ], "source": [ "ls tmp/" ] }, { "cell_type": "markdown", "id": "e4b4b84f-2d99-483c-b5e1-2642ddfc4356", "metadata": {}, "source": [ "Each preprocessed data is a tab-deliminated .csv file where each column contains the individual field of given BAM/SAM file. Additionally `dmr_ctype`, `dmr_label` and `ctype` are given:\n", "1. `dmr_ctype`: The specific cell type for each DMR\n", "2. `dmr_label`: DMR label. This is used for the read classifier fully-connected network in _MethylBERT_\n", "3. `ctype` : Cell-type of the read (indicated in the input file)" ] }, { "cell_type": "code", "execution_count": 11, "id": "c2d62b4b-c87a-4229-8f69-c6fd0725e9f3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | name | \n", "flag | \n", "ref_name | \n", "ref_pos | \n", "map_quality | \n", "cigar | \n", "next_ref_name | \n", "next_ref_pos | \n", "length | \n", "seq | \n", "... | \n", "PG | \n", "XG | \n", "NM | \n", "XM | \n", "XR | \n", "dna_seq | \n", "methyl_seq | \n", "dmr_ctype | \n", "dmr_label | \n", "ctype | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "SRR10165464.6790597_6790597_length=151 | \n", "83 | \n", "chr2 | \n", "176943541 | \n", "40 | \n", "151M | \n", "= | \n", "176943475 | \n", "-217 | \n", "AATTAACAATTTTCATCATAATCTACACATTATTAACATCAAACTT... | \n", "... | \n", "MarkDuplicates | \n", "GA | \n", "37 | \n", "h...hh........z.........x..........h............. | \n", "CT | \n", "GAT ATT TTG TGG GGC GCA CAA AAT ATT TTT TTT TT... | \n", "2222222222220222222222222222222222222222222222... | \n", "T | \n", "12 | \n", "N | \n", "
| 1 | \n", "SRR10165994.18752987_18752987_length=149 | \n", "163 | \n", "chr7 | \n", "157486616 | \n", "40 | \n", "149M | \n", "= | \n", "157486650 | \n", "183 | \n", "AGGCACGCGACCACCCTAAACCTCGAACAAAACTAAAAAAACGCAA... | \n", "... | \n", "MarkDuplicates | \n", "GA | \n", "51 | \n", "..Z...Z.Zx.......xhh....Zx...xhh...hhhhh..Z..x... | \n", "GA | \n", "CCG CGC GCA CAC ACG CGC GCG CGG GGC GCC CCA CA... | \n", "1222121222222222222222122222222222222222122222... | \n", "T | \n", "11 | \n", "T | \n", "
| 2 | \n", "SRR10165994.2935274_2935274_length=150 | \n", "83 | \n", "chr7 | \n", "1270222 | \n", "42 | \n", "150M | \n", "= | \n", "1269981 | \n", "-391 | \n", "ACGAACATTAAAACGCACGGAACCGCCGCGACGCGGACTCGCTCTT... | \n", "... | \n", "MarkDuplicates | \n", "GA | \n", "27 | \n", "h.Z.h....hhh..Z...ZX.h..Z..Z.Zx.Z.ZX....Z........ | \n", "CT | \n", "GCG CGA GAG AGC GCA CAT ATT TTG TGG GGG GGA GA... | \n", "1222222222221222122222122121221212222212222222... | \n", "T | \n", "1 | \n", "T | \n", "
| 3 | \n", "SRR10165464.56090327_56090327_length=151 | \n", "163 | \n", "chr2 | \n", "176949511 | \n", "42 | \n", "149M | \n", "= | \n", "176949602 | \n", "242 | \n", "AGGATTTCTTACTACATAACCACAAAAATACATTAAACCCACACCT... | \n", "... | \n", "MarkDuplicates | \n", "GA | \n", "36 | \n", "h.Z.......h....z.hh..z.zx.hh.h....hhh...z.z...... | \n", "GA | \n", "GCG CGC GCT CTT TTT TTC TCT CTT TTG TGC GCT CT... | \n", "1222222222222022222020222222222222222202022222... | \n", "T | \n", "12 | \n", "N | \n", "
| 4 | \n", "SRR10165464.47924911_47924911_length=150 | \n", "147 | \n", "chr7 | \n", "1272480 | \n", "42 | \n", "151M | \n", "= | \n", "1272378 | \n", "-253 | \n", "AATTATTGGGAGTTTGATGTTGATAAGTAAAGTGTTGGAGTGTGGG... | \n", "... | \n", "MarkDuplicates | \n", "CT | \n", "31 | \n", "......z.....h...................z.xz......z...... | \n", "GA | \n", "AAT ATT TTA TAT ATC TCG CGG GGG GGA GAG AGC GC... | \n", "2222202222222222222222222222222022022222202220... | \n", "T | \n", "1 | \n", "N | \n", "
5 rows × 22 columns
\n", "