MethylBERT fine-tuning

Once the pure cell-type data is preprocessed following the previous tutorial, you can fine-tune the pre-trained MethylBERT model.

[1]:

from methylbert.utils import set_seed
import warnings
warnings.filterwarnings("ignore") # Ignore warnings for a clear notebook

set_seed(42)
seq_len=150
n_mers=3
batch_size=5
num_workers=20
output_path = "tmp/fine_tune/"

Load training and evaluation data

The preprocessed data must be loaded into the MethylBertFinetuneDataset and the pytorch DataLoader object.

You can change the number of workers and the batch size depending on the CPU/GPU resources you have. Please find more detailed explanation here.

[2]:

from torch.utils.data import DataLoader
from methylbert.data.vocab import MethylVocab
from methylbert.data.dataset import MethylBertFinetuneDataset

# Creat a look-up table
tokenizer = MethylVocab(n_mers)

# Load the data files int a data set object
train_dataset = MethylBertFinetuneDataset("tmp/train_seq.csv",
                                          tokenizer,
                                          seq_len=seq_len)
test_dataset = MethylBertFinetuneDataset("tmp/test_seq.csv",
                                         tokenizer,seq_len=seq_len)

# Load the data into a data loader
train_data_loader = DataLoader(train_dataset, batch_size= batch_size,
                               num_workers= num_workers, pin_memory=False,
                               shuffle=True)
test_data_loader = DataLoader(test_dataset, batch_size= batch_size,
                              num_workers= num_workers, pin_memory=True,
                              shuffle=False)

Building Vocab
Total number of sequences :  3052
# of reads in each label:  [1561. 1491.]
Total number of sequences :  763
# of reads in each label:  [383. 380.]

Create MethylBERT model

MethylBERT FinetuneTrainer gets the DataLoader objects and hyperparameters for the training as input.

[7]:

from methylbert.trainer import MethylBertFinetuneTrainer
import os

trainer = MethylBertFinetuneTrainer(len(tokenizer),
                      save_path=output_path,
                      train_dataloader=train_data_loader,
                      test_dataloader=test_data_loader,
                      lr=1e-4, with_cuda=False,
                      log_freq=1,
                      #eval_freq=10, #activate this only when you want to evaluate the model with test_data_loader
                      warmup_step=3)

The model is loaded on CPU

Load pre-training model from the uploaded models (huggingface.co/hanyangii/)

Pre-trained models with 2, 4, 6, 8, and 12 encoder blocks are available for MethylBERT. You can give an argument “hanyangii/methylbert_hg19_{number of encoder blocks}l” to the load function.

[12]:

trainer.load("hanyangii/methylbert_hg19_4l")

Restore the pretrained model hanyangii/methylbert_hg19_4l

Some weights of MethylBertEmbeddedDMR were not initialized from the model checkpoint at hanyangii/methylbert_hg19_4l and are newly initialized: ['dmr_encoder.0.weight', 'read_classifier.0.bias', 'read_classifier.0.weight', 'read_classifier.3.bias', 'read_classifier.3.weight', 'read_classifier.4.bias', 'read_classifier.4.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Total Parameters: 46929874

Load pre-training model from a local directory

If you have the pre-trained model saved in a directory containing config.json and pytorch_model.bin files

[8]:

ls ../test/data/pretrained_model/

config.json@  pytorch_model.bin@

You can directly load the model using load function to MethylBertFinetuneTrainer.

[9]:

trainer.load("../test/data/pretrained_model/")

Restore the pretrained model ../test/data/pretrained_model/
Total Parameters: 32754130

Train the MethylBERT model without pre-training

If you do not have a pre-trained MethylBERT model, you can create a new MethylBERT model from a configuration file. For example, the baseline BERT model config file can be downlaoded from from https://huggingface.co/bert-base-uncased/blob/main/config.json.

[6]:

cat ../test/data/config.json

{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.6.0.dev0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

Then create_model function of the trainer creates a new MethylBERT from the config file.

[7]:

trainer.create_model(config_file="../test/data/pretrained_model/config.json")

Total Parameters: 32754130

train function starts the training of MethylBERT model during the given number of steps.

[8]:

trainer.train(steps=10)


Train Step 0 iter - loss : 0.802449 / lr : 0.000033
Running time for iter = 0.6947791576385498
Step 0 loss (0.695615) is lower than the current min loss (inf). Save the model at tmp/fine_tune/
Step:0 Model Saved on: tmp/fine_tune/

Train Step 1 iter - loss : 0.823174 / lr : 0.000067
Running time for iter = 0.621410608291626

Train Step 2 iter - loss : 1.258267 / lr : 0.000100
Running time for iter = 0.5103580951690674

Train Step 3 iter - loss : 0.490022 / lr : 0.000100
Running time for iter = 0.5122716426849365

Train Step 4 iter - loss : 1.356685 / lr : 0.000100
Running time for iter = 0.5860369205474854

Train Step 5 iter - loss : 0.959048 / lr : 0.000100
Running time for iter = 0.6873147487640381

Train Step 6 iter - loss : 0.939311 / lr : 0.000100
Running time for iter = 0.4517858028411865

Train Step 7 iter - loss : 0.776077 / lr : 0.000100
Running time for iter = 0.6021583080291748
Step 7 loss (0.691268) is lower than the current min loss (0.695615). Save the model at tmp/fine_tune/
Step:7 Model Saved on: tmp/fine_tune/

Train Step 8 iter - loss : 0.657497 / lr : 0.000100
Running time for iter = 0.4712505340576172
Step 8 loss (0.685933) is lower than the current min loss (0.691268). Save the model at tmp/fine_tune/
Step:8 Model Saved on: tmp/fine_tune/

Train Step 9 iter - loss : 0.690212 / lr : 0.000100
Running time for iter = 0.4791133403778076

It creates six files in the given save_path:

config.json and pytorch_model.bin: model configuration and the trained MethylBERT model
dmr_encoder.pickle : The trained DMR encoder in the MethylBERT model
read_classification_model.pickle : The trained fully connected neural network for read classification
train.csv and eval.csv : tracked training and evaluation loss and accuracy values during the training

[9]:

ls tmp/fine_tune/

config.json         eval.csv           read_classification_model.pickle
dmr_encoder.pickle  pytorch_model.bin  train.csv

[10]:

import pandas as pd
df_train  = pd.read_csv("tmp/fine_tune/train.csv", sep="\t")
df_train.head()

[10]:

	step	loss	ctype_acc	lr
0	0	0.802449	0.2	0.000033
1	1	0.823174	0.0	0.000067
2	2	1.258267	0.2	0.000100
3	3	0.490022	0.8	0.000100
4	4	1.356685	0.2	0.000100

[11]:

df_eval  = pd.read_csv("tmp/fine_tune/eval.csv", sep="\t")
df_eval.head()

[11]:

	step	loss	ctype_acc
0	0	0.695615	0.522936
1	1	0.876368	0.501966
2	2	0.932430	0.501966
3	3	0.942029	0.501966
4	4	0.849511	0.501966

[14]:

import seaborn as sns
import matplotlib.pyplot as plt
sns.lineplot(data=df_train, x="step", y="loss", label="train loss")
sns.lineplot(data=df_eval, x="step", y="loss", label="eval loss")
plt.show()

../_images/tutorials_04_Fine-tuning_MethylBERT_model_22_0.png

[15]:

sns.lineplot(data=df_train, x="step", y="lr", label="learning rate", color="m")
plt.show()

../_images/tutorials_04_Fine-tuning_MethylBERT_model_23_0.png