CRISPR/Cas9 on-target knockout efficacy prediction

A project under Machine Learning Sessional course using traditional machine learning with sequence-based properties and deep learning techniques.

Based on the work done by Wang et al., we implemented and experimented with some deep learning models to deal with the problem regarding CRISPR/Cas9 on-target knockout efficacy. We used the dataset provided by the above mentioned paper. During the experimentation phase, we implemented several deep learning models including CNN, LSTM, GRU, Bidirectional LSTM, Bidirectional LSTM with attention, Hierarchical Attention Networks (HAN) etc. This project focuses on using only the sequence as input without manually preparing any epigenetic feature beforehand.

Before training the models, we had to encode the sgRNA sequences. We used two different methods on the sgRNA sequences: one-hot encoding and word embedding using Word2Vec. Then we trained the deep-learning models and evaluated the performances using Spearman correlation coefficient. At the time of this implementation, there were two papers who used the same dataset. Their achieved Spearman correlation coefficients are stated in the follwing table.

WT-SpCas9 eSpCas9 SpCas9-HF1
DeepHF 0.867 0.862 0.86
CRISPRPred(SEQ) 0.838 0.83 0.821

A summary of the results we obtained through our experiments can be found in the following table.

Preprocessing technique & Model Cas9 type Cross validation score avg Test score with best fold
One hot encoding and CNN (Total params: 64,585) WT-SpCas9 0.8202634655 0.8242259448
eSpCas9 0.8029278236 0.8003575945
SpCas9-HF1 0.7937839469 0.7853817265
One hot encoding & CRNNCrispr seq branch only (Total params: 968,273) WT-SpCas9 0.8326987799 0.8383132948
eSpCas9 0.8168465796 0.8180295835
SpCas9-HF1 0.8174757287 0.813113697
Word embedding & BiLSTM(Attention) (Total params: 170,635) WT-SpCas9 0.8409261036 0.8463582093
eSpCas9 0.8258162304 0.822620635
SpCas9-HF1 0.836215959 0.8359173926
Word embedding & Hierarchical Attention Networks (Total params: 547,678) WT-SpCas9 0.8436791472 0.847940327
eSpCas9 0.8217506609 0.8185988637
SpCas9-HF1 0.8323302035 0.8303764046

Our deep learning models did well as we can see from the results on the test set. But still it could not beat DeepHF. Due to time constraint of the course, we had to limit our experiments. More experiments with models and their hyperparameter tuning could be done to find out if it is possible for a sequence-only model to do as good as the models which utilize epigenetic features too.