CRISPR/Cas9 on-target knockout efficacy prediction
A project under Machine Learning Sessional course using traditional machine learning with sequence-based properties and deep learning techniques.
Based on the work done by Wang et al., we implemented and experimented with some deep learning models to deal with the problem regarding CRISPR/Cas9 on-target knockout efficacy. We used the dataset provided by the above mentioned paper. During the experimentation phase, we implemented several deep learning models including CNN, LSTM, GRU, Bidirectional LSTM, Bidirectional LSTM with attention, Hierarchical Attention Networks (HAN) etc. This project focuses on using only the sequence as input without manually preparing any epigenetic feature beforehand.
Before training the models, we had to encode the sgRNA sequences. We used two different methods on the sgRNA sequences: one-hot encoding and word embedding using Word2Vec. Then we trained the deep-learning models and evaluated the performances using Spearman correlation coefficient. At the time of this implementation, there were two papers who used the same dataset. Their achieved Spearman correlation coefficients are stated in the follwing table.
WT-SpCas9 | eSpCas9 | SpCas9-HF1 | |
---|---|---|---|
DeepHF | 0.867 | 0.862 | 0.86 |
CRISPRPred(SEQ) | 0.838 | 0.83 | 0.821 |
A summary of the results we obtained through our experiments can be found in the following table.
Preprocessing technique & Model | Cas9 type | Cross validation score avg | Test score with best fold |
One hot encoding and CNN (Total params: 64,585) | WT-SpCas9 | 0.8202634655 | 0.8242259448 |
eSpCas9 | 0.8029278236 | 0.8003575945 | |
SpCas9-HF1 | 0.7937839469 | 0.7853817265 | |
One hot encoding & CRNNCrispr seq branch only (Total params: 968,273) | WT-SpCas9 | 0.8326987799 | 0.8383132948 |
eSpCas9 | 0.8168465796 | 0.8180295835 | |
SpCas9-HF1 | 0.8174757287 | 0.813113697 | |
Word embedding & BiLSTM(Attention) (Total params: 170,635) | WT-SpCas9 | 0.8409261036 | 0.8463582093 |
eSpCas9 | 0.8258162304 | 0.822620635 | |
SpCas9-HF1 | 0.836215959 | 0.8359173926 | |
Word embedding & Hierarchical Attention Networks (Total params: 547,678) | WT-SpCas9 | 0.8436791472 | 0.847940327 |
eSpCas9 | 0.8217506609 | 0.8185988637 | |
SpCas9-HF1 | 0.8323302035 | 0.8303764046 |
Our deep learning models did well as we can see from the results on the test set. But still it could not beat DeepHF. Due to time constraint of the course, we had to limit our experiments. More experiments with models and their hyperparameter tuning could be done to find out if it is possible for a sequence-only model to do as good as the models which utilize epigenetic features too.