info@biomedres.us   +1 (720) 414-3554
  One Westbrook Corporate Center, Suite 300, Westchester, IL 60154, USA

Biomedical Journal of Scientific & Technical Research

October, 2019, Volume 21, 5, pp 16131-16135

Research Article

Research Article

Crispr/Cas9 Target Prediction with Deep Learning

Özlem Aktaş, Elif Doğan* and Tolga Ensari

Author Affiliations

Department of Computer Engineering, Faculty of Engineering, Turkey

Received: September 24, 2019 | Published: October 03, 2019

Corresponding author: Elif Doğan, Department of Computer Engineering, Faculty of Engineering, Dokuz Eylül University, Turkey

DOI: 10.26717/BJSTR.2019.21.003652

Abstract

The CRISPR/CAS9 system is a powerful tool for regulating damaged genome sequences. Nucleases that are damaged in their sequence are called miRNAs (micro RNAs). The miRNAs targeted by multiple promoter sgRNA (single guide RNA) are cut or regulated from RNA by the CRISPR/CAS9 method. The sgRNAs targeted to the incorrect miRNAs may provoke undesired genome abnormalities. In this study, in order to minimize these genome distortions, sgRNA target estimation was performed for CRISPR/CAS9 with deep learning in this study. In this article, convolutional neural networks (Convolutional Neural Networks-CNN) and multilayer perceptron (Multi- Layer Perceptron-MLP) algorithms are used for experimental analysis. We also compare the performance of CRISPR/CAS9 system for three algorithms.

Abbreviations: CNN: Convolutional Neural Networks; BLSTM: Bidirectional Long-Short Term Memory; MLP: Multi Layer Perceptron

Introduction

In an Esherichia Coli (E. Coli) bacteria that investigated by “In silico” (simulation) method which is important role nowadays has been discovered an immune system named CRISPR (clustered regularly interspaced short palindromic repeats)/Cas9 [1]. According to this system an E.Coli bacteria which is infected by any virus, add the virus DNA its memory and recognize when any other virus attack accrues. This defined virus DNA cuts this virus DNA from its own DNA through the CAS9 enzyme. Thus, DNA repair will occur. According to recent research the leading factor in gene regulation has been observed the “microRNA (miRNA)” targeted by the “single-guide RNA (sGRNA)” s. MiRNAs are small and noncoding RNA molecules [2]. In diseases, for example, different miRNAs involved in all stages of cancer make cancer diagnosis and treatment available [3]. CRISPR/CAS9 is used to destroy targeted miRNAs in cells [4]. It uses “guideRNA (gRNA)” as a guide to target the CRISPR/CAS9 nucleus to the DNA sequence and trigger the double-strand break at the desired location. Breaking and repair of these threads can cause random addition and alteration of DNA [5].

MiRNA activities are different in each cell type [6]. So, the miRNA activities of a human cell and a bacterial cell will not be the same. From this point, It is important to find a dataset about the genomes to be studied in this respect. Various miRNA target data sets used in “in silico” studies will be used [7]. The aim of this study, it is made mirRNA target estimation with machine learning algorithms. The result of wrong targeted miRNA may be cause of undesirable gene mutations [8]. It is aimed to minimize the Type 1 and Type 2 errors of miRNAs by applying machine learning and deep learning method of Type 1 and Type 2 errors as output. Thus, the wrong target estimate will be minimized. In this way, it will be possible to reliably repair gene damage by correctly targeting mistargeted miRNAs. Genetic diseases will be eliminated by repair of gene damage.

The purpose of the “in silico” technique is to increase the accuracy of disease prevention by single pointing with the help of mechanization. The large base readings provided by mechanization also contribute greatly. Studies for miRNA targeting, support vector machines (Support Vector Machine -SVM) [9], deep learning, constrained logic programming (Constraint Logic Programming) [10] and a class classification (One Class Classification-OCC) [11] methods were used. Besides, in the future big data may be decreased to its most valuable parts with digital data forgetting concepts. It makes computations faster and the machines will store less data in disks [37].

Materials and Methods

CRISPR

CRISPR/CAS9 system is a strong genome editing mechanism used in many biotechnology applications [12]. MiRNAs are RNA molecules that play an important role in gene regulation in animals and plants [13]. Damaged miRNAs are targeted by sgRNAs. However, the effectiveness of sgRNAs have not been defined firmly to target area [14]. In the systems of CRISPR/CAS9 (clustered regularly interspaced short palindromic repeats), Cas9 nuclease creates short array (about 20 nucleotide) with RNA guideline two stranded in the determined region at DNA. CRISPR/CAS9 has been a unique technology that allows genetic and medical researchers to add, subtract or modify DNA sequences in various parts of the genome. In contrast to ZFN (Zinc-Finger Nuclease) and TALENs (Transcription Activator-Like Effector Nuclease), CRISPR/CAS9 is not man-made; currently the system is part of the bacterial immune system, which helps to protect from invasive phages. Originality in CRISPR is achieved using an RNA molecule that is complementary to the gene of interest. After binding, this RNA molecule (also called guide RNA) traps a CAS9 nuclease, a double-strand break that causes a frame shift when repaired by NHEJ (Non-Homologous End Joining). CRISPR is the most effective process to date in gene repair and editing tools. Efficiency varies depending on the organism and the target site. Since it is the simplest, versatile and sensitive method of genetic manipulation that is currently available, it attracts great attention in the world of science. In summary, CRISPR/CAS9 differs from ZFN and TALENs in an important aspect that makes them superior to genomic regulation applications: The ZFN and TALENs bind to DNA through a direct protein-DNA interaction that requires redesigning the protein for each new target.

DATA Set

Applied “in silico” research and review, BLAST (Basic Local Alignment Search Tool) was used for CRISPR [15]. BLAST is a search tool that analyzes the amino acids and DNA sequences of proteins and finds similarities between them. Besides BLAST, the data set resources have been used such as National Human Genome Research Institute (NHGRI) [16], miRBase [17], Genome Crispr [18], CrisprInc [19], ENSEMBL [20], ENCODE [21], CRISPRz [22], CRISPOR [23], CRISPR Local [24]. In the algorithm studies performed with these data sets, estimation tools such as mirWalk [25], TargetScan (ID2 PPI analysis network) [26], miRanda [27], mirBase [28], mirTarget [29], TarBase [30] have been developed.

In this study, CRISPR Local data set has been used for the analysis [31]. The source of CRISPR Local dataset is ENSEMBL Plants. There are approximately 854.610 lines of CRISPR data in the original. In this study, we used 34.200 lines of data because of technical issues. There are 11 column features in this dataset which are examples of “Cyanidioschyzon merolae” alga. This features; the gene in which the sgRNA, on target estimated chromosome and its coordinate, sgRNA sequence with 23’nt., on target prediction score, off target prediction gene which has the greatest CFD score, the chromosome on target prediction, its coordinate and beginning position, off-target prediction sequence, the number of sgRNA and mismatch on the off-target sequence, axon name, axon start position, all off-target and sgRNA having the highest CFD score. The sequences having 4 channels like Adenin (A), Sitozin (C), Guanin (G) ve Timin (T) are used as [1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,11]. According these channels, the process has been realized with converted RNA sequence to binary system. These sequences were used as binary in the data set. In this example, each base in the sequence is considered as a separate column and feature. The 23nt. sgRNA sequence, on-target prediction score features were used. Figure 1 shows an illustration of the sample data set. used. Figure 1 shows an illustration of the sample data set.

Figure 1: Sample Data Set.

Experimental Studies

In this study, CNN, MLP and BLSTM models were constituted and compared.

Multilayer Perceptron -MLP

Figure 2: MLP Model.

A 4-layer MLP model has been developed according to the MLP algorithm run using the Google Colab GPU (https://colab.research. google.com). In the MLP model a fully connected structure of dense layers was formed. Information on the model used is shown in Figure 2. The rates of logistic regression and accuracy according to the MLP model are shown in Figure 3 and Figure 4. Accuracy obtained 80.38% according to MLP model.

Figure 3: Model Accuracy.

Figure 4: Model loss.

Convolutional Neural Network - CNN

Convolutional Neural Networks are a model of artificial neutral network which is used successfully in image processing, bioinformatics, robotics, data mining, finance and many other areas. However, except for image analysis, surprisingly high accuracy ratio was obtained in emotion analysis, text classification and question answering applications. According to this model, it is applied to nxn matrix with nxn filtering method (dot product), with acceptation of n>m. Thus, it allows the identification and classification of properties. In Figure 5, 3x3 matrix as a result of the intrinsic product of a 5x5 matrix and filtering was obtained. In this study, data set has been divided two group one of training 70 per cent, the other test 30 per cent. The model has been fixed up to non-linear by using the tangent and sigmoid activation functions. Convolution network is used to clarify the properties. The convolution network helps to create a new matrix with the results of the multiplication of the matrices. In order to prevent over fitting, maxpooling layer was used. It selects the elements with the maximum value from the matrix pool of the specified size in the maxpooling layer. Accordingly, the information obtained when the CNN model is generated can be seen in Figure 6. The rate of accuracy according to the CNN model can be seen from Figure 7.

Figure 5: Convolution Process.

Figure 6: CNN Model.

Figure 7: Model Accuracy.

Bidirectional Long-Short Term Memory - BSLTM

Bidirectional LSTM is different from other feed forward models in neural networks feedback system. Accordingly, the information obtained when the bidirectional LSTM model is generated can be seen in Figure 8. The rate of accuracy according to the bidirectional LSTM model can be seen from Figure 9. Accuracy has obtained 80.88 % with bidirectional LSTM model. Accuracy was 96.7% compared to the CNN model. Stochastic Gradient Descent (SGD) optimization method was used. Learning rate was set to 0.0005. Binary cross entropy logarithmic regression function was used. The following table (Table 1) shows the MLP, CNN and bidirectional LSTM model accuracy rates. Table 2 shows the MLP, CNN and bidirectional LSTM model accuracy rates, precision-recall, f-measure values. In another study [35] authors used the same data set. Source of Crispr Local which is Ensembl Plants [36] data contains on a set of plant and animal cell data. Support Vector Machine (SVM) algorithm was applied. Located accuracy of ratio is 87 %. Table 3 shows a comparison of our algorithm and last studies accuracy rate.

Figure 8:Bidirectional LSTM model.

Figure 9:Model accuracy.

Table 1: MLP, CNN and Bidirectional LSTM accuracy.

Table 2: MLP, CNN and BLSTM results.

Table 3: Comparing with other studies.

Conclusion

In this study, the algorithms of Multilayer Perceptron-MLP, Convolutional Neural Networks-CNN and Bidirectional Long Short- Term Memory-BLSTM have been compared with use of CRISPR data set. As a result, according to this data set, the accuracy rate in the MLP model was 80.38 % and Bidirectional LSTM model was 80.88 % whereas for CNN this result was found 96.7 %. According to the results, a highest accuracy rate was obtained with the CNN model than MLP and BLSTM. In the CNN model, revised CRISPR has reached up to 7 layers according to the ENSEMBL Plants dataset and 4 layers have been formed in MLP and 2 layers have been formed in BSLTM. Comparing other algorithms with our algorithm is better performance according to results. Research that used with SVM algorithm performed 87.0 % accuracy result. However, our model achieved 96.7 %. This result is more reliable performance than the research used SVM algorithm. Any mistargeted position causes unwanted genome distortions. For this reason, the accuracy rate is urgent in sgRNA targeting.

References

Research Article

Crispr/Cas9 Target Prediction with Deep Learning

Özlem Aktaş, Elif Doğan* and Tolga Ensari

Author Affiliations

Department of Computer Engineering, Faculty of Engineering, Turkey

Received: September 24, 2019 | Published: October 03, 2019

Corresponding author: Elif Doğan, Department of Computer Engineering, Faculty of Engineering, Dokuz Eylül University, Turkey

DOI: 10.26717/BJSTR.2019.21.003652

Abstract

The CRISPR/CAS9 system is a powerful tool for regulating damaged genome sequences. Nucleases that are damaged in their sequence are called miRNAs (micro RNAs). The miRNAs targeted by multiple promoter sgRNA (single guide RNA) are cut or regulated from RNA by the CRISPR/CAS9 method. The sgRNAs targeted to the incorrect miRNAs may provoke undesired genome abnormalities. In this study, in order to minimize these genome distortions, sgRNA target estimation was performed for CRISPR/CAS9 with deep learning in this study. In this article, convolutional neural networks (Convolutional Neural Networks-CNN) and multilayer perceptron (Multi- Layer Perceptron-MLP) algorithms are used for experimental analysis. We also compare the performance of CRISPR/CAS9 system for three algorithms.

Abbreviations: CNN: Convolutional Neural Networks; BLSTM: Bidirectional Long-Short Term Memory; MLP: Multi Layer Perceptron