### 1. Introduction

*ab initio*methods, comparative genomics, and hybrid methods. The first one primarily relies on: 1) statistical patterns extracted from accumulative protein coding database, 2) prior knowledge obtained in existing biological techniques, and 3) signal processing that takes advantage of the periodicity of amino acid codons [7]. Comparative genomics can accurately locate conserved coding exons by pair-wise or multiple sequence alignments of homogeneous species and have played an important role in predicting novel genes and exons that had escaped transcriptome profiling and corrected previous gene annotation. Especially, with the recent increase in annotation data, the homology-based comparative methods significantly improve predictive performance via investigating the homologous database of annotated mRNAs, proteins, peptides, and other biological elements. However, computational methods are still far from satisfactory in both sensitivity and specificity. More scientists realize that using only one method is not enough for predicting genes with high sensitivity and high specificity:

*ab initio*methods often lack the high sensitivity on the predictive results while comparative methods are prone to lose specificity. Thus,

*ab initio*and comparative methods are often combined to maximize the predictive performance.

*ab initio*, comparative methods, and hybrid methods, are discussed, respectively. In Section 6, the evaluation metrics and the typical dataset are given to assess the performance of computational methods. In Section 7, we conclude this review, discuss the related issues, and provide a possible direction in regards to future work in gene prediction.

### 2. Preliminaries

### 2.1 Gene Structure

### 2.2 Exon and Intron

#### 2.3 3-Periodicity

### 2.4 Splice Sites

### 2.5 Start and Stop Codons

### 2.6 Promoter

### 3. *Ab initio* Methods

*Ab initio*methods can detect genes by systematically examining and discriminating signal sensors and distinct biological patterns as well as being able to distinguish gene regions in a single input sequence. The only criteria this type of method adopts to identify the genes relies on the extracted intrinsic information of DNA sequences. Many

*ab initio*methods largely depend on probabilistic models. Of them, hidden Markov models (HMMs) are the most generative, where the transitions of the nucleotide over finite hidden states are ruled by the probabilities of present and previous appearances.

*Ab initio*methods are indispensable for gene prediction because these methods use statistical patterns and intrinsic information, especially signal sensors, to detect the boundaries of content and can greatly increase the specificity of prediction performance. On the other hand, one of the disadvantages of an

*ab initio*method is that it requires a large volume of training sets to collect the near-ground-truth statistical properties of various signal sensors, which inherently limits their applicability to low sample sets. Another disadvantage is that since the boundaries are often variable, it results in over-fitting models on small training sets.

### 3.1 Fourier Transform and Digital Signal Processing

*x*

*[*

_{C}*n*],

*x*

*[*

_{T}*n*],

*x*

*[*

_{A}*n*], and

*x*

*[*

_{G}*n*], where 1 or 0 represents the presence or absence, respectively, in the corresponding positions. Other representations include the quaternion approach [15], the electron-ion interaction potential method [20], real-number representation [21], complex number mapping [22], etc.

*x*[

*n*] is a finite numerical sequence length of length

*N*,

*n*is the sequence index,

*k*is the period of

*N/k*samples and the discrete frequency of 2

*πk*/

*N*. The total Fourier spectra of coding sequences typically have a peak at the frequency

*k = N*/3, whereas the Fourier spectra of non-coding sequences generally do not have any significant peaks [17].

*k = N*/3, as shown in Eq. (2).

*m*∈{

*C, T, A, G*} and

*Ŝ*is the average of the spectral content of

*S*.

*P*is assigned to 4 as a critical point where the bulk of coding sequences is distinct from almost 90% of non-coding regions having

*P*<4. Based on spectral content measures, the optimization technique [24] is applied to the calculation of coefficients for the spectral content of each spectral sequence according to the known genes of a given organism.

*π*/3, the spectral-rotation method [17] studies the argument by using an argument plane to rotate the four Fourier vectors,

*X*

*[*

_{m}*k*],

*m*∈{

*C, T, A, G*}, clockwise for a certain angle that is equivalent to the average phase angle in coding sequences so that each of them can point to the same direction. Consequently, the vectors in coding regions can approximately point along the real axis, whereas the vectors in non-coding regions point in different directions. The spectral-rotation measure

*V*is defined as the square of the ratio of the spectral rotation over the corresponding angular deviation, as shown in Eq. (3).

*μ*is the average phase angle,

*δ*is the angular deviation in coding regions, and

*S*is the same spectral content, as defined in Eq. (2). It makes a contribution to spectral analysis methodology by converting the analysis from frequency to argument.

*x*[

*n*], as a function of the period

*k =*3, shown in the equation below.

*N*represents the window size. The

*AMDF*[

*k*] can produce a deep null at

*k =*3 where significant correlation exists. If the period

*k ≠*3 is detected, the relatively low AMDF values are produced at non-coding regions compared with high values at coding regions [32]. The time-domain method and the frequency-domain method can be combined to improve the accuracy of the 3-periodicity prediction [33]. Different from the widely adopted parameter of 351 in window size in the frequency domain, the frame size is set to 117 for optimizing the performance of the AMDF method in a time domain [33].

*π*/3 was applied to numerical sequences, an impulse train of 3-periodicity was multiplied with the filtered numerical sequences in order to focus on the 3-periodicity property in the exonic region, as shown in Eq. (5).

### 3.2 Hidden Markov Model and Statistical Methods

*k*previous nucleotides (

*k*is the order of HMM), namely, the conditional probabilities

*P*(

*X*

_{k+1}*|X*

_{1}*,X*

_{2}*,…,X*

*), where*

_{k}*X*∈{

*C, T, A, G*}. The zero-order Markov model is the simplest Markov model, which means that each nucleotide occurs independently with a given frequency. It is believed that the large-order Markov model can better characterize the dependencies between adjacent nucleotides. Some gene prediction methods are constructed as a 5th-order Markov model, which uses compositional words that are 6 in gene characteristic length. However, in [38], it was revealed that models with an order higher than 5 do not make a distinct difference in discriminating the coding and non-coding regions while they significantly increase the computational load.

*φ*using probabilistic models. A parse or an estimator is composed of an ordered set of states

*q̄*= {

*q*

_{1},

*q*

_{2}, ···,

*q*

*} with an associated set of length or duration*

_{n}*d̄*= {

*d*

_{1},

*d*

_{2}, ···,

*d*

*} and the total length of a sequence*

_{n}*q*

*is defined as one of 27 biological components, such as a promoter; 5′ or 3′ UTR′ poly-A-signal; initial, middle, or terminal exon; intron; and so forth. These states are classified as the forward strand and reverse strand, and the nucleotides located at different positions within a codon or an intron are differentiated. For example, the first base and the second base of an intron are specifically marked as a donor.*

_{i}*q*

_{1}in terms of an initial distribution on the states

*π⃗*=

*P*{

*q*

_{1}}, where

*q*

_{1}belongs to one of 27 components, and the conditional distribution of the length for

*q*

_{2}; 2) generating a sequence

*s*

_{1}that is conditional on

*d*

_{1}and

*q*

_{1}, in terms of the sequence generating model; and 3) generating

*q*

_{2}, which is conditional on the generation of

*q*

_{1}, according to the first order Markov state transition of Matrix

*T*

*=*

_{ij}*P*

_{qk+1|qk}. The three steps are repeated until the sum of the state duration equals or exceeds the total length

*L*, at which point the final sequence consists of a set of ordered sequence segments,

*S = s*

_{1}*s*

_{2}*…s*

*. Four main probabilistic components [40] are needed during the training stage, a vector*

_{n}*π⃗*for initial probabilities, a matrix

*T*for state transition probabilities, a set of length distributions

*f*for state length estimation, and a set of sequence generation models

*P*for various states.

*Ω*=

*Φ*

*×*

_{L}*Ψ*

*contains all possible sets of parses and DNA sequences for the fixed length*

_{L}*L*where

*Φ*is the set of all possible parses/estimators and

*Ψ*is the set of all possible DNA sequences. The HMM model can be thought of as a measuring function that assigns a probability density to each parse/sequence pair [39,42]. Thus, via the Bayesian theorem, the conditional probability of a particular parse

*φ*

_{i}*∈ Φ*

*for a particular sequence*

_{L}*S ∈ Ψ*

*can be calculated, as shown in Eq. (6):*

_{L}*φ*.

*X*=

*x*

_{1},

*x*

_{2}, ···,

*x*

*. The first two steps are learned from the training sequence data and the last step is the estimation of the probability for a generated input sequence. For example, a 6-bp WMM model [49,50] can be applied for polyadenylation signal prediction by using the annotated sequences of polyA in the GenBank database. Similarly, promoter prediction [12,40] can be modeled by differentiating between TATA-containing and TATA-less promoters and by using two separate WMM models.*

_{n}*x*

*at position*

_{i}*i*, given nucleotide

*x*

_{i}_{−1}at position

*i*− 1. The conditional probability can be acquired by estimating the corresponding conditional frequencies in the set of the aligned sequences for signal sensors from the training set. In many literature [40,52–54], the WAM method was applied for detecting splice acceptor and donor sites based on 2-order weight matrices.

*k*-1 level at most, where

*k*is the consensus length. The MDD tree can be constructed by the following steps:

Step 1. For each position of consensus, calculating the sum of the χ

^{2}statistics for matching the variable*C*versus each position_{i}*X*for all pairs_{j}*i*,*j*with*i ≠ j*.Step 2. Choosing the maximal sum value in each iteration and dividing the chosen group into two subsets at a certain position

*i*according to the consensus._{1}

### 3.3 Traditional and Deep Neural Networks

*k-*mer preferences are computed through the observed

*k-*mers in both coding DNA and DNA genomes; dinucleotide fractal dimension represents the differences in dinucleotide occurrence between the intron and an examined window; word commonality is calculated by summing all

*k-*mer commonalities in the analysis window; the frame bias matrix provides the usage of an amino acid to calculate the correlation coefficient for a reading frame; and the Fickett algorithm considers several properties of coding sequences. The multiple-classifier based method has been experimentally verified to provide better performance than the single classifier neural network [57].

*h*and the iterative estimation of

*x*

*can be expressed by calculating their weights, as illustrated in Fig. 4(b). The iteration becomes stable when it has the minimum distance between*

^{*}*x*and

*x*

*. The preliminary ideas of a shallow/deep neural network have been discussed since the 1990s. However, mature concepts about deep learning, including a deep neural network, were not proposed until the mid-2000s [66–68]. Since then, only a small number of works [65,69,70] have applied deep learning in the life sciences, even though it has shown tremendous promise [64]. As a promising method for the future directions of gene prediction, deep learning and other emerging methods are further discussed in Section 7.*

^{*}### 3.4 Support Vector Machines and Kernel Methods

*f*(

*x*)

*=<w, x>+b*, where,

*x*denotes a vector in

*M*-dimensional vector space,

*w*represents a weight vector,

*b*is a bias, and the component is defined as the dot product or scalar product,

*l*-mer, and so forth. For example, for splice site recognition in a spectrum kernel [81], the long sub-strings as input features are more informative than short ones. However, due to mismatching in the long sub-strings, a sufficient length may cause downgrading in predictive performance [83]. The weighted degree kernel shift [82] extends the WD kernel and allows some flexibility in matching sub-strings. Similarly, the Oligo kernel [85] and others closely related to the spectrum kernel [72,84] are extensions of allowing for gaps and mismatches and for achieving the goal via subtly different means.

### 4. Comparative Methods

### 4.1 Alignment Techniques

*k*-mer seed [98,99]. A binary mask [99] assigned to seeds aims to find more flexible error-tolerance patterns or some specific purpose patterns, such as splice sites. For example, a spaced seed that is 11 in length and 8 in weight in the mask 11110110011 allows for mismatching in the 0 positions and exact match in the 1 positions. In human and mouse genome alignments [97] different seeds are applied iteratively to detect the orthologous and paralogous genes between the two mammal species.

### 4.2 Intergenomic Comparison

### 4.3 Intragenomic Comparison

*ab initio*gene finding becomes difficult. Thus, the latest methods for gene identification in RNA-Seq transcripts often involve a variety of methods. For example, the novel method in [47] integrates unassembled RNA-Seq read alignments into the self-training procedure of the GeneMark-ES program to improve the accuracy of gene prediction.

### 4.4 Chaining Technique

### 5. The Hybrid Trend: Combining *Ab initio* and Comparative Methods

*ab initio*and comparative methods into a particular application. The innovation of hybrid methods primarily relies on a novel combination of techniques in the two mainstream methods in order to achieve performance improvement in a particular application. Hybrid methods are comprised of two categories: one is the combination of methodologies and the second is the combination of overlapping results. Even though the latter one looks simpler than the former, the success of a hybrid method heavily depends on the final performance and its particular constraints.

*ab initio*method, such as the artificial neural network, with the comparative method, such as a homology search in proteins and the EST database, to improve the performance of gene prediction. A comparative method in [117] is used by BLASTX query against a protein database to identify the protein-coding regions in ESTs. After finding a candidate from the coding region in sequences, it uses

*ab initio*for sequence prediction.

*ab initio*[119] employs the profiles of multiple protein sequence alignments and models human dynein heavy chain proteins as an evidence to improve the accuracy of the prediction. A similar computational combination between machine learning and comparative methods [136] are used to discover the splice sites. JIGSAW [137], a hybrid method, absorbs several sources of evidence and automates the process of predicting the gene structure. It calculates the relative weight of evidence using statistics generated from the training set, and then uses dynamic programming to combine the evidence. The results in the EGASP evaluation experiment [138] show that hybrid methods are superior to

*ab initio*methods in finding genes and achieves the highest rank in both specificity and sensitivity.

*ab initio*HMM model for the unsupervised training procedure and the novel combination improved the accuracy of gene prediction. Similarly, in [140] the alignment information was fed to the HMM model for the automatic training process.

*ab initio*+ Comparative + Script Tools model, results from many mature tools can be overlapped and combined for the best collection of predictive candidates, which is a pragmatic paradigm in gene prediction for improving the specificity and the efficiency of computational methods [142]. In [143], it combined results from three computational tools, GenScan [40], HMMGene [144], and Glimmer [145], and integrated them into an artificial neural network to refine the best combined outcome.

### 6. Performance Assessment

### 6.1 Evaluation and Dataset

*C. elegans*genome that has been well annotated by scientists [150]. About 10% of the

*C. elegans*genomes were selected as the dataset in the evaluation experiment, including 1M genomic sequence regions separately for the training set and testing set. Meanwhile, the multiple genome alignments between

*C. elegans*,

*C. briggsae,*and

*C. remanei*, and the alignments of ESTs, mRNAs, and proteins with the

*C. elegans*genome were provided as additional data for different types of methods [9]. The reference gene sets were derived from WormBase [151,152] and were used as benchmarks to measure the results of different programs.

*ab initio*statistical methods were the 1) HAVANA dataset and 2) the human

*PAX5*gene on chromosome 9.

### 6.2 Metrics

Sensitivity, S

_{n}= TP/(TP + FN)Specificity, S

_{p}= TN/(TN + FP)Accuracy, A

_{cc}= (TP + TN)/(TP + FP + FN + TN)Mean correlation coefficient, M

_{cc}= (TP×TN-FN×FP)/[(TP+FN) × (TN+FP) × (TP+FP) × (TN+FN)]^{1/2}Positive predictive value, P

_{pv}= TP/(TP + FP)Performance coefficient, P

_{c}= TP/(TP + FN + FP)F1 score, the harmonic mean of precision and sensitivity

F1 = 2×TP/(2×TP + FP + FN)

### 7. Discussion and Future Direction

### 7.1 Relevant Issues

#### 1) RNA-Seq and Alternative Splicing Sites

#### 2) Non-coding Elements

#### 3) Metagenomic Sequences

*ab initio*gene prediction algorithms on short-read fragment metagenomic datasets, especially error-prone sequencing data. The results showed that the HMM-based FragGeneScan [171] is more sensitive than others, while the Prodigal [172], MetaGeneAnnotator [173], and MetaGeneMark [38] are better suited for low-error sequences. Although many metagenomic methods have been studied (e.g., [169]), it must be noted that most of the proposed computational methods for gene prediction on mixed metagenomic datasets are still thought of as a largely unresolved question.

#### 4) Polymorphism and Gene Finding

#### 5) CpG Island and Methylation Region

*Alu*repeats [177,178], which were often regarded as junk pieces in the DNA genome. In CGIs, some unrevealed principles may play a critical role in transcription regulation [179] and forming an evolutionary force [177]. The redefinition of a CGI [180] and its identification [181] may contribute to gene finding and the studies on CGI and methylation may establish a direct link between epigenetic analysis and genomic studies.

### 7.2 Emerging Techniques and Future Direction

#### 1) Deep Learning

#### 2) Cloud and Parallel Computing

#### 3) Information Theory

#### 4) Machine Learning

*k*-mer-based sequence binning methods and sequence property-based methods are often seen as the input for training models. For example, a newly developed SVM-based algorithm [80] can be implemented in a three-stage strategy to predict genes. First, short reads are classified into phylogenetic groups by a

*k*-mer-based sequence binning method. Second, SVM classifiers integrate entropy density profiles of codon usage, translation initiation site scores, and open reading frame length as input patterns for supervised universal model training. Then, protein coding sequences are identified in each group independently with these SVM classifiers.