|
Electronic Journal of Biotechnology, Vol. 1, No. 3, December, 1998 Using a neural network to backtranslate amino acid sequences Gilbert White1 and William Seffens*2 1Department
of Biological Sciences, Clark Atlanta University, 223 James Brawley Dr., S.W.
Atlanta, GA 30314, USA Financial Support: This work was supported (or partially supported) by NIH grant GM08247, Research Centers in Minority Institutions award G12RR03062 from the Division of Research Resources, National Institutes of Health and NSF CREST Center for Theoretical Studies of Physical Systems (CTSPS) Cooperative Agreement #HRD-9632844. Received
26 August 1998 Code Number: ej98023 A neural network (NN) was trained on amino and nucleic acid sequences to test the NNs ability to predict a nucleic acid sequence given only an amino acid sequence. A multi-layer backpropagation network of one hidden layer with 5 to 9 neurons was used. Different network configurations were used with varying numbers of input neurons to represent amino acids, while a constant representation was used for the output layer representing nucleic acids. In the best-trained network, 93% of the overall bases, 85% of the degenerate bases, and 100% of the fixed bases were correctly predicted from randomly selected test sequences. The training set was composed of 60 human sequences in a window of 10 to 25 codons at the coding sequence start site. Different NN configurations involving the encoding of amino acids under increasing window sizes were evaluated to predict the behavior of the NN with a significantly larger training set. This genetic data analysis effort will assist in understanding human gene structure. Benefits include computational tools that could predict more reliably the backtranslation of amino acid sequences useful for Degenerate PCR cloning, and may assist the identification of human gene coding sequences (CDS) from open reading frames in DNA databases. Keywords: Amino acids, , Backtranslation, , Genetic code, Neural network, Nucleic acids Degenerate primers or probes, usually designed from partially sequenced peptides or conserved regions on the basis of comparison of several proteins, have been widely used in the polymerase chain reaction (PCR), DNA library screening, or Southern blot analysis. The degenerate nature of the genetic code prevents backtranslation of amino acids into codons with certainty. Numerous statistical studies have established that codon frequencies are not random (Karlin and Brendel, 1993). Many cDNA sequences have been mapped onto a "DNA-walk" and long-range power law correlations were found (Peng et.al., 1992). In consideration of the long-range correlations in DNA, a neural network approach may identify sequence patterns in coding regions that could be used to improve the accuracy of backtranslation. Neural networks are able to form generalizations and can identify patterns with noisy data sets. To list just a few biological applications, neural networks have been used successfully to identify coding regions in genomic DNA (Snyder and Stormo, 1993), to detect mRNA splice sites (Ogura et. al., 1997), and to predict the secondary structure of proteins (Holley and Karplus, 1989, Chandonia and Karplus, 1996). Neural networks have also been used to study the structure of the genetic code. One such network was trained to classify the 61 nucleotide triplets of the genetic code into 20 amino acid categories (Tolstrup et.al., 1994). This network was able to correlate the structure of the genetic code to measures of amino acid hydrophobicity. Most neural network methods for identifying patterns in sequences can be classified as a search by signal or a search by content (Granjeon and Tarroux, 1995). Search by signal consists in identifying specific sites, such as splice sites. This method suffers from a lack of reliability when variable signals delimit the regions of interest. Search-by-content algorithms use local constraints, such as compositional bias, to characterize regions of DNA. The goal of the research reported here is to utilize the successful NN techniques to analyze and generalize codon usage in mRNA sequences beginning at the CDS start site. Local and global patterns of codon usage in genes may be identifiable by neural networks of suitable architecture. This paper reports on some initial trials of altering the encoding of amino acids for the input neural layer. Future studies will address the architecture of the hidden layer to optimize for the NN ability to detect codon usage patterns in genes. Materials and MethodsTraining set. Human mRNA sequences were obtained from GenBank on the basis of several criteria. The coding sequences were relatively short in order to avoid splicing and other variants of the mRNA. The sequences were identified by keywords that would indicate a complete mRNA could be reconstructed. Such words would be complete coding sequence (CDS), 5 and 3 untranslated regions (UTR), and poly(A) site. Multiple members from gene families were excluded to prevent overtraining of those sequences. The sequences were downloaded from Entrez at the NIH web site (http://www.ncbi.nlm.nih.gov/Entrez/) and the coding sequence was saved from each into a file. Up to the first 75 nucleotides of the CDS were selected for this study in a window starting at the methionine ATG start site. Binary representations. In order to train the neural network (NN) it is necessary to formulate a decoding scheme because the architecture of the NN is binary and does not allow a direct representation of nucleic or amino acid sequences. Therefore, a binary numeric representation was used to encode the amino acid data. Several Microsoft Word 97 macros were recorded to convert amino acids and nucleic acids into numerical values. The macros used the find and replace commands in Microsoft Word 97 for each of the twenty amino acids and for the four nucleotides. The individual numeric-encoded sequence files were then joined together into groups. For this study a total of sixty mRNAs were examined with different window sequence lengths which changed the total size of the training set (White, 1998). The nomenclature for each group identifies the number of sequences used and the number of codons taken from each sequence. For example, in Training Set 60S-10C there are sixty sequences with a window of ten codons taken from each sequence. Since ten codons were taken from each sequence, there are 600 codons in this set. A related study of predicting bases in tRNA sequences used a window size of 15 bases (Sun et.al., 1995), while this study used a window of 10 codons or 30 bases. Neural network. All work with the NN was performed on a Sun SPARCstationÔ 20 computer. The NN used was a utility of Partek 2.0b4, called a multi-layer perceptron (MLP). A MLP is a NN, which has at least three layers (the input, output and the hidden layer(s)). Each layer is attached to the next layer by connection weights that are changed during the training process to reduce the overall error. This allows the network to "learn" patterns in the mRNA sequences. Training was stopped when the change in the total output error became less than 0.1% from the previous iteration. This usually occurred after 500 - 1200 iterations using the backpropagation learning method. Test sets were assembled to assess the predictive accuracy of the trained NN. The test sets consisted of 3 randomly selected human gene sequences from the same group of sequences from which the training set was selected. The predicted output was measured in 3 categories: the overall percent correct, percent correct for degenerate bases, and percent correct for fixed bases. These measures allow the assessment of the various schemes used to encode the amino acids. Encoding
the amino acids Adding
degeneracy information Binary
encoding Comparing
the schemes One of the possible uses of this research is to improve the design of oligonucleotide probes (Eberhardt, 1992). One primer-design study found an overall homology greater than 82% between the predicted probe and the target sequence when codon utilization and dinucleotide frequencies were taken into account (Lathe, 1985). When sequence stretches lacking Serine, Arginine, and Leucine are selected the overall homology became 85.7% in Lathe's study. Our best network predicted 85% of the degenerate bases, and 93% of the overall bases. The data set used Lathe's study contained 13,000 nucleotides and our largest training set had 4500 nucleotides. Therefore, an increase in our network or training set size could lead to even greater accuracy by detecting patterns of codon choice within the mRNA sequences. The architecture of the amino acid encoding method apparently does not have a large impact on predictive accuracy as found in this study. Therefore other factors, such as computational time or memory size may be a criteria used to select an encoding scheme for a larger training set. It is also interesting to note that the network that predicted the highest percentage of correct overall bases did so on a test set that had eight Leucines, one Arginine, and two Serines. These amino acids present difficulties for algorithms based on codon lookup tables, such as Lathe's work or common primer selection programs (such as Nash, 1993). The work reported here demonstrates that a NN approach may yield improvements in predictive accuracy for PCR primer selection.
Table 1. Degeneracy codes for nucleic acids.
Shown are the percent of correctly predicted degenerate bases in a test set composed of three sequences selected randomly from the same group of sequences from which the training set was assembled.
Supported by UNESCO / MIRCEN network © 1998 by Universidad Católica de Valparaíso -- Chile The following images related to this document are available:Photo images[ej98023f3.jpg] [ej98023f1.jpg] [ej98023f2.jpg] |
|