Abstract
Messenger RNA polyadenylation is one of the essential processing steps during eukaryotic gene expression. The site of polyadenylation [poly(A) site] marks the end of a transcript, which is also the end of a gene in most cases. A computation program that is able to recognize poly(A) sites would not only be useful for genome annotation in finding genes ends, but also for predicting alternative poly(A) sites. PASS [Poly(A) Site Sleuth] and PAC [Poly(A) site Classifier] were developed to predict poly(A) sites in plants. PASS was built based on the Generalized Hidden Markov Model (GHMM), which consists of four functional modules: input model, poly(A) site recognition module, graphic process module, and output module. PAC is a classification model, integrating several features that define the poly(A) sites including K-gram pattern, Z-curve, position-specific scoring matrix, and first-order inhomogeneous Markov sub-model. PAC can be used to predict poly(A) sites from species whose polyadenylation profile is unknown. The result of PASS and PAC is an output of a few files with one of them containing the score or probability of being a poly(A) site for each position of a given sequence. While the models were built mostly based on poly(A) profile data from Arabidopsis, it is also functional in other higher plants since their profiles are quite similar.