Polyadenylation [poly(A)] is a vital step in post-transcriptional processing of pre-mRNA. Alternative polyadenylation is a widespread mechanism of regulating gene expression in eukaryotes. Defining poly(A) sites contributes to the annotation of transcripts' ends and the study of gene regulatory mechanisms. Here, we survey methods for collecting poly(A) sites using high-throughput sequencing technologies and summarize the general processes for genome-wide poly(A) site identifications. We also compare the performances of various poly(A) site prediction models and discuss the relationship between poly(A) site identification from sequencing projects and predictive modeling. Moreover, we attempt to address some potential problems in current researches and propose future directions related to polyadenylation research.
Publications
2015
BACKGROUND: Messenger RNA polyadenylation is an essential step for the maturation of most eukaryotic mRNAs. Accurate determination of poly(A) sites helps define the 3'-ends of genes, which is important for genome annotation and gene function research. Genomic studies have revealed the presence of poly(A) sites in intergenic regions, which may be attributed to 3'-UTR extensions and novel transcript units. However, there is no systematically evaluation of intergenic poly(A) sites in plants.
RESULTS: Approximately 16,000 intergenic poly(A) site clusters (IPAC) in Arabidopsis thaliana were discovered and evaluated at the whole genome level. Based on the distributions of distance from IPACs to nearby sense and antisense genes, these IPACs were classified into three categories. About 70 % of them were from previously unannotated 3'-UTR extensions to known genes, which would extend 6985 transcripts of TAIR10 genome annotation beyond their 3'-ends, with a mean extension of 134 nucleotides. 1317 IPACs were originated from novel intergenic transcripts, 37 of which were likely to be associated with protein coding transcripts. 2957 IPACs corresponded to antisense transcripts for genes on the reverse strand, which might affect 2265 protein coding genes and 39 non-protein-coding genes, including long non-coding RNA genes. The rest of IPACs could be originated from transcriptional read-through or gene mis-annotations.
CONCLUSIONS: The identified IPACs corresponding to novel transcripts, 3'-UTR extensions, and antisense transcription should be incorporated into current Arabidopsis genome annotation. Comprehensive characterization of IPACs from this study provides insights of alternative polyadenylation and antisense transcription in plants.
2014
The polyadenylation of mRNA in eukaryotes is an important biological process. In recent years, significant progress has been made in the field of mRNA polyadenylation owing to the advent of the next generation DNA sequencing technologies. The high-throughput sequencing capabilities have resulted in the direct experimental determinations of large numbers of polyadenylation sites, analysis of which has revealed a vast potential for the regulation of gene expression in eukaryotes. These collections have been generated using specialized sequencing methods that are targeted to the junction of 3'-UTR and the poly(A) tail. Here we present three variations of such a protocol that has been used for the analysis of alternative polyadenylation in plants. While all these methods use oligo-dT as an anchor to the 3'-end, they differ in the means of generating an anchor for the 5'-end in order to produce PCR products suitable for effective Illumina sequencing; the use of different methods to append 5' adapters expands the possible utility of these approaches. These methods are versatile, reproducible, and may be used for gene expression analysis as well as global determinations of poly(A) site choice.
Messenger RNA 3'-end formation is an essential posttranscriptional processing step for most eukaryotic genes. Different from plants and animals where AAUAAA and its variants routinely are found as the main poly(A) signal, Chlamydomonas reinhardtii uses UGUAA as the major poly(A) signal. The advance of sequencing technology provides an enormous amount of sequencing data for us to explore the variations of poly(A) signals, alternative polyadenylation (APA), and its relationship with splicing in this algal species. Through genome-wide analysis of poly(A) sites in C. reinhardtii, we identified a large number of poly(A) sites: 21,041 from Sanger expressed sequence tags, 88,184 from 454, and 195,266 from Illumina sequence reads. In comparison with previous collections, more new poly(A) sites are found in coding sequences and intron and intergenic regions by deep-sequencing. Interestingly, G-rich signals are particularly abundant in intron and intergenic regions. The prevalence of different poly(A) signals between coding sequences and a 3'-untranslated region implies potentially different polyadenylation mechanisms. Our data suggest that the APA occurs in about 68% of C. reinhardtii genes. Using Gene Ontolgy analysis, we found most of the APA genes are involved in RNA regulation and metabolic process, protein synthesis, hydrolase, and ligase activities. Moreover, intronic poly(A) sites are more abundant in constitutively spliced introns than retained introns, suggesting an interplay between polyadenylation and splicing. Our results support that APA, as in higher eukaryotes, may play significant roles in increasing transcriptome diversity and gene expression regulation in this algal species. Our datasets also provide useful information for accurate annotation of transcript ends in C. reinhardtii.
The ability to integrate environmental and developmental signals with physiological responses is critical for plant survival. How this integration is done, particularly through posttranscriptional control of gene expression, is poorly understood. Previously, it was found that the 30 kD subunit of Arabidopsis cleavage and polyadenylation specificity factor (AtCPSF30) is a calmodulin-regulated RNA-binding protein. Here we demonstrated that mutant plants (oxt6) deficient in AtCPSF30 possess a novel range of phenotypes–reduced fertility, reduced lateral root formation, and altered sensitivities to oxidative stress and a number of plant hormones (auxin, cytokinin, gibberellic acid, and ACC). While the wild-type AtCPSF30 (C30G) was able to restore normal growth and responses, a mutant AtCPSF30 protein incapable of interacting with calmodulin (C30GM) could only restore wild-type fertility and responses to oxidative stress and ACC. Thus, the interaction with calmodulin is important for part of AtCPSF30 functions in the plant. Global poly(A) site analysis showed that the C30G and C30GM proteins can restore wild-type poly(A) site choice to the oxt6 mutant. Genes associated with hormone metabolism and auxin responses are also affected by the oxt6 mutation. Moreover, 19 genes that are linked with calmodulin-dependent CPSF30 functions, were identified through genome-wide expression analysis. These data, in conjunction with previous results from the analysis of the oxt6 mutant, indicate that the polyadenylation factor AtCPSF30 is a regulatory hub where different signaling cues are transduced, presumably via differential mRNA 3' end formation or alternative polyadenylation, into specified phenotypic outcomes. Our results suggest a novel function of a polyadenylation factor in environmental and developmental signal integration.
BACKGROUND: Alternative polyadenylation (APA) plays an important role in the post-transcriptional regulation of gene expression. Little is known about how APA sites may evolve in homologous genes in different plant species. To this end, comparative studies of APA sites in different organisms are needed. In this study, a collection of poly(A) sites in Medicago truncatula, a model system for legume plants, has been generated and compared with APA sites in Arabidopsis thaliana.
RESULTS: The poly(A) tags from a deep-sequencing protocol were mapped to the annotated M. truncatula genome, and the identified poly(A) sites used to update the annotations of 14,203 genes. The results show that 64% of M. truncatula genes possess more than one poly(A) site, comparable to the percentages reported for Arabidopsis and rice. In addition, the poly(A) signals associated with M. truncatula genes were similar to those seen in Arabidopsis and other plants. The 3'-UTR lengths are correlated in pairs of orthologous genes between M. truncatula and Arabidopsis. Very little conservation of intronic poly(A) sites was found between Arabidopsis and M. truncatula, which suggests that such sites are likely to be species-specific in plants. In contrast, there is a greater conservation of CDS-localized poly(A) sites in these two species. A sizeable number of M. truncatula antisense poly(A) sites were found. A high percentage of the associated target genes possess Arabidopsis orthologs that are also associated with antisense sites. This is suggestive of important roles for antisense regulation of these target genes.
CONCLUSIONS: Our results reveal some distinct patterns of sense and antisense poly(A) sites in Arabidopsis and M. truncatula. In so doing, this study lends insight into general evolutionary trends of alternative polyadenylation in plants.
BACKGROUND: The biological world is replete with phenomena that appear to be ideally modeled and analyzed by one archetypal statistical framework - the Graphical Probabilistic Model (GPM). The structure of GPMs is a uniquely good match for biological problems that range from aligning sequences to modeling the genome-to-phenome relationship. The fundamental questions that GPMs address involve making decisions based on a complex web of interacting factors. Unfortunately, while GPMs ideally fit many questions in biology, they are not an easy solution to apply. Building a GPM is not a simple task for an end user. Moreover, applying GPMs is also impeded by the insidious fact that the "complex web of interacting factors" inherent to a problem might be easy to define and also intractable to compute upon.
DISCUSSION: We propose that the visualization sciences can contribute to many domains of the bio-sciences, by developing tools to address archetypal representation and user interaction issues in GPMs, and in particular a variety of GPM called a Conditional Random Field(CRF). CRFs bring additional power, and additional complexity, because the CRF dependency network can be conditioned on the query data.
CONCLUSIONS: In this manuscript we examine the shared features of several biological problems that are amenable to modeling with CRFs, highlight the challenges that existing visualization and visual analytics paradigms induce for these data, and document an experimental solution called StickWRLD which, while leaving room for improvement, has been successfully applied in several biological research projects. Software and tutorials are available at http://www.stickwrld.org/.
2013
BACKGROUND: The yeast and human Pcf11 functions in both constitutive and regulated transcription and pre-mRNA processing. The constitutive roles of PCF11 are largely mediated by its direct interaction with RNA Polymerase II C-terminal domain and a polyadenylation factor, Clp1. However, little is known about the mechanism of the regulatory roles of Pcf11. Though similar to Pcf11 in multiple aspects, Arabidopsis Pcf11-similar-4 protein (PCFS4) plays only a regulatory role in Arabidopsis gene expression. Towards understanding how PCFS4 regulates the expression of its direct target genes in a genome level, ChIP-Seq approach was employed in this study to identify PCFS4 enrichment sites (ES) and the ES-linked genes within the Arabidopsis genome.
RESULTS: A total of 892 PCFS4 ES sites linked to 839 genes were identified. Distribution analysis of the ES sites along the gene bodies suggested that PCFS4 is preferentially located on the coding sequences of the genes, consistent with its regulatory role in transcription and pre-mRNA processing. Gene ontology (GO) analysis revealed that the ES-linked genes were specifically enriched in a few GO terms, including those categories of known PCFS4 functions in Arabidopsis development. More interestingly, GO analysis suggested novel roles of PCFS4. An example is its role in circadian rhythm, which was experimentally verified herein. ES site sequences analysis identified some over-represented sequence motifs shared by subsets of ES sites. The motifs may explain the specificity of PCFS4 on its target genes and the PCFS4's functions in multiple aspects of Arabidopsis development and behavior.
CONCLUSIONS: Arabidopsis PCFS4 has been shown to specifically target on, and physically interact with, the subsets of genes. Its targeting specificity is likely mediated by cis-elements shared by the genes of each subset. The potential regulation on both transcription and mRNA processing levels of each subset of the genes may explain the functions of PCFS4 in multiple aspects of Arabidopsis development and behavior.
2012
The Arabidopsis thaliana ortholog of the 30-kD subunit of the mammalian Cleavage and Polyadenylation Specificity Factor (CPSF30) has been implicated in the responses of plants to oxidative stress, suggesting a role for alternative polyadenylation. To better understand this, poly(A) site choice was studied in a mutant (oxt6) deficient in CPSF30 expression using a genome-scale approach. The results indicate that poly(A) site choice in a large majority of Arabidopsis genes is altered in the oxt6 mutant. A number of poly(A) sites were identified that are seen only in the wild type or oxt6 mutant. Interestingly, putative polyadenylation signals associated with sites that are seen only in the oxt6 mutant are decidedly different from the canonical plant polyadenylation signal, lacking the characteristic A-rich near-upstream element (where AAUAAA can be found); this suggests that CPSF30 functions in the handling of the near-upstream element. The sets of genes that possess sites seen only in the wild type or mutant were enriched for those involved in stress and defense responses, a result consistent with the properties of the oxt6 mutant. Taken together, these studies provide new insights into the mechanisms and consequences of CPSF30-mediated alternative polyadenylation.