Alternative polyadenylation (APA), in which a transcript uses one of the poly(A) sites to define its 3'-end, is a common regulatory mechanism in eukaryotic gene expression. However, the potential of APA in determining crop agronomic traits remains elusive. This study systematically tallied poly(A) sites of 14 different rice tissues and developmental stages using the poly(A) tag sequencing (PAT-seq) approach. The results indicate significant involvement of APA in developmental and quantitative trait loci (QTL) gene expression. About 48% of all expressed genes use APA to generate transcriptomic and proteomic diversity. Some genes switch APA sites, allowing differentially expressed genes to use alternate 3' UTRs. Interestingly, APA in mature pollen is distinct where differential expression levels of a set of poly(A) factors and different distributions of APA sites are found, indicating a unique mRNA 3'-end formation regulation during gametophyte development. Equally interesting, statistical analyses showed that QTL tends to use APA for regulation of gene expression of many agronomic traits, suggesting a potential important role of APA in rice production. These results provide thus far the most comprehensive and high-resolution resource for advanced analysis of APA in crops and shed light on how APA is associated with trait formation in eukaryotes.
Publications
2016
Alternative polyadenylation (APA) is an important layer of gene regulation that produces mRNAs that have different 3' ends and/or encode diverse protein isoforms. Up to 70% of annotated genes in plants undergo APA. Increasing numbers of poly(A) sites collected in various plant species demand new methods and tools to access and mine these data. We have created an open-access web service called PlantAPA (http://bmi.xmu.edu.cn/plantapa) to visualize and analyze genome-wide poly(A) sites in plants. PlantAPA provides various interactive and dynamic graphics and seamlessly integrates a genome browser that can profile heterogeneous cleavage sites and quantify expression patterns of poly(A) sites across different conditions. Particularly, through PlantAPA, users can analyze poly(A) sites in extended 3' UTR regions, intergenic regions, and ambiguous regions owing to alternative transcription or RNA processing. In addition, it also provides tools for analyzing poly(A) site selections, 3' UTR lengthening or shortening, non-canonical APA site switching, and differential gene expression between conditions, making it more powerful for the study of APA-mediated gene expression regulation. More importantly, PlantAPA offers a bioinformatics pipeline that allows users to upload their own short reads or ESTs for poly(A) site extraction, enabling users to further explore poly(A) site selection using stored PlantAPA poly(A) sites together with their own poly(A) site datasets. To date, PlantAPA hosts the largest database of APA sites in plants, including Oryza sativa, Arabidopsis thaliana, Medicago truncatula, and Chlamydomonas reinhardtii. As a user-friendly web service, PlantAPA will be a valuable addition to the community of biologists studying APA mechanisms and gene expression regulation in plants.
2015
Polyadenylation [poly(A)] is an essential posttranscriptional processing step in the maturation of eukaryotic mRNA. The advent of next-generation sequencing (NGS) technology has offered feasible means to generate large-scale data and new opportunities for intensive study of polyadenylation, particularly deep sequencing of the transcriptome targeting the junction of 3'-UTR and the poly(A) tail of the transcript. To take advantage of this unprecedented amount of data, we present an automated workflow to identify polyadenylation sites by integrating NGS data cleaning, processing, mapping, normalizing, and clustering. In this pipeline, a series of Perl scripts are seamlessly integrated to iteratively map the single- or paired-end sequences to the reference genome. After mapping, the poly(A) tags (PATs) at the same genome coordinate are grouped into one cleavage site, and the internal priming artifacts removed. Then the ambiguous region is introduced to parse the genome annotation for cleavage site clustering. Finally, cleavage sites within a close range of 24 nucleotides and from different samples can be clustered into poly(A) clusters. This procedure could be used to identify thousands of reliable poly(A) clusters from millions of NGS sequences in different tissues or treatments.
The NCBI manages the SRA (Sequence Read Archive) database to store RNA-Seq data generated from different NGS technologies. With ever increasing finished and ongoing genome and transcriptome sequencing projects, the data in SRA expand rapidly and present a treasure for mining useful information to facilitate our understanding of biological issues like mRNA 3'-end formation and alternative polyadenylation. We developed a bioinformatics pipeline that can process raw SRA sequence data and obtain high quality poly(A) sites and poly(A) cluster sites with detailed expression information. This pipeline is designed to be generic and can be utilized for polyadenylation studies in any eukaryotic species.
Messenger RNA in eukaryotic cells is initially produced as a nascent transcript (pre-mRNA) without a polyadenine [poly(A)] tail to the 3' end. The precise cleavage of the pre-mRNA and addition of a poly(A) track need the communication between cis-elements in the pre-mRNA sequences and transacting protein factors recognizing them. Based on homology analyses, Arabidopsis cleavage and polyadenylation specificity factor (AtCPSF) complex should play a critical role in pre-mRNA 3' end processing. Here we describe the isolation of AtCPSF complex by using a tandem affinity purification (TAP) method. We demonstrate that TAP is a potent protein complex isolating approach that can fulfill a downstream protein identification purpose based on mass spectrometry techniques.
Messenger RNA polyadenylation in eukaryotes marks the end of a transcript, and the process is associated with transcription termination. Increasing evidence reveals the potential of gene expression regulation through alternative polyadenylation. The site of poly(A) addition is defined by poly(A) signals reside in the transcribed pre-mRNA. To gain further insight into poly(A) signals and their functions in defining alternative polyadenylation sites that lie within different genomic regions, SignalSleuth2 was developed to extract and analyze cis-elements from a set of data with known poly(A) sites. After obtaining the sequences surrounding the poly(A) sites, exhaustive search of short sequence motifs in specified range of nucleotide sequences are performed, variable motif sizes and rank the detected motifs based on their occurrence frequencies are tallied. It also has new functions including Position-Specific Scoring Matrix (PSSM) scores calculation and multiple scanning modes. This program is powerful in revealing underline sequence motifs surrounding any target regions in a given dataset.
Alternative polyadenylation has been demonstrated as a tier of gene expression regulation in eukaryotes. However, its role has not been elucidated at the cellular level. Equipped with techniques to isolate single cells by fluorescence-activated cell sorting (FACS) and laser captured micro-dissection, analysis of alternative polyadenylation in specific cell types becomes possible. We present a method to generate poly(A) tags for high-throughput sequencing (PAT-seq) libraries from very low amount of total RNA. This protocol targets the junction of the 3'-UTR and poly(A) tail of transcripts. Ten nanograms of total RNA isolated from the FACS-sorted cells was reverse-transcribed to double stranded cDNA with a anchored oligo dT(18) primer containing maximal T7 promoter sequence. Then, an RNA amplification step using in vitro transcription of T7 RNA polymerase was carried out. Achieved cRNA was fragmented by partial digestion. First strand synthesis was carried out by using a partial adaptor sequence with random 9-nt primer to introduce the adaptor at the 5' end. An anchored oligo dT primer containing adaptor sequence on 3' end was introduced through second strand cDNA synthesis. This new method has been applied to investigate polyadenylation using nanogram amount of total RNA from Arabidopsis cells.
Messenger RNA polyadenylation is one of the essential processing steps during eukaryotic gene expression. The site of polyadenylation [poly(A) site] marks the end of a transcript, which is also the end of a gene in most cases. A computation program that is able to recognize poly(A) sites would not only be useful for genome annotation in finding genes ends, but also for predicting alternative poly(A) sites. PASS [Poly(A) Site Sleuth] and PAC [Poly(A) site Classifier] were developed to predict poly(A) sites in plants. PASS was built based on the Generalized Hidden Markov Model (GHMM), which consists of four functional modules: input model, poly(A) site recognition module, graphic process module, and output module. PAC is a classification model, integrating several features that define the poly(A) sites including K-gram pattern, Z-curve, position-specific scoring matrix, and first-order inhomogeneous Markov sub-model. PAC can be used to predict poly(A) sites from species whose polyadenylation profile is unknown. The result of PASS and PAC is an output of a few files with one of them containing the score or probability of being a poly(A) site for each position of a given sequence. While the models were built mostly based on poly(A) profile data from Arabidopsis, it is also functional in other higher plants since their profiles are quite similar.
Genome-wide studies revealed the prevalence of multiple transcripts resulting from alternative polyadenylation (APA) of a single given gene in higher eukaryotes. Several studies in the past few years attempted to address how those APA events are regulated and what the biological consequences of those regulations are. Common to these efforts is the comparison of unbiased transcriptome data, either derived from whole-genome tiling array or next generation sequencing, to identify the specific APA events in a given condition. RADPRE (Ratio-based Analysis of Differential mRNA Processing and Expression) is an R program, developed to serve such a purpose using data from the whole-genome tilling array. RADPRE took a set of tilling array data as input, performed a series of calculation including a correction of the probe affinity variation, a hierarchy of statistical tests and an estimation of the false discovery rate (FDR) of the differentially processed genes (DPG). The result was an output of a few tabular files including DPG and their corresponding FDR. This chapter is written for scientists with limited programming experiences.