TSSAR: Transcription Start Site Annotation Regime Web Service
Introduction to TSSAR statistics
The main TSSAR application is the analysis of differential RNA-seq experimental data. Here, transcription start sites are enriched in a TEX treated library compared to a untreated library. TEX specifically degrades reads which are not protected by a triphosphate at its 5' end, a characteristic of RNA fragments originating from primary transcription starts. Since the depletion is not infallible, not every signal represents an original TSS. Hence, a statistical analysis to discriminate between significant enriched positions and background noise has to be performed. In the following a short illustration how TSSAR pursues this goal is given.
Method
Background Modeling
To account for the different transcription dynamics in the genome, each site is evaluated in the context of its local surrounding by a sliding window approach.
RNA-seq read start counts within a transcribed genomic region can be, according to their nature as count data, described by a Poisson distribution. Regarding an arbitrary region in the genome adds some complexity since it might be a mixture of transcribed and not transcribed sections. While the first can be described by a Poisson distribution, the later is expected to be ideally uniformly zero distributed. To estimate the parameters that describe the Poisson part only, namely its mean, TSSAR applies a so called zero-inflated Poisson model regression. This can be seen as estimating the number of zeros which complement all the observed non-zero values to a sound Poisson distribution. All excess zeros are believed to originate from untranslated regions and are removed from the sample. Finally, the mean value of the remaining sample (the transcriptional active part of the original sample) needs to be determined in order to obtain the parameter lambda describing the background distribution of the transcribed part of the considered window. This procedure is applied to the treated and untreated library.
TSS Annotation
In a next step, having an idea of the expected background, TSSAR aims to find positions where the signal in the TEX treated library is significantly enriched compared to the untreated library, considering the expected variability from the background model. A straight forward approach is to consider the differences between treated and untreated library for each position. Since both libraries follow a Poisson distribution, the derived difference sample is supposed to follow a Skellam distribution. This distribution's shape and position is characterized by the lambda (λ) parameters, deduce prior to this step. Regarding the whole sample, each value can be evaluated how well it fits the model. Defining a P-value cut off alpha, all positions which are less likely to arise from the background Skellam distribution are annotated as TSS.
Details
A more detailed description of the TSSAR algorithm can be found in our peer reviewed paper which has been submitted to BMC Bioinformatics. We will give the according reference here once it is published.
Stand-alone Version
The illustrated algorithm is also available as a stand-alone version. It is implemented in Perl and depends on R, a free software environment for statistical computing and graphics. You can also download the TSSAR stand-alone tool.