RNAlien is a tool for automatic construction of RNA family models from a single sequence. The sections of this help document cover the use of the webservice and the commandline tool, as well as details of the pipeline used in the backend.
If you look for source code or a installation guide for the tool please refer to the Tool subpage.
Please note that constructions can take up to 24h, if you want to construct multiple families please use the tool instead of the webservice.
This manual is included with the Tool as manual.pdf.


Table of contents:



Webservice:


Input:

RNAlien Webservice accepts a fasta file with a single sequence as obligatory input. Example GlmS fasta file content:
>AARQ02000011.1/391-585
AAUUGAAUAGAAGCGCCAGAACUGAUUGGGACGAAAAUGCUUGAAGGUGAAAUCCCUGAA
AAGUAUCGAUCAGUUGACGAGGAGGAGAUUAAUCGAAGUUUCGGCGGGAGUCUCCCGGCU
GUGCAUGCAGUCGUUAAGUCUUACUUACAAAUCAUUUGGGUGACCAAGUGGACAGAGUAG
UAAUGAAACAUGCUU
The organism where this sequence originates from can either be supplied via the autocomplete function, as name or via the NCBI Taxononomy id. In out current example that would be the organism (Listeria monocytogenes FSL N3-165) or Taxononomy id(393124). The id can be retrieved as described here (Retrieve Taxonomy id).

Output:

RNAlien collects potential family members in multiple iterations. While the construction process continues, the progress and intermediate results are summarized in a table. Each iteration goes through multiple states (loading, sequence search, sequence retrieval, candidate alignment, candidate filtering, query selection, model calibration, done).

Once the construction is finished all result files are available in the result table. The output contains only the sequences (and corresponding organisms) that actually were used to construct the final RNA family model. If you want to find all sequences covered by the result model scan your database of choice with cmsearch.
The organisms of found non-redundant sequences used in the model are visualized in the taxonomic overview.

The evaluation results for the constructed covariance model and the stockholm alignment used to build it are summarized in the evaluation table.

The taxonomic tree of the organsims the included sequences originated from can be found in the zoom and collapsable taxonomy tree

Finally all included sequences are listed in the included sequences table.


Tool:


Input:

The RNAlien command line utility accepts several command line parameters(see table). Please note that RNAlien requires a active and uninterrupted internet connection, which it also verifies before starting.
Parameter Switch Required Description Default value
inputFastaFilePath -i yes Path to input fasta file
outputPath -o yes Path to output directory
inputTaxId -t no NCBI taxonomy ID number of input RNA organism
inputnSCICutoff -z no Only candidate sequences with a nSCI higher than this value are accepted. 1.0
inputEvalueCutoff -e no Evalue cutoff for cmsearch filtering. 0.001
inputBlastDatabase -b no Specify name of blast database to use. nt
coverageFilter -a no Filter blast hits by coverage of at least 80%. True
singleHitperTax -s no Only the best blast hit per taxonomic entry is considered. False
blastsoftmasking -f no Toggles blast softmasking, meaning exclusion of low complexity (repetative) regions in lookup table. True
inputqueryselectionmethod -m no Method for selection of queries (filtering,clustering). filtering
threads -c no Number of available cpu slots/cores. 1
taxonomyRestriction -r no Restrict search space to taxonomic kingdom (bacteria,archea,eukaryia).
sessionIdentificator -d no Optional session id that is used instead of automatically generated one.
Help -? no Print this help

Example call for RNAlien:
RNAlien -i /home/user/newrna.fa -c 5 -t 562 -o /home/user/temp/ -d construction1

This command starts RNAlien with the input fasta file /home/user/newrna.fa, 5 cores, the taxonomy id set to E. coli. RNAlien will create a directory called construction1 in /home/user/temp/. On how to obtain the taxonomy id for the organism the sequence originates from see Retrieve Taxonomy id.

Output:

RNAlien creates a output folder which contains all files used in the construction and log files. The most relevant output files and the logfile are directly deposited in the main folder. Each step of model expansion has an own subfolder. Following is a description of the output folder contents, of the different iteration directories that are created during model construction (empty,initial model construction, model expansion), the blast log directory that is contained in each iteration directory and the Log file content.
The output contains only the non-redundant sequences (and corresponding organisms) that were actually were used to construct the final RNA family model. If you want to find all sequences covered by the result model scan your database of choice with cmsearch.
+RNAlien output folder
|
|--Log file: Summary for construction process, tool versions, iteration info, evaluation results (see Log file)
|--result.cm: Result covariance model
|--result.stockholm: Result stockholm alignment
|--result.fa: Result fasta
|--result.csv
|--evaluation: contains RNAz and cmstat output for result-files
|--log: contains non-iteration specific log files
|--1 iteration directories (see description below)
|--2 ..
|--3 ..


+iterationdirectory: initial model construction
|
|--model.cm: Result covariance model
|--model.stockholm: Result stockholm alignment
|--model.fa: Result fasta
|--log: Raw and processed blast hits and accepted /rejected candidates (see iteration log directory)
|--input.fa: The input fasta sequence |--input.fold: RNAfold output of the input fasta sequence |--1.fa: fasta file for first candidate |--1.alifold: RNAalifold file for the input sequence and the first candidate
|--1.fold: RNAfold output for the first candidate
|--2.fa: Fasta file for the second candidate
|-..



+iterationdirectory: modelexpansion
|
|--model.cm: Result covariance model
|--model.stockholm: Result stockholm alignment
|--model.fa: Result fasta
|--log: Directory that contains raw and processed blast hits (see blastdirectory)
|--1.fa: fasta file for first candidate
|--1.cmsearch
|--2.fa
|--2.cmsearch
|--..


+iteration log directory
|
|--1_1blastOutput: Raw blast output for first query, indicated by leading 1_
|--1_2blastHits: Parsed blast output
|--1_3blastHitsFilteredByLength: Blasthits filtered by exceeding 3* query length
|--1_3ablastHitsFilteredByLength: Blasthits filtered by having >80% coverage
|--1_4blastHitsFilteredByParentTaxId: Only one blasthit per parent taxid
|--1_5filteredBlastResult: Only one blasthit per taxid
|--1_6requestedSequenceElements: Blasthit derived sequences requested from Entrez
|--1_10afullSequencesWithSimilars: Sequences retrieved from Entrez
|--1_10fullSequences: fullSequencesWithSimilars filtered for only containing unique sequences
|--2_1blastOutput: Raw blast output for second query, indicated by leading 2_
|--..
|--11candidates: All query specific sequences merged
|--12candidatesFilteredByCollected: Filter for sequences not identical with collected
|--13selectedCandidates: Sequences selected either by nSCI or cmsearch for inclusion in model by set evalue cutoff
|--14rejectedCandidates: Sequences that were rejected
|--15potentialCandidates: Sequences that are within a 10^3 interval of the set evalue cutoff,
will be reevaluated at end of modelconstruction

Log file

Benchmark


RNAfamilies constructed for benchmark are available as archive:

Rfam id Family name Archive link (.tar.gz)
RF00001 5S rRNA archive
RF00002 5_8S rRNA archive
RF00003 U1 archive
RF00004 U2 archive
RF00005 tRNA archive
RF00008 Hammerhead_3 archive
RF00010 RNaseP_bact_a archive
RF00011 RNaseP_bact_b archive
RF00017 Metazoa_SRP archive
RF00023 tmRNA archive
RF00026 U6 archive
RF00028 Intron_gpI archive
RF00029 Intron_gpII archive
RF00032 Histone3 archive
RF00037 IRE_I archive
RF00044 Phage_pRNA archive
RF00050 FMN archive
RF00059 TPP archive
RF00114 S15 archive
RF00162 SAM archive
RF00164 s2m archive
RF00167 Purine archive
RF00168 Lysine archive
RF00169 Bacteria_small_SRP archive
RF00174 Cobalamin archive
RF00175 HIV-1_DIS archive
RF00177 SSU_rRNA_bacteria archive
RF00207 K10_TLS archive
RF00209 IRES_Pesti archive
RF00234 glmS archive
RF00374 Gammaretro_CES archive
RF00380 ykoK archive
RF00458 IRES_Cripavirus archive
RF00480 HIV_FE archive
RF00500 TCV_H5 archive
RF00504 Glycine archive
RF00843 mir-228 archive
RF00871 mir-689 archive
RF01051 c-di-GMP-I archive
RF01054 preQ1-II archive
RF01073 GP_knot1 archive
RF01118 PK-G12rRNA archive
RF01380 HIV-1_SD archive
RF01510 MFR archive
RF01689 AdoCbl-variant archive
RF01734 crcB archive
RF01786 c-di-GMP-II archive
RF01831 THF archive
RF01852 tRNA-Sec archive
RF01856 Protozoa_SRP archive
RF01857 Archaea_SRP archive
RF01998 group-II-D1D4-1 archive
RF02001 group-II-D1D4-3 archive
RF02095 mir-2985-2 archive
RF02253 IRE_II archive
RF02519 ToxI archive

Values for the specificity and recall plots were computed with the RNAlienStatistics executable and the rnalienstatistics.pl script included in RNAlien.


RNAalien pipeline


RNAlien pipeline


Retrieve Taxnonomy id


To obtain the Taxonomy id go to NCBI Taxonomy.
Enter the organisms name in the Taxonomy field (e.g Escherichia coli) and click search. You are then redirected to a Summary page. Click the organism name. Depending on how specific your search was you can either select from a list of organisms or you are directed to a organism page Escherichia coli str. K-12 substr. MG1655star. You can find the taxonomy id right below the organisms name (Escherichia coli str. K-12 substr. MG1655star Taxonomy ID: 879462).