catRAPID accepts protein sequences with a length between 50 and 750 amino acids and RNA sequences between 50 to 1200 nucleotides. If the protein sequence exceeds catRAPID requirements, the user can perform interaction predictions using catRAPID fragments and catRAPID omics (with domain selection) modules. If RNA or both of the query sequences exceed catRAPID requirements, the user can perform interaction predictions using catRAPID fragments module.
There are no limits in the number of jobs that can be submitted. However, the user should keep in mind that all the submitted jobs are scheduled for execution in queues and computation can take some time.
Submission results are stored on the server for one week before they are deleted automatically. Please, remember to download the results within this timeframe to your local system.
The Interaction Propensity is a measure of the interaction probability between one protein (or region) and one RNA (or region). This measure is based on the observed tendency of the components of ribonucleoprotein complexes to exhibit specific properties of their physico-chemical profiles that can be used to make a prediction. The Discriminative Power is a statistical measure introduced to evaluate the Interaction Propensity with respect to catRAPID training. It represents the confidence of the prediction. The Discriminative Power (DP) ranges from 0% (unpredictability) to 100% (predictability). DP values above 50% indicate that the interaction is likely to take place, whereas DPs above 75% represent high-confidence predictions. In the catRAPID fragments module, the Normalized Score is the Interaction Propensity normalised with mean and standard deviation of all fragments. In the catRAPID omics module, the Z-score is normalized using mean = 23.25 and standard deviation = 37.90 that were calculated on two reference sets.
The catRAPID fragments module was designed to calculate the interaction propensities of fragments of the sequences provided by the user. This tool can be helpful for the analysis of sequences with size exceeding catRAPID requirements. The Fragmentation option allows the user to choose to fragment both protein and RNA sequences (Uniform option) or just the RNA sequence (Weighted option). In particular, the Weighted option implies the generation of a maximum of 100 overlapping segments according to the sequence length (total of 10^4 interactions), while the Uniform option implies the generation of predicted energetically stable RNA fragments of ~100-200 nucleotides. If the secondary structure of input sequences (protein domains, RNA stems an loops) is known, users are encouraged to use catRAPID strength module to predict the binding specificity.
The interaction strength is computed using a reference set composed by 100 random protein and 100 random RNA sequences having the same lengths as the molecules under investigation [Agostini et al. 2012] [Cirillo et al. 2013]. Using the reference set, we generate the cumulative distribution function (CDF) of the Interaction Propensities and measure the score of the protein-RNA pair under investigation. The Strength ranges from 0% (non-specific) to 100% (specific). Strength values above 50% indicate high specificity for the interaction. In addition to the interaction strength (104 interactions) we also provide two other types of strength: RNA Strength (Interaction Propensities of all the associations between the protein under investigation and 100 RNA reference sequences) and Protein Strength (Interaction Propensities of all the associations between the RNA under investigation and 100 protein reference sequences).
catRAPID strength is a tool that allows the user to compute the statistical strength of the interaction propensity of a protein-RNA pair with respect to a reference set with same sequence lengths of the pair under investigation. The reference set is a collection of 100 proteins and 100 RNA sequences generated in different ways according to the Reference Set option chosen by the user. A Random reference set is composed of non-redundant protein-RNA pair sequences that display equal probabilities for amino acids and nucleotides to compose them (random sequences). A Mutation reference set is produced replacing a single base nucleotide and a single amino acid with another nucleotide and amino acid of the protein-RNA pair under investigation (point mutated sequences). Once a reference set has been chosen, this pool of reference sequences are used to compute three different Strength types: the Interaction Strength, the RNA Interaction Strength and the Protein Interaction Strength.
The transcriptome sequences of all model organisms have been retrieved from the Ensembl genome database (release 68). The complete proteomes have been downloaded from UniprotKB database (release 2012_11).
Protein and RNA sequences necessary for the predictions are updated regularly. The user can upload custom libraries.
The identification of nucleic acid binding domains is performed analyzing the protein sequence for Pfam matches annotated with DNA- and RNA-related terms [Finn et al. 2010]. The identification of disordered regions is performed analyzing the protein sequence by means of IUPred algorithm [Dosztányi et al. 2005].
Yes, it is possible to use a personal reference set. Moreover, users can contact us to request support for the analysis of customized reference sets or genomes/proteomes not included in the current version of catRAPID omics.
The star rating system helps the user to rank the results.
The score is the sum of three individual values: 1) catRAPID normalized propensity, 2) presence of RNA/DNA binding domains and disordered regions, and 3) presence of known RNA-binding motifs.
1) catRAPID normalized propensity: the Interaction Propensity is linearly normalized between 0 and 1 and multiplied by the Distribution Score.
2) RNA/DNA binding domains and disordered regions: the protein regions are assigned the following scores: RNA domain = 1, DNA domain = 0.5, disorder = 0.5, and DNA domain + disorder = 1.
3) Known RNA-binding motifs: presence of motif is assigned the value of 1, 0 otherwise.
By considering the distribution of interaction propensities it is possible to discriminate between RNA-binding and non-nucleic-acid-binding proteins. A support vector machine (LIBSVM 3.17) was trained on 176 positive cases (intersection of two independent experimental measurements [Castello et al. 2012][Baltz et al. 2012]) and 250 negative cases ([Stawiski et al. 2003]). The ten-fold-cross-validation yields an accuracy of 86% for the discrimination between positive and negative cases. The Distribution Score is defined as the probability that the query is assigned to the positive set.
A number of large scale analyses are reported in our main publication. All the top interactions are associated with a p-value < 0.025 (Chi-squared test). In addition, large-scale predictions allow to accurately discriminate between RNA-binding and non-nucleic-acid-binding proteins. We randomly extracted 10 RNA-binding proteins (RBPs) from 176 cases ([Castello et al. 2012] [Baltz et al. 2012]) and 10 non-nucleic-acid-binding proteins (NNBPs) from a list of 250 negatives ([Stawiski et al. 2003]) and computed their interaction propensities (IPs) against human RNAs. We repeated this procedure until we reached high coverage (>95%) of the smaller dataset (20 times). From low to high IPs, we measured how many times RBPs are associated with larger scores than NNBPs. We observed a monotonic enrichment ranging from 0.3 (IP: -100) to 0.73 (IP: 200), which indicates that the positive and negative distributions can be efficiently discriminated with an average accuracy of 73%. Furthermore, we employed RBPs and NNBPs to Z-normalize the interaction propensities, which helps to highlight which cases deviate from the average. We found that very few cases are associated with a Z-score higher than 4 (IP: 200) and the statistics is less significant for these extreme values (negative distributions can have few outliers).
The mapping of the RNA-binding proteins and their associated recognition motifs were obtained from publicly available databases (RBPDB and SpliceAid-F) and from literature (e.g. Ray et al. 2013 and Hogan et al. 2008). We are constantly updating our knowledge database with information derived from the most recent genome-wide analyses and studies.
Turn the Help ON (top-left corner of the output page) and click on the question marks beside the table headers.