Portal Home

catRAPID Frequently Asked Questions


Is there any restriction on protein/RNA sequence length?
Is there a maximum number of submitted job per user to the catRAPID server?
How long my submission will be stored on the server?
What are the Interaction Propensity, the Discriminative Power, the Normalized Score and the Z-score?
Which is the difference between Uniform and Weighted options in catRAPID fragments module?
What are the Interaction Strength, the RNA Interaction Strength and the Protein Interaction Strength?
Which is the difference between "Random" and "Mutations" options in catRAPID strength module?
Which are sequence databases used in the catRAPID omics?
Are the sequence databases updated automatically?
How have the protein domains been defined in the catRAPID omics module? How have they been selected as nucleic acid binding and disorder regions?
Can I submit a personal reference set or interrogate a different model organism in the catRAPID omics module?
What is the star rating system?
What is the distribution score?
What is the accuracy on large-scale predictions?
How did you collect the RNA motifs?
I do not understand the variables in the catRAPID omics module!


Is there any restriction on protein/RNA sequence length?


catRAPID accepts protein sequences with a length between 50 and 750 amino acids and RNA sequences between 50 to 1200 nucleotides. If the protein sequence exceeds catRAPID requirements, the user can perform interaction predictions using catRAPID fragments and catRAPID omics (with domain selection) modules. If RNA or both of the query sequences exceed catRAPID requirements, the user can perform interaction predictions using catRAPID fragments module.



Is there a maximum number of submitted job per user to the catRAPID server?


There are no limits in the number of jobs that can be submitted. However, the user should keep in mind that all the submitted jobs are scheduled for execution in queues and computation can take some time.



How long my submission will be stored on the server?


Submission results are stored on the server for one week before they are deleted automatically. Please, remember to download the results within this timeframe to your local system.



What are the Interaction Propensity, the Discriminative Power, the Normalized Score and the Z-score?


The Interaction Propensity is a measure of the interaction probability between one protein (or region) and one RNA (or region). This measure is based on the observed tendency of the components of ribonucleoprotein complexes to exhibit specific properties of their physico-chemical profiles that can be used to make a prediction. The Discriminative Power is a statistical measure introduced to evaluate the Interaction Propensity with respect to catRAPID training. It represents the confidence of the prediction. The Discriminative Power (DP) ranges from 0% (unpredictability) to 100% (predictability). DP values above 50% indicate that the interaction is likely to take place, whereas DPs above 75% represent high-confidence predictions. In the catRAPID fragments module, the Normalized Score is the Interaction Propensity normalised with mean and standard deviation of all fragments. In the catRAPID omics module, the Z-score is normalized using mean = 23.25 and standard deviation = 37.90 that were calculated on two reference sets.



Which is the difference between Uniform and Weighted options in catRAPID fragments module?


The catRAPID fragments module was designed to calculate the interaction propensities of fragments of the sequences provided by the user. This tool can be helpful for the analysis of sequences with size exceeding catRAPID requirements. The Fragmentation option allows the user to choose to fragment both protein and RNA sequences (Uniform option) or just the RNA sequence (Weighted option). In particular, the Weighted option implies the generation of a maximum of 100 overlapping segments according to the sequence length (total of 10^4 interactions), while the Uniform option implies the generation of predicted energetically stable RNA fragments of ~100-200 nucleotides. If the secondary structure of input sequences (protein domains, RNA stems an loops) is known, users are encouraged to use catRAPID strength module to predict the binding specificity.



What are the Interaction Strength, the RNA Interaction Strength and the Protein Interaction Strength?


The interaction strength is computed using a reference set composed by 100 random protein and 100 random RNA sequences having the same lengths as the molecules under investigation [Agostini et al. 2012] [Cirillo et al. 2013]. Using the reference set, we generate the cumulative distribution function (CDF) of the Interaction Propensities and measure the score of the protein-RNA pair under investigation. The Strength ranges from 0% (non-specific) to 100% (specific). Strength values above 50% indicate high specificity for the interaction. In addition to the interaction strength (104 interactions) we also provide two other types of strength: RNA Strength (Interaction Propensities of all the associations between the protein under investigation and 100 RNA reference sequences) and Protein Strength (Interaction Propensities of all the associations between the RNA under investigation and 100 protein reference sequences).



Which is the difference between "Random" and "Mutations" options in catRAPID strength module?


catRAPID strength is a tool that allows the user to compute the statistical strength of the interaction propensity of a protein-RNA pair with respect to a reference set with same sequence lengths of the pair under investigation. The reference set is a collection of 100 proteins and 100 RNA sequences generated in different ways according to the Reference Set option chosen by the user. A Random reference set is composed of non-redundant protein-RNA pair sequences that display equal probabilities for amino acids and nucleotides to compose them (random sequences). A Mutation reference set is produced replacing a single base nucleotide and a single amino acid with another nucleotide and amino acid of the protein-RNA pair under investigation (point mutated sequences). Once a reference set has been chosen, this pool of reference sequences are used to compute three different Strength types: the Interaction Strength, the RNA Interaction Strength and the Protein Interaction Strength.



Which are sequence databases used in the catRAPID omics?


The transcriptome sequences of all model organisms have been retrieved from the Ensembl genome database (release 68). The complete proteomes have been downloaded from UniprotKB database (release 2012_11).



Are the sequence databases updated automatically?


Protein and RNA sequences necessary for the predictions are updated regularly. The user can upload custom libraries.



How have the protein domains been defined in the catRAPID omics module? How have they been selected as nucleic acid binding and disorder regions?


The identification of nucleic acid binding domains is performed analyzing the protein sequence for Pfam matches annotated with DNA- and RNA-related terms [Finn et al. 2010]. The identification of disordered regions is performed analyzing the protein sequence by means of IUPred algorithm [Dosztányi et al. 2005].



Can I submit a personal reference set or interrogate a different model organism in the catRAPID omics module?


Yes, it is possible to use a personal reference set. Moreover, users can contact us to request support for the analysis of customized reference sets or genomes/proteomes not included in the current version of catRAPID omics.



What is the star rating system?


The star rating system helps the user to rank the results. The score is the sum of three individual values: 1) catRAPID normalized propensity, 2) presence of RNA/DNA binding domains and disordered regions, and 3) presence of known RNA-binding motifs.
1) catRAPID normalized propensity: the Interaction Propensity is linearly normalized between 0 and 1 and multiplied by the Distribution Score.
2) RNA/DNA binding domains and disordered regions: the protein regions are assigned the following scores: RNA domain = 1, DNA domain = 0.5, disorder = 0.5, and DNA domain + disorder = 1.
3) Known RNA-binding motifs: presence of motif is assigned the value of 1, 0 otherwise.



What is the Distribution Score?


By considering the distribution of interaction propensities it is possible to discriminate between RNA-binding and non-nucleic-acid-binding proteins. A support vector machine (LIBSVM 3.17) was trained on 176 positive cases (intersection of two independent experimental measurements [Castello et al. 2012][Baltz et al. 2012]) and 250 negative cases ([Stawiski et al. 2003]). The ten-fold-cross-validation yields an accuracy of 86% for the discrimination between positive and negative cases. The Distribution Score is defined as the probability that the query is assigned to the positive set.



What is the accuracy on large-scale predictions?


A number of large scale analyses are reported in our main publication. All the top interactions are associated with a p-value < 0.025 (Chi-squared test). In addition, large-scale predictions allow to accurately discriminate between RNA-binding and non-nucleic-acid-binding proteins. We randomly extracted 10 RNA-binding proteins (RBPs) from 176 cases ([Castello et al. 2012] [Baltz et al. 2012]) and 10 non-nucleic-acid-binding proteins (NNBPs) from a list of 250 negatives ([Stawiski et al. 2003]) and computed their interaction propensities (IPs) against human RNAs. We repeated this procedure until we reached high coverage (>95%) of the smaller dataset (20 times). From low to high IPs, we measured how many times RBPs are associated with larger scores than NNBPs. We observed a monotonic enrichment ranging from 0.3 (IP: -100) to 0.73 (IP: 200), which indicates that the positive and negative distributions can be efficiently discriminated with an average accuracy of 73%. Furthermore, we employed RBPs and NNBPs to Z-normalize the interaction propensities, which helps to highlight which cases deviate from the average. We found that very few cases are associated with a Z-score higher than 4 (IP: 200) and the statistics is less significant for these extreme values (negative distributions can have few outliers).



How did you collect the RNA motifs?


The mapping of the RNA-binding proteins and their associated recognition motifs were obtained from publicly available databases (RBPDB and SpliceAid-F) and from literature (e.g. Ray et al. 2013 and Hogan et al. 2008). We are constantly updating our knowledge database with information derived from the most recent genome-wide analyses and studies.



I do not understand the variables in the catRAPID omics module!


Turn the Help ON (top-left corner of the output page) and click on the question marks beside the table headers.