Score network generation

Different approaches have been developed to derive confidence values for potential protein-protein interactions from the noisy and incomplete purification data sets. Those approaches which are implemented in ProCope will shortly be presented here along with references to the original literature.

All original PPI score networks generated by the authors of the respective publications can be found in the data/scores/ folder of this package.

Socio affinity scores

The socio affinity scoring system was proposed by Gavin et al., 2006. It calculates a log-odd of the occurence frequency of a protein interaction and the probability that this interaction occurs randomly. Two different models for protein interactions in purification data are employed:
  1. Spokes model: How often does a protein i retrieve protein j if i is bait?
  2. Matrix model: How often do two proteins cooccur in the same purification experiments?

Purification enrichment scores

Another measure for the confidence of potential PPIs is the purification enrichment scoring system proposed by Collins et al., 2007. It uses the spokes/matrix model as well but incorporates the model of a Bayes classifier which also takes into account negative evidence for each protein-protein interaction.

Scoring scheme based on the hypergeometric distribution

Hart et al., 2007 defined another statistical scoring measure based on the hypergeometric distribution to score protein-protein interactions. This model only incorporates matrix model  interactions in the purification experiments (see above). It calculates a p-value which represents the probability of a given protein-protein interaction occuring randomly based on the given set of interactions. The Hart scoring method accepts multiple purification datasets as input, the resulting p-values are multiplied to generate a single p-value for a protein-protein interaction.

Dice coefficients

Dice coefficents are another scoring measure proposed by Zhang et al., 2008. Basically, the Dice coefficient of two proteins is the fraction of experiments both proteins are involved in and the total number of experiments the proteins are involved in.

Bootstrap scores

Finally ProCope contains the methods needed to score proteins using the bootstrap approach which is part of our own work (see also: References).

Here is a summary of the steps needed to generate the bootstrap score network. Note that inflation coefficients and effiency are part of the MCL algorithm, please read the original literature for more information.
  1. From a given purification data set draw n random experiments with replacement where n is the number of total experiments in the set. The result is one bootstrap sample and can be used as any other purification data set.
  2. Generate socio affinity scores from the bootstrap samples and cluster the resulting networks with the Markov Clustering program developed by van Dongen, 2000 with different inflation coefficients.
  3. Determine inflation coefficient producing the best average efficiency.
  4. From each bootstrap sample use the one clustering produced by the best inflation coefficient. This list of clusterings is then used to generate the bootstrap network. The weights of the edges in this network represent the frequency of cooccurence of two proteins in the given set of clusterings. For instance, if two proteins appear in the same complex in 60% of the clusterings, their edge gets a weight of 0.6
The bootstrap method incorporates shared protein calculation. In a living cell, proteins can be contained in multiple complexes whereas normal clustering methods produce distinct sets of proteins from the scores networks. We use an approach which adds proteins in a post-processing step to other complexes. The decision whether a protein is added to another complex is based on its score to the members of the complex and the average score within the complex


ProCope documentation