Score network generation
Different approaches have been developed to derive confidence values
for potential protein-protein interactions from the noisy and
incomplete purification data sets. Those approaches which are
implemented in ProCope will shortly be presented here along with
references to the original literature.
All original PPI score networks generated by the authors of the
respective publications can be found in the data/scores/
folder of this package.
Socio affinity scores
The socio affinity
scoring system was proposed by Gavin et al., 2006.
It calculates a log-odd of the occurence frequency of a protein
interaction and the probability that this interaction occurs randomly.
Two different models for protein interactions in purification data are
employed:
- Spokes
model: How often does a protein i
retrieve protein j
if i is
bait?
- Matrix
model: How often do two proteins cooccur in the same purification experiments?
Purification enrichment scores
Another measure for the confidence of potential PPIs is the purification enrichment
scoring system proposed by Collins et al., 2007.
It uses the spokes/matrix
model as well but incorporates the model of a
Bayes classifier which also takes into account negative evidence for
each protein-protein interaction.
Scoring scheme based on the hypergeometric distribution
Hart et al., 2007 defined another
statistical scoring measure based on the hypergeometric distribution to score protein-protein interactions. This model only
incorporates matrix model
interactions in the purification experiments (see
above). It calculates a p-value which represents the
probability of a given protein-protein interaction occuring randomly
based on the given set of interactions. The Hart scoring method accepts
multiple purification datasets as input, the resulting p-values are
multiplied to generate a single p-value for a protein-protein
interaction.
Dice coefficients
Dice coefficents are another scoring measure proposed by Zhang et al., 2008.
Basically, the Dice coefficient of two proteins is the fraction of
experiments both proteins are involved in and the total number of
experiments the proteins are involved in.
Bootstrap scores
Finally ProCope contains the methods needed to score proteins using the bootstrap approach which is part of our own work (see also: References).
Here is a summary of the steps needed to generate the bootstrap score
network. Note that inflation coefficients and effiency are part of the
MCL algorithm, please read the original literature for more information.
- From a given purification data set draw n random experiments with replacement where n
is the number of total experiments in the set. The result is one
bootstrap sample and can be used as any other purification data set.
- Generate socio affinity scores from the bootstrap samples and cluster the resulting networks with the Markov Clustering program developed by van Dongen, 2000 with different inflation coefficients.
- Determine inflation coefficient producing the best average efficiency.
- From each bootstrap sample use the one clustering produced by the best inflation coefficient. This
list of clusterings is then used to generate the bootstrap network. The
weights of the edges in this network represent the frequency of
cooccurence of two proteins in the given set of clusterings. For
instance, if two proteins appear in the same complex in 60% of the
clusterings, their edge gets a weight of 0.6
The bootstrap method incorporates shared protein calculation.
In a living cell, proteins can be contained in multiple complexes
whereas normal clustering methods produce distinct sets of proteins
from the scores networks. We use an approach which adds proteins in a
post-processing step to other complexes. The decision whether a protein
is added to another complex is based on its score to the members of the
complex and the average score within the complex