Complex set evaluation
Protein clusters produced by the clustering of score networks cannot be
directly accepted as existing in-vivo
protein complexes. We need experimental data or highly confident
protein annotations to verify the quality of our predictions and of
course our methods. Different approaches were proposed in the
literature which are implemented in ProCope. Here you find a short
introduction on these evaluation methods.
Comparison with reference protein complex sets
A very straighforward approach to determine the correctness of a given
set of predicted protein complexes is the comparison with a reference
set of protein complexes which must of course be of high
quality.
See Datasets for more
information on existing protein complex reference sets.
Brohee et al., 2006 proposed a
similarity measure consisting of sensitivity,
positive predictive value
(PPV) and accuracy. Basically, the sensitivity
describes how much of the reference complex sets are also contained in
the candidate complex set, where as the positive predictive
value (PPV) is a measure how much of the prediction in the
candidate set is correct. The accuracy is then the
geometric mean of sensitivity and PPV.
Another comparison method is the mapping of complexes between the predicition and the reference set. Two complexes are defined to be mappable
if they have a sufficiently large overlap. Mappings can be used as a
measure of how good real complexes from a reference set are matched in
the prediciton.
Note:
As the reference
data sets do not provide full coverage and the clusterings contain many
more complexes than the reference sets, 100% PPV cannot
be the final goal of this evaluation approach.
GO semantic similarity
Another way to evaluate the quality of predicted protein complexes is
the comparison of Gene Ontology (GO) terms
associated with the proteins within a complex. Schlicker et al., 2006
proposed a new scoring methods for proteins based on these GO
information. This scoring method basically generates another
PPI
scores network containing functional relationships based on the GO
annotations. The network is then used to calculate average
inner-complex scores for the predicted complex set. The higher this
score, the better is our prediction (assuming that the GO annotations
are correct).
Colocalization
Proteins which belong to the same complex should be localized in the
same inner-cellular compartment. If we have localization data
for
the proteins of an organism we can use this information to evaluate the
localization consistency of our predicted complexes. For yeast two such
datasets are available, one published by Kumar et al., 2002 and the other
one by Huh et al., 2003. Note that for
each protein there may be more than one localization information.
ProCope implements two different methods which score a complex based
on localization data. Both measures take into account overlap of
localization information of the proteins within each complex. Both
methods first calculate the cellular localization with the highest
annotation frequency for the given complex (e.g. nucleus which is annotated for 6 of 8 proteins in a complex). The Colocalization score
which was proposed in our own work divides this value by the size of
the complex (for our example we would get a colocalization score of
0.75). Pu et al., 2007 defined the Colocalization PPV.
In this measure the highest annotation frequency is divided by the
total sum of annotation frequencies of all localizations present in the
complex.