Complex set evaluation

Protein clusters produced by the clustering of score networks cannot be directly accepted as existing in-vivo protein complexes. We need experimental data or highly confident protein annotations to verify the quality of our predictions and of course our methods. Different approaches were proposed in the literature which are implemented in ProCope. Here you find a short introduction on these evaluation methods.

Comparison with reference protein complex sets

A very straighforward approach to determine the correctness of a given set of predicted protein complexes is the comparison with a reference set of protein complexes which must of course be of high quality. See Datasets for more information on existing protein complex reference sets.

Brohee et al., 2006 proposed a similarity measure consisting of sensitivity, positive predictive value (PPV) and accuracy. Basically, the sensitivity describes how much of the reference complex sets are also contained in the candidate complex set, where as the positive predictive value (PPV) is a measure how much of the prediction in the candidate set is correct. The accuracy is then the geometric mean of sensitivity and PPV.

Another comparison method is the mapping of complexes between the predicition and the reference set. Two complexes are defined to be mappable if they have a sufficiently large overlap. Mappings can be used as a measure of how good real complexes from a reference set are matched in the prediciton.

Note: As the reference data sets do not provide full coverage and the clusterings contain many more complexes than the reference sets, 100% PPV cannot be the final goal of this evaluation approach.

GO semantic similarity

Another way to evaluate the quality of predicted protein complexes is the comparison of Gene Ontology (GO) terms associated with the proteins within a complex. Schlicker et al., 2006 proposed a new scoring methods for proteins based on these GO information. This scoring method basically generates another PPI scores network containing functional relationships based on the GO annotations. The network is then used to calculate average inner-complex scores for the predicted complex set. The higher this score, the better is our prediction (assuming that the GO annotations are correct).

Colocalization

Proteins which belong to the same complex should be localized in the same inner-cellular compartment. If we have localization data for the proteins of an organism we can use this information to evaluate the localization consistency of our predicted complexes. For yeast two such datasets are available, one published by Kumar et al., 2002 and the other one by Huh et al., 2003. Note that for each protein there may be more than one localization information.

ProCope implements two different methods which score a complex based on localization data. Both measures take into account overlap of localization information of the proteins within each complex. Both methods first calculate the cellular localization with the highest annotation frequency for the given complex (e.g. nucleus which is annotated for 6 of 8 proteins in a complex). The Colocalization score which was proposed in our own work divides this value by the size of the complex (for our example we would get a colocalization score of 0.75). Pu et al., 2007 defined the Colocalization PPV. In this measure the highest annotation frequency is divided by the total sum of annotation frequencies of all localizations present in the complex.



ProCope documentation