Sample Use Case 1: Generating score networks, clustering, cluster evaluation

This sample use case guides you through the calculation of score networks from purification data sets, clustering and complex evaluation using ProCope. Below you find snippets of source code along with detailed explanations. The full source code of this example is available in the procope.examples package of the src/ folder. Note that the execution of the program might take a some seconds to finish as a lot of tasks are performed.

For more detailed documentations of all methods and classes please check out the JavaDocs.

Important note: All parameters like cutoffs and linkage methods used in this example are arbitrary and for explanation purposes only. For the practical prediction of protein complexes these parameters need to be tuned carefully.

Table of contents:

Loading purification data

First of all we need to load purification data from which the scores networks will be derived. This example is based on datasets for the yeast Saccharomyces cerevisiae. We use the datasets of Gavin et al., 2006 and Krogan et al., 2006 which are also delivered in the data/ folder of this package. For evaluation, the MIPS gold standard set for Yeast is used. The PurificationDataReader class provides methods to load purification data from files or streams. It will throw an exception of an IO error occurs or if the file format is invalid. In this example we catch all kinds of exception in one catch block:

PurificationData dataKrogan=null, dataGavin=null;
try {
    dataKrogan = PurificationDataReader.readPurifications("data/purifications/krogan_raw.txt");
    dataGavin = PurificationDataReader.readPurifications("data/purifications/gavin_raw.txt");
} catch (Exception e) {
    // something went wrong, output error message
    System.err.println("Could not load purification data:");
    System.err.println(e.getMessage());
    System.exit(1);
}


Merging purification data

Next we merge both purification data sets into one. The experiment sets are simply concatenated, which is equivalent to concatenating the data files and reading the merged file.:
HiT

Calculating Socio Affinity scores

We calculate Socio Affinity scores according to Gavin et al., 2006. All classes which implement the interface ScoresCalculator can be used for automatic score network generation using the NetworkGenerator class. Note that Socio Affinity scores do not require any further parameters. We use a score threshold of 0.0 here which means that no negative scores will be inserted into the scores network.

ScoresCalculator calcSocios = new SocioAffinityCalculator(dataMerged);
ProteinNetwork scoresSocios = NetworkGenerator.generateNetwork(calcSocios, 0f);


Calculating Purification Enrichment scores

We also calculate Purification Enrichment (PE) scores according to Collins et al., 2007. As proposed in the paper, the combined PE network is calculated using a weighted combination of the PE network using the Gavin and Krogan network respectively. The Krogan network gets a weight of 0.5 whereas the Gavin network gets a weight of 1.0. We accomplish this task by performing a scalar multiplication with 0.5 on the Krogan network and the combining both networks by adding up the edge weights. The CombinationRules class contains different settings which control the merging of two given networks.

Note that we create the scores calculator and feed it into the scores network generator in one step this time:

ProteinNetwork scoresPEGavin = NetworkGenerator
        .generateNetwork(new PECalculator(dataGavin, 0.62f, 10f), 0f);
ProteinNetwork scoresPEKrogan = NetworkGenerator
        .generateNetwork(new PECalculator(dataKrogan, 0.51f, 20f), 0f);

And the combination:

scoresPEKrogan.scalarMultiplication(0.5f);
CombinationRules combiRules = new CombinationRules(CombinationRules.CombinationType.MERGE);
combiRules.setWeightMergePolicy(CombinationRules.WeightMergePolicy.ADD);
ProteinNetwork scoresPE = scoresPEGavin.combineWith(scoresPEKrogan, combiRules);


Clustering

Next we cluster the scores networks we calculate using hierarchical agglomerative clustering with Saaverage-linkage. Two arbitrary cutoffs are choosen here which yield a reasonable number of clusters. All classes which perform clustering should implement the Clusterer interface.

Clusterer clustererSocios = new HierarchicalClusterer(HierarchicalLinkage.UPGMA, 2.7f);
ComplexSet clusteringSocios = clustererSocios.cluster(scoresSocios);

Clusterer clustererPE = new HierarchicalClusterer(HierarchicalLinkage.UPGMA, 0.5f);
ComplexSet clusteringPE = clustererPE.cluster(scoresPE);

Note that new HierarchicalClusterer... could be replaced by any other clusterer, e.g. the Markov Clusterer delivered with this library (MarkovClusterer class).


Loading a reference complex set

We now want to compare our resulting clusters against the complex set reference set from MIPS. First we need to load this complex set which is also contained in the data/ folder of this package. The ComplexSetReader class contains methods to load complex sets from files or streams.


ComplexSet setMIPS = null;
try {
    setMIPS = ComplexSetReader.readComplexes("data/complexes/mips_complexes.txt");
} catch (Exception e) {
    // something went wrong, output error message
    System.err.println("Could not load complex set:");
    System.err.println(e.getMessage());
    System.exit(1);

Again the load method throws different kinds of exceptions (see JavaDoc) which we catch in a single block here.


Complex set comparison (using the Brohee measure)

The ComplexComparison class contains different comparison methods, here we use the measure proposed in Brohee et al., 2006. Check out the JavaDocs of ComplexComparison for more information on complex set comparison.

System.out.println("Socio clusters against MIPS: "
        + ComplexSetComparison.broheeComparison(clusteringSocios, setMIPS));
System.out.println("PE clusters against MIPS:    "
        + ComplexSetComparison.broheeComparison(clusteringPE, setMIPS));

Note that ComplexSetComparison.broheeComparison produces a BroheeStats object. This object overwrites toString() to produce a reasonable string representation of the comparison result. This is why we can directly append the result of the function in this print-call.


Calculating the colocalization score

The colocalization score of a clustering is a measure of how much the members of all complexes are colocalized according to a set of localization data. We use the localization data published by Huh et al., 2003 which are contained in the data/ folder of the package:

LocalizationData locData = null;
try {
    locData = LocalizationDataReader.readLocalizationData("data/localizations/huh_loc_070804.txt");
} catch (Exception e) {
    // something went wrong, output error message
    System.err.println("Could not load localization data:");
    System.err.println(e.getMessage());
    System.exit(1);
}

Now we calculate the colocalization score which is the average colocalization score of all complex sets. In this case we calculate a complex-size-weighted average:

Colocalization coloc = new Colocalization(locData);
System.out.println("Average colocalization score of socio clusters: "
        + coloc.getAverageColocalizationScore(clusteringSocios, true, true));
System.out.println("Average colocalization score of PE clusters:    "
        + coloc.getAverageColocalizationScore(clusteringPE, true, 
true));

We create a Colocalization object using the localization data we read and then calculate the colocalization score. The second argument of the method indicates if we calculate a weighted mean. The third argument defines if complexes for which no localization data are present will be scored with zero or ignored (true means they are ignored).


Name mappings

As the GO annotation files (see below) contain Primary SGDIDs (e.g. S000000099) whereas the purification files and complexes contain systematic names (e.g. YBL003C) we need a name mapping. Name mappings are represented as directed networks where each directed edge represents one name mapping from the source to the target node of the edge. The mapping needed in our case is contained in data/yeastmappings.txt.

try {
    ProteinManager.addNameMappings(NetworkReader.readNetwork("yeastmappings_080415.txt
", true), true);
} catch (Exception e) {
    // something went wrong, output error message
    System.err.println("Could not load name mapping network:");
    System.err.println(e.getMessage());
    System.exit(1);
}

The ProteinManager handles the name mappings and the mapping of protein identifiers to internal IDs. Please check out the JavaDocs of that class for more information.

Note that the second argument of readNetwork indicates whether we are reading a directed network (in this case: yes).

The second argument of the addNameMappings methods tells the ProteinManager that the file contains the targets in the first column and the synoyms in the second columns (targetFirst == true). That is a mapping S000000061 => YAL066W in the file looks like this:

YAL066W    S000000061

The alternative would be (targetFirst == false, which is not the case for data/yeastmappings.txt):

S000000061    YAL066W

GO semantic similarity

Another measure for the quality of complexes can be calculated using gene annotations to the Gene Ontology for the specific organism. We implemented a measure proposed by Schlicker et al., 2006. It defines a term similarity which calculates a similarity measure between two given GO terms based on the given gene annotations. In addition it defines a functional similarity measure which integrates the term similarites of all pairwise terms two genes are annotated with to get a GO semantic similarity score of two given proteins.

First we need to load the GO annotations for Yeast and the GO Network:

GOAnnotations goAnno=null;
GONetwork goNet=null;
try {
    goAnno = GOAnnotationReader.readAnnotations("data/go/gene_association_080504.sgd");
    goNet = new GONetwork("data/go/gene_ontology_edit_080504.obo",
            GONetwork.Namespace.BIOLOGICAL_PROCESS,
            GONetwork.Relationships.BOTH);
} catch (Exception e) {
    // something went wrong, output error message
    System.err.println("Could not load GO data:");
    System.err.println(e.getMessage());
    System.exit(1);
}

We use the biological process ontology. To generate the GO network we follow both is_a and part_of relationships.

Next we create the term and functional similarity calculators. For more information about the similarity measures please consider the original literature referenced above.

TermSimilarities termSim = new TermSimilaritiesSchlicker(goNet, goAnno,
        TermSimilaritiesSchlicker.TermSimilarityMeasure.RELEVANCE, true);

FunctionalSimilarities funSim = new FunctionalSimilaritiesSchlicker(
        goNet, goAnno, termSim,
        FunctionalSimilaritiesSchlicker.FunctionalSimilarityMeasure.TOTALMAX);

Note that a FunctionalSimilarities object is a ScoresCalculator and can be used to calculate score networks or complex scores as described in the following section.


Calculation of complex scores

We have to use the GO scores calculator we just defined to assign a quality measure for the complex sets which resulted from the clusterings above. The class ComplexScoreCalculator contains methods to calculate complex scores and average complex scores over whole complex sets. The score of a complex is defined as the average score of all inner-complex protein interactions where undefined scores (e.g. missing edges in the network) are treated with a value of 0. That is each complex score is the average of n*(n-1)/2 different inner-complex interaction scores. Again we calculate a complex-size-weighted average:

System.out.println("Functional similarity of socio clusters: " +
        ComplexScoreCalculator.averageComplexSetScore(funSim, clusteringSocios, true, true));
System.out.println("Functional similarity of PE clusters:    " +
        ComplexScoreCalculator.averageComplexSetScore(funSim, clusteringPE, true, true));


Output

The output of the program should look like this:

Loading purifications...
Merging datasets...
Calculating socio affinity scores...
Calculating PE networks...
Merging PE networks...
Clustering...
Socio clusters against MIPS: Sensitivity: 0.54275095, PPV: 0.71098727, Accuracy: 0.62119967
PE clusters against MIPS:    Sensitivity: 0.63791823, PPV: 0.71283257, Accuracy: 0.6743359
Calculating colocalization scores...
Average colocalization score of socio clusters: 0.6771759
Average colocalization score of PE clusters:    0.7355031
Loading name mappings...
Loading GO network...
Functional similarity of socio clusters: 0.57627916
Functional similarity of PE clusters:    0.6019231




ProCope documentation