File formats

ProCope awaits certain file formats when loading different data objects from the file system. Below you find short specifications of all file formats. Note that <TAB> represents the tab character.

Overview

GZIP
Protein networks
Complex sets
Purification data
Gene Ontology
Localization data

GZIP

The graphical user interface and all command line tools automatically detect and decompress files which are compressed using the gzip program. When using the Java API use the GZIPInputStream class for easily decompressing files on the fly (see for instance: Java API Sample Use Case 2).

Protein networks

Protein networks in ProCope are edge-induced. That is, a protein network file contains only edges, one edge per line. Each line has the following format:

protein1<TAB>protein2<TAB>edgeweight[<TAB>annotations]

Where protein1 and protein2 are identifiers of the two involved proteins, edgeweight is the decimal weight or score for that edge and annotations are optional edge annotations containing arbitrary key/value pairs. For directed networks, protein1 is the source of the directed edge whereas protein2 is its target node.

Note that ProCope automatically treats numeric values in annotations as numbers within the program (which can for instance be used for numeric filtering).

Here is an example of two lines in a network file:

proteinAB    proteinDE    0.2

proteinAB    proteinFG    1.0    otherscore

=0.3212;source=database2;verified=no

Leaving out the edge weight is also valid, be sure to use two consecutive <TAB> characters. The will not have an associated weight but only consists of its annotations:
proteinAB proteinFG otherscore

=0.3212;source=database2;verified=no

Complex sets

Complex set file consist of one complex per line. Each line is a <TAB> separated list of protein identifiers which make up the respective complex. Example complex line from the MIPS reference complex set:

ydr004w    ydr076w    yer095w    ygl163c    yml032c

Purification data

ProCope employs the purification dataset file format used in the data files of the Krogan Lab Interactome Database. Each line contains one bait-prey interaction and contains the following <TAB> seperated values:

purification name
identification
day
number
bait
prey
score

ProCope only uses the purification name, bait and prey information. All bait-prey interactions with the same purification name (which of course should also all have the same bait) will be combined into one purification experiment.

Gene Ontology

ProCope supports the ontology and gene annotation file formats provided on the Gene Ontology website. Please use the .obo ontology file format.

Localization data

Localization data files contain the localization information of one protein per line. Each line consists of the protein identifier and a <TAB> character, followed by a comma-seperated list of cellular localizations. An example:

YLR132C cytoplasm,mitochondrion,nucleus

ProCope documentation