Sample Use Case 1: Loading and filtering data

This sample use case shows you how to load and filter data from microarray or mRNA-seq experiments with the HALO API. Below you can find a detailed description of each necessary step for this procedure as well as the corresponding code snippet. The complete source code of this use case example is available in the package halo.examples contained in the src/ directory.
In order to extract your mRNA level measurements from your data files you first have to load these. Afterwards you have the possibility to load additional files that contain attributes corresponding to your probesets or protein sequences. You can filter your data with a set of different filtering methods and also provide an output file with the filtered probesets. For each of these steps an example in source code is given in this sample use case. For more detailed information of all methods and classes please see the JavaDocs.

You can find the data used as examples in folder HALO/data.

Important note: All variable parameters (methods, thresholds, etc) that are used in this example are chosen arbitrarily and only for description purposes. These parameters have to be chosen carefully depending on the data and goals for practical uses.

Table of contents:

Loading expression data

The first important step in data preparation is the loading of probesets and corresponding RNA measurements from a file containing expression data. For details on the requested input file format please see File formats. For this example the file named data/Example_mouse.txt is used as input. From the file a Data-object is created, where all necessary information is stored.
The example also shows how to load not only RNA measurements but also additional attributes that are part of the original datafile.

//Define the names of columns containing newly transcribed RNA
ArrayList<String> colNew = new ArrayList<String>();
colNew.add("E1");
colNew.add("E2");
colNew.add("E3");

//Define the names of columns containing pre-existing RNA
ArrayList<String> colPre = new ArrayList<String>();
colPre.add("U1");
colPre.add("U2");
colPre.add("U3");

//Define the names of columns containing total RNA
ArrayList<String> colTot = new ArrayList<String>();
colTot.add("T1");
colTot.add("T2");
colTot.add("T3");

//Define the name of the column containing the gene names
ArrayList<String> geneAtt = new ArrayList<String>();
geneAtt.add("Gene Symbol");
//Is the data in log scale?
boolean log = false;

//Read in the data and load attributes in parallel
Data data = new Data("data/Example_mouse.txt", colTot, colNew, colPre, geneAtt, log);


Loading additional attributes (optional)

Besides RNA measurements you can also load attributes as gene names or present/absent calls that may be needed for later procedures or that you want to be preserved in the output. You can either load these attributes directly with your data if they are contained in the input file, or load them separately. This step can thus be performed at any time point. In the example below the original file "data/Example_mouse.txt" serves as source for the additional attributes that contain present calls. There is also the possibility to load attributes from a separate file (shown in the last comment below). This file is then loaded to complete the previously created Data-object. It is always advisable to load present calls separately instead of with other attributes, since they are handled internally differently. For this reason there exist two methods for each attribute loading procedure, one for regular attributes, and one for present/absent calls or similar attributes.
Please note that you don't have to load sequence files prior to using them for evaluation.


//Define the names of columns containing present call attributes
ArrayList<String> colAtt = new ArrayList<String>();
String[] attributes = new String[]{"Call_T1", "Call_T2", "Call_T3", "Call_E1", "Call_E2", "Call_E3","Call_U1", "Call_U2", "Call_U3"};
for(String t: attributes) {
colAtt.add(t);
}

//Load present calls separately from the data file
data.loadPresentCallsFromDatafile(colAtt);

//load attributes from the given file
// data.addAttributes(ATTRIBUTEFILE);


Filtering

In order to filter the data we have loaded we want to perform a filtering step consisting of several subsequent filtering methods. In this sample filtering according to a threshold as well as to present/absent calls are chosen as examples. Each method can be repeated as often as desired, since a new Data-object is created after every filtering step that can be used for subsequent analysis and even more filtering methods.

//Filter according to a given threshold
double threshold = 50;
data = Filter.filter(data, threshold);

String call = "A"; //The present/absent call used for filtering
int callNumber = 1; //The number of appearances of this call requested to discard the probeset

//Filter according to present/absent calls
data = Filter.filterAbsent(data, colAtt, call, callNumber);
data = Filter.filterAbsent(data, colAtt, "M", callNumber);
//Filtering for probe sets with no annotated gene name
data = Filter.filterAbsent(data, geneAtt, "---", callNumber);


Output data

After having filtered the data you may want to provide the new probsets and a subset or complete set of RNA measurement values and attributes as an output file. This is performed in the below example.

//choose name for output
String output = "Example_mouse_filtered.txt";

//write output
colAtt.addAll(geneAtt);
data.writeOutput(output, colTot, colNew, colPre,colAtt );


Output

The output produced by HALO should look like this:

Loading data...
Done loading data.
You have 31451 probesets.
------------------------------
Loading attributes...
Done loading attributes.
------------------------------
Filtering data...
Done filtering data.
You have 11031 probesets.
------------------------------
Filtering data...
Done filtering data.
You have 10984 probesets.
------------------------------
Filtering data...
Done filtering data.
You have 10937 probesets.
------------------------------
Filtering data...
Done filtering data.
You have 10731 probesets.
------------------------------

The following file should be produced:




HALO documentation