R: Predicting an optimal model for classifying gene-expression data based on random sampling

pipeline {StabPerf}

R Documentation

Predicting an optimal model for classifying gene-expression data based on random sampling

Description

Predicting and assessing of an optimal model for classifying gene-expression data based on random sampling

Usage

pipeline(data.file=NULL, config=NULL)

Arguments

`file`	Optional parameter indicating the source of the dataset
`config`	Optional parameter describing a file from where the configuration is being read

Details

Configures the pipeline from the given config-file or the provided default configuration. Reads the selectors, classifiers and predictors from the appropriate files (~.conf). The dataset is read from the given datafile or datafile. Classes of the dataset are adjusted according to pipeline.classes.set and pipeline.classes.combine. If cv is given and is larger k-fold cross-validation is performed. In this case, for each fold and selector feature-vectors are drawn (using 1-1/sampling_cv parts of the training-set) and for each selector/classifier-combination the accuracies are calculated. These accuracies are returned. The best model is chosen by assessing the stability of the sampled features and the median accuracy of the sampled classifiers and the deviation. Finally the best model is determined on the full dataset and dumped to modelfile. The number of drawn samples is sampling_steps * sampling_repeats. All samples can be computed on a cluster and are saved to checkpoints (under tmpdir), providing resuming.

Value

Invisible NULL

References

Davis CA, Gerick F, Hintermair V: "Assessment of signature stability and dependant classifier performance by random sampling", (2005)

Examples

pipeline(data.file="dataset.RData", config="config/pipeline.cfg")

[Package StabPerf version 0.5 Index]