pipeline {StabPerf} | R Documentation |
Predicting and assessing of an optimal model for classifying gene-expression data based on random sampling
pipeline(data.file=NULL, config=NULL)
file |
Optional parameter indicating the source of the dataset |
config |
Optional parameter describing a file from where the configuration is being read |
Configures the pipeline from the given config-file or the provided default configuration. Reads the selectors, classifiers and predictors from the appropriate files (~.conf
). The dataset is read from the given datafile or datafile
. Classes of the dataset are adjusted according to pipeline.classes.set
and pipeline.classes.combine
. If cv
is given and is larger k-fold cross-validation is performed. In this case, for each fold and selector feature-vectors are drawn (using 1-1/sampling_cv
parts of the training-set) and for each selector/classifier-combination the accuracies are calculated. These accuracies are returned. The best model is chosen by assessing the stability of the sampled features and the median accuracy of the sampled classifiers and the deviation. Finally the best model is determined on the full dataset and dumped to modelfile
. The number of drawn samples is sampling_steps * sampling_repeats
. All samples can be computed on a cluster and are saved to checkpoints (under tmpdir
), providing resuming.
Invisible |
Davis CA, Gerick F, Hintermair V: "Assessment of signature stability and dependant classifier performance by random sampling", (2005)
predict.edamodel
, .save.model
, confParser
, changeLabels
, best.model
pipeline(data.file="dataset.RData", config="config/pipeline.cfg")