R: Gene selection based on internally computed 'Over-representation analysis (ORA)' results.

doORA {StabPerf}

R Documentation

Gene selection based on internally computed 'Over-representation analysis (ORA)' results.

Description

Performs gene selection on a set of passed RefSeq identifiers (actually their indices in the background data). The gene selection is based on internally computed Over-representation analysis (ORA) results.

Necessary mappings between GO terms and RefSeq identifiers are provided. Various parameters for a custom setup of ORA are available.

For the unexperienced user, default settings are provided.

Usage

doORA(feature_vector,all_refseqs=all_refseqs,thresh=NULL,GOgreater=NULL,GOsmaller=NULL,fold,s,nr)

Arguments

`feature_vector`	a list of indices. corresponding to the indices of RefSeq identifiers from the provided background data.
`all_refseqs`	a list of all RefSeq identifiers contained in the background data.
`thresh`	optional parameter. pvalue threshold. only GO categories with pvalues lower than this threshold are considered to be significant. default pvalue setting:thresh<0.01.
`GOgreater`	optional parameter. Only GO categries wich have a larger total size (in the background data) than GOgreater are considered to be significant. default setting:GOgreater=5
`GOsmaller`	optional parameter. Only GO categries wich have a smaller total size (in the background data) than GOsmaller are considered to be significant. default setting:GOsmaller=100
`fold`	parameter necessary in context of the pipeline. fold resembles current cross validation run. stand alone users should set fold=0.
`s`	parameter necessary in context of the pipeline. s resembles current number of used sampling method in pipeline. stand alone users should set s=0.
`nr`	parameter necessary in context of the pipeline. nr resembles current number of sampling repeat. stand alone users should set nr=1.

Details

All necessary mapping files (refseq2go and go2refseq) are provided by the package. doORA() takes a set of passed RefSeq identifier indices. These indices correspond to the position of these RefSeq identifiers in the background data. Only RefSeq identifiers with available GO annotation are processed. Using the go2refseq mapping, all GO categories associated with this current refseq set are retrieved. Each occurence of a GO term in any RefSeq identifier is counted. In the next step, p-values are calculated for all GO categories in the background data, using the hypergeometric distribution, as shown in Draghici et al. (see References). All GO categories with pvalue<thresh, total size>GOgreater and total size<GOsmaller are considered significant. Now these GO terms are used in a back-mapping step to single out those RefSeq identifiers associated with these GO terms. These are the selected genes/RefSeq identifiers. Their corresponding indices in the background data are returned by doORA. All generated ORA information is saved to a *.RData object, with the file signature ora_fold_s_nr_thresh_GOgreater_GOsmaller.RData. If doORA() is used in pipeline context, the ORA file is saved to a folder "./tmp/ora/". The file is saved to "." if used in standalone mode.

Value

indices a list of indices. corresponding to the indices of the selected RefSeq identifiers from the provided background data.

References

Draghici S, Khatri P, Martins RP, Ostermeier GC, and Krawetz SA: "Global functional profiling of gene expression." Genomics. 81(2):98-104. 2003.

Barnes MG, Aronow BJ, Luyrink LK, Moroldo MB, Pavlidis P, Passo MH, Grom AA, Hirsch R, Giannini EH, Colbert RA, Glass DN, and Thompson SD: "Gene expression in juvenile arthritis and spondyloarthropathy: pro-angiogenic ELR+ chemokine genes relate to course of arthritis." Rheumatology. 43(8):973-9. 2004.

Pruitt KD, Tatusova T, and Maglott DR: "NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins." Nucleic Acids Res. 33(1):D501-D504. 2005.

The Gene Ontology: "http://www.geneontology.org".

Examples

selectedgenes_indices <- doORA(feature_vector,all_refseqs="dataset$refseq",thresh="0.001",GOgreater="4",GOsmaller="100","0","0","1")

[Package StabPerf version 0.5 Index]