BDVAL/Define-splits

From Icbwiki

Revision as of 21:40, 20 April 2009; view current revision
←Older revision | Newer revision→
Jump to: navigation, search

This mode is used to partition a training set into various splits for training and testing. A typical split design is cross-validation, but other splitting strategies are possible. This tool generates a file which precisely describes how the samples in the whole training set should be distributed into splits.

The generated file consists of lines of the form:

 repeat#  split#  fold-type    sample-id   numeric-class-label sample-index

where:

 Repeat# (integer) identifies a (random) repetition of the split strategy.
 Split# is an integer which uniquely identifies a split.
 Fold-type (string) identifies the purpose of the fold in a given split. Samples which have a fold-type=training
 should be used for training the model, whereas samples with fold-type=test should be used to test the model.
 Sample-id is a string which indicates that the corresponding sample is part of the split/fold described.
 The last two columns are optional.

The following encodes a leave-one-out split strategy with three samples (6 folds for three splits):

 1 1 training sample2
 1 1 training sample3
 1 1 test sample1
 1 2 training sample1
 1 2 training sample3
 1 2 test sample2
 1 3 training sample1
 1 3 training sample2
 1 3 test sample3

This encoding makes it possible to devise strategies that define several partitions of the input samples. For instance, it is possible to define feature-selection, training and test fold-types, in the context of cross-validation with a number of random repeats. The split plan can also be generated independently of DefineSplitsMode and given to execute-splits.

It is implemented by org.bdval.DefineSplitsMode.

Mode Parameters

The following options are available in this mode

Flag Arguments Required Description
--pathway-components-dir pathway-components-dir no Directory where pathway components will be stored. (default: pathway-components)
(-f | --folds) folds yes Number of cross validation folds.
--cv-repeats cv-repeats no Number of cross validation repeats. default=1 (does on round of cross-validation). Values larger than one cause the cross validation to be repeated and results averaged over the rounds. (default: 1)
--stratification stratification no When true, each random fold is constrained to contain the same proportion of positive samples as the whole input set (modulo integer rounding errors). Default is true.
--feature-selection-fold feature-selection-fold no When true, one fold is labeled for feature selection (split-type=feature-selection) and excluded from the training split. Default is false.
Personal tools