BDVAL/Define-splits
From Icbwiki
←Older revision | Newer revision→
This mode is used to partition a training set into various splits for training and testing. A typical split design is cross-validation, but other splitting strategies are possible. This tool generates a file which precisely describes how the samples in the whole training set should be distributed into splits.
The generated file consists of lines of the form:
repeat# split# fold-type sample-id numeric-class-label sample-index
where:
Repeat# (integer) identifies a (random) repetition of the split strategy. Split# is an integer which uniquely identifies a split. Fold-type (string) identifies the purpose of the fold in a given split. Samples which have a fold-type=training should be used for training the model, whereas samples with fold-type=test should be used to test the model. Sample-id is a string which indicates that the corresponding sample is part of the split/fold described. The last two columns are optional.
The following encodes a leave-one-out split strategy with three samples (6 folds for three splits):
1 1 training sample2 1 1 training sample3 1 1 test sample1 1 2 training sample1 1 2 training sample3 1 2 test sample2 1 3 training sample1 1 3 training sample2 1 3 test sample3
This encoding makes it possible to devise strategies that define several partitions of the input samples. For instance, it is possible to define feature-selection, training and test fold-types, in the context of cross-validation with a number of random repeats. The split plan can also be generated independently of DefineSplitsMode and given to execute-splits.
It is implemented by org.bdval.DefineSplitsMode.
Mode Parameters
The following options are available in this mode
Flag | Arguments | Required | Description |
---|---|---|---|
--pathway-components-dir | pathway-components-dir | no | Directory where pathway components will be stored. (default: pathway-components) |
(-f | --folds) | folds | yes | Number of cross validation folds. |
--cv-repeats | cv-repeats | no | Number of cross validation repeats. default=1 (does on round of cross-validation). Values larger than one cause the cross validation to be repeated and results averaged over the rounds. (default: 1) |
--stratification | stratification | no | When true, each random fold is constrained to contain the same proportion of positive samples as the whole input set (modulo integer rounding errors). Default is true. |
--feature-selection-fold | feature-selection-fold | no | When true, one fold is labeled for feature selection (split-type=feature-selection) and excluded from the training split. Default is false. |