BDVAL/Ga-wrapper

From Icbwiki

Jump to: navigation, search

This mode is used to discover sets of features that maximize a given performance measure using optimization with genetic algorithms. Classification is performed with a support vector machine (linear or RBF kernel). Starting with the entire set of features presented as input (containing N features), the algorithm optimizes CV10 performance of a N*ratio set of features with a genetic algorithm (Typical choice for r is ratio r is 0.5 to keep 50% of features at each iteration). Various parameters of the optimization can affect the computational resources required to carry out the optimization, and how close the found solution is to the optimal solution of the optimization problem. Larger values or population size and number of iterations (see runtime arguments) favor optimal solutions, but increase computational time. As usual with optimization algorithm, there is no guarantee that the optimal solution will be found. In the case of biomarker discovery, that is probably OK, since the fitness function (cross validation F-1 on a finite training set) is also not optimal.

This method performs aggressive feature selection that optimizes cross-validation performance. Additionally, it is capable of optimizing any performance measure for any classifier type. Unfortunately, methods using genetic algorithms tend to scale poorly with number of features and training set size.

Genetic Algorithm

It is implemented by org.bdval.DiscoverWithGeneticAlgorithm.

Mode Parameters

The following options are available in this mode

Flag Arguments Required Description
(-r | --ratio) ratio no The ratio of new number of feature to original number of features, for each iteration. (default: 0.5)
(-n | --number-of-steps) number-of-steps no The number of genetic algorithm evolution steps. Larger values increase the chance that the optimal solution will be found, but increase computation time. (default: 100)
(-s | --population-size) population-size no Number of chromosomes for genetic algorithm optimization. The larger the population size, the more diversity can be represented in the population, and the more effective cross-over will be at combining successful solutions into a more optimal offspring. Larger population sizes are more computationally expensive, since the fitness function must be evaluated for each chromosome at each evolution step. (default: 10)
--discrete-parameters discrete-parameters no A list of discrete classifier parameters to optimize at the same time as the feature set. Parameters must be described in the format param1=value1,value2,...[:param2=value1,value2,...]. For instance, alpha=1,2,3,4:beta=0.2,.5,.33 will optimize the parameters alpha and beta alongside with the feature set. The combination of features and parameter values that optimizes CV performance will be kept. Optimal parameter values will be written to stderr, or to the value of argument --optimal-parameters-out
(-f | --folds) folds no Number of cross validation folds. default=10/CV10. (default: 10)
--cv-repeats cv-repeats no Number of cross validation repeats. default=1 (does on round of cross-validation). Values larger than one cause the cross validation to be repeated and results averaged over the rounds. (default: 1)
--output-gene-list n/a no Write features to the output in the tissueinfo gene list format.
--roc n/a no Optimize the area under the ROC curve. If neither this option nor --maximize is not provided, maximizes the F-1 measure (harmonic mean of precision and recall) Otherwise, the parameter --maximize will name the objective function.
--num-features num-features no Number of features to select. (default: 50)
--optimal-parameters-out optimal-parameters-out no Name of the file where optimal parameters will be written (as Java properties).
--maximize maximize no Select the objective measure that the GA process will try to maximize. Valid measure names include auc, mat, acc. For a complete list of measure names, see the ROCR documentation.
Personal tools