BDVAL/Example

From Icbwiki

Jump to: navigation, search

An example of using BDVal with the publicly available GEO dataset GSE8402 is shown here. The example is referred to as the "prostate example" throughout the binary distribution.

Contents

Prerequisites

This example assumes the binary distribution has been downloaded and extracted to a local directory. As described in the installation and configuration sections, a Java runtime (1.6 or better), ant and R with Rserve and ROCR should be installed. The Rserve process should be running at this point as well.

You may wish to consult the options reference page for BDVal while walking through the examples.

Detailed Walkthrough

This section breaks down the Complete Example into distinct steps for the baseline sequence to illustrate how each step in the process works in the hopes that this gives a better understanding of how to apply BDVal to your particular datasets. For this dataset, the condition we are interested in is the "fusion" state. The fusion state is divided into three distinct classes, namely "YES", "NO" and "UNKNOWN". There are a number of different endpoints in this dataset that can be defined based on this dataset. For example, Training, Test and Validation sets are possible. We will use the endpoint name "GSE8402_FusionYesNo_TrainingSplit".

The commands shown in this section assume the current working directory is the base installation directory of the example (i.e., <install-dir>/bdval_20080619143434) which should be the location of the bdval.jar file. Directories are relative to the working directory unless explicitly specified otherwise.

Downloading the dataset

Download dataset from ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE8402/GSE8402_family.soft.gz and place the file in the directory data/bdval/GSE8402/norm-data. (Note: this file is included in the distribution as well).

Generate "cids" files

As mentioned previously, there are 3 classes for the fusion state in this dataset. We can extract this information directly from the data file. The original dataset file contains lines such as:

 ^SAMPLE = GSM208116
 !Sample_title = Fusion: Yes, Deletion: No (prostate_21055)

This indicates that sample GSM208116 belongs to the class "YES". A complete cids file is located at data/bdval/GSE8402/cids/GSE8402-FusionYesNo-TrainingSplit.cids of the distribution.

Determine and download the platform file(s)

From the GSE8402 dataset file, we see that the platform used is GPL5474 because the original file contain the following line:

 !Sample_platform_id = GPL5474

Downloaded the platform data from ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_platform/GPL5474/GPL5474_family.soft.gz and place the file in the directory data/bdval/GSE8402/platforms. (Note: this file is included in the distribution as well).

Define parameters for the runs

Number of cross validation folds

You will need to decide how many folds are appropriate for the dataset you are interested in. For this example we use 5.

Tasks for the dataset

For the training split, the following task file is defined in the file data/bdval/GSE8402/tasks/GSE8402-FusionYesNo-TrainingSplit.tasks as

 GSE8402-FusionYesNo-TrainingSplit       NO      YES     196     39

This indicates that the dataset GSE8402 has 196 samples in the "NO" class and 39 samples in the "YES" class.

Split plans

At this point there is enough information present to define split plans using the Define Splits mode of BDVal as follows:

 java -jar bdval.jar --mode define-splits --input data/bdval/GSE8402/norm-data/GSE8402_family.soft.gz
 --conditions data/bdval/GSE8402/cids/GSE8402-FusionYesNo-TrainingSplit.cids --folds 5
 --platform-filenames data/bdval/GSE8402/platforms/GPL5474_family.soft.gz
 --task-list data/bdval/GSE8402/tasks/GSE8402-FusionYesNo-TrainingSplit.tasks
 -–output data/bdval/GSE8402/GSE8402_FusionYesNo_TrainingSplit-split-plan-fs=true-CV-5-R-1.txt

This will create a split plan file specified as the output parameter.

Execute the plans

Now that the split plans have been created we can begin the biomarker discovery and model creation process by executing the following command:

 java -jar bdval.jar --mode execute-splits --input data/bdval/GSE8402/norm-data/GSE8402_family.soft.gz
 --conditions data/bdval/GSE8402/cids/GSE8402-FusionYesNo-TrainingSplit.cids
 --platform-filenames data/bdval/GSE8402/platforms/GPL5474_family.soft.gz
 --task-list data/bdval/GSE8402/tasks/GSE8402-FusionYesNo-TrainingSplit.tasks
 --splits data/bdval/GSE8402/GSE8402_FusionYesNo_TrainingSplit-split-plan-fs=true-CV-5-R-1.txt
 --sequence-file data/sequences/baseline.sequence --num-features 50

This uses the "baseline.sequence" that is part of the binary distribution. This sequence uses svm-weights mode to discover markers, write-model mode to train and create models and predict mode to evaluate the model predictions.

Complete Example

A complete example with this dataset can be found as part of the binary distribution and requires ant and R to execute as described in the software requirements section of the distribution requirements. All the data required for the example is included in the binary distribution.

Generating the evaluation models

The entire process of generating the models can be executed by executing:

 ant -f prostate-example.xml

from the <install-dir>/bdval_20080619143434/data directory where <install-dir> is the directory where the binary distribution was extracted. Note that the directory "bdval_20080619143434" is based on the release number of the distribution and may differ from the number shown here.

Producing final models

The class org.bdval.GenerateFinalModels is used to identify features by consensus across splits of cross-validation and training the final models.

Personal tools