RUtils

From Icbwiki

Jump to: navigation, search

Contents

About RUtils

RUtils is a collection of utilities that can be used to assist in using and connecting to R from Java. It makes use of Rserve and adds pooling functionality to ease the configuration and use of multiple instances of running Rserve processes running on various different hosts and configurations.

Software Requirements

Implementation Details

Getting the library

The RUtils library is available as precompiled jar files and also in source code form. The jar files are

includes all the external classes needed to run (i.e., RServe Java API, commons, etc.)
includes only the project classes and will require external jar files to use in your own projects

Building

Note that this section is meant only for those with the source distribution or subverion access. Users of the binary distribution should skip this section.

Compiling and packaging

The target used to build the RUtils package is called "jar". Executing ant jar will produce the a the jar files described earlier

Running JUnit Tests

The RUtils library is built using ant and a build.xml file located in the <install-dir>. The default target will compile the source and run the junit tests.

Subversion Access

This project's Subversion repository can be checked out through SVN with the following instruction set:

 svn co https://pbtech-vc.med.cornell.edu/public/svn/icb/trunk/icb-commons/RUtils

Browse the RUtils package in the Subversion repository.

NOTE: Use username "guest" and your email address at the login prompt if you do not have an account with the ICB.

Documentation

The RUtils Javadoc API is available here.

Using the RUtils package

Configuring Rserve Instances in the Connection Pool

Rserve processes available to the pool are configured using a fairly simply xml file. The root element is called RConnectionPool and the child nodes are called RServer. There should be one Rserver node per Rserve process you wish to be made available in the pool. Each Rserve node has the following attributes:

host
The host/ip Rserve is running on (required)
port
The TCP port Rserve is listening on (default = 6311)
username
Username to supply for the connection
password
Password to supply for the connection
command
The full path to start Rserve on the host
embedded
If true, the connection pool will attempt to manage the rserve processes by starting the servers on pool initialization and terminating the servers on JVM shutdown. There is no need to start or stop the R server process manually as described below when running in embedded mode - the server will start as a daemon thread from the man Java process of your application.

The following configuration would make three servers available to the pool.

    <RConnectionPool>
       <RConfiguration>
          <RServer host="localhost"/>
          <RServer host="127.0.0.1" port="6312"/>
          <RServer host="foobar.med.cornell.edu" port="1234" username="me" password="mypassword"/>
       </RConfiguration>
    </RConnectionPool>

The preferred way to specify the configuration file is through the use of a system property called RConnectionPool.configuration. The property should be a valid url or the name of a resource file that exists on the classpath. In case the system property RConnectionPool.configuration is not defined, then the resource will be set to its default value of RConnectionPool.xml.

Rserve

Installing

Base "Official" Distribution

Assuming R is installed, via the R command line

 install.packages('Rserve',,'http://www.rforge.net/')

Note that if you do not specify the rforge url, you will likely get an older version of the Rserve package. This seems to work also, but the latest one from rforge should match the Java libraries used by the connection pool. Also, R 2.6+ is recommend.

Source Distribution

There have been some changes made to the Rserve code to work around some issues with the base distribution so until these are resolved in the official distribution, it may be necessary to install Rserve directly from the source code. The best way to do this is to get the Rserve source code from the local ICB subversion repository at https://pbtech-vc.med.cornell.edu/public/svn/icb/3rdparty/Rserve. There is a script file called mkdist in the root directory of the distribution. Running script this will create a gzipped tar file called Rserve_X.Y-Z.tar.gz which is the file to be installed into the R library. To install this file run the following from the unix command line:

 R CMD INSTALL Rserve_X.Y-Z.tar.gz

This should overwrite any previous versions. If you would like to install the test version of Rserve in parallel with the existing version, you can specify a different local library version using the library option when installing.

Starting

The ant build script in the RUtils project contains a startup task. The startup command uses ssh to start remote processes if applicable.

Via R Command Line
 library("Rserve")
 Rserve()
From a configuration file
 java -jar icb-rutils.jar --startup --configuration <filename>

Stopping

The ant build script in the RUtils project contains a shutdown task. Unfortunately, the shutdown task does not use ssh and may not work if ports are not open in the firewall.

A single instance
 java -jar icb-rutils.jar --shutdown [--host <hostname>] [--port <port>]
From a configuration file
 java -jar icb-rutils.jar --shutdown --configuration <filename>
Kill

The good old fashioned "kill" or "kill -9" still works well.

Validating R Configuration

The ant build script in the RUtils project contains a target that will display the status of Rserve instances configured in a file as follows.

A single instance
 java -jar icb-rutils.jar --validate [--host <hostname>] [--port <port>]
From a configuration file
 java -jar icb-rutils.jar --validate --configuration <filename>

Using Connections from the Pool

The connection pool is implemented using the Singleton pattern. Instances of the pool are not created by calling the constructor, but are retrieved using the static method getInstance(). Connections are retrieved from the pool using either borrowConnection() or borrowConnection(long, java.util.concurrent.TimeUnit). Both versions will return a valid RConnection immediately if one is available. The borrow method with no parameters will block if there are no connections available, while the latter form will wait until the timeout expires before returning null. Connections borrowed from the pool will NOT be closed upon return if left open when returned they will be reused by the next borrow. Returning a closed connection is fine and a new connection will be established on the next borrow.

When your Java process is complete, the pool can be "shutdown" in order to disconnect any connections in the cleanest way possible. This is done by calling the shutdown method on the pool. A JVM shutdown hook will do this if the application does not explicitly call shutdown.

Using RScript to call R Scripts from a Java Program

Once you have RServe running and your RConnectionPool.xml file configured, you are ready to call R from Java. The easiest way to do this is using the RScript class which is part of icb-rutils.jar and icb-rutils-api.jar. RScript provides a simple way to

  1. Specify an R script to execute
  2. Specify input and output values for the script
  3. Execute the script and retrieve output values from R back into Java

1. Specify an R script to execute

RScript provides two factory methods for specifying the script to run, both methods return an RScript object.

  • RScript.createFromResource(path to resource) loads an R script file within the classpath.
  • RScript.createFromScriptString(script string) allows you to supply an R script from a Java String.
  final String ksTest =
     "q <- ks.test(x,y)" + "\n"
     + "p_value <- q$p.value" + "\n"
     + "test_statistic <- q$statistic[[1]]";
  final RScript rscript = RScript.createFromScriptString(ksTest);

Here we have defined an R script to perform a Kolmogorov-Smirnov (KS) test. The KS test will be performed on the values "x" and "y" and then the script defines the output of two variables, "p_value" and "test_statistic". What appears to be missing is the definition of the values for "x" and "y". These values will be provided from Java. It is important that you don't define these input values in your R script as those values would override the values coming from Java.

2. Specify input values from Java for the script

We need to provide values for the "x" and "y" variables from Java and tell the RScript object that we want the output values stored in the R variables named "p_value" and "test_statistic" once the script has executed. To do this, we add the following code:

  final double[] xValues = new double[] {0.1, 0.2, 0.3, 0.4, 0.5};
  final double[] yValues = new double[] {0.6, 0.7, 0.8, 0.9, 1.0};
  // Specify the input variable names and values for the script.
  rscript.setInput("x", xValues);
  rscript.setInput("y", yValues);
  // Specify the variable names and types for the script output.
  // Outputs should be specified before we execute the script.
  rscript.setOutput("p_value", RDataObjectType.Double);
  rscript.setOutput("test_statistic", RDataObjectType.Double);

3. Execute the script and retrieve output values from R back into Java

All that is left now is to execute the script and retrieve the output values from R back into Java.

  rscript.execute();
  final double pvalue = rscript.getOutputDouble("p_value");
  final double testStat = rscript.getOutputDouble("test_statistic");

In the above example, we used double and double[], but RScript supports bi-directional use of other variable types. The complete list is defined by the enum RDataObjectType and conists of

  • RDataObjectType.String, represented in both Java and R as a String.
    Retrieve data using rscript.getOutputString(R script variable name).
  • RDataObjectType.StringArray, in Java this is a String[] and in R this is a c(string values).
    Retrieve data using rscript.getOutputStringArray(R script variable name).
  • RDataObjectType.Double, represented in both Java and R as a double.
    Retrieve data using rscript.getOutputDouble(R script variable name).
  • RDataObjectType.DoubleArray, in Java this is a double[], in R this is a c(double values).
    Retrieve data using rscript.getOutputDoubleArray(R script variable name).
  • RDataObjectType.Double2DArray, in Java this is a double[][], in R this is a matrix.
    Retrieve data using rscript.getOutputDouble2DArray(R script variable name).

Miscellaneous Notes

General

  • Rserve processes need to be running when the connection pool is started. If the connection cannot be made at startup time, the connection is excluded from the available servers in the pool.
  • Rserve process sometimes crash when given invalid data, watch out for this.
  • The various different OS and R configurations on our systems make paths confusing. Don't assume that paths are the same on every machine.
  • Firewalls are great, but for networked computing, they are a headache.

Windows Specific

  • It seems that multiple Rserve instances running on different ports on a single Windows machine get each other confused. At this point, we recommend only running one Rserve process per windows machine.

More Information

The R Project for Statistical Computing
Rserve - Binary R server
CruiseControl Test Results

Personal tools