Friday, June 15, 2012

Integrating R with Ruby

We are close to integrating our first set of survival analysis models with our ETL process. The latter is a set of scripts written in Ruby, using the Sequel gem that take data from our operational PostgreSQL database, transforms and loads the results to a set of fact and dimension tables in the data warehouse, which is also in PostgreSQL. I explored, for some time, for ways to deploy predictive models within applications. It seemed liked the opinions are broadly divided into two schools of thought:

  1. Some people suggest using R only for building and experimenting with the model offline, and then re-implement the predictive algorithms in C++, or Java, or whatever the main application is in.
  2. The other school is in favor of maintaining the model in R, and integrating it with the app by using some sort of "bridge", and it seemed like some of the most popular "bridges" were rApache, Rserve, PMML and RSRuby.
There are, of course, pros and cons of either approach. With the first, one can probably expect more control over the algorithms as they develop it themselves, but then, it is more time- and resource-consuming, and also, needs a lot of testing to get it as reliable as the standard libraries in R, which have been there for a while. This works in favor of the second choice - the R libraries have a pretty big community supporting them, and have been tested thoroughly over years. However, with the second choice, the problem is that not all of these bridges have a big enough supporting community behind them.

After some discussion among ourselves, we opted for the RSRuby. It lets you call the standard R functions using Ruby, once you create a "wrapper" around R, which is an instance of class RSRuby. Here are some blog entries to get started on RSRuby, which I found pretty useful.

However, when we develop a Cox model, the coxph() function demands that the response variable be a survival object, returned by the Surv() function. This entry shows how to use lm() in RSRuby; however, lm() needs only a simply vector, or, a column of a data frame as the response variable, making things easier. Although RSRuby has a built-in class for data frames, none for survival objects, and the documentation is scarce.

We, therefore, chose to create our own application-level package in R (using the concepts of generic functions and method dispatch), that takes the data as simple vectors or data frames, and returns the fitted survival function for a given child or a cohort group as a data frame. I found this tutorial on how to use by Friedrich Leisch an excellent one. It works fine for our volume of data. Until RSRuby matures more, this seems like a nice workaround.