knowthynumbers: Casebook

Tuesday, December 4, 2012

Survival Analysis, Part II

I continued my work on survival analysis, this time on real Casebook data, as Casebook went live on July 5. In child welfare, finding the length of stay by entry cohort has always been hard, as caseworkers had to wait till all or most of the children who entered care in a given time period exited care, to get an idea of the aggregate length of stay. To enable caseworkers have an idea of how long kids who are currently in care are going to stay in care, we treated the exits from care as "events" in survival data, and used Cox regression to predict the length of stay. We used a bunch of covarites, like entry cohort (2008-2012), age at the time of entering care, gender, race, parents' alcohol and drug abuse problems, child's behavioral problems, child's disability, history of domestic violence, family structure etc.

Since Cox model returns a survival function, and not a single value of length of stay, even for a kid for whom we know the actual value of length of stay, computing the residual for the model is not as simple as computing (expected - observed), and had to be done rather indirectly. There are various residuals that focus on different aspects of a Cox model, and I was mostly interested in a residual that gives the overall goodness of fit of the model with respect to the observations. The Cox-Snell residual is one such residual. I got fascinated by the underlying idea: it is based on the concept of probability integral transform , which says that if X is a continuous random variable with cumulative distribution function F_X, then the random variable defined as Y = F_X(X) has a uniform distribution in [0, 1].

Now, for any continuous random variable X, then, since S(X) = Pr(X > x) = 1 - F_X(X), F_X(X) having a uniform distribution in [0, 1] implies S(X) too has a uniform distribution in [0, 1]. It can be shown that the cumulative hazard rate evaluated at the actual exit time, H(X), which is known to be equal to -ln[S(X)], then, follows an exponential distribution with rate 1.

Now, the Cox-Snell residual is defined as the cumulative hazard rate of the actual survival time X, and hence, by what we discussed in the above paragraph, if the estimates of the coefficients are close to the actual values of the coefficients, then the Cox-Snell residual should have an exponential distribution with rate 1.

However, the Cox-Snell residual is itself also a continuous random variable, so, by the probability integral transform property and what else we discussed above, the cumulative hazard rate of the Cox-Snell residual should also follow an exponential distribution with rate 1.

Combining all we just said, if we plot the cumulative hazard rate of the Cox-Snell residual vs the Cox-Snell residual, then, the P-P plot should give rise to a scatterplot where the points lie approximately along the line y = x. In practice, the cumulative hazard rate of the Cox-Snell residual is evaluated by the Nelson-Aalen Estimator, and I found this piece of code pretty useful in doing all this. For a theoretical discussion, one should refer to Section 11.2 of Klein and Moeschberger.

Friday, June 15, 2012

Integrating R with Ruby

We are close to integrating our first set of survival analysis models with our ETL process. The latter is a set of scripts written in Ruby, using the Sequel gem that take data from our operational PostgreSQL database, transforms and loads the results to a set of fact and dimension tables in the data warehouse, which is also in PostgreSQL. I explored, for some time, for ways to deploy predictive models within applications. It seemed liked the opinions are broadly divided into two schools of thought:

Some people suggest using R only for building and experimenting with the model offline, and then re-implement the predictive algorithms in C++, or Java, or whatever the main application is in.
The other school is in favor of maintaining the model in R, and integrating it with the app by using some sort of "bridge", and it seemed like some of the most popular "bridges" were rApache, Rserve, PMML and RSRuby.

There are, of course, pros and cons of either approach. With the first, one can probably expect more control over the algorithms as they develop it themselves, but then, it is more time- and resource-consuming, and also, needs a lot of testing to get it as reliable as the standard libraries in R, which have been there for a while. This works in favor of the second choice - the R libraries have a pretty big community supporting them, and have been tested thoroughly over years. However, with the second choice, the problem is that not all of these bridges have a big enough supporting community behind them.

After some discussion among ourselves, we opted for the RSRuby. It lets you call the standard R functions using Ruby, once you create a "wrapper" around R, which is an instance of class RSRuby. Here are some blog entries to get started on RSRuby, which I found pretty useful.

However, when we develop a Cox model, the coxph() function demands that the response variable be a survival object, returned by the Surv() function. This entry shows how to use lm() in RSRuby; however, lm() needs only a simply vector, or, a column of a data frame as the response variable, making things easier. Although RSRuby has a built-in class for data frames, none for survival objects, and the documentation is scarce.

We, therefore, chose to create our own application-level package in R (using the concepts of generic functions and method dispatch), that takes the data as simple vectors or data frames, and returns the fitted survival function for a given child or a cohort group as a data frame. I found this tutorial on how to use by Friedrich Leisch an excellent one. It works fine for our volume of data. Until RSRuby matures more, this seems like a nice workaround.