knowthynumbers: 2012

Friday, December 21, 2012

Accelerated Failure Time Models

I continued my experiments with survival analysis, this time with Accelerated Failure Time Models. Although the Cox-Snell residuals, as explained in this post, indicated that the Cox model was a pretty good match, it works on the Proportional Hazards assumption, which holds only approximately for our data, and I was interested in seeing if we can do away with this assumption. The Accelerated Failure Time Model stands on the assumption that the event times (in our case, actual times when kids exit from care) come from a theoretical distribution, like Weibull, or Log-logistic, exponential etc. The graphical test to check if the exit times follow Weibull distribution (I mean, apart from drawing a histogram of exit times with equal bucket widths, which I did, too) is to plot log(-log(S(t))) against log(t), where S(t) is the survival function estimated by Kaplan-Meier, and it should give a straight line (an explanation is here). What we got was roughly a straight line. The equivalent check for the Log-logistic distribution is to plot log((1-S(t))/S(t)) against log(t), which should be a straight line if the assumption holds, and the curve I got was almost quadratic, so I pursued Weibull. I used the survreg function from the survival library. For prediction using the AFT Weibull model, I used the generic predict function of R, computing the quantiles, and plotting them in a reversed way, using ideas as in this lecture.

However, when I did the graphical check with Cox-Snell residuals and their cumulative hazards, as I did for the Cox-model, it seemed that the Cox model was a better fit. The reason was the estimated survival probabilities of the children who exited, as computed on their actual exit times, did not follow a uniform distribution very well, and the graphical check with Cox-Snell residual stands on the assumption that it would.

However, when we plot (one plot per entry cohort year, which is 2008, 2009, 2010, 2011 or 2012) the three estimated survival functions by Kaplan-Meier, Cox regression and AFT regression (for the last two, we apply the model on an "average kid", i.e., a fictitious kid with all covariate values set to their respective means), the last two come pretty close, and both remain pretty close to the one given by Kaplan-Meier for the older entry cohorts; and farther from the one given by Kaplan-Meier for the more recent entry cohorts. This is actually the way we want it to be; the reason being, for the older cohorts, most (93-95%) of kids have exited care, but for the more recent cohorts, only very few have; so, what Kaplan-Meier gives is essentially a biased estimate, based mostly on the kids who have already left, and for those kids, the length of stay in care for 2012 entry cohort, as of today, can't be more than 350 days, and so on. However, the pattern seen from the older cohorts establishes our trust in the model, and we expect the Kaplan-Meier estimate and the Cox-AFT estimates will come closer to each other as more kids from the recent cohorts will exit from care and the model will get updated.

Tuesday, December 4, 2012

Survival Analysis, Part II

I continued my work on survival analysis, this time on real Casebook data, as Casebook went live on July 5. In child welfare, finding the length of stay by entry cohort has always been hard, as caseworkers had to wait till all or most of the children who entered care in a given time period exited care, to get an idea of the aggregate length of stay. To enable caseworkers have an idea of how long kids who are currently in care are going to stay in care, we treated the exits from care as "events" in survival data, and used Cox regression to predict the length of stay. We used a bunch of covarites, like entry cohort (2008-2012), age at the time of entering care, gender, race, parents' alcohol and drug abuse problems, child's behavioral problems, child's disability, history of domestic violence, family structure etc.

Since Cox model returns a survival function, and not a single value of length of stay, even for a kid for whom we know the actual value of length of stay, computing the residual for the model is not as simple as computing (expected - observed), and had to be done rather indirectly. There are various residuals that focus on different aspects of a Cox model, and I was mostly interested in a residual that gives the overall goodness of fit of the model with respect to the observations. The Cox-Snell residual is one such residual. I got fascinated by the underlying idea: it is based on the concept of probability integral transform , which says that if X is a continuous random variable with cumulative distribution function F_X, then the random variable defined as Y = F_X(X) has a uniform distribution in [0, 1].

Now, for any continuous random variable X, then, since S(X) = Pr(X > x) = 1 - F_X(X), F_X(X) having a uniform distribution in [0, 1] implies S(X) too has a uniform distribution in [0, 1]. It can be shown that the cumulative hazard rate evaluated at the actual exit time, H(X), which is known to be equal to -ln[S(X)], then, follows an exponential distribution with rate 1.

Now, the Cox-Snell residual is defined as the cumulative hazard rate of the actual survival time X, and hence, by what we discussed in the above paragraph, if the estimates of the coefficients are close to the actual values of the coefficients, then the Cox-Snell residual should have an exponential distribution with rate 1.

However, the Cox-Snell residual is itself also a continuous random variable, so, by the probability integral transform property and what else we discussed above, the cumulative hazard rate of the Cox-Snell residual should also follow an exponential distribution with rate 1.

Combining all we just said, if we plot the cumulative hazard rate of the Cox-Snell residual vs the Cox-Snell residual, then, the P-P plot should give rise to a scatterplot where the points lie approximately along the line y = x. In practice, the cumulative hazard rate of the Cox-Snell residual is evaluated by the Nelson-Aalen Estimator, and I found this piece of code pretty useful in doing all this. For a theoretical discussion, one should refer to Section 11.2 of Klein and Moeschberger.

Wednesday, November 14, 2012

Completed ML course

Received my course completion certificate for taking the Machine Learning course by Stanford professor Dr. Andrew Ng on Coursera today. Turned out to be a very useful course. What I enjoyed most was I could apply some of the concepts learned there immediately at work as I was working on a classification problem for predicting permanency outcomes (adoption, reunification, emancipation etc) for children in Casebook at that time.

Friday, June 15, 2012

Integrating R with Ruby

We are close to integrating our first set of survival analysis models with our ETL process. The latter is a set of scripts written in Ruby, using the Sequel gem that take data from our operational PostgreSQL database, transforms and loads the results to a set of fact and dimension tables in the data warehouse, which is also in PostgreSQL. I explored, for some time, for ways to deploy predictive models within applications. It seemed liked the opinions are broadly divided into two schools of thought:

Some people suggest using R only for building and experimenting with the model offline, and then re-implement the predictive algorithms in C++, or Java, or whatever the main application is in.
The other school is in favor of maintaining the model in R, and integrating it with the app by using some sort of "bridge", and it seemed like some of the most popular "bridges" were rApache, Rserve, PMML and RSRuby.

There are, of course, pros and cons of either approach. With the first, one can probably expect more control over the algorithms as they develop it themselves, but then, it is more time- and resource-consuming, and also, needs a lot of testing to get it as reliable as the standard libraries in R, which have been there for a while. This works in favor of the second choice - the R libraries have a pretty big community supporting them, and have been tested thoroughly over years. However, with the second choice, the problem is that not all of these bridges have a big enough supporting community behind them.

After some discussion among ourselves, we opted for the RSRuby. It lets you call the standard R functions using Ruby, once you create a "wrapper" around R, which is an instance of class RSRuby. Here are some blog entries to get started on RSRuby, which I found pretty useful.

However, when we develop a Cox model, the coxph() function demands that the response variable be a survival object, returned by the Surv() function. This entry shows how to use lm() in RSRuby; however, lm() needs only a simply vector, or, a column of a data frame as the response variable, making things easier. Although RSRuby has a built-in class for data frames, none for survival objects, and the documentation is scarce.

We, therefore, chose to create our own application-level package in R (using the concepts of generic functions and method dispatch), that takes the data as simple vectors or data frames, and returns the fitted survival function for a given child or a cohort group as a data frame. I found this tutorial on how to use by Friedrich Leisch an excellent one. It works fine for our volume of data. Until RSRuby matures more, this seems like a nice workaround.

Thursday, May 10, 2012

Survival analysis

Over last couple of weeks, I have been experimenting with survival analysis techniques. We need this for Casebook analytics as we would eventually show the quantiles of how long children in placement stay in foster homes etc, and how that length of stay varies across different cohorts groups (so that users can ask questions like "Have the median length of stay reduced in 2011 compare to 2010?"). The technique we chose was the Kaplan-Meier estimator, which is a popular nonparametric estimator for the survival function We chose it for a few reasons, a) it's simple to implement, b) it deals well with staggered entry (not all children who entered placements in the same year entered on the same day) and censored data (many children who enter placements in a year do not exit till the end of the year, so these observations are right-censored). However, while surveying the literature, I did not find many work which dealt with staggered entries, except this one which has got cited repeatedly. I was interested in using the survival function for finding the median length of stay, which this paper does not address directly; and moreover, it takes the estimate of the survival function at regular intervals, which is not very useful when trying to find the median for the following reason: let's say the median length of stay is 90 days (with an s.d. of 10 days or so), and entries occur starting from January, but we do not see many exits until April. Thus, till the end of March, the KM estimate remains very close to 1, and it hits 0.5 as late as in June. So what should the estimate of median be? 6 months? But that's too longer than the theoretical median of 90 days!

life_table(1000, 90, 10);
Median length of stay = 90
   point_in_time no_at_risk no_of_exists fraction_remains survival_prob
1     2008-01-31         83            0        1.0000000    1.00000000
2     2008-02-29        152            0        1.0000000    1.00000000
3     2008-03-31        240           16        0.9333333    0.93333333
4     2008-04-30        303           69        0.7722772    0.72079208
5     2008-05-31        332           73        0.7801205    0.56230466
6     2008-06-30        360           87        0.7583333    0.42641437
7     2008-07-31        356           76        0.7865169    0.33538209
8     2008-08-31        359          103        0.7130919    0.23915826
9     2008-09-30        335           96        0.7134328    0.17062335
10    2008-10-31        310           81        0.7387097    0.12604112
11    2008-11-30        317           77        0.7570978    0.09542546
12    2008-12-31        322           87        0.7298137    0.06964280

life_table(1000, 180, 10);
Median length of stay = 181
   point_in_time no_at_risk no_of_exists fraction_remains survival_prob
1     2008-01-31         83            0        1.0000000     1.0000000
2     2008-02-29        158            0        1.0000000     1.0000000
3     2008-03-31        241            0        1.0000000     1.0000000
4     2008-04-30        315            0        1.0000000     1.0000000
5     2008-05-31        395            0        1.0000000     1.0000000
6     2008-06-30        474           10        0.9789030     0.9789030
7     2008-07-31        552           73        0.8677536     0.8494466
8     2008-08-31        574           88        0.8466899     0.7192178
9     2008-09-30        573           75        0.8691099     0.6250794
10    2008-10-31        595           75        0.8739496     0.5462879
11    2008-11-30        593           83        0.8600337     0.4698260
12    2008-12-31        596           84        0.8590604     0.4036089

        However, as this article discusses, with staggered entry, it is common practice to shift the intervals so that all of them start from a common point in time, keeping the lengths of the intervals intact, and dealing with the right endpoints (which is now same as the length of the interval itself) as the "time-to-event".

     With this left-shifting, I was curious about whether the accuracy of the Kaplan-Meier estimate would depend on the assumption of entries happening uniformly across the year. So, I experimented with two synthetic datasets:

a) the entries happen uniformly across the year: choosing the entry time into placement uniformly at random in a year, drawing the length of stay from a Normal distribution with some input mean and input standard deviation

b) the entries happen mostly in the first half of the year: tossing a coin that comes up heads with p = 0.8, choosing a date of entry from the first half of the year if the coin turns heads; otherwise choosing a date of entry from the second half of the year.

   With both datasets, and with all intervals left-shifted to start at 0, however, Kaplan-Meier worked nicely to return the median - the line on the X-axis corresponding to the survival function being at 0.5 came close to the mean/median of the Normal distribution the length of stay was generated from. Here is an example from a sample of 1000 children, entries uniformly across the year with a median length of stay of 90 days and s.d. 2 days, times to events bucketed in buckets of size 3, and 90% confidence intervals for the survival probability. As the table points out, the median is around 89 days.

   start_bucket end_bucket #risk      #exits       #censored fraction_remains surv_prob std_error            90% CI
1              83            85       1000           6                1        0.9939970         0.994      2.443356e-03     [0.98998 - 0.99802]
2              86            88        993         179             54        0.8146998         0.810      1.991064e-03     [0.80672 - 0.81328]
3              89            91        760         397            138       0.4254703         0.345     8.480460e-04     [0.34361 - 0.34639]
4              92            94        225         158              53       0.2040302         0.070      1.720673e-04     [0.06972 - 0.07028]
5              95            97         14           9                   5       0.2173913         0.015      3.687156e-05     [0.01494 - 0.01506]

Saturday, March 31, 2012

STOMP with Rails

I am looking for some message-oriented middleware to make the ETL between the operational Casebook database and the data warehouse DB (with star schema) real-time (in fact, Pentaho uses JMS for real-time ETL, but it comes with the enterprise edition only). This week, I experimented with the Apache ActiveMQ that speaks STOMP, among other protocols (ActiveMQ does not yet support AMQP, though). I checked this tutorial on the ActiveMessaging plugin, and also this tutorial from the Rails Recipe book. Finally, I decided to write a PoC on my own. ActiveMessaging brings in the idea of a Processor class, which is basically a subscriber that listens to a message queue. In my demo app, the user can create orders through the front-end, which gets saved to the database through the usual controller-model handshake, with an initial status 'NEW'. However, the controller also writes the data about the newly created order to the message broker. The OrderProcessor, which listens to the same message broker, parses the XML-formatted data using the REXML api, finds the order through the id, and updates the status to 'SHIPPED'. Note that, although the controller and the processor have a client-server (or publisher-subscriber) relationship between themselves, they are both clients to the ActiveMQ server, which, in this case, speaks STOMP.

I will look next how the "publish" part to the message broker can be triggered from the operational database, as inserts, updates and deletes take place there.

Friday, March 23, 2012

d3 for analytics dashboard

This week, I worked on developing a proof-of-concept for the analytics dashboard for Casebook. We are using a star schema in PostgreSQL for the data warehouse. For the front-end rendering, where we need histograms, box-and-whisker plots, line charts (may be more...we are an advocate of agile, after all), I am using the d3 javascript library. I found it pretty cool for a number of reasons. First, all the rendering are by SVG, so if you want to draw a histogram, you draw a set of rectangles; if you want to draw a box plot, you draw all the lines by yourself, and so on. You learn a set of basic tools, and you use them repeatedly to get your job done. I have used Google charts before for implementing analytics dashboards, but since SVG lets you draw things with geometric primitives, it gives you better control over things. Second, the concept of "joining" your data with the SVG components of the page. A histogram can be drawn with the following piece of nifty code:

chart.selectAll("rect")
.data(histogram)
.enter().append("rect")
.attr("width", x.rangeBand())
.attr("x", function(d) { return x(d.x); })
.attr("y", function(d) { return height - y(d.y); })
.attr("height", function(d) { return y(d.y); });

where histogram is an array of Javascript objects, each object mentions the start-point, width and frequency of a bucket. There are no SVG "rect" elements to begin with, so we can think of it as an "outer join" between the SVG rectangle elements and the histogram buckets, and set the rectangles attributes on the elements of the resultant set. The enter() operator comes in handy here (and in most situations) when nodes that we are trying to select do not exist yet in the DOM tree, it takes the name of the node to append to the document, which, in this case, is the SVG rect element.

One other thing I found interesting here is the object named "histogram". It's obtained by the following method call:

var histogram = d3.layout.histogram()(data);

it's a layout object returned by d3.layout.histogram(), and it is both an object and a function. Since it is a function, it can be invoked with the parameter "data"; and when it is used as a parameter to the method data() above, it serves as a "key function", which means if the data changes (which can happen if you change the filtering criteria of your query), the new data can be rebound to the nodes of the document.

I attended a talk on d3 by Mike Dewar of bit.ly at a meetup recently, where I first got introduced to it. For more details, check the paper by the original authors.

Tuesday, March 20, 2012

EMC acquires Pivotal

Pivotal Labs, our collaborators on Casebook, got acquired by EMC. We had a talk in our office today on "Social meets Big Data". Big Data meets Agile. :)

Wednesday, March 7, 2012

Defended PhD thesis

Was back in Ames for a few days to defend my PhD thesis. Have never seen Ames being so nice in March! Good to wrap up the pending work!

Friday, March 2, 2012

Strata 2012 (SF Bay Area)

I was in the Bay Area for last 3 days, attending the Strata conference on Big Data. A gathering from Data Scientists all over the US and abroad, companies demo-ing their big data products, data scientists explaining how they are using machine learning in various domains. A few talks that I particularly liked are the following:

1) Robert Lefkowitz's talk on how an array-based database model makes certain database operations, like comparing data points in a time series, faster. Pretty in-depth.

2) Jake Porway's talk on how Big Data can be used for the public sector, and what non-profits are doing for it. Good to see someone working on a set of problems that are relevant for a bigger audience.

3) Dr. Hal Varian's talk on how Google's search data can be used for predicting economic recessions.

Above all, it was great to meet Doug Cutting and Julian Hyde in person. Julian was generous enough with his time and suggestions about what might be the right tools for building our data warehouse for Casebook. A very useful conference indeed, in a place that I doubt whether shares the same sun with New York :)

Friday, January 6, 2012

RPostgreSQL for data analysis

Since the operational database for Casebook is PostgreSQL, and I use R for exploratory data analysis, I was looking for a tool that marries the two, and I found RPostgreSQL to be quite handy. Make a connection to your PostgreSQL DB using the DBI interface, run the query, get the result as a data frame, and the rest is plain R.