knowthynumbers

Thursday, May 10, 2012

Survival analysis

Over last couple of weeks, I have been experimenting with survival analysis techniques. We need this for Casebook analytics as we would eventually show the quantiles of how long children in placement stay in foster homes etc, and how that length of stay varies across different cohorts groups (so that users can ask questions like "Have the median length of stay reduced in 2011 compare to 2010?"). The technique we chose was the Kaplan-Meier estimator, which is a popular nonparametric estimator for the survival function We chose it for a few reasons, a) it's simple to implement, b) it deals well with staggered entry (not all children who entered placements in the same year entered on the same day) and censored data (many children who enter placements in a year do not exit till the end of the year, so these observations are right-censored). However, while surveying the literature, I did not find many work which dealt with staggered entries, except this one which has got cited repeatedly. I was interested in using the survival function for finding the median length of stay, which this paper does not address directly; and moreover, it takes the estimate of the survival function at regular intervals, which is not very useful when trying to find the median for the following reason: let's say the median length of stay is 90 days (with an s.d. of 10 days or so), and entries occur starting from January, but we do not see many exits until April. Thus, till the end of March, the KM estimate remains very close to 1, and it hits 0.5 as late as in June. So what should the estimate of median be? 6 months? But that's too longer than the theoretical median of 90 days!

life_table(1000, 90, 10);
Median length of stay = 90
   point_in_time no_at_risk no_of_exists fraction_remains survival_prob
1     2008-01-31         83            0        1.0000000    1.00000000
2     2008-02-29        152            0        1.0000000    1.00000000
3     2008-03-31        240           16        0.9333333    0.93333333
4     2008-04-30        303           69        0.7722772    0.72079208
5     2008-05-31        332           73        0.7801205    0.56230466
6     2008-06-30        360           87        0.7583333    0.42641437
7     2008-07-31        356           76        0.7865169    0.33538209
8     2008-08-31        359          103        0.7130919    0.23915826
9     2008-09-30        335           96        0.7134328    0.17062335
10    2008-10-31        310           81        0.7387097    0.12604112
11    2008-11-30        317           77        0.7570978    0.09542546
12    2008-12-31        322           87        0.7298137    0.06964280

life_table(1000, 180, 10);
Median length of stay = 181
   point_in_time no_at_risk no_of_exists fraction_remains survival_prob
1     2008-01-31         83            0        1.0000000     1.0000000
2     2008-02-29        158            0        1.0000000     1.0000000
3     2008-03-31        241            0        1.0000000     1.0000000
4     2008-04-30        315            0        1.0000000     1.0000000
5     2008-05-31        395            0        1.0000000     1.0000000
6     2008-06-30        474           10        0.9789030     0.9789030
7     2008-07-31        552           73        0.8677536     0.8494466
8     2008-08-31        574           88        0.8466899     0.7192178
9     2008-09-30        573           75        0.8691099     0.6250794
10    2008-10-31        595           75        0.8739496     0.5462879
11    2008-11-30        593           83        0.8600337     0.4698260
12    2008-12-31        596           84        0.8590604     0.4036089

        However, as this article discusses, with staggered entry, it is common practice to shift the intervals so that all of them start from a common point in time, keeping the lengths of the intervals intact, and dealing with the right endpoints (which is now same as the length of the interval itself) as the "time-to-event".

     With this left-shifting, I was curious about whether the accuracy of the Kaplan-Meier estimate would depend on the assumption of entries happening uniformly across the year. So, I experimented with two synthetic datasets:

a) the entries happen uniformly across the year: choosing the entry time into placement uniformly at random in a year, drawing the length of stay from a Normal distribution with some input mean and input standard deviation

b) the entries happen mostly in the first half of the year: tossing a coin that comes up heads with p = 0.8, choosing a date of entry from the first half of the year if the coin turns heads; otherwise choosing a date of entry from the second half of the year.

   With both datasets, and with all intervals left-shifted to start at 0, however, Kaplan-Meier worked nicely to return the median - the line on the X-axis corresponding to the survival function being at 0.5 came close to the mean/median of the Normal distribution the length of stay was generated from. Here is an example from a sample of 1000 children, entries uniformly across the year with a median length of stay of 90 days and s.d. 2 days, times to events bucketed in buckets of size 3, and 90% confidence intervals for the survival probability. As the table points out, the median is around 89 days.

   start_bucket end_bucket #risk      #exits       #censored fraction_remains surv_prob std_error            90% CI
1              83            85       1000           6                1        0.9939970         0.994      2.443356e-03     [0.98998 - 0.99802]
2              86            88        993         179             54        0.8146998         0.810      1.991064e-03     [0.80672 - 0.81328]
3              89            91        760         397            138       0.4254703         0.345     8.480460e-04     [0.34361 - 0.34639]
4              92            94        225         158              53       0.2040302         0.070      1.720673e-04     [0.06972 - 0.07028]
5              95            97         14           9                   5       0.2173913         0.015      3.687156e-05     [0.01494 - 0.01506]

Saturday, March 31, 2012

STOMP with Rails

I am looking for some message-oriented middleware to make the ETL between the operational Casebook database and the data warehouse DB (with star schema) real-time (in fact, Pentaho uses JMS for real-time ETL, but it comes with the enterprise edition only). This week, I experimented with the Apache ActiveMQ that speaks STOMP, among other protocols (ActiveMQ does not yet support AMQP, though). I checked this tutorial on the ActiveMessaging plugin, and also this tutorial from the Rails Recipe book. Finally, I decided to write a PoC on my own. ActiveMessaging brings in the idea of a Processor class, which is basically a subscriber that listens to a message queue. In my demo app, the user can create orders through the front-end, which gets saved to the database through the usual controller-model handshake, with an initial status 'NEW'. However, the controller also writes the data about the newly created order to the message broker. The OrderProcessor, which listens to the same message broker, parses the XML-formatted data using the REXML api, finds the order through the id, and updates the status to 'SHIPPED'. Note that, although the controller and the processor have a client-server (or publisher-subscriber) relationship between themselves, they are both clients to the ActiveMQ server, which, in this case, speaks STOMP.

I will look next how the "publish" part to the message broker can be triggered from the operational database, as inserts, updates and deletes take place there.

Friday, March 23, 2012

d3 for analytics dashboard

This week, I worked on developing a proof-of-concept for the analytics dashboard for Casebook. We are using a star schema in PostgreSQL for the data warehouse. For the front-end rendering, where we need histograms, box-and-whisker plots, line charts (may be more...we are an advocate of agile, after all), I am using the d3 javascript library. I found it pretty cool for a number of reasons. First, all the rendering are by SVG, so if you want to draw a histogram, you draw a set of rectangles; if you want to draw a box plot, you draw all the lines by yourself, and so on. You learn a set of basic tools, and you use them repeatedly to get your job done. I have used Google charts before for implementing analytics dashboards, but since SVG lets you draw things with geometric primitives, it gives you better control over things. Second, the concept of "joining" your data with the SVG components of the page. A histogram can be drawn with the following piece of nifty code:

chart.selectAll("rect")
.data(histogram)
.enter().append("rect")
.attr("width", x.rangeBand())
.attr("x", function(d) { return x(d.x); })
.attr("y", function(d) { return height - y(d.y); })
.attr("height", function(d) { return y(d.y); });

where histogram is an array of Javascript objects, each object mentions the start-point, width and frequency of a bucket. There are no SVG "rect" elements to begin with, so we can think of it as an "outer join" between the SVG rectangle elements and the histogram buckets, and set the rectangles attributes on the elements of the resultant set. The enter() operator comes in handy here (and in most situations) when nodes that we are trying to select do not exist yet in the DOM tree, it takes the name of the node to append to the document, which, in this case, is the SVG rect element.

One other thing I found interesting here is the object named "histogram". It's obtained by the following method call:

var histogram = d3.layout.histogram()(data);

it's a layout object returned by d3.layout.histogram(), and it is both an object and a function. Since it is a function, it can be invoked with the parameter "data"; and when it is used as a parameter to the method data() above, it serves as a "key function", which means if the data changes (which can happen if you change the filtering criteria of your query), the new data can be rebound to the nodes of the document.

I attended a talk on d3 by Mike Dewar of bit.ly at a meetup recently, where I first got introduced to it. For more details, check the paper by the original authors.

Tuesday, March 20, 2012

EMC acquires Pivotal

Pivotal Labs, our collaborators on Casebook, got acquired by EMC. We had a talk in our office today on "Social meets Big Data". Big Data meets Agile. :)

Wednesday, March 7, 2012

Defended PhD thesis

Was back in Ames for a few days to defend my PhD thesis. Have never seen Ames being so nice in March! Good to wrap up the pending work!

Friday, March 2, 2012

Strata 2012 (SF Bay Area)

I was in the Bay Area for last 3 days, attending the Strata conference on Big Data. A gathering from Data Scientists all over the US and abroad, companies demo-ing their big data products, data scientists explaining how they are using machine learning in various domains. A few talks that I particularly liked are the following:

1) Robert Lefkowitz's talk on how an array-based database model makes certain database operations, like comparing data points in a time series, faster. Pretty in-depth.

2) Jake Porway's talk on how Big Data can be used for the public sector, and what non-profits are doing for it. Good to see someone working on a set of problems that are relevant for a bigger audience.

3) Dr. Hal Varian's talk on how Google's search data can be used for predicting economic recessions.

Above all, it was great to meet Doug Cutting and Julian Hyde in person. Julian was generous enough with his time and suggestions about what might be the right tools for building our data warehouse for Casebook. A very useful conference indeed, in a place that I doubt whether shares the same sun with New York :)

Friday, January 6, 2012

RPostgreSQL for data analysis

Since the operational database for Casebook is PostgreSQL, and I use R for exploratory data analysis, I was looking for a tool that marries the two, and I found RPostgreSQL to be quite handy. Make a connection to your PostgreSQL DB using the DBI interface, run the query, get the result as a data frame, and the rest is plain R.