Thursday, May 10, 2012

Survival analysis

Over last couple of weeks, I have been experimenting with survival analysis techniques. We need this for Casebook analytics as we would eventually show the quantiles of how long children in placement stay in foster homes etc, and how that length of stay varies across different cohorts groups (so that users can ask questions like "Have the median length of stay reduced in 2011 compare to 2010?"). The technique we chose was the Kaplan-Meier estimator, which is a popular nonparametric estimator for the survival function  We chose it for a few reasons, a) it's simple to implement, b) it deals well with staggered entry (not all children who entered placements in the same year entered on the same day) and censored data (many children who enter placements in a year do not exit till the end of the year, so these observations are right-censored). However, while surveying the literature, I did not find many work which dealt with staggered entries, except this one which has got cited repeatedly. I was interested in using the survival function for finding the median length of stay, which this paper does not address directly; and moreover, it takes the estimate of the survival function at regular intervals, which is not very useful when trying to find the median for the following reason: let's say the median length of stay is 90 days (with an s.d. of 10 days or so), and entries occur starting from January, but we do not see many exits until April. Thus, till the end of March, the KM estimate remains very close to 1, and it hits 0.5 as late as in June. So what should the estimate of median be? 6 months? But that's too longer than the theoretical median of 90 days!

life_table(1000, 90, 10);
Median length of stay = 90
   point_in_time no_at_risk no_of_exists fraction_remains survival_prob
1     2008-01-31         83            0        1.0000000    1.00000000
2     2008-02-29        152            0        1.0000000    1.00000000
3     2008-03-31        240           16        0.9333333    0.93333333
4     2008-04-30        303           69        0.7722772    0.72079208
5     2008-05-31        332           73        0.7801205    0.56230466
6     2008-06-30        360           87        0.7583333    0.42641437
7     2008-07-31        356           76        0.7865169    0.33538209
8     2008-08-31        359          103        0.7130919    0.23915826
9     2008-09-30        335           96        0.7134328    0.17062335
10    2008-10-31        310           81        0.7387097    0.12604112
11    2008-11-30        317           77        0.7570978    0.09542546
12    2008-12-31        322           87        0.7298137    0.06964280


life_table(1000, 180, 10);
Median length of stay = 181
   point_in_time no_at_risk no_of_exists fraction_remains survival_prob
1     2008-01-31         83            0        1.0000000     1.0000000
2     2008-02-29        158            0        1.0000000     1.0000000
3     2008-03-31        241            0        1.0000000     1.0000000
4     2008-04-30        315            0        1.0000000     1.0000000
5     2008-05-31        395            0        1.0000000     1.0000000
6     2008-06-30        474           10        0.9789030     0.9789030
7     2008-07-31        552           73        0.8677536     0.8494466
8     2008-08-31        574           88        0.8466899     0.7192178
9     2008-09-30        573           75        0.8691099     0.6250794
10    2008-10-31        595           75        0.8739496     0.5462879
11    2008-11-30        593           83        0.8600337     0.4698260
12    2008-12-31        596           84        0.8590604     0.4036089

        However, as this article discusses, with staggered entry, it is common practice to shift the intervals so that all of them start from a common point in time, keeping the lengths of the intervals intact, and dealing with the right endpoints (which is now same as the length of the interval itself) as the "time-to-event".

     With this left-shifting, I was curious about whether the accuracy of the Kaplan-Meier estimate would depend on the assumption of entries happening uniformly across the year. So, I experimented with two synthetic datasets:

a) the entries happen uniformly across the year: choosing the entry time into placement uniformly at random in a year, drawing the length of stay from a Normal distribution with some input mean and input standard deviation

b) the entries happen mostly in the first half of the year: tossing a coin that comes up heads with p = 0.8, choosing a date of entry from the first half of the year if the coin turns heads; otherwise choosing a date of entry from the second half of the year.

   With both datasets, and with all intervals left-shifted to start at 0, however, Kaplan-Meier worked nicely to return the median - the line on the X-axis corresponding to the survival function being at 0.5 came close to the mean/median of the Normal distribution the length of stay was generated from. Here is an example from a sample of 1000 children, entries uniformly across the year with a median length of stay of 90 days and s.d. 2 days, times to events bucketed in buckets of size 3, and 90% confidence intervals for the survival probability. As the table points out, the median is around 89 days.

   start_bucket end_bucket #risk      #exits       #censored fraction_remains surv_prob std_error            90% CI
1              83            85       1000           6                1        0.9939970         0.994      2.443356e-03     [0.98998 - 0.99802]
2              86            88        993         179             54        0.8146998         0.810      1.991064e-03     [0.80672 - 0.81328]
3              89            91        760         397            138       0.4254703         0.345      8.480460e-04     [0.34361 - 0.34639]
4              92            94        225         158              53       0.2040302         0.070      1.720673e-04     [0.06972 - 0.07028]
5              95            97         14           9                   5       0.2173913         0.015      3.687156e-05     [0.01494 - 0.01506]