Over last couple of weeks, I have been experimenting with survival analysis techniques. We need this for Casebook analytics as we would eventually show the quantiles of how long children in placement stay in foster homes etc, and how that length of stay varies across different cohorts groups (so that users can ask questions like "Have the median length of stay reduced in 2011 compare to 2010?"). The technique we chose was the Kaplan-Meier estimator, which is a popular nonparametric estimator for the survival function We chose it for a few reasons, a) it's simple to implement, b) it deals well with staggered entry (not all children who entered placements in the same year entered on the same day) and censored data (many children who enter placements in a year do not exit till the end of the year, so these observations are right-censored). However, while surveying the literature, I did not find many work which dealt with staggered entries, except this one which has got cited repeatedly. I was interested in using the survival function for finding the median length of stay, which this paper does not address directly; and moreover, it takes the estimate of the survival function at regular intervals, which is not very useful when trying to find the median for the following reason: let's say the median length of stay is 90 days (with an s.d. of 10 days or so), and entries occur starting from January, but we do not see many exits until April. Thus, till the end of March, the KM estimate remains very close to 1, and it hits 0.5 as late as in June. So what should the estimate of median be? 6 months? But that's too longer than the theoretical median of 90 days!
life_table(1000, 90, 10);
Median length of stay = 90
point_in_time no_at_risk no_of_exists fraction_remains survival_prob
1 2008-01-31 83 0 1.0000000 1.00000000
2 2008-02-29 152 0 1.0000000 1.00000000
3 2008-03-31 240 16 0.9333333 0.93333333
4 2008-04-30 303 69 0.7722772 0.72079208
5 2008-05-31 332 73 0.7801205 0.56230466
6 2008-06-30 360 87 0.7583333 0.42641437
7 2008-07-31 356 76 0.7865169 0.33538209
8 2008-08-31 359 103 0.7130919 0.23915826
9 2008-09-30 335 96 0.7134328 0.17062335
10 2008-10-31 310 81 0.7387097 0.12604112
11 2008-11-30 317 77 0.7570978 0.09542546
12 2008-12-31 322 87 0.7298137 0.06964280
life_table(1000, 180, 10);
Median length of stay = 181
point_in_time no_at_risk no_of_exists fraction_remains survival_prob
1 2008-01-31 83 0 1.0000000 1.0000000
2 2008-02-29 158 0 1.0000000 1.0000000
3 2008-03-31 241 0 1.0000000 1.0000000
4 2008-04-30 315 0 1.0000000 1.0000000
5 2008-05-31 395 0 1.0000000 1.0000000
6 2008-06-30 474 10 0.9789030 0.9789030
7 2008-07-31 552 73 0.8677536 0.8494466
8 2008-08-31 574 88 0.8466899 0.7192178
9 2008-09-30 573 75 0.8691099 0.6250794
10 2008-10-31 595 75 0.8739496 0.5462879
11 2008-11-30 593 83 0.8600337 0.4698260
12 2008-12-31 596 84 0.8590604 0.4036089
However, as this article discusses, with staggered entry, it is common practice to shift the intervals so that all of them start from a common point in time, keeping the lengths of the intervals intact, and dealing with the right endpoints (which is now same as the length of the interval itself) as the "time-to-event".
With this left-shifting, I was curious about whether the accuracy of the Kaplan-Meier estimate would depend on the assumption of entries happening uniformly across the year. So, I experimented with two synthetic datasets:
a) the entries happen uniformly across the year: choosing the entry time into placement uniformly at random in a year, drawing the length of stay from a Normal distribution with some input mean and input standard deviation
b) the entries happen mostly in the first half of the year: tossing a coin that comes up heads with p = 0.8, choosing a date of entry from the first half of the year if the coin turns heads; otherwise choosing a date of entry from the second half of the year.
With both datasets, and with all intervals left-shifted to start at 0, however, Kaplan-Meier worked nicely to return the median - the line on the X-axis corresponding to the survival function being at 0.5 came close to the mean/median of the Normal distribution the length of stay was generated from. Here is an example from a sample of 1000 children, entries uniformly across the year with a median length of stay of 90 days and s.d. 2 days, times to events bucketed in buckets of size 3, and 90% confidence intervals for the survival probability. As the table points out, the median is around 89 days.
start_bucket end_bucket #risk #exits #censored fraction_remains surv_prob std_error 90% CI
1 83 85 1000 6 1 0.9939970 0.994 2.443356e-03 [0.98998 - 0.99802]
2 86 88 993 179 54 0.8146998 0.810 1.991064e-03 [0.80672 - 0.81328]
3 89 91 760 397 138 0.4254703 0.345 8.480460e-04 [0.34361 - 0.34639]
4 92 94 225 158 53 0.2040302 0.070 1.720673e-04 [0.06972 - 0.07028]
5 95 97 14 9 5 0.2173913 0.015 3.687156e-05 [0.01494 - 0.01506]
life_table(1000, 90, 10);
Median length of stay = 90
point_in_time no_at_risk no_of_exists fraction_remains survival_prob
1 2008-01-31 83 0 1.0000000 1.00000000
2 2008-02-29 152 0 1.0000000 1.00000000
3 2008-03-31 240 16 0.9333333 0.93333333
4 2008-04-30 303 69 0.7722772 0.72079208
5 2008-05-31 332 73 0.7801205 0.56230466
6 2008-06-30 360 87 0.7583333 0.42641437
7 2008-07-31 356 76 0.7865169 0.33538209
8 2008-08-31 359 103 0.7130919 0.23915826
9 2008-09-30 335 96 0.7134328 0.17062335
10 2008-10-31 310 81 0.7387097 0.12604112
11 2008-11-30 317 77 0.7570978 0.09542546
12 2008-12-31 322 87 0.7298137 0.06964280
life_table(1000, 180, 10);
Median length of stay = 181
point_in_time no_at_risk no_of_exists fraction_remains survival_prob
1 2008-01-31 83 0 1.0000000 1.0000000
2 2008-02-29 158 0 1.0000000 1.0000000
3 2008-03-31 241 0 1.0000000 1.0000000
4 2008-04-30 315 0 1.0000000 1.0000000
5 2008-05-31 395 0 1.0000000 1.0000000
6 2008-06-30 474 10 0.9789030 0.9789030
7 2008-07-31 552 73 0.8677536 0.8494466
8 2008-08-31 574 88 0.8466899 0.7192178
9 2008-09-30 573 75 0.8691099 0.6250794
10 2008-10-31 595 75 0.8739496 0.5462879
11 2008-11-30 593 83 0.8600337 0.4698260
12 2008-12-31 596 84 0.8590604 0.4036089
However, as this article discusses, with staggered entry, it is common practice to shift the intervals so that all of them start from a common point in time, keeping the lengths of the intervals intact, and dealing with the right endpoints (which is now same as the length of the interval itself) as the "time-to-event".
With this left-shifting, I was curious about whether the accuracy of the Kaplan-Meier estimate would depend on the assumption of entries happening uniformly across the year. So, I experimented with two synthetic datasets:
a) the entries happen uniformly across the year: choosing the entry time into placement uniformly at random in a year, drawing the length of stay from a Normal distribution with some input mean and input standard deviation
b) the entries happen mostly in the first half of the year: tossing a coin that comes up heads with p = 0.8, choosing a date of entry from the first half of the year if the coin turns heads; otherwise choosing a date of entry from the second half of the year.
With both datasets, and with all intervals left-shifted to start at 0, however, Kaplan-Meier worked nicely to return the median - the line on the X-axis corresponding to the survival function being at 0.5 came close to the mean/median of the Normal distribution the length of stay was generated from. Here is an example from a sample of 1000 children, entries uniformly across the year with a median length of stay of 90 days and s.d. 2 days, times to events bucketed in buckets of size 3, and 90% confidence intervals for the survival probability. As the table points out, the median is around 89 days.
start_bucket end_bucket #risk #exits #censored fraction_remains surv_prob std_error 90% CI
1 83 85 1000 6 1 0.9939970 0.994 2.443356e-03 [0.98998 - 0.99802]
2 86 88 993 179 54 0.8146998 0.810 1.991064e-03 [0.80672 - 0.81328]
3 89 91 760 397 138 0.4254703 0.345 8.480460e-04 [0.34361 - 0.34639]
4 92 94 225 158 53 0.2040302 0.070 1.720673e-04 [0.06972 - 0.07028]
5 95 97 14 9 5 0.2173913 0.015 3.687156e-05 [0.01494 - 0.01506]