Tuesday, December 4, 2012

Survival Analysis, Part II

I continued my work on survival analysis, this time on real Casebook data, as Casebook went live on July 5. In child welfare, finding the length of stay by entry cohort has always been hard, as caseworkers had to wait till all or most of the children who entered care in a given time period exited care, to get an idea of the aggregate length of stay. To enable caseworkers have an idea of how long kids who are currently in care are going to stay in care, we treated the exits from care as "events" in survival data, and used Cox regression to predict the length of stay. We used a bunch of covarites, like entry cohort (2008-2012), age at the time of entering care, gender, race, parents' alcohol and drug abuse problems, child's behavioral problems, child's disability, history of domestic violence, family structure etc.

Since Cox model returns a survival function, and not a single value of length of stay, even for a kid for whom we know the actual value of length of stay, computing the residual for the model is not as simple as computing (expected - observed), and had to be done rather indirectly. There are various residuals that focus on different aspects of a Cox model, and I was mostly interested in a residual that gives the overall goodness of fit of the model with respect to the observations. The Cox-Snell residual is one such residual. I got fascinated by the underlying idea: it is based on the concept of probability integral transform , which says that if X is a continuous random variable with cumulative distribution function F_X, then the random variable defined as Y = F_X(X) has a uniform distribution in [0, 1]. 

Now, for any continuous random variable X, then, since S(X) = Pr(X > x) = 1 - F_X(X), F_X(X) having a uniform distribution in [0, 1] implies S(X) too has a uniform distribution in [0, 1]. It can be shown that the cumulative hazard rate evaluated at the actual exit time, H(X), which is known to be equal to -ln[S(X)], then, follows an exponential distribution with rate 1. 

Now, the Cox-Snell residual is defined as the cumulative hazard rate of the actual survival time X, and hence, by what we discussed in the above paragraph, if the estimates of the coefficients are close to the actual values of the coefficients, then the Cox-Snell residual should have an exponential distribution with rate 1.

However, the Cox-Snell residual is itself also a continuous random variable, so, by the probability integral transform property and what else we discussed above, the cumulative hazard rate of the Cox-Snell residual should also follow an exponential distribution with rate 1.

Combining all we just said, if we plot the cumulative hazard rate of the Cox-Snell residual vs the Cox-Snell residual, then, the P-P plot should give rise to a scatterplot where the points lie approximately along the line y = x. In practice, the cumulative hazard rate of the Cox-Snell residual is evaluated by the Nelson-Aalen Estimator, and I found this piece of code pretty useful in doing all this. For a theoretical discussion, one should refer to Section 11.2 of Klein and Moeschberger.

No comments:

Post a Comment