Thursday, December 19, 2013

Feature design for classification of network traffic

For last few days, I have been working on designing a classification system that can separate bot-infected sessions from benign ones. Our training data came in the form of two tables: one on the sessions along with bot/user labels, and the other was on the HTTP requests that were made during the sessions. I created two tables in PostgreSQL with some common columns (client IP, server IP, session ID) to join the two tables. Some of the HTTP requests were for the actual pages; whereas many more of them were for the objects embedded within those pages (e.g., gif files). We identified which HTTP requests were for pages with some heuristics. Since I was free to choose the features for building the classifier, I did some initial exploratory analysis, which revealed the important difference between the two types of traffic:

1) Median number of requests from user sessions for pages was much less than that for bots. This was probably because users stay on pages and read/view the actual contents, whereas bots just ask for the pages by automated scripts.

2) Median number of requests from user sessions for distinct pages was also much less than that for bots.

3) Median duration of user session was also much less than that for bots, probably because real users leave when they are done with the content, bots stay to do the damage.

We created three features based on these three metrics. Additionally, since we had data on the sequence of page URLs visited in a session, we extracted 2-grams out of those sequences, and for each session and each 2-gram, kept a flag to indicate whether that 2-gram of page URLs appeared in that session or not (i.e., whether the user/bot in that session visited those 2 URLs in that order). The 2-grams are featured that captured the sequential nature of the data. Since each 2-gram thus became a feature on its own, this made the total number of features more than 5,500. Since most 2-grams do not occur in most sessions, this gave rise to a very sparse high-dimensional matrix, as we often see in text mining.

We computed information gain of these 2-gram based (binary) features, and ordered them by descending order of information gain, so that we can select as many top features as we want to build the model. More on that to follow...

Sunday, September 15, 2013

Completed Linear Algebra course

I completed the Linear Algebra course by
Dr. Philip Klein this summer, and here
is my certificate. It was a rigorous one where I revised important concepts of Linear Algebra and implemented related algorithms in Python.

Monday, August 19, 2013

Joined Impetus

I joined Impetus Technologies as a Data Scientist in New York. Looking forward to working in challenging analytics projects with a more diverse mix of technologies...

Wednesday, April 3, 2013

Low-dimensional visualization of Casebook data

I recently started exploring biplot approximation for Casebook data, as some aspects of Casebook data makes it inherently high-dimensional (e.g., we collect a bunch of data on substance abuse history of child and parent, whether caretaker was unable to cope, whether child has disability through quizzes when a "removal episode" for a child starts), and hence, comparing entry cohorts, and even children, in terms of these attributes is interesting. I tried two types of biplots so far: 1) biplot based on PCA and 2) correspondence analysis. Both the techniques are based on SVD, but the situations where these two are applied are different: PCA-based biplot is used mostly when we have observations about attributes which are continuous in nature, e.g., for each entry cohort from 2008 to 2012, I took the percentage of removal episodes with male children, and percentage of removal episodes where the children were reported to have problems like substance abuse, disability etc. So my "observations" were the entry cohorts, and my "attributes" were these percentages. The nice thing about PCA-based biplot is that it plots the "observations" and the "attributes" on the same 2D plot, which reveals how close or far the observations are from each other, how close or far the attributes are from each other, and how the attributes are related to the observations. This paper by Gabriel laid the foundation of PCA-based biplot. In summary, the PCA-based biplot approximation takes the projection of each observation and each attribute along the first two principal components, and uses them as the coordinates to plot them on a 2D plot.  Some of the interesting findings from our data using PCA-based biplot were:  child's drug abuse, child's alcohol abuse and physical abuse were closely related; parent's alcohol abuse, caretaker's inability to cope, inadequate housing for the child and relinquishment were once again very close; and so were child's behavioral problem and incarceration of parent. The first principal component explained 49% of the variation in the data, while the second explained 32%; so 81% of the total variance was explained by the first two. Each principal component was a linear combination of 16 variables, and the total variance got explained by 5 principal components.

Correspondence analysis, on the other hand, is used to see how two (or more) categorical covariates are related to each other. In that sense, it is very related to the chi-square test of independence. I found this document a fairly easy-to-understand, and very intuitive explanation of CA; while this is a more mathematical one. I also liked the way this article compares the inertia of a row in a contingency table with the physical concept of angular inertia, and especially the fact that it makes the following point

"Correspondence analysis provides a means of representing a table of  distances in a graphical form, with rows represented by points, so that the distances between points approximate the  distances between the rows they represent.".

 In our case, the permanency outcome of a child can be adoption, guardianship, reunification (the more desirable ones); or, transfer to another placement, transfer to a collaborative care, or the kid's running away resulting in a dismissal of wardship (the less desirable ones). On the other hand, the type of the provider for a removal episode can be a placement provider, a residential resource, a foster family, or even a person. We created the contingency table with the permanency outcomes on the rows and the provider types as the columns, and applied correspondence analysis on that. The important structure that CA revealed was that placement in a foster family leads more often to adoption, guardianship and to some extent, reunification - in summary, to the more desirable outcomes, whereas placement with a placement provider or a residential resource leads more often to emancipation (aging out), or transfer to another agency, or the child being placed in a collaborative care. The first two eigenvalues were 0.065 and 0.0145, and the total inertia (which can be shown to be the chi-square statistic for the contingency table divided by the sum of all cell values) was 0.084, so the first two dimensions explained 0.065/0.084 = 78% and 0.0145/0.084 = 17% of the inertia, respectively. Among the rows, the biggest contributors to the inertia of 0.084 were adoption (26%), transfer to another agency (25%) and guardianship (22%); and among the columns, the biggest contributors were residential resource (40.6%), placement provider (26%) and foster family (24%).

Wednesday, January 30, 2013

Numeric column functions in VoltDB 3.0

I made my first contribution to the in-memory database VoltDB, the product of the database startup of the same name, founded by the database pioneer Dr. Michael Stonebraker. The feature was the set of numeric column functions like ABS, FLOOR, CEILING, SQRT, EXP and POWER. The feature was a part of VoltDB 3.0, and I got a mention in Twitter by VoltDB.


Friday, December 21, 2012

Accelerated Failure Time Models

I continued my experiments with survival analysis, this time with Accelerated Failure Time Models. Although the Cox-Snell residuals, as explained in this post, indicated that the Cox model was a pretty good match, it works on the Proportional Hazards assumption, which holds only approximately for our data, and I was interested in seeing if we can do away with this assumption. The Accelerated Failure Time Model stands on the assumption that the event times (in our case, actual times when kids exit from care) come from a theoretical distribution, like Weibull, or Log-logistic, exponential etc. The graphical test to check if the exit times follow Weibull distribution (I mean, apart from drawing a histogram of exit times with equal bucket widths, which I did, too) is to plot log(-log(S(t))) against log(t), where S(t) is the survival function estimated by Kaplan-Meier, and it should give a straight line (an explanation is here). What we got was roughly a straight line. The equivalent check for the Log-logistic distribution is to plot log((1-S(t))/S(t)) against log(t), which should be a straight line if the assumption holds, and the curve I got was almost quadratic, so I pursued Weibull. I used the survreg function from the survival library. For prediction using the AFT Weibull model, I used the generic predict function of R, computing the quantiles, and plotting them in a reversed way, using ideas as in this lecture.

However, when I did the graphical check with Cox-Snell residuals and their cumulative hazards, as I did for the Cox-model, it seemed that the Cox model was a better fit. The reason was the estimated survival probabilities of the children who exited, as computed on their actual exit times, did not follow a uniform distribution very well, and the graphical check with Cox-Snell residual stands on the assumption that it would.

However, when we plot (one plot per entry cohort year, which is 2008, 2009, 2010, 2011 or 2012) the three estimated survival functions by Kaplan-Meier, Cox regression and AFT regression (for the last two, we apply the model on an "average kid", i.e., a fictitious kid with all covariate values set to their respective means), the last two come pretty close, and both remain pretty close to the one given by Kaplan-Meier for the older entry cohorts; and farther from the one given by Kaplan-Meier for the more recent entry cohorts. This is actually the way we want it to be; the reason being, for the older cohorts, most (93-95%) of kids have exited care, but for the more recent cohorts, only very few have; so, what Kaplan-Meier gives is essentially a biased estimate, based mostly on the kids who have already left, and for those kids, the length of stay in care for 2012 entry cohort, as of today, can't be more than 350 days, and so on. However, the pattern seen from the older cohorts establishes our trust in the model, and we expect the Kaplan-Meier estimate and the Cox-AFT estimates will come closer to each other as more kids from the recent cohorts will exit from care and the model will get updated. 

Tuesday, December 4, 2012

Survival Analysis, Part II

I continued my work on survival analysis, this time on real Casebook data, as Casebook went live on July 5. In child welfare, finding the length of stay by entry cohort has always been hard, as caseworkers had to wait till all or most of the children who entered care in a given time period exited care, to get an idea of the aggregate length of stay. To enable caseworkers have an idea of how long kids who are currently in care are going to stay in care, we treated the exits from care as "events" in survival data, and used Cox regression to predict the length of stay. We used a bunch of covarites, like entry cohort (2008-2012), age at the time of entering care, gender, race, parents' alcohol and drug abuse problems, child's behavioral problems, child's disability, history of domestic violence, family structure etc.

Since Cox model returns a survival function, and not a single value of length of stay, even for a kid for whom we know the actual value of length of stay, computing the residual for the model is not as simple as computing (expected - observed), and had to be done rather indirectly. There are various residuals that focus on different aspects of a Cox model, and I was mostly interested in a residual that gives the overall goodness of fit of the model with respect to the observations. The Cox-Snell residual is one such residual. I got fascinated by the underlying idea: it is based on the concept of probability integral transform , which says that if X is a continuous random variable with cumulative distribution function F_X, then the random variable defined as Y = F_X(X) has a uniform distribution in [0, 1]. 

Now, for any continuous random variable X, then, since S(X) = Pr(X > x) = 1 - F_X(X), F_X(X) having a uniform distribution in [0, 1] implies S(X) too has a uniform distribution in [0, 1]. It can be shown that the cumulative hazard rate evaluated at the actual exit time, H(X), which is known to be equal to -ln[S(X)], then, follows an exponential distribution with rate 1. 

Now, the Cox-Snell residual is defined as the cumulative hazard rate of the actual survival time X, and hence, by what we discussed in the above paragraph, if the estimates of the coefficients are close to the actual values of the coefficients, then the Cox-Snell residual should have an exponential distribution with rate 1.

However, the Cox-Snell residual is itself also a continuous random variable, so, by the probability integral transform property and what else we discussed above, the cumulative hazard rate of the Cox-Snell residual should also follow an exponential distribution with rate 1.

Combining all we just said, if we plot the cumulative hazard rate of the Cox-Snell residual vs the Cox-Snell residual, then, the P-P plot should give rise to a scatterplot where the points lie approximately along the line y = x. In practice, the cumulative hazard rate of the Cox-Snell residual is evaluated by the Nelson-Aalen Estimator, and I found this piece of code pretty useful in doing all this. For a theoretical discussion, one should refer to Section 11.2 of Klein and Moeschberger.