For last few days, I have been working on designing a classification system that can separate bot-infected sessions from benign ones. Our training data came in the form of two tables: one on the sessions along with bot/user labels, and the other was on the HTTP requests that were made during the sessions. I created two tables in PostgreSQL with some common columns (client IP, server IP, session ID) to join the two tables. Some of the HTTP requests were for the actual pages; whereas many more of them were for the objects embedded within those pages (e.g., gif files). We identified which HTTP requests were for pages with some heuristics. Since I was free to choose the features for building the classifier, I did some initial exploratory analysis, which revealed the important difference between the two types of traffic:
1) Median number of requests from user sessions for pages was much less than that for bots. This was probably because users stay on pages and read/view the actual contents, whereas bots just ask for the pages by automated scripts.
2) Median number of requests from user sessions for distinct pages was also much less than that for bots.
3) Median duration of user session was also much less than that for bots, probably because real users leave when they are done with the content, bots stay to do the damage.
We created three features based on these three metrics. Additionally, since we had data on the sequence of page URLs visited in a session, we extracted 2-grams out of those sequences, and for each session and each 2-gram, kept a flag to indicate whether that 2-gram of page URLs appeared in that session or not (i.e., whether the user/bot in that session visited those 2 URLs in that order). The 2-grams are featured that captured the sequential nature of the data. Since each 2-gram thus became a feature on its own, this made the total number of features more than 5,500. Since most 2-grams do not occur in most sessions, this gave rise to a very sparse high-dimensional matrix, as we often see in text mining.
We computed information gain of these 2-gram based (binary) features, and ordered them by descending order of information gain, so that we can select as many top features as we want to build the model. More on that to follow...
1) Median number of requests from user sessions for pages was much less than that for bots. This was probably because users stay on pages and read/view the actual contents, whereas bots just ask for the pages by automated scripts.
2) Median number of requests from user sessions for distinct pages was also much less than that for bots.
3) Median duration of user session was also much less than that for bots, probably because real users leave when they are done with the content, bots stay to do the damage.
We created three features based on these three metrics. Additionally, since we had data on the sequence of page URLs visited in a session, we extracted 2-grams out of those sequences, and for each session and each 2-gram, kept a flag to indicate whether that 2-gram of page URLs appeared in that session or not (i.e., whether the user/bot in that session visited those 2 URLs in that order). The 2-grams are featured that captured the sequential nature of the data. Since each 2-gram thus became a feature on its own, this made the total number of features more than 5,500. Since most 2-grams do not occur in most sessions, this gave rise to a very sparse high-dimensional matrix, as we often see in text mining.
We computed information gain of these 2-gram based (binary) features, and ordered them by descending order of information gain, so that we can select as many top features as we want to build the model. More on that to follow...