Saturday, March 31, 2012

STOMP with Rails

I am looking for some message-oriented middleware to make the ETL between the operational Casebook database and the data warehouse DB (with star schema) real-time (in fact, Pentaho uses JMS for real-time ETL, but it comes with the enterprise edition only). This week, I experimented with the Apache ActiveMQ that speaks STOMP, among other protocols (ActiveMQ does not yet support AMQP, though). I checked this tutorial on the ActiveMessaging plugin, and also this tutorial from the Rails Recipe book. Finally, I decided to write a PoC on my own. ActiveMessaging brings in the idea of a Processor class, which is basically a subscriber that listens to a message queue. In my demo app, the user can create orders through the front-end, which gets saved to the database through the usual controller-model handshake, with an initial status 'NEW'. However, the controller also writes the data about the newly created order to the message broker. The OrderProcessor, which listens to the same message broker, parses the XML-formatted data using the REXML api, finds the order through the id, and updates the status to 'SHIPPED'. Note that, although the controller and the processor have a client-server (or publisher-subscriber) relationship between themselves, they are both clients to the ActiveMQ server, which, in this case, speaks STOMP.

I will look next how the "publish" part to the message broker can be triggered from the operational database, as inserts, updates and deletes take place there.

Friday, March 23, 2012

d3 for analytics dashboard

This week, I worked on developing a proof-of-concept for the analytics dashboard for Casebook. We are using a star schema in PostgreSQL for the data warehouse. For the front-end rendering, where we need histograms, box-and-whisker plots, line charts (may be more...we are an advocate of agile, after all), I am using the d3 javascript library. I found it pretty cool for a number of reasons. First, all the rendering are by SVG, so if you want to draw a histogram, you draw a set of rectangles; if you want to draw a box plot, you draw all the lines by yourself, and so on. You learn a set of basic tools, and you use them repeatedly to get your job done. I have used Google charts before for implementing analytics dashboards, but since SVG lets you draw things with geometric primitives, it gives you better control over things. Second, the concept of "joining" your data with the SVG components of the page. A histogram can be drawn with the following piece of nifty code:

chart.selectAll("rect")
.data(histogram)
.enter().append("rect")
.attr("width", x.rangeBand())
.attr("x", function(d) { return x(d.x); })
.attr("y", function(d) { return height - y(d.y); })
.attr("height", function(d) { return y(d.y); });

where histogram is an array of Javascript objects, each object mentions the start-point, width and frequency of a bucket. There are no SVG "rect" elements to begin with, so we can think of it as an "outer join" between the SVG rectangle elements and the histogram buckets, and set the rectangles attributes on the elements of the resultant set. The enter() operator comes in handy here (and in most situations) when nodes that we are trying to select do not exist yet in the DOM tree, it takes the name of the node to append to the document, which, in this case, is the SVG rect element.

One other thing I found interesting here is the object named "histogram". It's obtained by the following method call:

var histogram = d3.layout.histogram()(data);

it's a layout object returned by d3.layout.histogram(), and it is both an object and a function. Since it is a function, it can be invoked with the parameter "data"; and when it is used as a parameter to the method data() above, it serves as a "key function", which means if the data changes (which can happen if you change the filtering criteria of your query), the new data can be rebound to the nodes of the document.

I attended a talk on d3 by Mike Dewar of bit.ly at a meetup recently, where I first got introduced to it. For more details, check the paper by the original authors.

Tuesday, March 20, 2012

EMC acquires Pivotal

Pivotal Labs, our collaborators on Casebook, got acquired by EMC. We had a talk in our office today on "Social meets Big Data". Big Data meets Agile. :)

Wednesday, March 7, 2012

Defended PhD thesis

Was back in Ames for a few days to defend my PhD thesis. Have never seen Ames being so nice in March! Good to wrap up the pending work!

Friday, March 2, 2012

Strata 2012 (SF Bay Area)

I was in the Bay Area for last 3 days, attending the Strata conference on Big Data. A gathering from Data Scientists all over the US and abroad, companies demo-ing their big data products, data scientists explaining how they are using machine learning in various domains. A few talks that I particularly liked are the following:

1) Robert Lefkowitz's talk on how an array-based database model makes certain database operations, like comparing data points in a time series, faster. Pretty in-depth.

2) Jake Porway's talk on how Big Data can be used for the public sector, and what non-profits are doing for it. Good to see someone working on a set of problems that are relevant for a bigger audience.

3) Dr. Hal Varian's talk on how Google's search data can be used for predicting economic recessions.

Above all, it was great to meet Doug Cutting and Julian Hyde in person. Julian was generous enough with his time and suggestions about what might be the right tools for building our data warehouse for Casebook. A very useful conference indeed, in a place that I doubt whether shares the same sun with New York :)