Reflections on the Open Analytics Summit

Back in June, I attended the one-day Open Analytics Summit. We aren't really doing much with analytics or big data here at CCNMTL (yet), but there are many conversations and projects happening around campus and I wanted to get a better sense of the kinds of value these methods are yielding. These issues are sure to be central to much of the research and instruction at the Institute for Data Sciences and Engineering, and have already crept up on a number of Columbia projects we have been involved with, such as the Declassification Engine and the Open Syllabus Project.

The conference was interesting, but I was a bit puzzled by the format. The talks were all 15 or 30 min in length, and the speakers rarely left any time for questions. It was almost like a day of long lightning talks - they talks weren't really long enough to get into too much depth, but I did get a flavor for the kind of work happening in this field.

Some of the conference highlights:

  • Nearly everyone is using hadoop. No real surprises there.
  • I saw an impressive demo of elasticsearch (distributed SOLR), combined with logstash (which we are now using) and web-based querying tool - kibana. See demo.kibana.com and this writeup and video to see how this is used to query the twitter firehose. It's interesting to think about the different kinds of data that resemble log formats, and can be coaxed into this style of analysis.
  • Apache Drill, a FOSS implementation of Google's Dremel, for using sql to query multiple data stores, including nosql ones. The speaker quipped that the apache foundation is borrowing its roadmap from Google's whitepapers.
  • DataNitro - I thought this was super cool, even though its not open (though it is gratis for students) and windows-only. Basically treats excel as a front-end client (or, the View in an MVC system) for interacting with server-side python, and includes a python interpreter inside of excel for manipulating data. Looked really powerful for teaching, with plenty of IPython overlap, but has a pretty well defined niche. The author hopes that tools like these might do a better job with provenance, and prevent data disasters like the Reinhart & Rogoff depression.
  • Luigi - (pycon talk) is a tool "for batch data processing including dependency resolution and monitoring". It will be interesting to compare this to CCNMTL's Wardenclyffe (soon to be released!), a web-based, workflow orchestration tool that we use for batch processing of videos, and more.
  • Chartbeat a service that allows sites to track in minute detail where users are spending time on their pages. Their software sends data back to chartbeat every second to let them know how long you have spent on the page, and where you mouse is pointing. An interesting finding is that once you eliminate users that leave a page right away, most users spend most of their time scrolled part-way down the page.
  • Finally, I saw a fascinating talk about how big data and predictive modeling were used in the Obama campaign to strategize their media buys - I am pretty sure some of this was covered in the press, but the presenters were part of the campaign and shared some juicy details (like, how they spent something like 400k on "set top box" data), Here is their prezi presentation. They claimed that these techniques resulted in the Obama campaign spending $100/TV min/voter less than the Romney campaign.

Overall, this summit was a pretty interesting mixture of sectors and tools. It wasn't quite as technical as I was hoping, and the format prevented anyone from diving into the detail I was hoping for. I was also left wondering what kind of real value this kind of analytics is providing, but there were a few examples in marketing that demonstrated the payoff, and everyone in this room believes enough in these methods to invest reams of resources into finding out that answer.