Monday, October 26, 2015

My story of SQL as a data analysis equalizer

So, the title of this post is a little bit extreme, but I chose it specifically because I think it will help you (gentle reader) in thinking in the direction that I'd like you to think.

Back when I was a young & bright eyed computer scientist who was slowly realizing that social dynamics were a fascinating and complex environment within which to practice technology development, I was invited to intern at the Wikimedia Foundation with a group of 7 other researchers.  It turns out that we all had very different backgrounds going in.  Fabian, Yusuke, and I had a background in computer science.  Jonathan had expertise in technology design methods (but not really programming).  Melanie's expertise was in rhetoric and language.  Stuart was trained in sociology and philosophy of science (but he'd done a bit of casual programming to build some bots for Wikipedia).  I think this diverse set of backgrounds enabled us to think very well about the problem that we had to face, but that's a subject for another blog entry.  Today I want to talk to you about the technology that we ended up rallying around and taking massive advantage of: the Structured Query Language (SQL) and a Relational Database Management System (RDBMS).

Up until my time at the Wikimedia Foundation, I had to do my research of Wikipedia the hard way.  1st, I downloaded Wikipedia's 10 terrabyte XML dump (compressed to ~100 gigabytes).  Then I write Python script that used a streaming p7zip decompressor and a unix pipe to read the XML with a streaming processor.  This workflow was complex.  It tapped many of the skills I had learned in my training as a computer scientist.  And yet, it was still incredibly slow to perform basic analyses.

It didn't take me long to start using this XML processing strategy to produce intermediate datasets that could be loaded into a postgres RDBMS for high-speed querying.  This was invaluable to making quick progress.  Still I learned some lessons about including *all the metadata I reasonably could* in this dataset since going back to the XML was another headache and week-long processing job.  As a side-note, I've since learned that many other computer scientists working with this dataset went through a similar process and have since polished and published their code that implements these workflows.  TL;DR: This was a really difficult analysis workflow even for people with a solid background in technology.  I don't think it's unreasonable to say that social scientist or rhetoric scholar would have found it intractable to do alone.

When I was helping organize the work we'd be doing at the Wikimedia Foundation, I'd heard that there was a proposal in the works to get us researchers a replica of Wikipedia's databases to query directly.  Honestly, I was amazed that people weren't doing this already.  I put my full support behind it and thanks to others who saw the value, it was made reality.  Suddenly I didn't need to worry about processing new XML dumps to update my analyses, I could just run a query against a database that was already indexes and up to date at all times.  This was a breakthrough for me and I found myself doing explorations of the dataset that I had never considered before because the speed of querying the relevancy of the data made them possible.  Stuart and I had a great time writing SQL queries for both our own curiosity and to explore what we ought to be exploring.

For my coworkers who had no substantial background in programming, they saw yet another language that the techies were playing around with.  So, they took advantage of us to help them answer their questions by asking us to produce excel-sized that they could explore.  But as these things go when people gets busy, these requests would often remain unanswered for days at a time.  I've got to hand it to Jonathan.  Rather than twiddling his thumbs while he waited for a query request to be resolved, he decided to pick up SQL as a tool.  I think he set an example.  It's one thing to have a techy say to you, "Try SQL!  It's easy and powerful." and a totally different thing for someone without such a background to agree.  By the end of the summer internship, I don't think we had anyone (our managers included) who were not writing a bit of SQL here and there.  All would agree that they were substantially empowered by it.

Since then, Jonathan has made it part of his agenda to bring SQL and basic data analysis techniques to larger audiences.  He's been a primary advocate (and volunteer product manager) of Quarry our new open querying to for Wikimedia data.  That service has taken off like wildfire -- threatening to take down our MariaDB servers.  Check it out https://quarry.wmflabs.org/  Specifically, I'd like to point you to the list of recent queries: https://quarry.wmflabs.org/query/runs/all  Here, you can learn SQL techniques by watching others use them!

No comments:

Post a Comment