Big data technology for absolute beginners: a meetup report

After years of reading-intensive formal education, I’ve come to the conclusion that I’m actually best at hands-on learning. You can talk to me till next year about technical concepts but until I can see them in action, they often don’t make much sense to me. That’s why this morning’s Boston Data Mining meetup was so valuable: we worked directly with an Amazon Web Services Elastic MapReduce cluster and an AWS Redshift database. Once you can make those connections, the learning opportunities are probably boundless.

The good people of Data Kitchen (Gil Benghiat, Eric Estabrooks, and Chris Bergh) first gave us high-level overviews of Hadoop, AWS EMR (Amazon’s branding of Hadoop), Redshift, and associated frameworks like Impala. Also covered at a high level were MapReduceHive and Pig, which you can use to retrieve data from a Hadoop/AWS EMR cluster. Each technology has its strengths and weaknesses, and the DK guys gave some expert advice in those areas too. Questions from the 50+ people in the room were of high quality and brought up some good discussion points. Also on hand with deep subject matter expertise and critical helpful hints for newbs like me was William Lee of Imagios.

Before long it was time to try connecting to an AWS EMR cluster. Setup for these connections is not a trivial matter, but fortunately there were good instructions posted on the DK blog before the meetup convened. Amazingly, even though many people arrived at the meetup without having completed the setup prerequisites, and all three major desktop O/S were well represented in the crowd, most people were able to run a SQL query against an AWS EMR cluster by the end of the morning. Yes, even me. (My biggest challenge was trying to get SQL Workbench to run on Debian without gnome or KDE installed. FOSS and I have a Stockholm syndrome type of relationship. Long story short, make sure you have one of those desktop environments installed before you try to use SQL Workbench on Linux.)

The Data Kitchen guys and William Lee also put in a few extra hours to make sure we all could put together a Redshift database to which we could connect from our desktops. I was flabbergasted that I was able to get up and running with the AWS technologies for a couple of cents on the dollar. Last week I enrolled in an online Hadoop course that promised I could run labs in the cloud, only to find out after the first couple of lectures that there was no cloud and that the desktop software I would need required a six-core processor at a minimum. Needless to say, I quickly unenrolled from the course.

You can create an AWS EMR cluster that costs a few cents an hour to run. The configuration options you choose during cluster creation apparently can affect the price greatly, so be careful. (The Data Kitchen slides provided specific information on this point.) Also critical: if you’re not going to keep using the EMR cluster or Redshift database you create, remember to terminate it (EMR) or shut it down (Redshift), or face a big credit card bill later. Another great thing for us cheapskates: public big data is only a Google away.

Slides from the workshop will be posted to Slideshare – I would imagine that Data Kitchen will announce the postings on their blog. All told, this was a morning well spent. My lunchtime visit to Tatte Bakery on 3rd Street didn’t hurt.