This is the second part of a series of posts about the current Leaseweb BigData and MapReduce. The first post can be found here.
In this one we will focus on understanding MapReduce and building a job that will run on the Cluster. Let’s start to look for a diagram of what a MapReduce job looks like (Source: http://discoproject.org/media/images/diagram.png):
When I first saw this image I started thinking “That’s quite simple! I just need to provide 2 functions and things will fly!” After a while it become clear that is not the case…
So, what is the problem? First let me update the image above what I think is a more realistic view of what are the steps you need to effectively get something out of a MapReduce cluster:
So, now we moved from 2 problems (write a Map and a Reduce function) to 4. Could be 3 or still 2, depending on the cases. But I do think majority of the people will face 4 or 5 (Getting the data from whatever to the cluster).
Resuming, what you need to know to make it work:
- Get the data to the Cluster (could be a non-issue)
- Slice the data in appropriate chunks
- Write the Map
- Write the Reduce
- Do something with the output! Store it in someplace, generate files, etc…
This was the second post on BigData and MapReduce. The first post can be found here.
Today BigData is a enormous Buzz word. So what all about it? How difficult is to use it?
I already wrote on a previous post (http://www.leaseweb.com/labs/2011/11/big-data-do-i-need-it/) about the need for it and gave some insight on our BigData structure here at Leaseweb. On this post I will dive more into how we process this data using MapReduce and Python.
First off all, as the name says, BigData is actually a lot of data, so to retrieve information from it in a timely manner (And we do it in real time) you need to build a infrastructure that can handle it. So without more delay, this is our cluster structure:
- 6 machines
- 48 processing units (4 cores per machine with hyper-threading)
- 1Gb Network
As said before, we want to use our in-house python knowledge (although Java knowledge also exists). So we went with Disco (http://discoproject.org/) to build our MapReduce infrastructure. We didn’t benchmark Disco vs Classic Hadoop structure (the all knowing “Internet” says you take a hit on performance), but we are satisfied we the results we are getting and the cluster is not even getting that much work.
After the hardware and the framework is up and running, it starts the second part, programming the MapReduce Jobs.
MapReduce is not some miracle technology that will solve all your problems with data. So before starting programming the jobs you will have to actually understand how it works. But for that, wait for part 2 🙂