This is the second part of a series of posts about the current Leaseweb BigData and MapReduce. The first post can be found here.
In this one we will focus on understanding MapReduce and building a job that will run on the Cluster. Let’s start to look for a diagram of what a MapReduce job looks like (Source: http://discoproject.org/media/images/diagram.png):
When I first saw this image I started thinking “That’s quite simple! I just need to provide 2 functions and things will fly!” After a while it become clear that is not the case…
So, what is the problem? First let me update the image above what I think is a more realistic view of what are the steps you need to effectively get something out of a MapReduce cluster:
So, now we moved from 2 problems (write a Map and a Reduce function) to 4. Could be 3 or still 2, depending on the cases. But I do think majority of the people will face 4 or 5 (Getting the data from whatever to the cluster).
Resuming, what you need to know to make it work:
- Get the data to the Cluster (could be a non-issue)
- Slice the data in appropriate chunks
- Write the Map
- Write the Reduce
- Do something with the output! Store it in someplace, generate files, etc…
This was the second post on BigData and MapReduce. The first post can be found here.