Today BigData is a enormous Buzz word. So what all about it? How difficult is to use it?
I already wrote on a previous post (http://www.leaseweb.com/labs/2011/11/big-data-do-i-need-it/) about the need for it and gave some insight on our BigData structure here at Leaseweb. On this post I will dive more into how we process this data using MapReduce and Python.
First off all, as the name says, BigData is actually a lot of data, so to retrieve information from it in a timely manner (And we do it in real time) you need to build a infrastructure that can handle it. So without more delay, this is our cluster structure:
- 6 machines
- 48 processing units (4 cores per machine with hyper-threading)
- 1Gb Network
As said before, we want to use our in-house python knowledge (although Java knowledge also exists). So we went with Disco (http://discoproject.org/) to build our MapReduce infrastructure. We didn’t benchmark Disco vs Classic Hadoop structure (the all knowing “Internet” says you take a hit on performance), but we are satisfied we the results we are getting and the cluster is not even getting that much work.
After the hardware and the framework is up and running, it starts the second part, programming the MapReduce Jobs.
MapReduce is not some miracle technology that will solve all your problems with data. So before starting programming the jobs you will have to actually understand how it works. But for that, wait for part 2 🙂