Leaseweb BigData and MapReduce (Part 2)


This is the second part of a series of posts about the current Leaseweb BigData and MapReduce. The first post can be found here.

In this one we will focus on understanding MapReduce and building a job that will run on the Cluster. Let’s start to look for a diagram of what a MapReduce job looks like (Source: http://discoproject.org/media/images/diagram.png):

MapReduce Example

When I first saw this image I started thinking “That’s quite simple! I just need to provide 2 functions and things will fly!” After a while it become clear that is not the case…

So, what is the problem? First let me update the image above what I think is a more realistic view of what are the steps you need to effectively get something out of a MapReduce cluster:

My version

So, now we moved from 2 problems (write a Map and a Reduce function) to 4. Could be 3 or still 2, depending on the cases. But I do think majority of the people will face 4 or 5 (Getting the data from whatever to the cluster).

Resuming, what you need to know to make it work:

  • Get the data to the Cluster (could be a non-issue)
  • Slice the data in appropriate chunks
  • Write the Map
  • Write the Reduce
  • Do something with the output! Store it in someplace, generate files, etc…

This was the second post on BigData and MapReduce. The first post can be found here.

Share

Leaseweb BigData and MapReduce (Part 1)

Today BigData is a enormous Buzz word. So what all about it? How difficult is to use it?

I already wrote on a previous post (http://www.leaseweb.com/labs/2011/11/big-data-do-i-need-it/) about the need for it and gave some insight on our BigData structure here at Leaseweb. On this post I will dive more into how we process this data using MapReduce and Python.

First off all, as the name says, BigData is actually a lot of data, so to retrieve information from it in a timely manner (And we do it in real time) you need to build a infrastructure that can handle it. So without more delay, this is our cluster structure:

  • 6 machines
  • 48 processing units (4 cores per machine with hyper-threading)
  • 1Gb Network

As said before, we want to use our in-house python knowledge (although Java knowledge also exists). So we went with Disco (http://discoproject.org/) to build our MapReduce infrastructure. We didn’t benchmark Disco vs Classic Hadoop structure (the all knowing “Internet” says you take a hit on performance), but we are satisfied we the results we are getting and the cluster is not even getting that much work.

After the hardware and the framework is up and running, it starts the second part, programming the MapReduce Jobs.

MapReduce is not some miracle technology that will solve all your problems with data. So before starting programming the jobs you will have to actually understand how it works. But for that, wait for part 2 🙂

Share

Distributed storage for HA clusters (based on GlusterFS)

Nowadays everyone is searching for HA (High Available) solutions for running more powerful applications or websites. As you can see we already have post on our blog about configuring loadbalancers on Ubuntu (http://www.leaseweb.com/labs/2011/09/setting-up-keepalived-on-ubuntu-load-balancing-using-haproxy-on-ubuntu-part-2/) i will expand that post with the possibility to run highly available and balanced system which you can use for your shared hosting or high traffic website without having a single point of failure.

The actual problem of balanced / clustered solutions often the content server where you keep all data, that can be: databases, static files, uploaded files. All this content needs to be distributed across all your servers and you’ll have to keep track on modifications, which is not really convenient.  In other case you will have a single point of failure of your balanced solution. You can use rsync or lsyncd for doing this, but to make things more simple from administrative perspective and get more advantage for future, you can use a DFS (distributed file system), nowadays there are plenty suitable open source options like: ocfs, glusterfs, mogilefs, pvfs2, chironfs, xtreemfs etc.

Why we want to use DFS?

  • Data sharing of multiple users
  • user mobility
  • location transparency
  • location independence
  • backups and centralized management

For this post, I choose GlusterFS to start with. Why? It is opensource, it has modular, plug-able interface and you can run it on any linux based server without upgrading your kernel. Maybe i will make some overview of others DFS in one of my next blog posts.

We will use 3 dedicated servers like HP120G6:

1. HP120G6 / 1 x QC X3440 CPU / 2 x 1Gbit NICs / 4GB RAM / 2 x 1TB SATA2 (disks you can add later on fly :))

Let’s assume that we already have dedicated server with CentOS 6 installed on the first 1TB hard drive.

Just ssh to your host prepare HDDs and install GlusterFS server with agent.

#ssh root@85.17.xxx.xx

Now we need to prepare HDDs, we will use second 1TB drive for distributed storage:

#parted /dev/sdb
mklabel gpt
unit TB
mkpart primary 0 1TB
print
quit
mkfs.ext4 /dev/sdb1

After that we will install all required packages for gluster:

#yum -y install wget fuse fuse-libs automake bison gcc flex libtool
#yum -y install compat-readline5 compat-libtermcap
#wget http://packages.sw.be/rsync/rsync-3.0.7-1.el5.rfx.x86_64.rpm<

Let’s download the gluster packages from their website:

#wget http://download.gluster.com/pub/gluster/glusterfs/LATEST/CentOS/glusterfs-core-3.2.4-1.x86_64.rpm
#wget http://download.gluster.com/pub/gluster/glusterfs/LATEST/CentOS/glusterfs-fuse-3.2.4-1.x86_64.rpm
#wget http://download.gluster.com/pub/gluster/glusterfs/LATEST/CentOS/glusterfs-geo-replication-3.2.4-1.x86_64.rpm
#wget http://download.gluster.com/pub/gluster/glusterfs/LATEST/CentOS/glusterfs-rdma-3.2.4-1.x86_64.rpm
#rpm -Uvh glusterfs-core-3.2.4-1.x86_64.rpm
#rpm -Uvh glusterfs-geo-replication-3.2.4-1.x86_64.rpm

We also want to run our storage on a separated network card, let’s configure eth1 for that using internal IP range:

#nano /etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE="eth1"
HWADDR="XX:XX:XX:XX:XX:XX"
NM_CONTROLLED="yes"
BOOTPROTO="static"
IPADDR="10.0.10.X"
NETMASK="255.255.0.0"
ONBOOT="yes"

We also want to use hostnames while configuring gluster:

#nano /etc/hosts
10.0.10.1       gluster1-server
10.0.10.2       gluster2-server
10.0.10.3       gluster3-server

Ensure that TCP ports 111, 24007,24008, 24009-(24009 + number of bricks across all volumes) are open on all Gluster servers.

You can use the following chains with iptables:

#iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 24007:24047 -j ACCEPT
#iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 111 -j ACCEPT
#iptables -A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 111 -j ACCEPT
#iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 38465:38467 -j ACCEPT
#service iptables save
#service iptables restart

Now check the version of installed glusterfs:

#/usr/sbin/glusterfs -V

To configure Red Hat-based systems to automatically start the glusterd daemon every time the system boots, enter the following from the command line:

#chkconfig glusterd on

GlusterFS offers a single command line utility known as the Gluster Console Manager to simplify configuration and management of your storage environment. The Gluster Console Manager provides functionality similar to the LVM (Logical Volume Manager) CLI or ZFS Command Line Interface, but across multiple storage servers. You can use the Gluster Console Manager online, while volumes are mounted and active.

You can run the Gluster Console Manager on any Gluster storage server. You can run Gluster commands either by invoking the commands directly from the shell, or by running the Gluster CLI in interactive mode.

To run commands directly from the shell, for example:

#gluster peer status

To run the Gluster Console Manager in interactive mode:

#gluster

Upon invoking the Console Manager, you will get an interactive shell where you can execute gluster commands, for example:

gluster > peer status

Before configuring a GlusterFS volume, you need to create a trusted storage pool consisting of the storage servers that will make up the volume. A storage pool is a trusted network of storage servers. When you start the first server, the storage pool consists of that server alone. To add additional storage servers to the storage pool, you can use the probe command from a storage server.

To add servers to the trusted storage pool use following command for each server you have, example:

gluster peer probe gluster2-server
gluster peer probe gluster3-server

Verify the peer status from the first server using the following commands:

# gluster peer probe gluster1-server
Number of Peers: 2

Hostname: gluster2-server
Uuid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
State: Peer in Cluster (Connected)

Hostname: gluster3-server
Uuid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
State: Peer in Cluster (Connected)

This way, you can add additional storage servers to your storage pool on the fly, to remove a server to the storage pool, use the following command:

# gluster peer detach gluster3-server
Detach successful

Now we can create a replicated volume.
A volume is a logical collection of bricks where each brick is an export directory on a server in the trusted storage pool. Most of the gluster management operations happen on the volume.
Replicated volumes replicate files throughout the bricks in the volume. You can use replicated volumes in environments where high-availability and high-reliability are critical.

First we ssh to each server we have to create folders and mount HDDs:

#mkdir /storage1
mount /dev/sdb1 /storage1

#mkdir /storage2
mount /dev/sdb1 /storage2

#mkdir /storage3
mount /dev/sdb1 /storage3

Create the replicated volume using the following command:

# gluster volume create test-volume replica 3 transport tcp gluster1-server:/storage1 gluster2-server:/storage2 gluster3-server:/storage3
Creation of test-volume has been successful
Please start the volume to access data.

You can optionally display the volume information using the following command:

# gluster volume info
Volume Name: test-volume
Type: Distribute
Status: Created
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: gluster1-server:/storage1
Brick2: gluster2-server:/storage2
Brick3: gluster3-server:/storage3

You must start your volumes before you try to mount them, start the volume using the following command:

# gluster volume start test-volume
Starting test-volume has been successful

Now we need to setup GlusterFS client to access our volume, Gluster offers multiple options to access gluster volumes:

  • Gluster Native Client – This method provides high concurrency, performance and transparent failover in GNU/Linux clients. The Gluster Native Client is POSIX conformant. For accessing volumes using gluster native protocol, you need to install gluster native client.
  • NFS – This method provides access to gluster volumes with NFS v3 or v4.
  • CIFS – This method provides access to volumes when using Microsoft Windows as well as SAMBA clients. For this access method, Samba packages need to be present on the client side.

We will talk about Gluster Native Client as it is a POSIX conformant, FUSE-based, client running in user space. Gluster Native Client is recommended for accessing volumes when high concurrency and high write performance is required.

Verify that the FUSE module is installed:

# modprobe fuse
# dmesg | grep -i fuse
fuse init (API version 7.XX)

Install required prerequisites on the client using the following command:

# sudo yum -y install openssh-server wget fuse fuse-libs openib libibverbs

Ensure that TCP and UDP ports 24007 and 24008 are open on all Gluster servers. Apart from these ports, you need to open one port for each brick starting from port 24009. For example: if you have five bricks, you need to have ports 24009 to 24014 open.

You can use the following chains with iptables:

# sudo iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 24007:24008 -j ACCEPT
# sudo iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 24009:24014 -j ACCEPT

Install Gluster Native Client (FUSE component) using following command:

#rpm -Uvh glusterfs-fuse-3.2.4-1.x86_64.rpm
#rpm -Uvh glusterfs-rdma-3.2.4-1.x86_64.rpm

After that we need to mount volumes we created before:

To manually mount a Gluster volume, use the following command on each server we have:

# mount -t glusterfs gluster1-server:/test-volume /mnt/glusterfs

You can configure your system to automatically mount the Gluster volume each time your system starts.
Edit the /etc/fstab file and add the following line on each server we have:

gluster1-server:/test-volume /mnt/glusterfs glusterfs defaults,_netdev 0 0

To test mounted volumes we can easily use command:

# df -h /mnt/glusterfs Filesystem Size Used Avail Use% Mounted on gluster1-server:/test-volume
1T 1.3G 1TB 1% /mnt/glusterfs

Now you can use /mnt/glusterfs mount as HA storage for your files, backups or even XEN virtualization and you don’t need to worry if 1 server is going down because of hardware problem or you planned some maintenance on it.

Next time, i will write about how to integrate Gluster with NFS without having a single point of failure, so you can use it with any solution which supports the NFS protocol.

Any suggestions and comments are appreciated,Thank you!

Share

Scalable RDBMS

My name is Mukesh, I worked with fairly large (or medium large) scale websites as my previous assignments – and now in LeaseWeb’s cloud team, as an innovation engineer. When I say large scale I’m talking about a website serving 300 million webpages per day (all rendered within a second), a website storing about half a billion photos & videos, a website with an active user base of ~10 million, a web application with 3000 servers …and so on!

We all know it takes a lot to keep sites like these running especially if the company has decided to run it on commodity hardware. Coming from this background, I’d like to dedicate my first blog post to the subject of scalable databases.

A friend of mine,  marketing manager by profession, inspired by technology, asked me why are we using MySQL in knowing that it does not scale (or there is some special harry potter# magic?). He wanted to ask, from what reasons we have chosen MySQL?  And are there any plans to move to another database?

Well the answer for later one is easy “No, we’re not planning to move to another database”. The former question  however, can’t be answered in a single line.

#Talking of Harry Potter, what do you think about ‘The Deathly Hallows part -II’?

Think about Facebook –  a well recognised social networking website. Facebook handles more than 25 billion page views per day; even they use MySQL.

The bottleneck is not MySQL (or any common database). Generally speaking, every database product in the market has the following characteristics to some extent:

  1. PERSISTENCE:  Storage and (random) retrieval of data<
  2. CONCURRENCY:  The ability to support multiple users simultaneously (lock granularity is often an issue here)
  3. DISTRIBUTION:  Maintenance of relationships across multiple databases (support of locality of reference, data replication)
  4. INTEGRITY:  Methods to ensure data is not lost or corrupted (features including automatic two-phase commit, use of dual log files, roll-forward recovery)
  5. SCALABILITY:  Predictable performance as the number of users or the size of the database increase

This post deals about scalability, which we hear quite often when we talk about large systems/big data.

Data volume can be managed if you shard it. If you break the data on different servers at the application level, the scalability of MySQL is not such a big problem. Of course, you cannot make a JOIN with the data from different servers, but choosing a non-relational database doesn’t help either. There is no evidence that even Facebook uses (back in early 2008 its very own) Cassandra as primary storage, and it seems that the only things that’s needed there is a search for incoming messages.

In reality, distributed databases such as Cassandra, MongoDB and CouchDB or any new database (if that matters) lacks on scalability & stability unless there are some real users (I keep seeing post from users running into issues, or annoyance ) For example, the guys at Twitter were trying to move on with MySQL and Cassandra for about a year (great to see that have a bunch of feature working).
I’m not saying they aren’t good, they are getting better with time.  My point is any new database needs more time & cover a few large profiles to be mature (to be considered over MySQL). Of course, if someone tells about how he used any of these databases as primary storage for 1 billion cases in one year, then I’ll change my opinion.

I believe it’s a bad idea to risk your main base on new technology. It would be a disaster to lose or damage the database, and you may not be able to restore everything. Besides, if you’re not a developer of one of these newfangled databases and one of those few who actually use them in combat mode, you can only pray that the developer will fix bugs and issues with scalability as they become available.

In fact, you can go very far on a single MySQL without even caring about a partitioning data at the application level. While it’s easy to scale a server up on a bunch of kernels and tons of RAM, do not forget about replication. In addition, if the server is in front of the memcached layer (which simply scales), the only thing that your database cares is writes. For storing large objects, you can use S3 or any other distributed hash table.  Until you are sure that you need to scale the base as it grows, do not shoulder the burden of making the database an order of magnitude more scalable than you need it.

Most problems arise when you try to split the data over a large number of servers, but you can use an intermediate layer between the base, which is responsible for partitioning. Like for example FriendFeed does.
I believe that the relational model is the correct way of structuring data in most applications – content that users create. Schemes can contain data in a particular form as new versions of the service; they also serve as documentation and help avoid a heap of errors. SQL allows you to process more data as needed instead of getting tons of raw information, which then still need to be reprocessed in the application. I think once the whole hype around the NoSQL is over, someone will finally develop a relational database with free semantics.

Conclusion!

  1. Use MySQL or other classic databases for important, persistent data.
  2. Use caching delivery mechanisms – or maybe even NoSQL – for fast delivery
  3. Wait until the dust settles, and the next generation, free-semantics relational database rises up.
Share