Big data – do I need it?

Big Data?

Big data is one of the most recent “buzz words” on the Internet. This term is normally associated to data sets so big, that they are really complicated to store, process, and search trough.

Big data is known to be a three-dimensional problem (defined by Gartner, Inc*), i.e. it has three problems associated with it:
1. increasing volume (amount of data)
2. velocity (speed of data in/out)
3. variety (range of data types, sources)

Why Big Data?
As datasets grow bigger, the more data you can extract from it, and the better the precision of the results you get (assuming you’re using right models, but that is not relevant for this post). Also better and more diverse analysis could be done against the data. Diverse corporations are increasing more and more their datasets to get more “juice” out of it. Some to get better business models, other to improve user experiences, other to get to know their audience better, the choices are virtually unlimited.

In the end, and in my opinion, big data analysis/management can be a competitive advantage for corporations. In some cases, a crucial one.

Big Data Management

Big data management software is not something you buy normally on the market, as “off-the-shelf” product (Maybe Oracle wants to change this?). One of biggest questions of big data management is what do you want to do with it? Knowing this is essential to minimize to problems related with huge data sets. Of course you can just store everything and later try to make some sense of the data you have. Again, in my opinion, this is the way to get a problem and not a solution/advantage.
Since you cannot just buy a big data management solution, a strategy has to be designed and followed until something is found that can work as a competitive advantage to the product/company.

Internally at LeaseWeb we’ve got a big data set, and we can work on it at real-time speed (we are using Cassandra** at the moment) and obtaining the results we need. To get this working, we had several trial-and-error iterations, but in the end we got what we needed and until now is living up to the expectations. How much hardware? How much development time? This all depends, the question you have to ask yourself is “What do I need?”, and after you have an answer to that, normal software planning /development time applies. It can be even the case that you don’t need Big Data at all, or that you can solve it using standard SQL technologies.

In the end, our answer to the “What do I need?” provided us with all the data we needed to search what was best for us. In this case was a mix of technologies and one of them being a NoSQL database.



Choosing the right tool for automated GUI testing

Within LeaseWeb we did our first steps in automated GUI testing and we use a tool called “Selenium” for our automated tests. Before choosing Selenium as an automated testing execution tool, we also checked:

– Fitnesse (
– Rational Robot (
– Selenium IDE and webdriver (
– AutoIT3 (

We preselected these tools for research, because they all pretend to have a low learning curve and could be (more or less) used out of the box. Also we wanted open source tooling. Rational is not open source, but was an alternative, because I used to work with the Rational tooling with good results in the past.

Works with input from Wiki like pages, and very usable if you want to test a lot of data that comes from the business. So very usable for testing data that is going through the software from a business point of view. Fitnesse has no Record and Playback options (yet), so the learning curve is not that high for testers just starting with automating their tests. Maybe something to use in combination in the future.

This is more of a macro recorder. It has record and playback, but that is very limited and it clicks around on applications based on pixels on the screen. If you play a script from AutoIT on another PC with a different resolution or when a button is placed somewhere else, the script will fail. It is usable for quick recording, editing and playback when you need “throw away automation”. Nevertheless it has full programming capabilities, so can be used if you’re writing tests that go beyond the User interface.

Selenium is the most well known automated test tool for web applications and I immediately could play around with the Record & Playback tool of this product. It was reasonably easy to export this to a programming language as “Python” and then the options for using it as a test tool becomes almost limitless.

IBM Rational Robot
I used Rational Robot a few years ago and tried to install the latest version. But I couldn’t get it to start within my time box for this research. I checked this after checking Selenium, so I quickly went back to Selenium as an automated GUI testing platform.

This was a post of the selection of an automated tool. We have chosen the Selenium platform because it is a web based automated GUI tool that can be used almost immediately with little knowledge of programming. It doesn’t stop there, because you can use a programming environment to adapt it to your needs. Also it has lots of plug-ins for the record and playback software (Selenium IDE) that make more sophisticated test scripts possible in Selenium IDE. I’ve heard of combination s of Selenium with the Fitnesse tool, so that could be a good combination in the future for data driven testing too. Open source, open platform. What’s not to like?

Just try it for yourself and see the possibilities.


PFCongres 2011 – a quick recap

Last Saturday, 17th of September 2011 was the day when the PFCongres took place. For those who don’t know – it’s a web development conference in the Netherlands which has been gathering web enthusiasts for the sixth year in a row. This year’s edition was split into two simultaneous tracks and hosted fourteen, well known speakers such as:

Zeev Suraski – an Israeli programmer, PHP developer and co-founder of Zend Technologies. With help of Andi Gutmans he wrote PHP3 in 1997 and the Zend Engine in 1999.

Derick Rethans – author of the mcrypy, input_filter, dbus and date/time extensions in PHP. He takes care about the well-known PHP profiler – Xdebug and is contributor to the Apache Zeta Components.

Juozas Kaziukenas – founder and CEO of Web Species Ltd, speaker on web technologies’ conferences, blogger.

Joshua Thijssen – senior software engineer at Enrise/4Worx and owner of the privately held company NoxLogic.

Unfortunately we could only follow sessions on the English track, but there were a lot of interesting topics there:

– Mastering Namespaces in PHP
– The new era of PHP frameworks
– PHP Extensions, why and what?
– SPL Data Structures and their Complexity
– 15 Pro tips for MySQL users

Three of them became really valuable to me, as a PHP developer:

The greatest speech in my opinion was prepared by Joshua Thijssen, MySQL specialist, who in simple and concise form presented several tips that can speed up our database queries. I think the best description of what he did would be the presentation placed here: [slideshare] Remember – don’t trust varchars! 🙂

Another amazing speech was given by Jurriën Stutterheim who, in a very easy way went through the algorithmic complexity stuff to really interesting data structures part present in PHP. It was great pleasure listening that except common PHP arrays, we can choose from more sophisticated structures like: SplDoublyLinkedList, SplStack, SplQueue, SplHeap, SplMaxHeap, SplMinHeap, SplPriorityQueue, SplFixedArray and SplObjectStorage. Link to presentation can be found here: [slideshare]– even if that is just the tip of the iceberg, it really encourages to take a closer look into this topic.

I’d like to mention the session by Nick Belhomme here as well, he described a new functionality of PHP called namespaces which are abstract containers created to hold a logical grouping of unique identifiers. His presentation can be found here: [slideshare]. Like the previous one, this session was truly educational, providing additionally lots of great code examples.

Last but not least, I’d like to mention a pretty interesting speech which was given by Juozas Kaziukėnas. He was trying to depict what has changed in the last six years in the PHP framework world. And I must admit that he did it very well – impressive knowledge, objective look as well as plenty of accurate observations proves his expertise and skills in that topic.

He pointed out several frameworks, including: Symfony2, Zend Framework, Lithium, Alloy, Fuel, Fat-free framework and Flow3. Among them the most admired became the Symfony2 framework. Mostly for its bundles, dependency injecton, community driven development (GIT), interoperability and of course speed. He also really awaited the stable release of ZF2 which can probably take place in one year time? Or maybe even sooner? If we wanted to try something else during that time, there is always an option for trying out the micro frameworks. Although they are prepared for small projects, it should be interesting and for sure worth attention – one of them is Silex.

To sum it up, I’m really cheerful that could be one of the PFCongres attendants. I’ve learned lots of useful stuff and met a few interesting people. Hope to be there next year again, and if you are a PHP enthousiast, you should be there too!


Setting up keepalived on Ubuntu (load balancing using HAProxy on Ubuntu part 2)

In our previous post we have set up a HAProxy loadbalancer to balance the load of our web application between three webservers, here’s the diagram of the situation we have ended up with:

              |  uplink |
              | loadb01 |
     |             |             |
+---------+   +---------+   +---------+
|  web01  |   |  web02  |   |  web03  |
+---------+   +---------+   +---------+

As we already concluded in the last post, there’s still a single point of failure in this setup. If the loadbalancer dies for some reason the whole site will be offline. In this post we will add a second loadbalancer and setup a virtual IP address shared between the loadbalancers. The setup will look like this:

              |  uplink |
+---------+   +---------+   +---------+
| loadb01 |---|virtualIP|---| loadb02 |
+---------+   +---------+   +---------+
     |             |             |
+---------+   +---------+   +---------+
|  web01  |   |  web02  |   |  web03  |
+---------+   +---------+   +---------+

So our setup now is:
– Three webservers, web01 (, web02 ( ), and web03 ( each serving the application
– The first load balancer (loadb01, ip: ( ))
– The second load balancer (loadb02, ip: ( )), configure this in the same way as we configured the first one.

To setup the virtual IP address we will use keepalived (als also suggested by Warren in the comments):

loadb01$ sudo apt-get install keepalived

Good, keepalived is now installed. Before we proceed with configuring keepalived itself, edit the following file:

loadb01$ sudo vi /etc/sysctl.conf

And add this line to the end of the file:


This option is needed for applications (haproxy in this case) to be able to bind to non-local addresses (ip adresses which do not belong to an interface on the machine). To apply the setting, run the following command:

loadb01$ sudo sysctl -p

Now let’s add the configuration for keepalived, open the file:

loadb01$ sudo vi /etc/keepalived/keepalived.conf

And add the following contents (see comments for details ont he configuration!):

# Settings for notifications
global_defs {
    notification_email {			# Email address for notifications 
    notification_email_from loadb01@domain.ext  # The from address for the notifications
    smtp_server			# You can specifiy your own smtp server here
    smtp_connect_timeout 15
# Define the script used to check if haproxy is still working
vrrp_script chk_haproxy { 
    script "killall -0 haproxy" 
    interval 2 
    weight 2 
# Configuation for the virtual interface
vrrp_instance VI_1 {
    interface eth0
    state MASTER 				# set this to BACKUP on the other machine
    priority 101				# set this to 100 on the other machine
    virtual_router_id 51
    smtp_alert					# Activate email notifications
    authentication {
        auth_type AH
        auth_pass myPassw0rd			# Set this to some secret phrase
    # The virtual ip address shared between the two loadbalancers
    virtual_ipaddress {
    # Use the script above to check if we should fail over
    track_script {

And start keepalived:

loadb01$ /etc/init.d/keepalived start

Now the next step is to install and configure keepalived on our second loadbalancer aswell, redo the steps starting from apt-get install keepalived. In the configuration step for keepalived, be sure change these two settings:

    state MASTER 				# set this to BACKUP on the other machine
    priority 101				# set this to 100 on the other machine


    state BACKUP 			
    priority 100			

That’s it! We have now configured a virtual IP shared between our two loadbalancers, you can try loading the haproxy statistic page on the virtual IP adddress and should get the statistics for loadb01, then switch off loadb01 and refresh, the virtual IP address will now be assigned to the second loadbalancer and you should see the statistics page for that.

In a next post we will focus on adding MySQL to this setup as requested by Miquel in the comments on the previous post in this series. If there’s anything else you’d like us to cover, or if you have any questions please leave a comment!