LeaseWeb Labs celebrates 670k unique visitors in 2014

Now that 2014 is over, it is a good time to look back with some facts on our blog’s traffic.

The 5 most popular post of 2014

#1 – 10 tools for PHP developers in Ubuntu 14.04
This post explains what a developer needs to install in Ubuntu 14.04. It is found by people searching for a PHP IDE in Ubuntu Linux.

#2 – Text mode 2048 game in C, algorithm explained
People searching for the 2048 game algorithm seem find this post a lot and the source code on Github has received over 100 stars.

#3 – PyCon 2014, full conference on YouTube
PyCon was great and so was the traffic on our website when we listed all the talks. Thank you PyCon for publishing them!

#4 – Blocking Google and Facebook tracking using Ad Block Plus and Ghostery
The popular opinion that Google and Facebook know way too much about us may be the reason for this posts popularity.

#5 – Testing framework using Maven, TestNG and Webdriver
This is a very good and technical post by Grygorii Gerashchenko on automated testing based on Java technology.

Unique visitors in 2014

Also we are very proud of the amount of unique visitors we have seen on the blog this year (compared to last year):

traffic_2013_2014

As you can see the year 2014 was a very good year for LeaseWeb Labs. Some facts:

  • We have welcomed 670k unique visitors in 2014 as counted by Count Per Day!
  • We have maintained the high post rate of 2013 with more than 6 posts per month.
  • Even without viral hits on social bookmarking sites we now reach 60k unique visitors per month!
  • We went from 1500 unique visitors per day in 2013 to 2000 unique visitors per day in 2014.
  • This year we received almost half our visitors from Google (Search) were it was one third in 2013.

Post frequency

It takes quite some effort and writing discipline to build a popular blog. These are the statistics on our post behavior:

post_counts

Let’s hope 2015 will be again a great year with lots of interesting posts. Please keep reading at LeaseWeb Labs!

 

Affordable real-time Big Data with streaming analytics

In our CDN pops in various locations world-wide we analyze between 10.000 and 100.000 access log lines per second. We collect this data in order to be able to provide analytics and, even more important, billing information. The most important question is: How much data was served for each customer? This is information we need to collect and present in our control panel. Obviously it needs to be 100% correct. In our systems each log line holds over 20 fields. These fields need to analyzed in real-time. Real-time is a bit of a flexible concept. Some people claim that a 24 hour delay is still real-time. When we talk about real-time we aim for an update interval of half a minute.

So effectively our clusters need to analyze 2 millions of values per second and turn them into a set of 30 real-time analytics that update every 30 seconds.

There is only so little time

Let’s assume you are only willing to spend 10% extra on hardware to add analytics to your product, then the amount of time you can spend per data point is limited. If at peak time the average requests per second a machine is handling is 1000 and each machine is a dual hexacore server, then you can spend maximum of (1/1000)*16*0.10 = 0.0016 second = 1.6 millisecond per log line in a single threaded application for doing analytics CPU time. On I/O we can do a similar kind of calculation with disks instead of cores. You must agree that we are on a tight budget considering that a 15k SAS disk’s average seek time is 2 ms. Also note that it does not matter on which machine or tier this calculation happens as the budget stays the same.

Traditional Big Data approach

You collect all raw data that you have and you write it to a DFS (like Hadoop Distributed File System). Then you define MapReduce jobs to actually go over the data and aggregate it into a result format. That result is either stored in a NoSQL database, on a DFS or if sufficiently reduced into a relational database (like MySQL). To aggregate aggregated data we can read the result from disk and aggregate once more. In a very naive approach the disk I/O is high as the data would be written to log files on the edge, then read from the logs and written to the DFS and then read and written for each aggregation level and/or different kind of statistic.

Streaming analytics approach

streaming_statistics

Figure 1: An example of a streaming analytics infrastructure with the data flowing from left to right.

This approach is much simpler and performs better. We read the data directly from the log files as it is just written and probably still memory mapped, which reduces the I/O on the nodes. Updating all counters in RAM directly after the data is read for all resulting statistics is also relatively cheap. All these statistics are removed from RAM and sent upstream to an aggregation server with a certain interval (flush rate). On the aggregation server all counters are again updated in RAM and then written to disk in the final format. This final format is either in MySQL or in files on disk in the format that they will be queried.

I/O is limited, plenty of CPU

Most calculations done in analytics are extremely easy. It merely consists of counting occurrences. Sometimes only occurrences of certain combinations are counted and sometimes other fields are summed, like the amount of bytes transfered or seconds spent. Also note that the combination of a sum and a count can then lead to an average. These simple calculations do not require much CPU. But every time an number is to be incremented it needs to be read and written. And we need to realize that the amount of available I/O is also limited. Faster disks are expensive, especially when there is a need for high capacity.

RAM to the rescue

The only way to reduce the I/O is to keep the numbers in RAM (memory). This way they do not have to be read or written. This has it’s own disadvantage: your RAM capacity needs to be high or your result data needs to be small. Keeping the result data small is actually valuable. Humans cannot process tens of thousands of data points anyway, so why not reduce it to the most valuable information possible. Also, lets do it as early in the process as possible to not waste any expensive CPU, I/O or RAM. This is how the streaming analytics work. Log files are only written to disk once (on the edge nodes) and then processed in RAM, aggregated in RAM, sent over HTTP to a central server, where they are again aggregated in RAM to be removed and inserted in a traditional relational database, since they are sufficiently reduced in size.

MySQL is not so bad

MySQL can actually do many inserts or update queries per second. On my SSD-driven i5 laptop when using a non-optimized (default) MySQL with extended queries I got over 15.000 values updated per second. Which was a lot higher than I assumed was possible. If it is needed you can store large sets of values in text fields in MySQL (in the resulting json format) to reduce load on MySQL, but also give up the flexibility to query it any way you like. The flush rate tells us how often the aggregated analytics are sent upstream. Lowering the flush rate increases the total delay in the analytics, but on the other hand lowers the amount of writes needed on MySQL. This is easy to see, as every counter that is updated is then updated less frequently and with a larger increment.

High availability

When a node in the system goes down it clearly does not do any traffic anymore so it does also not have to do statistics. If the aggregation or MySQL server goes down, the nodes can stop further log analysis to only continue once the server(s) are available again. Web nodes are typically configured to hold several days of logging information on disks, so this would not cause a problem. The nodes may have a little higher load when they are trying to keep up for lost time, but the speed during recovery can easily be throttled to minimize this problem.

Optimizations

The most important rule is to keep the resulting statistics limited in size (relatively small). Otherwise you are using too much RAM and you have to fall back to slow disk I/O. By having a retention policy per statistic, you can make sure you delete any data that is no longer relevant. Obviously MySQL can be optimized using (and omitting) indexes and changing its disk writing strategy. One of the most powerful knobs is making this flush rate configurable per statistic, so you can tune the performance of the system and lower the delay or lower the writes. Also for top lists you can apply length limitations that would lower the correctness slightly, but highly reduce the data transfer and RAM usage. A last resort would be to only process a sample of the log lines for certain less important statistics on certain higher volume customers.

Conclusion

It seems that for real-time log file analysis this “Streaming Analytics” approach is much more suitable than a more traditional Big Data approach. The only real downside is that you have to keep the resulting data small in size, so that it fits in RAM. We’ve calculated that we are able to achieve more than a factor 10 higher performance compared to our previous implementation. Now that is what I call innovation!

LeaseWeb Labs celebrates a successful 2013

Today we share with you some insights in our LeaseWeb Labs traffic of 2013. The post also shows which tools we use to analyze the traffic (the underlined parts of the titles are links). Some facts:

  • We published 72 articles in 2013, where we published 29 articles in 2012.
  • The amounts of articles posted per month in 2013 from (Jan-Dec) are: 6,5,6,4,5,8,8,6,4,6,5,9, were that series was 2,2,0,0,4,2,3,5,1,2,4,4 in 2012.
  • At the end of 2012 we had about 4-5k unique visitors per month and now at the end of 2013 we are at 40-50k!
  • We are very proud that we reached 1500 unique visitors per day on average in December.
  • Google Webmaster Tools shows that about one third of those visitors are sent to us by Google (Search).

Check out screenshots from our statistics below.

Statistics from Count-per-day

visitors_per_day

countries

browsers

visitors_per_post

Statistics from Google Webmaster Tools

keywords

google_search

Statistics from Alexa

alexa_reach_graph

global_alexa_rank

I want to thank you all for helping to make this blog a success, whether it was by reading, commenting, writing, promoting or sharing article feedback and new ideas. It is much appreciated!

Happy holidays and keep visiting in 2014!

Leaseweb BigData and MapReduce (Part 2)


This is the second part of a series of posts about the current Leaseweb BigData and MapReduce. The first post can be found here.

In this one we will focus on understanding MapReduce and building a job that will run on the Cluster. Let’s start to look for a diagram of what a MapReduce job looks like (Source: http://discoproject.org/media/images/diagram.png):

MapReduce Example

When I first saw this image I started thinking “That’s quite simple! I just need to provide 2 functions and things will fly!” After a while it become clear that is not the case…

So, what is the problem? First let me update the image above what I think is a more realistic view of what are the steps you need to effectively get something out of a MapReduce cluster:

My version

So, now we moved from 2 problems (write a Map and a Reduce function) to 4. Could be 3 or still 2, depending on the cases. But I do think majority of the people will face 4 or 5 (Getting the data from whatever to the cluster).

Resuming, what you need to know to make it work:

  • Get the data to the Cluster (could be a non-issue)
  • Slice the data in appropriate chunks
  • Write the Map
  • Write the Reduce
  • Do something with the output! Store it in someplace, generate files, etc…

This was the second post on BigData and MapReduce. The first post can be found here.

Leaseweb BigData and MapReduce (Part 1)

Today BigData is a enormous Buzz word. So what all about it? How difficult is to use it?

I already wrote on a previous post (http://www.leaseweb.com/labs/2011/11/big-data-do-i-need-it/) about the need for it and gave some insight on our BigData structure here at Leaseweb. On this post I will dive more into how we process this data using MapReduce and Python.

First off all, as the name says, BigData is actually a lot of data, so to retrieve information from it in a timely manner (And we do it in real time) you need to build a infrastructure that can handle it. So without more delay, this is our cluster structure:

  • 6 machines
  • 48 processing units (4 cores per machine with hyper-threading)
  • 1Gb Network

As said before, we want to use our in-house python knowledge (although Java knowledge also exists). So we went with Disco (http://discoproject.org/) to build our MapReduce infrastructure. We didn’t benchmark Disco vs Classic Hadoop structure (the all knowing “Internet” says you take a hit on performance), but we are satisfied we the results we are getting and the cluster is not even getting that much work.

After the hardware and the framework is up and running, it starts the second part, programming the MapReduce Jobs.

MapReduce is not some miracle technology that will solve all your problems with data. So before starting programming the jobs you will have to actually understand how it works. But for that, wait for part 2 🙂