In our CDN pops in various locations world-wide we analyze between 10.000 and 100.000 access log lines per second. We collect this data in order to be able to provide analytics and, even more important, billing information. The most important question is: How much data was served for each customer? This is information we need to collect and present in our control panel. Obviously it needs to be 100% correct. In our systems each log line holds over 20 fields. These fields need to analyzed in real-time. Real-time is a bit of a flexible concept. Some people claim that a 24 hour delay is still real-time. When we talk about real-time we aim for an update interval of half a minute.
So effectively our clusters need to analyze 2 millions of values per second and turn them into a set of 30 real-time analytics that update every 30 seconds.
There is only so little time
Let’s assume you are only willing to spend 10% extra on hardware to add analytics to your product, then the amount of time you can spend per data point is limited. If at peak time the average requests per second a machine is handling is 1000 and each machine is a dual hexacore server, then you can spend maximum of (1/1000)*16*0.10 = 0.0016 second = 1.6 millisecond per log line in a single threaded application for doing analytics CPU time. On I/O we can do a similar kind of calculation with disks instead of cores. You must agree that we are on a tight budget considering that a 15k SAS disk’s average seek time is 2 ms. Also note that it does not matter on which machine or tier this calculation happens as the budget stays the same.
Traditional Big Data approach
You collect all raw data that you have and you write it to a DFS (like Hadoop Distributed File System). Then you define MapReduce jobs to actually go over the data and aggregate it into a result format. That result is either stored in a NoSQL database, on a DFS or if sufficiently reduced into a relational database (like MySQL). To aggregate aggregated data we can read the result from disk and aggregate once more. In a very naive approach the disk I/O is high as the data would be written to log files on the edge, then read from the logs and written to the DFS and then read and written for each aggregation level and/or different kind of statistic.
Streaming analytics approach
Figure 1: An example of a streaming analytics infrastructure with the data flowing from left to right.
This approach is much simpler and performs better. We read the data directly from the log files as it is just written and probably still memory mapped, which reduces the I/O on the nodes. Updating all counters in RAM directly after the data is read for all resulting statistics is also relatively cheap. All these statistics are removed from RAM and sent upstream to an aggregation server with a certain interval (flush rate). On the aggregation server all counters are again updated in RAM and then written to disk in the final format. This final format is either in MySQL or in files on disk in the format that they will be queried.
I/O is limited, plenty of CPU
Most calculations done in analytics are extremely easy. It merely consists of counting occurrences. Sometimes only occurrences of certain combinations are counted and sometimes other fields are summed, like the amount of bytes transfered or seconds spent. Also note that the combination of a sum and a count can then lead to an average. These simple calculations do not require much CPU. But every time an number is to be incremented it needs to be read and written. And we need to realize that the amount of available I/O is also limited. Faster disks are expensive, especially when there is a need for high capacity.
RAM to the rescue
The only way to reduce the I/O is to keep the numbers in RAM (memory). This way they do not have to be read or written. This has it’s own disadvantage: your RAM capacity needs to be high or your result data needs to be small. Keeping the result data small is actually valuable. Humans cannot process tens of thousands of data points anyway, so why not reduce it to the most valuable information possible. Also, lets do it as early in the process as possible to not waste any expensive CPU, I/O or RAM. This is how the streaming analytics work. Log files are only written to disk once (on the edge nodes) and then processed in RAM, aggregated in RAM, sent over HTTP to a central server, where they are again aggregated in RAM to be removed and inserted in a traditional relational database, since they are sufficiently reduced in size.
MySQL is not so bad
MySQL can actually do many inserts or update queries per second. On my SSD-driven i5 laptop when using a non-optimized (default) MySQL with extended queries I got over 15.000 values updated per second. Which was a lot higher than I assumed was possible. If it is needed you can store large sets of values in text fields in MySQL (in the resulting json format) to reduce load on MySQL, but also give up the flexibility to query it any way you like. The flush rate tells us how often the aggregated analytics are sent upstream. Lowering the flush rate increases the total delay in the analytics, but on the other hand lowers the amount of writes needed on MySQL. This is easy to see, as every counter that is updated is then updated less frequently and with a larger increment.
When a node in the system goes down it clearly does not do any traffic anymore so it does also not have to do statistics. If the aggregation or MySQL server goes down, the nodes can stop further log analysis to only continue once the server(s) are available again. Web nodes are typically configured to hold several days of logging information on disks, so this would not cause a problem. The nodes may have a little higher load when they are trying to keep up for lost time, but the speed during recovery can easily be throttled to minimize this problem.
The most important rule is to keep the resulting statistics limited in size (relatively small). Otherwise you are using too much RAM and you have to fall back to slow disk I/O. By having a retention policy per statistic, you can make sure you delete any data that is no longer relevant. Obviously MySQL can be optimized using (and omitting) indexes and changing its disk writing strategy. One of the most powerful knobs is making this flush rate configurable per statistic, so you can tune the performance of the system and lower the delay or lower the writes. Also for top lists you can apply length limitations that would lower the correctness slightly, but highly reduce the data transfer and RAM usage. A last resort would be to only process a sample of the log lines for certain less important statistics on certain higher volume customers.
It seems that for real-time log file analysis this “Streaming Analytics” approach is much more suitable than a more traditional Big Data approach. The only real downside is that you have to keep the resulting data small in size, so that it fits in RAM. We’ve calculated that we are able to achieve more than a factor 10 higher performance compared to our previous implementation. Now that is what I call innovation!