Why speed is more important than scalability

Software developers creating web applications like to talk about scalability as if it is totally unrelated to computing efficiency. They also like to argue that abstractions are important. They will tell you that their DBAL, Router, View Egine, Dependency Injection and even ORM do not have that much overhead (only a little). They will argue that the learning curve pays off and that the performance loss is not that bad. In this post I’ll argue that page loading speed is more important than scalability (or pretty abstractions).

Orders of magnitude of speed on the web

Just to get an idea of speed, I tried to search for a web action for every order of magnitude:

  • 0.1 ms – A simple database lookup from memory
  • 1 ms – Serving small static content from RAM
  • 10 ms – An very simple API that only does a DB lookup
  • 100 ms – A complex page load that calls multiple APIs
  • 1000 ms – Nothing should take this long… 🙂

But I’m sure that your website has pages that take a full second or more (even this site has). How can that be?

CPUs and web farms

Most severs have 1 to 4 CPUs. Each CPU has 2 to 32 cores. If you do a single web request, then you are using (at most) a single core of a single CPU. Maybe that is why people say that page loading speed is irrelevant. If you have more visitors, then they will use other cores or even other CPUs. This is true, but what if you have more concurrent requests than visitors? You can simply add machines and configure a web farm, as most people do.

At some point you may have 16 servers running your popular web application with page loads that average at 300 ms. If you could bring down the page load time to 20 ms, you could run this on a single box! Imagine how much simpler that is! The “one big box” strategy is also called the “big iron” strategy. Often programmers are not very careful with resources. This is because a software developers tend to aim for beautiful abstractions and not for fast software.

Programming languages matter

Only when hardware enthusiasts and software developers work together you may get efficient software. Nice frameworks may have to be removed. Toys like Dependency Injection that mainly bring theoretical value may have to be sacrificed. Also languages need to be chosen for execution speed. Languages that typically score good are: C, C# (Mono), Go and even Java (OpenJDK). Languages that typically score very bad: PHP, Python, Ruby and Perl (source: benchmarksgame). This is understandable, as these are all interpreted languages. But it should be considered when building a web application.

Stop using web application frameworks

But even if you use an interpreted language (which may be 10x slower), then you can still have good performance. Unless of course you build applications consisting of tens of thousands of lines of code that all need to be loaded. Normally people would not do that as it would take too much time to write. (warning: sarcasm) In order to be able to fail – against all odds – people have created frameworks. These are tens of thousands of lines of code that you can install on your server. With it your small and lean application will still be slow (/sarcasm). I think it is fair to say that frameworks will make your web application typically 10-100x slower.

Now let that be exactly the approach that is seen in the industry as “best practice”. Unbelievable right?

5 reasons your application is slow

If you are on shared hardware (VM or shared webhosting), then you need to fix that first. I’m sure switching to dedicated hardware will give you a better (and more consistent) performance pattern. Still, each of the following problems may give you an order of a magnitude of speed decrease:

  1. Not enough RAM
  2. No SSD disks (or wrong controllers)
  3. Using an interpreted programming language
  4. A bloated web framework
  5. Not using Memcache where possible

How many of these apply to you? Let me know in the comments.

Finally

This post will probably be considered offending by programmers that like VMs, frameworks and interpreted languages. Please, don’t use that negative energy in the comments. Use it to “Go” and try Gorilla, I dare you! It is not a framework and it is fast, very fast! It will not only going to be interesting and a lot of fun, it may also change your mind about this article.

NB: Go is almost as fast as the highly optimized C code of Nginx (about 10-100x faster than PHP).

The lies about the necessity of a Big Rewrite

This post is about real world software products that make money and have required multiple man-years of development time to build. This is an industry in which quality, costs and thus professional software development matters. The pragmatism and realism of Joel Spolsky’s blog on this type of software development is refreshing. He is also not afraid to speak up when he believes he is right, like on the “Big Rewrite” subject. In this post, I will argue why Joel Spolsky is right. Next I will show the real reasons software developers want to do a Big Rewrite and how they justify it. Finally, Neil Gunton has a quote that will help you convince software developers that there is a different path that should be taken.

Big Rewrite: The worst strategic mistake?

Whenever a developer says a software product needs a complete rewrite, I always think of Joel Spolsky saying:

… the single worst strategic mistake that any software company can make: (They decided to) rewrite the code from scratch. – Joel Spolsky

You should definitely read the complete article, because it holds a lot of strong arguments to back the statement up, which I will not repeat here. He made this statement in the context of the Big Rewrite Netscape did that led to Mozilla Firefox. In an interesting very well written counter-post Adam Turoff writes:

Joel Spolsky is arguing that the Great Mozilla rewrite was a horrible decision in the short term, while Adam Wiggins is arguing that the same project was a wild success in the long term. Note that these positions do not contradict each other.

Indeed! I fully agree that these positions do not contradict. So the result was not bad, but this was the worst mistake the software company could make. Then he continues to say that:

Joel’s logic has got more holes in it than a fishing net. If you’re dealing with a big here and a long now, whatever work you do right now is completely inconsequential compared to where the project will be five years from today or five million users from now. – Adam Turoff

Wait, what? Now he chooses Netscape’s side?! And this argument makes absolutely no sense to me. Who knows what the software will require five years or five million users from now? For this to be true, the guys at Netscape must have been able to look into the future. If so, then why did they not buy Apple stock? In my opinion the observation that one cannot predict the future is enough reason to argue that deciding to do a Big Rewrite is always a mistake.

But what if you don’t want to make a leap into the future, but you are trying to catch up? What if your software product has gathered so much technical debt that a Big Rewrite is necessary? While this argument also feels logical, I will argue why it is not. Let us look at the different technical causes of technical debt and what should be done to counter them:

  • Lack of test suite, which can easily be countered by adding tests
  • Lack of documentation, writing it is not a popular task, but it can be done
  • Lack of building loosely coupled components, dependency injection can be introduced one software component at a time; your test suite will guarantee there is no regression
  • Parallel development, do not rewrite big pieces of code, keep the change sets small and merge often
  • Delayed refactoring, is refactoring much more expensive than rewriting? It may seem so due to the 80/20 rule, but it probably is not; just start doing it

And then we immediately get back to the reality, which normally prevents us from doing a big rewrite – we need to tend the shop. We need to keep the current software from breaking down and we need to implement critical bug fixes and features. If this takes up all our time, because there is so much technical debt, then that debt may become a hurdle that seems too big to overcome ever. So realize that not being able to reserve time (or people) to get rid of technical debt can be the real reason to ask for a Big Rewrite.

To conclude: a Big Rewrite is always a mistake, since we cannot look into the future and if there is technical debt then that should be acknowledged and countered the normal way.

The lies to justify a Big Rewrite

When a developer suggests a “complete rewrite” this should be a red flag to you. The developer is most probably lying about the justification. The real reasons the developer is suggesting Big Rewrite or “build from scratch” are:

  1. Not-Invented-Here syndrome (not understanding the software)
  2. Hard-to-solve bugs (which are not fun working on)
  3. Technical debt, including debt caused by missing tests and documentation (which are not fun working on)
  4. The developer wants to work on a different technology (which is more fun working on)

The lie is that the bugs and technical debt are presented as structural/fundamental changes to the software that cannot realistically be achieved without a Big Rewrite. Five other typical lies (according to Chad Fowler) that the developer will promise in return of a Big Rewrite include:

  1. The system will be more maintainable (less bugs)
  2. It will be easier to add features (more development speed)
  3. The system will be more scalable (lower computation time)
  4. System response time will improve for our customers (less on-demand computation)
  5. We will have greater uptime (better high availability strategy)

Any code can be replaced incrementally and all code must be replaced incrementally. Just like bugs need to be solved and technical debt needs to be removed. Even when technology migrations are needed, they need to be done incrementally, one part or component at a time and not with a Big Bang.

Conclusion

Joel Spolsky is right; You don’t need a Big Rewrite. Doing a Big Rewrite is the worst mistake a software company can make. Or as Neil Gunton puts it more gentle and positive:

If you have a very successful application, don’t look at all that old, messy code as being “stale”. Look at it as a living organism that can perhaps be healed, and can evolve. – Neil Gunton

If a software developer is arguing that a Big Rewrite is needed, then remind him that the software is alive and he is responsible for keeping it healthy and growing it up to become fully matured.

Features of PostgreSQL

Database systems can cost lots of money, this is fairly known. Products like Microsoft SQL Server and Oracle Standard Edition are also billed per CPU (even per core) and may also require client licenses. The costs and issues of licensing may drive people to free (not only as in beer) software. When free database systems are discussed, most people immediately think about MySQL (also owned by Oracle). But there is another, maybe even better, player in the open source market that is less known. Its name is “PostgreSQL”.

An enterprise class database, PostgreSQL boasts sophisticated features such as Multi-Version Concurrency Control (MVCC), point in time recovery, tablespaces, asynchronous replication, nested transactions (savepoints), online/hot backups, a sophisticated query planner/optimizer, and write ahead logging for fault tolerance. It supports international character sets, multibyte character encodings, Unicode, and it is locale-aware for sorting, case-sensitivity, and formatting. It is highly scalable both in the sheer quantity of data it can manage and in the number of concurrent users it can accommodate. There are active PostgreSQL systems in production environments that manage in excess of 4 terabytes of data. — source: http://www.postgresql.org/about/

This database system is free and very powerful. It supports almost all of the features that the paid (and free) counterparts have. Some of the interesting features are: GIS support, hot standby for high availablity and index-only scans (great for big data). Still not convinced? Check out the impressive Feature Matrix of PostgreSQL yourself. It has excellent support on Linux systems (also OSX and Windows) and integrates well with PHP and the frameworks like Symfony2.

Download & try PostgreSQL (for OSX or Windows) or follow installation instructions (for Ubuntu Linux)

Note: If you want a tool like PHPMyAdmin for PostgreSQL you might consider Adminer.

Big data – do I need it?

Big Data?

Big data is one of the most recent “buzz words” on the Internet. This term is normally associated to data sets so big, that they are really complicated to store, process, and search trough.

Big data is known to be a three-dimensional problem (defined by Gartner, Inc*), i.e. it has three problems associated with it:
1. increasing volume (amount of data)
2. velocity (speed of data in/out)
3. variety (range of data types, sources)

Why Big Data?
As datasets grow bigger, the more data you can extract from it, and the better the precision of the results you get (assuming you’re using right models, but that is not relevant for this post). Also better and more diverse analysis could be done against the data. Diverse corporations are increasing more and more their datasets to get more “juice” out of it. Some to get better business models, other to improve user experiences, other to get to know their audience better, the choices are virtually unlimited.

In the end, and in my opinion, big data analysis/management can be a competitive advantage for corporations. In some cases, a crucial one.

Big Data Management

Big data management software is not something you buy normally on the market, as “off-the-shelf” product (Maybe Oracle wants to change this?). One of biggest questions of big data management is what do you want to do with it? Knowing this is essential to minimize to problems related with huge data sets. Of course you can just store everything and later try to make some sense of the data you have. Again, in my opinion, this is the way to get a problem and not a solution/advantage.
Since you cannot just buy a big data management solution, a strategy has to be designed and followed until something is found that can work as a competitive advantage to the product/company.

Internally at LeaseWeb we’ve got a big data set, and we can work on it at real-time speed (we are using Cassandra** at the moment) and obtaining the results we need. To get this working, we had several trial-and-error iterations, but in the end we got what we needed and until now is living up to the expectations. How much hardware? How much development time? This all depends, the question you have to ask yourself is “What do I need?”, and after you have an answer to that, normal software planning /development time applies. It can be even the case that you don’t need Big Data at all, or that you can solve it using standard SQL technologies.

In the end, our answer to the “What do I need?” provided us with all the data we needed to search what was best for us. In this case was a mix of technologies and one of them being a NoSQL database.

* http://www.gartner.com/it/page.jsp?id=1731916
** http://cassandra.apache.org/

Setting up keepalived on Ubuntu (load balancing using HAProxy on Ubuntu part 2)

In our previous post we have set up a HAProxy loadbalancer to balance the load of our web application between three webservers, here’s the diagram of the situation we have ended up with:

              +---------+
              |  uplink |
              +---------+
                   |
                   +
                   |
              +---------+
              | loadb01 |
              +---------+
                   |
     +-------------+-------------+
     |             |             |
+---------+   +---------+   +---------+
|  web01  |   |  web02  |   |  web03  |
+---------+   +---------+   +---------+

As we already concluded in the last post, there’s still a single point of failure in this setup. If the loadbalancer dies for some reason the whole site will be offline. In this post we will add a second loadbalancer and setup a virtual IP address shared between the loadbalancers. The setup will look like this:

              +---------+
              |  uplink |
              +---------+
                   |
                   +
                   |
+---------+   +---------+   +---------+
| loadb01 |---|virtualIP|---| loadb02 |
+---------+   +---------+   +---------+
                   |
     +-------------+-------------+
     |             |             |
+---------+   +---------+   +---------+
|  web01  |   |  web02  |   |  web03  |
+---------+   +---------+   +---------+

So our setup now is:
– Three webservers, web01 (192.168.0.1), web02 (192.168.0.2 ), and web03 (192.168.0.3) each serving the application
– The first load balancer (loadb01, ip: (192.168.0.100 ))
– The second load balancer (loadb02, ip: (192.168.0.101 )), configure this in the same way as we configured the first one.

To setup the virtual IP address we will use keepalived (als also suggested by Warren in the comments):

loadb01$ sudo apt-get install keepalived

Good, keepalived is now installed. Before we proceed with configuring keepalived itself, edit the following file:

loadb01$ sudo vi /etc/sysctl.conf

And add this line to the end of the file:

net.ipv4.ip_nonlocal_bind=1

This option is needed for applications (haproxy in this case) to be able to bind to non-local addresses (ip adresses which do not belong to an interface on the machine). To apply the setting, run the following command:

loadb01$ sudo sysctl -p

Now let’s add the configuration for keepalived, open the file:

loadb01$ sudo vi /etc/keepalived/keepalived.conf

And add the following contents (see comments for details ont he configuration!):

# Settings for notifications
global_defs {
    notification_email {
        your@emailaddress.com			# Email address for notifications 
    }
    notification_email_from loadb01@domain.ext  # The from address for the notifications
    smtp_server 127.0.0.1			# You can specifiy your own smtp server here
    smtp_connect_timeout 15
}
 
# Define the script used to check if haproxy is still working
vrrp_script chk_haproxy { 
    script "killall -0 haproxy" 
    interval 2 
    weight 2 
}
 
# Configuation for the virtual interface
vrrp_instance VI_1 {
    interface eth0
    state MASTER 				# set this to BACKUP on the other machine
    priority 101				# set this to 100 on the other machine
    virtual_router_id 51
 
    smtp_alert					# Activate email notifications
 
    authentication {
        auth_type AH
        auth_pass myPassw0rd			# Set this to some secret phrase
    }
 
    # The virtual ip address shared between the two loadbalancers
    virtual_ipaddress {
        192.168.0.200
    }
    
    # Use the script above to check if we should fail over
    track_script {
        chk_haproxy
    }
}

And start keepalived:

loadb01$ /etc/init.d/keepalived start

Now the next step is to install and configure keepalived on our second loadbalancer aswell, redo the steps starting from apt-get install keepalived. In the configuration step for keepalived, be sure change these two settings:

    state MASTER 				# set this to BACKUP on the other machine
    priority 101				# set this to 100 on the other machine

To:

    state BACKUP 			
    priority 100			

That’s it! We have now configured a virtual IP shared between our two loadbalancers, you can try loading the haproxy statistic page on the virtual IP adddress and should get the statistics for loadb01, then switch off loadb01 and refresh, the virtual IP address will now be assigned to the second loadbalancer and you should see the statistics page for that.

In a next post we will focus on adding MySQL to this setup as requested by Miquel in the comments on the previous post in this series. If there’s anything else you’d like us to cover, or if you have any questions please leave a comment!