The lies about the necessity of a Big Rewrite

This post is about real world software products that make money and have required multiple man-years of development time to build. This is an industry in which quality, costs and thus professional software development matters. The pragmatism and realism of Joel Spolsky’s blog on this type of software development is refreshing. He is also not afraid to speak up when he believes he is right, like on the “Big Rewrite” subject. In this post, I will argue why Joel Spolsky is right. Next I will show the real reasons software developers want to do a Big Rewrite and how they justify it. Finally, Neil Gunton has a quote that will help you convince software developers that there is a different path that should be taken.

Big Rewrite: The worst strategic mistake?

Whenever a developer says a software product needs a complete rewrite, I always think of Joel Spolsky saying:

… the single worst strategic mistake that any software company can make: (They decided to) rewrite the code from scratch. – Joel Spolsky

You should definitely read the complete article, because it holds a lot of strong arguments to back the statement up, which I will not repeat here. He made this statement in the context of the Big Rewrite Netscape did that led to Mozilla Firefox. In an interesting very well written counter-post Adam Turoff writes:

Joel Spolsky is arguing that the Great Mozilla rewrite was a horrible decision in the short term, while Adam Wiggins is arguing that the same project was a wild success in the long term. Note that these positions do not contradict each other.

Indeed! I fully agree that these positions do not contradict. So the result was not bad, but this was the worst mistake the software company could make. Then he continues to say that:

Joel’s logic has got more holes in it than a fishing net. If you’re dealing with a big here and a long now, whatever work you do right now is completely inconsequential compared to where the project will be five years from today or five million users from now. – Adam Turoff

Wait, what? Now he chooses Netscape’s side?! And this argument makes absolutely no sense to me. Who knows what the software will require five years or five million users from now? For this to be true, the guys at Netscape must have been able to look into the future. If so, then why did they not buy Apple stock? In my opinion the observation that one cannot predict the future is enough reason to argue that deciding to do a Big Rewrite is always a mistake.

But what if you don’t want to make a leap into the future, but you are trying to catch up? What if your software product has gathered so much technical debt that a Big Rewrite is necessary? While this argument also feels logical, I will argue why it is not. Let us look at the different technical causes of technical debt and what should be done to counter them:

  • Lack of test suite, which can easily be countered by adding tests
  • Lack of documentation, writing it is not a popular task, but it can be done
  • Lack of building loosely coupled components, dependency injection can be introduced one software component at a time; your test suite will guarantee there is no regression
  • Parallel development, do not rewrite big pieces of code, keep the change sets small and merge often
  • Delayed refactoring, is refactoring much more expensive than rewriting? It may seem so due to the 80/20 rule, but it probably is not; just start doing it

And then we immediately get back to the reality, which normally prevents us from doing a big rewrite – we need to tend the shop. We need to keep the current software from breaking down and we need to implement critical bug fixes and features. If this takes up all our time, because there is so much technical debt, then that debt may become a hurdle that seems too big to overcome ever. So realize that not being able to reserve time (or people) to get rid of technical debt can be the real reason to ask for a Big Rewrite.

To conclude: a Big Rewrite is always a mistake, since we cannot look into the future and if there is technical debt then that should be acknowledged and countered the normal way.

The lies to justify a Big Rewrite

When a developer suggests a “complete rewrite” this should be a red flag to you. The developer is most probably lying about the justification. The real reasons the developer is suggesting Big Rewrite or “build from scratch” are:

  1. Not-Invented-Here syndrome (not understanding the software)
  2. Hard-to-solve bugs (which are not fun working on)
  3. Technical debt, including debt caused by missing tests and documentation (which are not fun working on)
  4. The developer wants to work on a different technology (which is more fun working on)

The lie is that the bugs and technical debt are presented as structural/fundamental changes to the software that cannot realistically be achieved without a Big Rewrite. Five other typical lies (according to Chad Fowler) that the developer will promise in return of a Big Rewrite include:

  1. The system will be more maintainable (less bugs)
  2. It will be easier to add features (more development speed)
  3. The system will be more scalable (lower computation time)
  4. System response time will improve for our customers (less on-demand computation)
  5. We will have greater uptime (better high availability strategy)

Any code can be replaced incrementally and all code must be replaced incrementally. Just like bugs need to be solved and technical debt needs to be removed. Even when technology migrations are needed, they need to be done incrementally, one part or component at a time and not with a Big Bang.

Conclusion

Joel Spolsky is right; You don’t need a Big Rewrite. Doing a Big Rewrite is the worst mistake a software company can make. Or as Neil Gunton puts it more gentle and positive:

If you have a very successful application, don’t look at all that old, messy code as being “stale”. Look at it as a living organism that can perhaps be healed, and can evolve. – Neil Gunton

If a software developer is arguing that a Big Rewrite is needed, then remind him that the software is alive and he is responsible for keeping it healthy and growing it up to become fully matured.

Share

Session locking: Non-blocking read-only sessions in PHP

I ran into an excellent article titled PHP Session Locks – How to Prevent Blocking Requests and it inspired me. Judging by the comments, it seems that not everybody fully understands what session locking is, how it works, and why it is necessary. This post tries to clear these things up and also gives you a dirty way of speeding up your AJAX calls significantly.

What session locking is

To understand this, we first need to know that a web server does not run your PHP code in a single process. Multiple worker processes are running concurrently and they are all handling requests. Normally, visitor requests of your web page are serialized. This is also where HTTP persistent connections (a.k.a. keep-alives) come into play. By keeping the connection open for the requesting of all the assets of the page, the connection overhead is avoided. Browsers are quite smart and will always try to serialize requests for HTML pages. For the assets (images, scripts, etc.) on the page there is another strategy. The browser will download multiple assets in parallel from each unique hostname it sees referred in the HTML. It can do this by opening multiple TCP connections or by pipelining. When a browser thinks it is downloading assets it may download these for a single visitor in parallel. Session locking avoids this parallelism (by blocking) to provide reliable access to the session data in this situation.

How session locking works

This is quite easy: When you call “session_start()” PHP will block (wait) in this call until the previous script has called “session_write_close()”. On Linux it does this by relying on the “flock()” call. This is an advisory locking mechanism that blocks until the lock is released. NB: This locking time is not counted as part of the “max_execution_time” (see: set_time_limit()).

Why session locking is necessary

Session locking prevents race conditions on the shared memory that is used to store session data. Every PHP process reads the entire session storage when starting and writes it back when closing. This means that to reliably store the logging-in of a user (which is typically done in the session data) you must make sure no other process has read the session data and will overwrite your data after you have written it (since the last write wins). This is needed even more when using AJAX or IFrames since the browser considers those loads to be assets and not HTML pages (so they will be parallelized).

Read-only sessions to the rescue

Many websites use AJAX calls to load data. While retrieving this data we would like to know whether the user logged in to deny access if  needed. Moreover, we would not like the loading of this AJAX data to be serialized by the session locking, which slows down the website. This is where the following (arguably dirty) code comes into place. It will allow you to gain read-only access to the session data (call it instead of “session_start()”). This way you can check permissions in your AJAX call, but without locking, thus not blocking and serializing the requests. It may speed up your PHP powered AJAX website significantly!

            function session_readonly()
            {
                    $session_name = preg_replace('/[^\da-z]/i', '', $_COOKIE[session_name()]);
                    $session_data = file_get_contents(session_save_path().'/sess_'.$session_name);

                    $return_data = array();
                    $offset = 0;
                    while ($offset < strlen($session_data)) {
                        if (!strstr(substr($session_data, $offset), "|")) break;
                        $pos = strpos($session_data, "|", $offset);
                        $num = $pos - $offset;
                        $varname = substr($session_data, $offset, $num);
                        $offset += $num + 1;
                        $data = unserialize(substr($session_data, $offset));
                        $return_data[$varname] = $data;
                        $offset += strlen(serialize($data));
                    }
                    $_SESSION = $return_data;
            }

I think this call should be added to the next PHP version. What do you think? Let me know in the comments.

Share

Redis sorted set stores score as a floating point number

Today I was playing a little with our statistics. I was writing some “Go” (golang) that was requesting a top 20 customers from Redis using a “Sorted Set” and the “ZREVRANGEBYSCORE” command. Then I found out that the score of the sorted set was actually stored as a double precision floating point number. Normally, I would not be bothered about this and use the float storage for integer values.

But this time I wanted to make a top 20 of the real-time monthly data traffic of our entire CDN platform. A little background: Hits on the Nginx edges are measures in bytes and the logs are streamed to our statistics cluster. Therefor the real-time statistics counters for the traffic are in bytes. Normally we use 64 bit integers (in worst case they are signed and you lose 1 bit).

2^64 = 9,223,372,036,854,775,807
      EB, PB, TB, GB, MB, kB,  b

If you Google for: “9,223,372,036,854,775,807 bytes per month in gigabit per second” you will find that this is about 26 Tbps on average. We do not have such big customers yet, so that will do for now. So an “int64” will do, but how about the double precision float? Since it has a floating point, theory says it can not reliably count when numbers become too large. But how large is too large? I quickly implemented a small script in golang to find out:

package main

import (
	"fmt"
)

func main() {
	bits := 1
	float := float64(1)
	for float+1 != float {
		float *= 2
		bits++
	}
	fmt.Printf("%.0f = %d bits\n", float, bits)
}

Every step the script doubles the number and tries to add 1 until it fails. This is the output showing when the counting goes wrong:

9007199254740992 = 54 bits

So from 54 bits the counting is no longer precise (to the byte). What does that mean for our CDN statistics? Let’s do the same calculation we did before:

2^54 = 9,007,199,254,740,992
      PB, TB, GB, MB, kB,  b

If you Google for: “9,007,199,254,740,992 bytes per month in gigabits per second” you will find that this is about 25 Gbps on a monthly average. We definitely have customers that do much more than that.

I quickly calculated that the deviation would be less than 0.000000000000001%. But then I realized I was wrong: At 26 Tbps average the deviation might as well be as big as 1 kB (10 bits). Imagine that the customer is mainly serving images and JavaScript from the CDN and has an average file size of 10 kB. In this case the statistics will be off by 10% during the last days of the month!

Okay, this may be the worst case scenario, but still I would not sleep well ignoring it. I feel that when it comes to CDN statistics, accuracy is very important. You are dealing with large numbers and lots of calculation and as you see this may have some unexpected side effects. That is why these kind of seemingly small things keep me busy.

Share

How to use the “yield” keyword in PHP 5.5 and up

The “yield” keyword is new in PHP 5.5. This keyword allows you to program “generators”. Wikipedia explains generators accurately:

A generator is very similar to a function that returns an array, in that a generator has parameters, can be called, and generates a sequence of values. However, instead of building an array containing all the values and returning them all at once, a generator yields the values one at a time, which requires less memory and allows the caller to get started processing the first few values immediately. In short, a generator looks like a function but behaves like an iterator.

The concept of generators is not new. The “yield” keyword exists in other programming languages as well. As far as I know C#, Ruby, Python, and JavaScript have this keyword. The first usage that comes to mind for me is when I want to read a big text file line-by-line (for instance a log file). Instead of reading the whole text file into RAM you can use an iterator and still have a simple program flow containing a “foreach” loop that iterates over all the lines. I wrote a small script in PHP that shows how to do this (efficiently) using the “yield” keyword:

<?php
class File {

  private $file;
  private $buffer;

  function __construct($filename, $mode) {
    $this->file = fopen($filename, $mode);
    $this->buffer = false;
  }

  public function chunks() {
    while (true) {
      $chunk = fread($this->file,8192);
      if (strlen($chunk)) yield $chunk;
      elseif (feof($this->file)) break;
    }
  }

  function lines() {
    foreach ($this->chunks() as $chunk) {
      $lines = explode("\n",$this->buffer.$chunk);
      $this->buffer = array_pop($lines);
      foreach ($lines as $line) yield $line;
    }
    if ($this->buffer!==false) { 
      yield $this->buffer;
    }
  }

  // ... more methods ...
}

$f = new File("data.txt","r");
foreach ($f->lines() as $line) {
  echo memory_get_usage(true)."|$line\n";
}

One of my colleagues asked me why I used “fread” and did not simply call PHP’s “fgets” function (which reads a line from a file). I assumed that he was right and that it would be faster. To my surprise the above implementation is (on my machine) actually faster than the “fgets” variant that is shown below:

<?php
class File {

  private $file;

  function __construct($filename, $mode) {
    $this->file = fopen($filename, $mode);
  }

  function lines() {
    while (($line = fgets($this->file)) !== false) {
        yield $line;
    }
  }

  // ... more methods ...
}

$f = new File("data.txt","r");
foreach ($f->lines() as $line) {
  echo memory_get_usage(true)."|$line";
}

I played around with the two implementations above,  and found out that the execution speed and memory usage of the first implementation is dependent on the amount of bytes read by “fread”. So I made a benchmark script:

<?php
class File {

  private $file;
  private $buffer;
  private $size;

  function __construct($filename, $mode, $size = 8192) {
    $this->file = fopen($filename, $mode);
    $this->buffer = false;
    $this->size = $size;
  }

  public function chunks() {
    while (true) {
      $chunk = fread($this->file,$this->size);
      if (strlen($chunk)) yield $chunk;
      elseif (feof($this->file)) break;
    }
  }

  function lines() {
    foreach ($this->chunks() as $chunk) {
      $lines = explode("\n",$this->buffer.$chunk);
      $this->buffer = array_pop($lines);
      foreach ($lines as $line) yield $line;
    }
    if ($this->buffer!==false) { 
      yield $this->buffer;
    }
  }
}

echo "size;memory;time\n";
for ($i=6;$i<20;$i++) {
  $size = ceil(pow(2,$i));
  // "data.txt" is a text file of 897MB holding 40 million lines
  $f = new File("data.txt","r", $size);
  $time = microtime(true);
  foreach ($f->lines() as $line) {
    $line .= '';
  }
  echo $size.";".(memory_get_usage(true)/1000000).";".(microtime(true)-$time)."\n";
}

You can generate the “data.txt” file yourself. First step is to take the above script and save it as “yield.php”. After that you have to save the following bash code in a file and run it:

#!/bin/bash
cp /dev/null data_s.txt
for i in {1..1000}
do
 cat yield.php >> data_s.txt
done
cp /dev/null data.txt
for i in {1..1000}
do
 cat data_s.txt >> data.txt
done
rm data_s.txt

I executed the benchmark script on my workstation and loaded its output into a spreadsheet so I could plot the graph below.

yield_graph

As you can see, the best score is for the 16384 bytes (16 kB) fread size. With that fread size the 40 million lines from the 897 MB text file were iterated at 11.88 seconds using less than 1 MB of RAM. I do not understand why the performance graph looks like it does. I can reason that reading small chunks of data is not efficient, since it requires many I/O operations that each have overheads. But why is reading large chunks inefficient? It is a mystery to me, but maybe you know why? If you do, then please use the comments and enlighten me (and the other readers).

Share

ZFS – One File System to Rule Them All

ZFS [1] is one of the few enterprise-grade file systems with advanced storage features, such as in-line deduplication, in-line compression, copy-on-write, and snapshotting. These features are handy in a variety of scenarios from backups to virtual machine image storage. A native port of ZFS is also available for Linux. Here we take a look at ZFS compression and deduplication features using some examples.

Setting ZFS up

ZFS handles disks very much like operating systems handle memory. This way, ZFS creates a logical separation between the file system and the physical disks. This logical seperation is called “pool” in ZFS terms.

Here we simply create a large file to mimic a disk via a loopback device and we create a pool on top:

# fallocate -l10G test1.img
# losetup /dev/loop0 test1.img
# zpool create testpool /dev/loop0
# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
testpool 9.94G 124K 9.94G 0% 1.00x ONLINE -

Let’s create a file, note that the pool gets mounted on /$POOLNAME:

# cd /testpool
# dd if=/dev/urandom of=randfile bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 15.3166 s, 6.8 MB/s
# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
testpool 9.94G 100M 9.84G 0% 1.00x ONLINE -

Deduplication/compression

ZFS supports in-line deduplication and compression. This means that if these features are enabled, the file system automatically finds duplicated data and deduplicates it and compresses the data with compression potential. Here we show how deduplication can help save disk space:

# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
testpool 9.94G 100M 9.84G 0% 1.00x ONLINE -
# cp randfile randfile2
# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
testpool 9.94G 200M 9.74G 1% 1.00x ONLINE -
# zfs create -o dedup=on testpool/deduplicated
# ls
deduplicated randfile randfile2
# mv randfile deduplicated/
# mv randfile2 deduplicated/
# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
testpool 9.94G 101M 9.84G 0% 2.00x ONLINE -

Here we show how compression can help save disk space with gzip algorithm:

# zfs create -o compression=gzip testpool/compressed
# ls
compressed deduplicated linux-3.12.6
# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
testpool 9.94G 532M 9.42G 5% 1.00x ONLINE -
# mv linux-3.12.6 compressed/
# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
testpool 9.94G 155M 9.79G 1% 1.00x ONLINE -

Discussion

As you can see deduplication and compression can save you some serious disk space. You can also enable both deduplication and compression together according to your needs. Deduplication is especially useful when there are lots of similar inter or intra files (e.g. virtual machine images). Compression is useful when there is compression opportunity inter files (e.g. text, source code). Benefits aside, deduplication needs a hashtable for detecting similarity. Depending on the data, you may need a couple of GBs of memory per TB of data. De/compression on the other hand, burns a lot of your CPU cycles.

[1] http://en.wikipedia.org/wiki/ZFS
[2] http://zfsonlinux.org

Share