How to use the “yield” keyword in PHP 5.5 and up

The “yield” keyword is new in PHP 5.5. This keyword allows you to program “generators”. Wikipedia explains generators accurately:

A generator is very similar to a function that returns an array, in that a generator has parameters, can be called, and generates a sequence of values. However, instead of building an array containing all the values and returning them all at once, a generator yields the values one at a time, which requires less memory and allows the caller to get started processing the first few values immediately. In short, a generator looks like a function but behaves like an iterator.

The concept of generators is not new. The “yield” keyword exists in other programming languages as well. As far as I know C#, Ruby, Python, and JavaScript have this keyword. The first usage that comes to mind for me is when I want to read a big text file line-by-line (for instance a log file). Instead of reading the whole text file into RAM you can use an iterator and still have a simple program flow containing a “foreach” loop that iterates over all the lines. I wrote a small script in PHP that shows how to do this (efficiently) using the “yield” keyword:

<?php
class File {

  private $file;
  private $buffer;

  function __construct($filename, $mode) {
    $this->file = fopen($filename, $mode);
    $this->buffer = false;
  }

  public function chunks() {
    while (true) {
      $chunk = fread($this->file,8192);
      if (strlen($chunk)) yield $chunk;
      elseif (feof($this->file)) break;
    }
  }

  function lines() {
    foreach ($this->chunks() as $chunk) {
      $lines = explode("\n",$this->buffer.$chunk);
      $this->buffer = array_pop($lines);
      foreach ($lines as $line) yield $line;
    }
    if ($this->buffer!==false) { 
      yield $this->buffer;
    }
  }

  // ... more methods ...
}

$f = new File("data.txt","r");
foreach ($f->lines() as $line) {
  echo memory_get_usage(true)."|$line\n";
}

One of my colleagues asked me why I used “fread” and did not simply call PHP’s “fgets” function (which reads a line from a file). I assumed that he was right and that it would be faster. To my surprise the above implementation is (on my machine) actually faster than the “fgets” variant that is shown below:

<?php
class File {

  private $file;

  function __construct($filename, $mode) {
    $this->file = fopen($filename, $mode);
  }

  function lines() {
    while (($line = fgets($this->file)) !== false) {
        yield $line;
    }
  }

  // ... more methods ...
}

$f = new File("data.txt","r");
foreach ($f->lines() as $line) {
  echo memory_get_usage(true)."|$line";
}

I played around with the two implementations above,  and found out that the execution speed and memory usage of the first implementation is dependent on the amount of bytes read by “fread”. So I made a benchmark script:

<?php
class File {

  private $file;
  private $buffer;
  private $size;

  function __construct($filename, $mode, $size = 8192) {
    $this->file = fopen($filename, $mode);
    $this->buffer = false;
    $this->size = $size;
  }

  public function chunks() {
    while (true) {
      $chunk = fread($this->file,$this->size);
      if (strlen($chunk)) yield $chunk;
      elseif (feof($this->file)) break;
    }
  }

  function lines() {
    foreach ($this->chunks() as $chunk) {
      $lines = explode("\n",$this->buffer.$chunk);
      $this->buffer = array_pop($lines);
      foreach ($lines as $line) yield $line;
    }
    if ($this->buffer!==false) { 
      yield $this->buffer;
    }
  }
}

echo "size;memory;time\n";
for ($i=6;$i<20;$i++) {
  $size = ceil(pow(2,$i));
  // "data.txt" is a text file of 897MB holding 40 million lines
  $f = new File("data.txt","r", $size);
  $time = microtime(true);
  foreach ($f->lines() as $line) {
    $line .= '';
  }
  echo $size.";".(memory_get_usage(true)/1000000).";".(microtime(true)-$time)."\n";
}

You can generate the “data.txt” file yourself. First step is to take the above script and save it as “yield.php”. After that you have to save the following bash code in a file and run it:

#!/bin/bash
cp /dev/null data_s.txt
for i in {1..1000}
do
 cat yield.php >> data_s.txt
done
cp /dev/null data.txt
for i in {1..1000}
do
 cat data_s.txt >> data.txt
done
rm data_s.txt

I executed the benchmark script on my workstation and loaded its output into a spreadsheet so I could plot the graph below.

yield_graph

As you can see, the best score is for the 16384 bytes (16 kB) fread size. With that fread size the 40 million lines from the 897 MB text file were iterated at 11.88 seconds using less than 1 MB of RAM. I do not understand why the performance graph looks like it does. I can reason that reading small chunks of data is not efficient, since it requires many I/O operations that each have overheads. But why is reading large chunks inefficient? It is a mystery to me, but maybe you know why? If you do, then please use the comments and enlighten me (and the other readers).

Share