Iterators and 10 other Python 3 upgrade reasons

I ran into a very interesting presentation from Aaron Meurer titled:

10 awesome features of Python that you can’t use because you refuse to upgrade to Python 3 – source

Aaron gave the presentation on April 9, 2014 at the Austin Python User Group (APUG). Austin is a bit too far away from Amsterdam to just go there and to see the talk, but I am glad we get access to his slides, because they are awesome! Below I quote the slides on “iterator” improvements of Python 3 that he explains exceptionally well.

Why iterators are good

Aaron Meurer lists the advantages of using iterators (and generators) on slide 58:

  • Only one value is computed at a time. Low memory impact (see range example below).
  • Can break in the middle. Don’t have to compute everything just to find out you needed none of it. Compute just what you need. If you often don’t need it all, you can gain a lot of performance here.
  • If you need a list (e.g., for slicing), just call list() on the generator.
  • Function state is “saved” between yields.
  • This leads to interesting possibilities, a. la. coroutines…

Why Python 2 was not so good

Then he states that in Python 2 things were not that good (slide 42-50), since you could accidentally forget to use iterators:

  • Don’t write range or zip or dict.values or …. If you do…
  • python_iterator
    (click to enlarge)
  • Instead write some variant (xrange, itertools.izip, dict.itervalues, …).
  • Inconsistant API anyone?

Why Python 3 is much better

In  slide 51 Aaron Meurer states that in Python 3 things have improved and iterators are used by default. He explains that you can still get a list if you need one:

  • In Python 3, range, zip, map, dict.values, etc. are all iterators.
  • If you want a list, just wrap the result with list.
  • Explicit is better than implicit.
  • Harder to write code that accidentally uses too much memory, because the input was bigger than you expected.

The 10 other features

Interested in the 10 other features “you can’t use because you refuse to upgrade to Python 3”? Look at awesome presentation by Aaron Meurer on his GitHub account:

http://asmeurer.github.io/python3-presentation/slides.html

Share

How to use the “yield” keyword in PHP 5.5 and up

The “yield” keyword is new in PHP 5.5. This keyword allows you to program “generators”. Wikipedia explains generators accurately:

A generator is very similar to a function that returns an array, in that a generator has parameters, can be called, and generates a sequence of values. However, instead of building an array containing all the values and returning them all at once, a generator yields the values one at a time, which requires less memory and allows the caller to get started processing the first few values immediately. In short, a generator looks like a function but behaves like an iterator.

The concept of generators is not new. The “yield” keyword exists in other programming languages as well. As far as I know C#, Ruby, Python, and JavaScript have this keyword. The first usage that comes to mind for me is when I want to read a big text file line-by-line (for instance a log file). Instead of reading the whole text file into RAM you can use an iterator and still have a simple program flow containing a “foreach” loop that iterates over all the lines. I wrote a small script in PHP that shows how to do this (efficiently) using the “yield” keyword:

<?php
class File {

  private $file;
  private $buffer;

  function __construct($filename, $mode) {
    $this->file = fopen($filename, $mode);
    $this->buffer = false;
  }

  public function chunks() {
    while (true) {
      $chunk = fread($this->file,8192);
      if (strlen($chunk)) yield $chunk;
      elseif (feof($this->file)) break;
    }
  }

  function lines() {
    foreach ($this->chunks() as $chunk) {
      $lines = explode("\n",$this->buffer.$chunk);
      $this->buffer = array_pop($lines);
      foreach ($lines as $line) yield $line;
    }
    if ($this->buffer!==false) { 
      yield $this->buffer;
    }
  }

  // ... more methods ...
}

$f = new File("data.txt","r");
foreach ($f->lines() as $line) {
  echo memory_get_usage(true)."|$line\n";
}

One of my colleagues asked me why I used “fread” and did not simply call PHP’s “fgets” function (which reads a line from a file). I assumed that he was right and that it would be faster. To my surprise the above implementation is (on my machine) actually faster than the “fgets” variant that is shown below:

<?php
class File {

  private $file;

  function __construct($filename, $mode) {
    $this->file = fopen($filename, $mode);
  }

  function lines() {
    while (($line = fgets($this->file)) !== false) {
        yield $line;
    }
  }

  // ... more methods ...
}

$f = new File("data.txt","r");
foreach ($f->lines() as $line) {
  echo memory_get_usage(true)."|$line";
}

I played around with the two implementations above,  and found out that the execution speed and memory usage of the first implementation is dependent on the amount of bytes read by “fread”. So I made a benchmark script:

<?php
class File {

  private $file;
  private $buffer;
  private $size;

  function __construct($filename, $mode, $size = 8192) {
    $this->file = fopen($filename, $mode);
    $this->buffer = false;
    $this->size = $size;
  }

  public function chunks() {
    while (true) {
      $chunk = fread($this->file,$this->size);
      if (strlen($chunk)) yield $chunk;
      elseif (feof($this->file)) break;
    }
  }

  function lines() {
    foreach ($this->chunks() as $chunk) {
      $lines = explode("\n",$this->buffer.$chunk);
      $this->buffer = array_pop($lines);
      foreach ($lines as $line) yield $line;
    }
    if ($this->buffer!==false) { 
      yield $this->buffer;
    }
  }
}

echo "size;memory;time\n";
for ($i=6;$i<20;$i++) {
  $size = ceil(pow(2,$i));
  // "data.txt" is a text file of 897MB holding 40 million lines
  $f = new File("data.txt","r", $size);
  $time = microtime(true);
  foreach ($f->lines() as $line) {
    $line .= '';
  }
  echo $size.";".(memory_get_usage(true)/1000000).";".(microtime(true)-$time)."\n";
}

You can generate the “data.txt” file yourself. First step is to take the above script and save it as “yield.php”. After that you have to save the following bash code in a file and run it:

#!/bin/bash
cp /dev/null data_s.txt
for i in {1..1000}
do
 cat yield.php >> data_s.txt
done
cp /dev/null data.txt
for i in {1..1000}
do
 cat data_s.txt >> data.txt
done
rm data_s.txt

I executed the benchmark script on my workstation and loaded its output into a spreadsheet so I could plot the graph below.

yield_graph

As you can see, the best score is for the 16384 bytes (16 kB) fread size. With that fread size the 40 million lines from the 897 MB text file were iterated at 11.88 seconds using less than 1 MB of RAM. I do not understand why the performance graph looks like it does. I can reason that reading small chunks of data is not efficient, since it requires many I/O operations that each have overheads. But why is reading large chunks inefficient? It is a mystery to me, but maybe you know why? If you do, then please use the comments and enlighten me (and the other readers).

Share