How to use the “yield” keyword in PHP 5.5 and up

The “yield” keyword is new in PHP 5.5. This keyword allows you to program “generators”. Wikipedia explains generators accurately:

A generator is very similar to a function that returns an array, in that a generator has parameters, can be called, and generates a sequence of values. However, instead of building an array containing all the values and returning them all at once, a generator yields the values one at a time, which requires less memory and allows the caller to get started processing the first few values immediately. In short, a generator looks like a function but behaves like an iterator.

The concept of generators is not new. The “yield” keyword exists in other programming languages as well. As far as I know C#, Ruby, Python, and JavaScript have this keyword. The first usage that comes to mind for me is when I want to read a big text file line-by-line (for instance a log file). Instead of reading the whole text file into RAM you can use an iterator and still have a simple program flow containing a “foreach” loop that iterates over all the lines. I wrote a small script in PHP that shows how to do this (efficiently) using the “yield” keyword:

<?php
class File {

  private $file;
  private $buffer;

  function __construct($filename, $mode) {
    $this->file = fopen($filename, $mode);
    $this->buffer = false;
  }

  public function chunks() {
    while (true) {
      $chunk = fread($this->file,8192);
      if (strlen($chunk)) yield $chunk;
      elseif (feof($this->file)) break;
    }
  }

  function lines() {
    foreach ($this->chunks() as $chunk) {
      $lines = explode("\n",$this->buffer.$chunk);
      $this->buffer = array_pop($lines);
      foreach ($lines as $line) yield $line;
    }
    if ($this->buffer!==false) { 
      yield $this->buffer;
    }
  }

  // ... more methods ...
}

$f = new File("data.txt","r");
foreach ($f->lines() as $line) {
  echo memory_get_usage(true)."|$line\n";
}

One of my colleagues asked me why I used “fread” and did not simply call PHP’s “fgets” function (which reads a line from a file). I assumed that he was right and that it would be faster. To my surprise the above implementation is (on my machine) actually faster than the “fgets” variant that is shown below:

<?php
class File {

  private $file;

  function __construct($filename, $mode) {
    $this->file = fopen($filename, $mode);
  }

  function lines() {
    while (($line = fgets($this->file)) !== false) {
        yield $line;
    }
  }

  // ... more methods ...
}

$f = new File("data.txt","r");
foreach ($f->lines() as $line) {
  echo memory_get_usage(true)."|$line";
}

I played around with the two implementations¬†above, ¬†and found out that the execution speed and memory usage of the first implementation is dependent on the amount of bytes read by “fread”. So I made a benchmark script:

<?php
class File {

  private $file;
  private $buffer;
  private $size;

  function __construct($filename, $mode, $size = 8192) {
    $this->file = fopen($filename, $mode);
    $this->buffer = false;
    $this->size = $size;
  }

  public function chunks() {
    while (true) {
      $chunk = fread($this->file,$this->size);
      if (strlen($chunk)) yield $chunk;
      elseif (feof($this->file)) break;
    }
  }

  function lines() {
    foreach ($this->chunks() as $chunk) {
      $lines = explode("\n",$this->buffer.$chunk);
      $this->buffer = array_pop($lines);
      foreach ($lines as $line) yield $line;
    }
    if ($this->buffer!==false) { 
      yield $this->buffer;
    }
  }
}

echo "size;memory;time\n";
for ($i=6;$i<20;$i++) {
  $size = ceil(pow(2,$i));
  // "data.txt" is a text file of 897MB holding 40 million lines
  $f = new File("data.txt","r", $size);
  $time = microtime(true);
  foreach ($f->lines() as $line) {
    $line .= '';
  }
  echo $size.";".(memory_get_usage(true)/1000000).";".(microtime(true)-$time)."\n";
}

You can generate the “data.txt” file yourself. First step is to take the above script and save it as “yield.php”. After that you have to save the following bash code in a file and run it:

#!/bin/bash
cp /dev/null data_s.txt
for i in {1..1000}
do
 cat yield.php >> data_s.txt
done
cp /dev/null data.txt
for i in {1..1000}
do
 cat data_s.txt >> data.txt
done
rm data_s.txt

I executed the benchmark script on my workstation and loaded its output into a spreadsheet so I could plot the graph below.

yield_graph

As you can see, the best score is for the 16384 bytes (16 kB) fread size. With that fread size the 40 million lines from the 897 MB text file were iterated at 11.88 seconds using less than 1 MB of RAM. I do not understand why the performance graph looks like it does. I can reason that reading small chunks of data is not efficient, since it requires many I/O operations that each have overheads. But why is reading large chunks inefficient? It is a mystery to me, but maybe you know why? If you do, then please use the comments and enlighten me (and the other readers).

8 Responses to “How to use the “yield” keyword in PHP 5.5 and up”

  • Mirco:

    A file is a stream in php core c-code. Each stream has an internal core c-buffer. My guess is this buffer is 16384 bytes in size. So when fread-ing 16384 chunks, internally you have the least amount of overhead.

    When reading bigger chunks, things must be concatenated in memory. And that is expensive.

  • Maurits van der Schee (Innovation Engineer):

    @Mirco: That is a good explanation! Thank you.

  • Mirco is right considering the file-is-a-stream comment. However, by default, file streams in PHP are actually using 8KB as a stream buffer by default (not 100% sure, but I think every stream uses this by default).

    One way to testing this is by looking at the output of an strace. When you read – say – 10 bytes from a file, you will notice it will read 8192 bytes once, even though we do two fread()s.

    The function responsible for reading from a stream will buffer data inside an internal stream buffer. The size of this buffer is most often 8Kb, but it actually depends on an variable inside the stream-resource. A really nice feature is that this variable CAN be modified if you like with the stream_set_chunk_size() method.

    Try and strace it yourself with the following two snippets: https://gist.github.com/jaytaph/8d0e9b481cdf47e7742e
    Here you will see that the first snippet will read 1 8KB block once, while the other reads 2 times a 5-byte block.

    But the performance depends on many other factors, mostly the OS and the way how it caches/buffers files. Most likely, different setups might result in different performance benchmarks.

  • Maurits van der Schee (Innovation Engineer):

    @Joshua: Great reply. You got me enthusiastic! I immediately tried calling “stream_set_chunk_size()” in the constructor. Unfortunately I did not see any performance improvements. I hoped this would make a big difference, but it did not. Probably some OS behavior right? I’m on Ubuntu 14.04.

  • Don’t you forget the last line in the file?
    When all chunks are processed buffer may be not empty.

  • Maurits van der Schee (Innovation Engineer):

    @Vladimir: Good point.. Thank you.. corrected!

  • Also worth considering is the CPU’s internal cache. If you start to exceed that, it has to move stuff to and from memory multiple times.

  • Maurits van der Schee (Innovation Engineer):

    @Laurence: Good point, the L2 cache size can make a difference. But how do I measure that? And how does that translate into code? Can it help me to squeeze more throughput out of PHP?

Leave a Reply