Linux context switching costs of inotify in Python

“A context switch is the process of storing and restoring the state (context) of a process or thread”, it is also “an essential feature of a multitasking operating system” according to Wikipedia.

In this post we go a little deeper into detail on why context switching might be expensive and how you can spot these hidden costs. First I read the fantastic article on Tsuna’s blog titled: “How long does it take to make a context switch?” In the end Benoit Sigoure writes:

Context switching is expensive. My rule of thumb is that it’ll cost you about 30µs of CPU overhead.

Okay, that is quite clear :-) Now how do we measure the context switches of an application? I wrote “” in bash:

T="$(date +%s)"
$@ > /dev/null &
watch -n1 grep ctxt /proc/$pid/status
V="$(cat /proc/$pid/status | grep ^voluntary_ctxt_switches | cut -f2)"
N="$(cat /proc/$pid/status | grep ^nonvoluntary_ctxt_switches | cut -f2)"
T="$(($(date +%s)-T))"
echo "voluntary_ctxt_switches: $((V/T))/sec"
echo "nonvoluntary_ctxt_switches: $((N/T))/sec"
kill -9 $pid

This scripts gets the time in seconds. Then it starts whatever you put as arguments and pipes the output to /dev/null. It backgrounds the process and stores the PID. We watch the context switches in real time until you press Ctrl-C. After that we get the amount of context switches and we get the difference in amount of seconds. We calculate some average and then we finally (force) kill the application.

Inotify in Python

After the last post we are particularly interested in the context switching overhead of inotify in Python. I tested a simple Python inotify script using:

bash python2 /var/log/nginx/access.log

And in a separate terminal I ran:

siege -c 100 http://localhost/

I’ve found that the inotify script does about:

voluntary_ctxt_switches: 125/sec
nonvoluntary_ctxt_switches: 23/sec

148 x 30 microseconds = 4.44 ms CPU just for the context switching each second. 4 ms out of 1000 ms = 0.4% CPU load. And note that this is expected to linearly increase with a higher load on the Nginx server, because this is now doing “only” 200 requests per second.

Nginx buffered access logging

My colleague Carlos pointed out that I was wrong in my last post and that inotify is great choice for looking at Nginx access logs (and he is right). The only thing you need to change is that you need to reduce the amount I/O operations that Nginx does for the logging. By default the access log is not buffered, but you can easily change that in /etc/nginx/nginx.conf:

access_log /var/log/nginx/access.log;

To enable buffered logging change this into:

access_log /var/log/nginx/access.log combined buffer=1024k flush=5s;

Now Nginx will only write every 1024k to disk or every 5 seconds, whatever comes first. You can easily verify the improved behavior using the “” bash script.

Buffered Nginx log reading using PHP and logrotate

The problem: We want to continuously read the Nginx access logs on our high-traffic web nodes for real-time analytic purposes.

Logrotate every 10 seconds

Not possible! The minimum that logrotate allows is hourly. Apart from that I do not think this is very practical. I would still prefer to keep the logfiles hourly, for easy disk management. I do not like an hour of delay on my statistics. I would prefer something like one minute until it reaches the MySQL database.

Using a named pipe with mkfifo

The idea is that you can make a pipe with “mkfifo” and configure Nginx to log to that file. Sounds good until you realize that the buffer size of the pipe is quite limited (a few kb) and it will just silently drop log lines on buffer overflow. Also Igor Sysoev advised that you should not do it and since he knows what he is talking about, I suggest you stay away from it. :-)

Watch the log file with inotify

This idea is to install a watcher on the log file that triggers an event when the file is written to. This is a good idea, but the amount of events on a busy server may be just a little too high for this to be really efficient. We would rather not be notified every time the filesystem sees a change on the Nginx access log file. EDIT: This is actually not a bad idea as long as you enable buffered Nginx access logging, see next post.

Open the file and read until EOF every x seconds

Open the file and keep it open. Sleep for x seconds and read the file until the end. Send that data to your statistics cluster and return to the sleep. This works great until the file gets rotated. Once the file gets rotated, you first need to finish reading the old file until you start reading the new one. Fortunately, logrotate moves the file by default and a open moved file can still be read using the original file descriptor. When we find that that the file has been moved y seconds ago and not been written to for z seconds, then we can decide to close the rotated file and open the new current file. This way we can ensure that we do not lose any log lines. Certainly x, y, and z need some values found using trial and error. I think five, three, and one seconds should be a good starting point.

Configure logrotate to rotate hourly for 48 hours

This can be done using the following logrotate script:

/var/log/nginx/*.log {
        rotate 48
        create 0640 www-data adm
                if [ -d /etc/logrotate.d/httpd-prerotate ]; then \
                        run-parts /etc/logrotate.d/httpd-prerotate; \
                fi \
                [ -s /run/ ] && kill -USR1 `cat /run/`

Make sure you also move the logrotate cron job from daily to hourly using:

sudo mv /etc/cron.daily/logrotate /etc/cron.hourly/logrotate

Read the log files

The following code is an example in PHP that will demonstrate how this log reader will work. Note that Python and Golang may be more suitable languages for implementing these kinds of long running programs. This script just prints the data it reads. Streaming the log lines to your big data analytics cluster is left as an exercise to the reader. ;-)

$logfile = "/var/log/nginx/access.log";
$var_x = 5;
$var_y = 3;
$var_z = 1;
$f = fopen($logfile, "r");
while (true) {
  $data = stream_get_contents($f);
  echo $data; // or send the data somewhere
  $s1 = fstat($f);
  $s2 = stat($logfile);
  $renamed  = $s1["dev"] != $s2["dev"] || $s1["ino"] != $s2["ino"];
  if ($renamed && time()-$s1["ctime"]>$var_y && time()-$s1["mtime"]>$var_z) {
    echo "renamed\n";
    while (!file_exists($logfile)) sleep(1);
    $f = fopen($logfile,"r");
  } else sleep($var_x);

While running the script above you can make sure the access log is written to another terminal using:

siege -c 100 http://localhost/

Now you can force the rotation to be executed using:

sudo logrotate -f /etc/logrotate.d/nginx

And see how the script gracefully handles it.

Best Linux commercials (TV ads)

There are a few commercials (TV ads) for Linux. There are not many and they are not very well-known either. So for old times sake and to promote the good cause I will share my three favorite videos.

#1. IBM: The Future is Open

This is a Linux advertisement from IBM, released in September 2003 titled “The Future is Open”.

#2. RedHat Linux Commercial

This is a Linux advertisement from RedHat. I have not found any background information on it (does anyone know?).

#3. Linux Foundation: How Linux is Built

The following video was published in 2012 by the Linux Foundation. In my opinion, it is not as powerful as the above two.

Ideas or suggestions? Use the comments!

MySQL inserts duplicate records despite unique index

Okay, I thought I knew quite a bit about databases, so I did not expect any surprises when I added a unique index to my MySQL table. For those who do not know what a unique index is:

A UNIQUE index creates a constraint such that all values in the index must be distinct. An error occurs if you try to add a new row with a key value that matches an existing row. – MySQL manual

After adding the index the table looked like this:

CREATE TABLE `table` (
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL,
  `c` int(11) DEFAULT NULL,
  UNIQUE KEY `a_b_c` (`a`,`b`,`c`)

And I was trying to insert or increment values in it like this:


I expected to get the following result:

mysql> select * from `table`;
| a    | b    | c    |
|    1 | NULL |    6 |
1 row in set (0.00 sec)

But it did not do that! I got this result instead:

mysql> select * from `table`;
| a    | b    | c    |
|    1 | NULL |    3 |
|    1 | NULL |    3 |
2 rows in set (0.00 sec)

I was clueless until I read the documentation of MySQL a little better:

This constraint does not apply to NULL values except for the BDB storage engine. For other engines, a UNIQUE index permits multiple NULL values for columns that can contain NULL. – MySQL manual

Especially the part where it says: “does not apply to NULL values” makes things much, much clearer. :-) Note that Microsoft SQL Server behaves different (they way I expected). So keep this is mind when using a unique index in MySQL, because I certainly did not expect this behavior!

Redis sorted set stores score as a floating point number

Today I was playing a little with our statistics. I was writing some “Go” (golang) that was requesting a top 20 customers from Redis using a “Sorted Set” and the “ZREVRANGEBYSCORE” command. Then I found out that the score of the sorted set was actually stored as a double precision floating point number. Normally, I would not be bothered about this and use the float storage for integer values.

But this time I wanted to make a top 20 of the real-time monthly data traffic of our entire CDN platform. A little background: Hits on the Nginx edges are measures in bytes and the logs are streamed to our statistics cluster. Therefor the real-time statistics counters for the traffic are in bytes. Normally we use 64 bit integers (in worst case they are signed and you lose 1 bit).

2^64 = 9,223,372,036,854,775,807
      EB, PB, TB, GB, MB, kB,  b

If you Google for: “9,223,372,036,854,775,807 bytes per month in gigabit per second” you will find that this is about 26 Tbps on average. We do not have such big customers yet, so that will do for now. So an “int64″ will do, but how about the double precision float? Since it has a floating point, theory says it can not reliably count when numbers become too large. But how large is too large? I quickly implemented a small script in golang to find out:

package main

import (

func main() {
	bits := 1
	float := float64(1)
	for float+1 != float {
		float *= 2
	fmt.Printf("%.0f = %d bits\n", float, bits)

Every step the script doubles the number and tries to add 1 until it fails. This is the output showing when the counting goes wrong:

9007199254740992 = 54 bits

So from 54 bits the counting is no longer precise (to the byte). What does that mean for our CDN statistics? Let’s do the same calculation we did before:

2^54 = 9,007,199,254,740,992
      PB, TB, GB, MB, kB,  b

If you Google for: “9,007,199,254,740,992 bytes per month in gigabits per second” you will find that this is about 25 Gbps on a monthly average. We definitely have customers that do much more than that.

I quickly calculated that the deviation would be less than 0.000000000000001%. But then I realized I was wrong: At 26 Tbps average the deviation might as well be as big as 1 kB (10 bits). Imagine that the customer is mainly serving images and JavaScript from the CDN and has an average file size of 10 kB. In this case the statistics will be off by 10% during the last days of the month!

Okay, this may be the worst case scenario, but still I would not sleep well ignoring it. I feel that when it comes to CDN statistics, accuracy is very important. You are dealing with large numbers and lots of calculation and as you see this may have some unexpected side effects. That is why these kind of seemingly small things keep me busy.