Affordable real-time Big Data with streaming analytics

In our CDN pops in various locations world-wide we analyze between 10.000 and 100.000 access log lines per second. We collect this data in order to be able to provide analytics and, even more important, billing information. The most important question is: How much data was served for each customer? This is information we need to collect and present in our control panel. Obviously it needs to be 100% correct. In our systems each log line holds over 20 fields. These fields need to analyzed in real-time. Real-time is a bit of a flexible concept. Some people claim that a 24 hour delay is still real-time. When we talk about real-time we aim for an update interval of half a minute.

So effectively our clusters need to analyze 2 millions of values per second and turn them into a set of 30 real-time analytics that update every 30 seconds.

There is only so little time

Let’s assume you are only willing to spend 10% extra on hardware to add analytics to your product, then the amount of time you can spend per data point is limited. If at peak time the average requests per second a machine is handling is 1000 and each machine is a dual hexacore server, then you can spend maximum of (1/1000)*16*0.10 = 0.0016 second = 1.6 millisecond per log line in a single threaded application for doing analytics CPU time. On I/O we can do a similar kind of calculation with disks instead of cores. You must agree that we are on a tight budget considering that a 15k SAS disk’s average seek time is 2 ms. Also note that it does not matter on which machine or tier this calculation happens as the budget stays the same.

Traditional Big Data approach

You collect all raw data that you have and you write it to a DFS (like Hadoop Distributed File System). Then you define MapReduce jobs to actually go over the data and aggregate it into a result format. That result is either stored in a NoSQL database, on a DFS or if sufficiently reduced into a relational database (like MySQL). To aggregate aggregated data we can read the result from disk and aggregate once more. In a very naive approach the disk I/O is high as the data would be written to log files on the edge, then read from the logs and written to the DFS and then read and written for each aggregation level and/or different kind of statistic.

Streaming analytics approach

streaming_statistics

Figure 1: An example of a streaming analytics infrastructure with the data flowing from left to right.

This approach is much simpler and performs better. We read the data directly from the log files as it is just written and probably still memory mapped, which reduces the I/O on the nodes. Updating all counters in RAM directly after the data is read for all resulting statistics is also relatively cheap. All these statistics are removed from RAM and sent upstream to an aggregation server with a certain interval (flush rate). On the aggregation server all counters are again updated in RAM and then written to disk in the final format. This final format is either in MySQL or in files on disk in the format that they will be queried.

I/O is limited, plenty of CPU

Most calculations done in analytics are extremely easy. It merely consists of counting occurrences. Sometimes only occurrences of certain combinations are counted and sometimes other fields are summed, like the amount of bytes transfered or seconds spent. Also note that the combination of a sum and a count can then lead to an average. These simple calculations do not require much CPU. But every time an number is to be incremented it needs to be read and written. And we need to realize that the amount of available I/O is also limited. Faster disks are expensive, especially when there is a need for high capacity.

RAM to the rescue

The only way to reduce the I/O is to keep the numbers in RAM (memory). This way they do not have to be read or written. This has it’s own disadvantage: your RAM capacity needs to be high or your result data needs to be small. Keeping the result data small is actually valuable. Humans cannot process tens of thousands of data points anyway, so why not reduce it to the most valuable information possible. Also, lets do it as early in the process as possible to not waste any expensive CPU, I/O or RAM. This is how the streaming analytics work. Log files are only written to disk once (on the edge nodes) and then processed in RAM, aggregated in RAM, sent over HTTP to a central server, where they are again aggregated in RAM to be removed and inserted in a traditional relational database, since they are sufficiently reduced in size.

MySQL is not so bad

MySQL can actually do many inserts or update queries per second. On my SSD-driven i5 laptop when using a non-optimized (default) MySQL with extended queries I got over 15.000 values updated per second. Which was a lot higher than I assumed was possible. If it is needed you can store large sets of values in text fields in MySQL (in the resulting json format) to reduce load on MySQL, but also give up the flexibility to query it any way you like. The flush rate tells us how often the aggregated analytics are sent upstream. Lowering the flush rate increases the total delay in the analytics, but on the other hand lowers the amount of writes needed on MySQL. This is easy to see, as every counter that is updated is then updated less frequently and with a larger increment.

High availability

When a node in the system goes down it clearly does not do any traffic anymore so it does also not have to do statistics. If the aggregation or MySQL server goes down, the nodes can stop further log analysis to only continue once the server(s) are available again. Web nodes are typically configured to hold several days of logging information on disks, so this would not cause a problem. The nodes may have a little higher load when they are trying to keep up for lost time, but the speed during recovery can easily be throttled to minimize this problem.

Optimizations

The most important rule is to keep the resulting statistics limited in size (relatively small). Otherwise you are using too much RAM and you have to fall back to slow disk I/O. By having a retention policy per statistic, you can make sure you delete any data that is no longer relevant. Obviously MySQL can be optimized using (and omitting) indexes and changing its disk writing strategy. One of the most powerful knobs is making this flush rate configurable per statistic, so you can tune the performance of the system and lower the delay or lower the writes. Also for top lists you can apply length limitations that would lower the correctness slightly, but highly reduce the data transfer and RAM usage. A last resort would be to only process a sample of the log lines for certain less important statistics on certain higher volume customers.

Conclusion

It seems that for real-time log file analysis this “Streaming Analytics” approach is much more suitable than a more traditional Big Data approach. The only real downside is that you have to keep the resulting data small in size, so that it fits in RAM. We’ve calculated that we are able to achieve more than a factor 10 higher performance compared to our previous implementation. Now that is what I call innovation!

Building software is like building a LEGO house

One of the problems of the software development industry, is non-software people making decisions about software builds. Many of these people believe that building software is like building houses, which it is not. People see the software business, approach it like building a house, specifying upfront and asking when it is finished. Assigning project managers as general contractors and hiring cheap labor to execute the build. I can hear some managers who are reading this think: Wow! Can we do that? That is the answer! No, it is not, we tried it for more than a decade and it failed spectacularly. They even came up with a name for it: “out-sourcing”.

But even though we move from Big Design Up Front (BDUF) to Agile the analogy of building houses is hard to get rid of. Yes, we know we do Agile with sprints, no deadlines just sprint goals and a backlog with user stories and a road-map with epics. But we are still building a house. Well.. no.. we are actually not building a house. The whole house analogy is wrong. We are building a LEGO house. Yes, something like this:

lego-house

Was it a coincidence that Paul Hammant used a LEGO picture in his post about software architecture? I think not. Here are some of the points that show why building LEGO house is a good analogy for building product software:

  • Innovation: Best builds come from experimenting constantly and thinking out-of-the-box.
  • Flexible: LEGO house foundations can be replaced carefully. Anything can be replaced any time.
  • Looks: Building a pretty LEGO house is very hard and requires special skills.
  • Agility: There is no such thing as a “done” LEGO house. It is alive and constantly changing.
  • Requirements: You don’t plan building a LEGO house. You know what you want when it evolves.
  • Specs: LEGO is enjoyed best by building whatever comes to mind, not by following specs.
  • Compatibility: Everything connects to everything, with hardly any effort.
  • Freedom: LEGO can be build without having to comply with all kinds of regulation.
  • Creative joy: Creating software is as much fun as creating with LEGO.
  • Value: People pay for the promise of easy integration and high flexibility.
  • Validation: A real-life test by the target audience tells you things you could not imagine.
  • Mindset: Grown-ups do not understand LEGO, like business does not understand software development.
  • Out-sourcing: If somebody else builds your LEGO house, you will get something you don’t like.
  • Standards: Build with non-standard blocks and you’ll get stuck quickly.

Can you think of other things in which the analogy is right? Use the comments! Don’t comment saying that I do not take software development serious. I believe that software development is a serious business, but it is also seriously fun. To put it stronger: If it does not remind you of playing with LEGO when you were still a kid, you are doing it wrong.

About hiring programmers and Asperger’s syndrome

Managers in software teams are often confronted with highly intelligent programmers; programmers that all suffer in some extent from Asperger’s syndrome. These people are very intelligent, question authority and do not do what is told unless they are satisfied with the reasons why. They have a low social sensibility and love endless discussions about principles. They like to stick to rules onces they are settled, can focus for hours on single pieces of code, love to be fully focused and hate it to be disturbed.

Am I generalizing? Yes, you bet I am, but I am sure: If you are a programmer you will recognize some of it.

“Managing programmers is like herding cats” – Meilir Page-Jones

I love that quote, because it holds so much value. Some say it only applies to senior programmers. But that leads to the following dilemma: hire easy-to-steer junior programmers that may have little or negative contribution to the software you are building or hire hard-to-steer seniors that can bring big value. Whether it is true or not, I think the latter is preferable.

Dealing with senior programmers is hard and I believe it is a challenge that does not have its match in other industries. One of the factors that is making matters worse is the global lack of senior programmers. These programmers can afford to behave arrogantly and non-compliant because the software business needs them so badly. They can always find another well-paying job. All they have to do is mention on their social media profile that they are “available” and it will rain job offers on them.

One of the methods I see that people use to get a grip on their developers, is to pay them above average or get them secondary benefits that are so good they cannot be matched elsewhere. But then again, you must be careful this does not lead to arrogance, lack of self-criticism and a feeling of superiority. These are definitely enemies of a good working team, so this cure may be worse than the disease.

Further reading on Asperger’s syndrome

The following links give some more insight into Asperger’s syndrome:

  1. http://archive.wired.com/wired/archive/9.12/aspergers_pr.html
  2. http://www.dailymail.co.uk/home/you/article-2557765/Is-man-wired-differently.html
  3. http://www.healthcentral.com/autism/c/1443/153287/asperger-difficulty/
  4. http://edition.cnn.com/2013/04/11/health/aspergers-work-irpt/

Linux web browsers Midori and (Gnome) Web

Browsers have become the most important application you run on your computer. Google Chrome, Firefox, and Internet Explorer seem to be the most commonly used browsers today. But there are other options. The following things are worth considering when choosing a browser:

  1. Security
  2. Stability
  3. Speed
  4. Adblocker
  5. Support for media
  6. Development tools

If you are on the Linux platform, you can install a variety of browsers, but Firefox and Chromium are probably the most popular choices. This post will introduce two other browsers for Linux that you might not know yet: (Gnome) Web and Midori.

Gnome Web 3.10.3

gnome_web

Gnome Web is an elegant polished and minimalistic browser that only runs on Linux and feels much like Safari. It used to be called Epiphany and was based on the Gecko engine (currently WebKit). It has a good set of development tools hidden under the right-click (Inspect Element).

  1. Engine: Webkit
  2. Search: DuckDuckGo
  3. AdBlocker: Yes
  4. Platforms: Linux only

Installation command for Debian-based Linux systems:

sudo apt-get install epiphany-browser

User-agent for this browser on Xubuntu Linux 14.04:

Mozilla/5.0 (Macintosh; Intel Mac OS X) AppleWebKit/538.15 (KHTML, like Gecko) Safari/538.15 
Version/6.0 Ubuntu/14.04 (3.10.3-0ubuntu2) Epiphany/3.10.3

Midori 0.4.3

midori

Midori is a fast and full-featured browser with nice development tools. The only thing I found is that it does not have as many plugins as Chrome or Firefox. Other than that I think the browser is just as good (with some better defaults). It also runs on Windows, but as far as I know it is not ported to OSX.

  1. Engine: WebKit
  2. Search: DuckDuckGo
  3. AdBlocker: Yes
  4. Platforms: Linux and Windows

Installation command for Debian-based Linux systems:

sudo apt-get install midori

User-agent for this browser on Xubuntu Linux 14.04:

Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-us) AppleWebKit/537+ (KHTML, like Gecko) 
Version/5.0 Safari/537.6+ Midori/0.4

The lies about the necessity of a Big Rewrite

This post is about real world software products that make money and have required multiple man-years of development time to build. This is an industry in which quality, costs and thus professional software development matters. The pragmatism and realism of Joel Spolsky’s blog on this type of software development is refreshing. He is also not afraid to speak up when he believes he is right, like on the “Big Rewrite” subject. In this post, I will argue why Joel Spolsky is right. Next I will show the real reasons software developers want to do a Big Rewrite and how they justify it. Finally, Neil Gunton has a quote that will help you convince software developers that there is a different path that should be taken.

Big Rewrite: The worst strategic mistake?

Whenever a developer says a software product needs a complete rewrite, I always think of Joel Spolsky saying:

… the single worst strategic mistake that any software company can make: (They decided to) rewrite the code from scratch. – Joel Spolsky

You should definitely read the complete article, because it holds a lot of strong arguments to back the statement up, which I will not repeat here. He made this statement in the context of the Big Rewrite Netscape did that led to Mozilla Firefox. In an interesting very well written counter-post Adam Turoff writes:

Joel Spolsky is arguing that the Great Mozilla rewrite was a horrible decision in the short term, while Adam Wiggins is arguing that the same project was a wild success in the long term. Note that these positions do not contradict each other.

Indeed! I fully agree that these positions do not contradict. So the result was not bad, but this was the worst mistake the software company could make. Then he continues to say that:

Joel’s logic has got more holes in it than a fishing net. If you’re dealing with a big here and a long now, whatever work you do right now is completely inconsequential compared to where the project will be five years from today or five million users from now. – Adam Turoff

Wait, what? Now he chooses Netscape’s side?! And this argument makes absolutely no sense to me. Who knows what the software will require five years or five million users from now? For this to be true, the guys at Netscape must have been able to look into the future. If so, then why did they not buy Apple stock? In my opinion the observation that one cannot predict the future is enough reason to argue that deciding to do a Big Rewrite is always a mistake.

But what if you don’t want to make a leap into the future, but you are trying to catch up? What if your software product has gathered so much technical debt that a Big Rewrite is necessary? While this argument also feels logical, I will argue why it is not. Let us look at the different technical causes of technical debt and what should be done to counter them:

  • Lack of test suite, which can easily be countered by adding tests
  • Lack of documentation, writing it is not a popular task, but it can be done
  • Lack of building loosely coupled components, dependency injection can be introduced one software component at a time; your test suite will guarantee there is no regression
  • Parallel development, do not rewrite big pieces of code, keep the change sets small and merge often
  • Delayed refactoring, is refactoring much more expensive than rewriting? It may seem so due to the 80/20 rule, but it probably is not; just start doing it

And then we immediately get back to the reality, which normally prevents us from doing a big rewrite – we need to tend the shop. We need to keep the current software from breaking down and we need to implement critical bug fixes and features. If this takes up all our time, because there is so much technical debt, then that debt may become a hurdle that seems too big to overcome ever. So realize that not being able to reserve time (or people) to get rid of technical debt can be the real reason to ask for a Big Rewrite.

To conclude: a Big Rewrite is always a mistake, since we cannot look into the future and if there is technical debt then that should be acknowledged and countered the normal way.

The lies to justify a Big Rewrite

When a developer suggests a “complete rewrite” this should be a red flag to you. The developer is most probably lying about the justification. The real reasons the developer is suggesting Big Rewrite or “build from scratch” are:

  1. Hard-to-solve bugs (which are not fun working on)
  2. Technical debt, including debt caused by missing tests and documentation (which are not fun working on)
  3. The developer wants to work on a different technology (which is more fun working on)

The lie is that the bugs and technical debt are presented as structural/fundamental changes to the software that cannot realistically be achieved without a Big Rewrite. Five other typical lies (according to Chad Fowler) that the developer will promise in return of a Big Rewrite include:

  1. The system will be more maintainable (less bugs)
  2. It will be easier to add features (more development speed)
  3. The system will be more scalable (lower computation time)
  4. System response time will improve for our customers (less on-demand computation)
  5. We will have greater uptime (better high availability strategy)

Any code can be replaced incrementally and all code must be replaced incrementally. Just like bugs need to be solved and technical debt needs to be removed. Even when technology migrations are needed, they need to be done incrementally, one part or component at a time and not with a Big Bang.

Conclusion

Joel Spolsky is right; You don’t need a Big Rewrite. Doing a Big Rewrite is the worst mistake a software company can make. Or as Neil Gunton puts it more gentle and positive:

If you have a very successful application, don’t look at all that old, messy code as being “stale”. Look at it as a living organism that can perhaps be healed, and can evolve. – Neil Gunton

If a software developer is arguing that a Big Rewrite is needed, then remind him that the software is alive and he is responsible for keeping it healthy and growing it up to become fully matured.