Archive for July 2011
Welcome to LeaseWeb Labs!
This might not be the first post on here, but it still seems like a good idea to have this welcome post.
The LeaseWeb Labs blog is an initiative from LeaseWeb’s IT department, to create a place where technically focused people can post about technical subjects. As (research, product and maintenance) departments, we get confronted with all dark and murky corners of development and engineering. It seems a shame to keep all that knowledge and learning for ourselves, so we decided to start sharing it here.
Things we’ll be posting about will vary; we have engineers working on high-traffic delivery systems, developers building front-ends for our various systems, and teams working on a pretty interesting new cloud platform – so you can expect information, howto’s and insights about almost any (IT related) subject. As we work in the hosting industry, there is definitely a bias towards hosting-related information
Some of these posts are aimed at basic system administration or development, and some will be more about complex scalability and high performance systems. We try to have something for everyone, but do let us know if there is something specific you want to hear about – you can contact us via the twitter account, on facebook, or use the comment system below.
To get you started, we’ve already placed a few articles that might interest you:
- High availability load balancing using HAProxy on Ubuntu: Sander gives you a practical howto on implementing a basic high availability setup – part #1 in a series on high availability.
- Scalable RDBMS: Mukesh is our scalability guru, and used to work on extremely high traffic (web) systems. He’s writing a first post in a series on scalable databases.
- Tuning Zend framework and Doctrine: Alexander works mainly in PHP, and gives you some pointers on starting a customized Zend/Doctrine project to make it an even better combination.
We hope this gives you a general feeling of what we’re planning to do – and will be posting more. Any requests or comments are welcome!
In this post we will show you how to easily setup loadbalancing for your web application. Imagine you currently have your application on one webserver called web01:
+---------+ | uplink | +---------+ | +---------+ | web01 | +---------+
But traffic has grown and you’d like to increase your site’s capacity by adding more webservers (web02 and web03), aswell as eliminate the single point of failure in your current setup (if web01 has an outage the site will be offline).
+---------+ | uplink | +---------+ | +-------------+-------------+ | | | +---------+ +---------+ +---------+ | web01 | | web02 | | web03 | +---------+ +---------+ +---------+
In order to spread traffic evenly over your three web servers, we could install an extra server to proxy all the traffic an balance it over the webservers. In this post we will use HAProxy, an open source TCP/HTTP load balancer. (see: http://haproxy.1wt.eu/) to do that:
+---------+ | uplink | +---------+ | + | +---------+ | loadb01 | +---------+ | +-------------+-------------+ | | | +---------+ +---------+ +---------+ | web01 | | web02 | | web03 | +---------+ +---------+ +---------+
So our setup now is:
- Three webservers, web01 (192.168.0.1), web02 (192.168.0.2 ), and web03 (192.168.0.3) each serving the application
- A new server (loadb01, ip: (192.168.0.100 )) with Ubuntu installed.
Allright, now let’s get to work:
Start by installing haproxy on your loadbalancing machine:
loadb01$ sudo apt-get install haproxy
My name is Mukesh, I worked with fairly large (or medium large) scale websites as my previous assignments – and now in LeaseWeb’s cloud team, as an innovation engineer. When I say large scale I’m talking about a website serving 300 million webpages per day (all rendered within a second), a website storing about half a billion photos & videos, a website with an active user base of ~10 million, a web application with 3000 servers …and so on!
We all know it takes a lot to keep sites like these running especially if the company has decided to run it on commodity hardware. Coming from this background, I’d like to dedicate my first blog post to the subject of scalable databases.
A friend of mine, marketing manager by profession, inspired by technology, asked me why are we using MySQL in knowing that it does not scale (or there is some special harry potter# magic?). He wanted to ask, from what reasons we have chosen MySQL? And are there any plans to move to another database?
Well the answer for later one is easy “No, we’re not planning to move to another database”. The former question however, can’t be answered in a single line.
#Talking of Harry Potter, what do you think about ‘The Deathly Hallows part -II’?
Think about Facebook – a well recognised social networking website. Facebook handles more than 25 billion page views per day; even they use MySQL.
The bottleneck is not MySQL (or any common database). Generally speaking, every database product in the market has the following characteristics to some extent:
- PERSISTENCE: Storage and (random) retrieval of data<
- CONCURRENCY: The ability to support multiple users simultaneously (lock granularity is often an issue here)
- DISTRIBUTION: Maintenance of relationships across multiple databases (support of locality of reference, data replication)
- INTEGRITY: Methods to ensure data is not lost or corrupted (features including automatic two-phase commit, use of dual log files, roll-forward recovery)
- SCALABILITY: Predictable performance as the number of users or the size of the database increase
This post deals about scalability, which we hear quite often when we talk about large systems/big data.
Data volume can be managed if you shard it. If you break the data on different servers at the application level, the scalability of MySQL is not such a big problem. Of course, you cannot make a JOIN with the data from different servers, but choosing a non-relational database doesn’t help either. There is no evidence that even Facebook uses (back in early 2008 its very own) Cassandra as primary storage, and it seems that the only things that’s needed there is a search for incoming messages.
I believe it’s a bad idea to risk your main base on new technology. It would be a disaster to lose or damage the database, and you may not be able to restore everything. Besides, if you’re not a developer of one of these newfangled databases and one of those few who actually use them in combat mode, you can only pray that the developer will fix bugs and issues with scalability as they become available.
In fact, you can go very far on a single MySQL without even caring about a partitioning data at the application level. While it’s easy to scale a server up on a bunch of kernels and tons of RAM, do not forget about replication. In addition, if the server is in front of the memcached layer (which simply scales), the only thing that your database cares is writes. For storing large objects, you can use S3 or any other distributed hash table. Until you are sure that you need to scale the base as it grows, do not shoulder the burden of making the database an order of magnitude more scalable than you need it.
- Use MySQL or other classic databases for important, persistent data.
- Use caching delivery mechanisms – or maybe even NoSQL – for fast delivery
- Wait until the dust settles, and the next generation, free-semantics relational database rises up.
In principle, the combination of Zend Framework with Doctrine is not too difficult. But first let’s talk about the preparations. According to the author of Zend Framework, the default file structure of project can be a bit more optimal.
Here is the default structure of the Zend Framework project files:
/ application/ default/ controllers/ layouts/ models/ views/ html/ library/
It can often be that you will have a number of applications (e.g., frontend and backend ), and you want to use the same model for them. In this case, it can be a good practice to create your models folder in library/, in which case the new structure would look as follows:
/ application/ default/ controllers/ layouts/ views/ html/ library/ Model/
In addition, the folder models/ is renamed to Model. We now proceed as follows:
- Download a fresh copy of Doctrine-xxx-Sandbox.tgz from the official website.
- Copy the contents of the lib/folder from the archive to our project library/ folder.
- Create another folder bin/sandbox/ in the root of our project and copy the rest of the archive there (except models/ folder and the index.php file).
Now the structure of our project should look like this:
/ application/ default/ controllers/ layouts/ views/ bin/ sandbox/ data/ lib/ migrations/ schema/ config.php doctrine doctrine.php html/ library/ Doctrine/ Model/ Doctrine.php
Clear the content of the folder bin/sandbox/lib/ – we now have the library in another place.
Now it’s time to configure the Doctrine to work with new file structure.
Change the value of the constant MODELS_PATH in the file bin/sandbox/config.php::
SANDBOX_PATH . DIRECTORY_SEPARATOR . '..' . DIRECTORY_SEPARATOR . '..' . DIRECTORY_SEPARATOR . 'library' . DIRECTORY_SEPARATOR . 'Model'
Next, change the connection settings for the database. Change the value of the DSN constant to reflect your database settings. For example, if you use the DBMS MySQL, the DSN might look like this:
Configure include_paths on the first line in the config file, so our script can find files on new locations:
set_include_path( '.' . PATH_SEPARATOR . '..' . DIRECTORY_SEPARATOR . '..' . DIRECTORY_SEPARATOR . 'library' . DIRECTORY_SEPARATOR . PATH_SEPARATOR . '.' . DIRECTORY_SEPARATOR . 'lib' . PATH_SEPARATOR . get_include_path());
Then connect the main Doctrine library file directly after installation paths, and set the startup function:
require_once 'Doctrine.php'; /** * Setup autoload function */ spl_autoload_register( array( 'Doctrine', 'autoload' ));