Posts Tagged ‘scalability’
Features of PostgreSQL
Database systems can cost lots of money, this is fairly known. Products like Microsoft SQL Server and Oracle Standard Edition are also billed per CPU (even per core) and may also require client licenses. The costs and issues of licensing may drive people to free (not only as in beer) software. When free database systems are discussed, most people immediately think about MySQL (also owned by Oracle). But there is another, maybe even better, player in the open source market that is less known. Its name is “PostgreSQL”.
An enterprise class database, PostgreSQL boasts sophisticated features such as Multi-Version Concurrency Control (MVCC), point in time recovery, tablespaces, asynchronous replication, nested transactions (savepoints), online/hot backups, a sophisticated query planner/optimizer, and write ahead logging for fault tolerance. It supports international character sets, multibyte character encodings, Unicode, and it is locale-aware for sorting, case-sensitivity, and formatting. It is highly scalable both in the sheer quantity of data it can manage and in the number of concurrent users it can accommodate. There are active PostgreSQL systems in production environments that manage in excess of 4 terabytes of data. — source: http://www.postgresql.org/about/
This database system is free and very powerful. It supports almost all of the features that the paid (and free) counterparts have. Some of the interesting features are: GIS support, hot standby for high availablity and index-only scans (great for big data). Still not convinced? Check out the impressive Feature Matrix of PostgreSQL yourself. It has excellent support on Linux systems (also OSX and Windows) and integrates well with PHP and the frameworks like Symfony2.
Download & try PostgreSQL (for OSX or Windows) or follow installation instructions (for Ubuntu Linux)
Note: If you want a tool like PHPMyAdmin for PostgreSQL you might consider Adminer.
Big data – do I need it?
Big Data?
Big data is one of the most recent “buzz words” on the Internet. This term is normally associated to data sets so big, that they are really complicated to store, process, and search trough.
Big data is known to be a three-dimensional problem (defined by Gartner, Inc*), i.e. it has three problems associated with it:
1. increasing volume (amount of data)
2. velocity (speed of data in/out)
3. variety (range of data types, sources)
Why Big Data?
As datasets grow bigger, the more data you can extract from it, and the better the precision of the results you get (assuming you’re using right models, but that is not relevant for this post). Also better and more diverse analysis could be done against the data. Diverse corporations are increasing more and more their datasets to get more “juice” out of it. Some to get better business models, other to improve user experiences, other to get to know their audience better, the choices are virtually unlimited.
In the end, and in my opinion, big data analysis/management can be a competitive advantage for corporations. In some cases, a crucial one.
Big Data Management
Big data management software is not something you buy normally on the market, as “off-the-shelf” product (Maybe Oracle wants to change this?). One of biggest questions of big data management is what do you want to do with it? Knowing this is essential to minimize to problems related with huge data sets. Of course you can just store everything and later try to make some sense of the data you have. Again, in my opinion, this is the way to get a problem and not a solution/advantage.
Since you cannot just buy a big data management solution, a strategy has to be designed and followed until something is found that can work as a competitive advantage to the product/company.
Internally at LeaseWeb we’ve got a big data set, and we can work on it at real-time speed (we are using Cassandra** at the moment) and obtaining the results we need. To get this working, we had several trial-and-error iterations, but in the end we got what we needed and until now is living up to the expectations. How much hardware? How much development time? This all depends, the question you have to ask yourself is “What do I need?”, and after you have an answer to that, normal software planning /development time applies. It can be even the case that you don’t need Big Data at all, or that you can solve it using standard SQL technologies.
In the end, our answer to the “What do I need?” provided us with all the data we needed to search what was best for us. In this case was a mix of technologies and one of them being a NoSQL database.
* http://www.gartner.com/it/page.jsp?id=1731916
** http://cassandra.apache.org/
Setting up keepalived on Ubuntu (load balancing using HAProxy on Ubuntu part 2)
In our previous post we have set up a HAProxy loadbalancer to balance the load of our web application between three webservers, here’s the diagram of the situation we have ended up with:
+---------+
| uplink |
+---------+
|
+
|
+---------+
| loadb01 |
+---------+
|
+-------------+-------------+
| | |
+---------+ +---------+ +---------+
| web01 | | web02 | | web03 |
+---------+ +---------+ +---------+
As we already concluded in the last post, there’s still a single point of failure in this setup. If the loadbalancer dies for some reason the whole site will be offline. In this post we will add a second loadbalancer and setup a virtual IP address shared between the loadbalancers. The setup will look like this:
+---------+
| uplink |
+---------+
|
+
|
+---------+ +---------+ +---------+
| loadb01 |---|virtualIP|---| loadb02 |
+---------+ +---------+ +---------+
|
+-------------+-------------+
| | |
+---------+ +---------+ +---------+
| web01 | | web02 | | web03 |
+---------+ +---------+ +---------+
So our setup now is:
- Three webservers, web01 (192.168.0.1), web02 (192.168.0.2 ), and web03 (192.168.0.3) each serving the application
- The first load balancer (loadb01, ip: (192.168.0.100 ))
- The second load balancer (loadb02, ip: (192.168.0.101 )), configure this in the same way as we configured the first one.
To setup the virtual IP address we will use keepalived (als also suggested by Warren in the comments):
loadb01$ sudo apt-get install keepalived
Good, keepalived is now installed. Before we proceed with configuring keepalived itself, edit the following file:
loadb01$ sudo vi /etc/sysctl.conf
And add this line to the end of the file:
net.ipv4.ip_nonlocal_bind=1
This option is needed for applications (haproxy in this case) to be able to bind to non-local addresses (ip adresses which do not belong to an interface on the machine). To apply the setting, run the following command:
loadb01$ sudo sysctl -p
Now let’s add the configuration for keepalived, open the file:
loadb01$ sudo vi /etc/keepalived/keepalived.conf
And add the following contents (see comments for details ont he configuration!):
# Settings for notifications
global_defs {
notification_email {
your@emailaddress.com # Email address for notifications
}
notification_email_from loadb01@domain.ext # The from address for the notifications
smtp_server 127.0.0.1 # You can specifiy your own smtp server here
smtp_connect_timeout 15
}
# Define the script used to check if haproxy is still working
vrrp_script chk_haproxy {
script "killall -0 haproxy"
interval 2
weight 2
}
# Configuation for the virtual interface
vrrp_instance VI_1 {
interface eth0
state MASTER # set this to BACKUP on the other machine
priority 101 # set this to 100 on the other machine
virtual_router_id 51
smtp_alert # Activate email notifications
authentication {
auth_type AH
auth_pass myPassw0rd # Set this to some secret phrase
}
# The virtual ip address shared between the two loadbalancers
virtual_ipaddress {
192.168.0.200
}
# Use the script above to check if we should fail over
track_script {
chk_haproxy
}
}
And start keepalived:
loadb01$ /etc/init.d/keepalived start
Now the next step is to install and configure keepalived on our second loadbalancer aswell, redo the steps starting from apt-get install keepalived. In the configuration step for keepalived, be sure change these two settings:
state MASTER # set this to BACKUP on the other machine
priority 101 # set this to 100 on the other machine
To:
state BACKUP
priority 100
That’s it! We have now configured a virtual IP shared between our two loadbalancers, you can try loading the haproxy statistic page on the virtual IP adddress and should get the statistics for loadb01, then switch off loadb01 and refresh, the virtual IP address will now be assigned to the second loadbalancer and you should see the statistics page for that.
In a next post we will focus on adding MySQL to this setup as requested by Miquel in the comments on the previous post in this series. If there’s anything else you’d like us to cover, or if you have any questions please leave a comment!
High availability load balancing using HAProxy on Ubuntu (part 1)
In this post we will show you how to easily setup loadbalancing for your web application. Imagine you currently have your application on one webserver called web01:
+---------+
| uplink |
+---------+
|
+---------+
| web01 |
+---------+
But traffic has grown and you’d like to increase your site’s capacity by adding more webservers (web02 and web03), aswell as eliminate the single point of failure in your current setup (if web01 has an outage the site will be offline).
+---------+
| uplink |
+---------+
|
+-------------+-------------+
| | |
+---------+ +---------+ +---------+
| web01 | | web02 | | web03 |
+---------+ +---------+ +---------+
In order to spread traffic evenly over your three web servers, we could install an extra server to proxy all the traffic an balance it over the webservers. In this post we will use HAProxy, an open source TCP/HTTP load balancer. (see: http://haproxy.1wt.eu/) to do that:
+---------+
| uplink |
+---------+
|
+
|
+---------+
| loadb01 |
+---------+
|
+-------------+-------------+
| | |
+---------+ +---------+ +---------+
| web01 | | web02 | | web03 |
+---------+ +---------+ +---------+
So our setup now is:
- Three webservers, web01 (192.168.0.1), web02 (192.168.0.2 ), and web03 (192.168.0.3) each serving the application
- A new server (loadb01, ip: (192.168.0.100 )) with Ubuntu installed.
Allright, now let’s get to work:
Start by installing haproxy on your loadbalancing machine:
loadb01$ sudo apt-get install haproxy
Scalable RDBMS
My name is Mukesh, I worked with fairly large (or medium large) scale websites as my previous assignments – and now in LeaseWeb’s cloud team, as an innovation engineer. When I say large scale I’m talking about a website serving 300 million webpages per day (all rendered within a second), a website storing about half a billion photos & videos, a website with an active user base of ~10 million, a web application with 3000 servers …and so on!
We all know it takes a lot to keep sites like these running especially if the company has decided to run it on commodity hardware. Coming from this background, I’d like to dedicate my first blog post to the subject of scalable databases.
A friend of mine, marketing manager by profession, inspired by technology, asked me why are we using MySQL in knowing that it does not scale (or there is some special harry potter# magic?). He wanted to ask, from what reasons we have chosen MySQL? And are there any plans to move to another database?
Well the answer for later one is easy “No, we’re not planning to move to another database”. The former question however, can’t be answered in a single line.
#Talking of Harry Potter, what do you think about ‘The Deathly Hallows part -II’?
Think about Facebook – a well recognised social networking website. Facebook handles more than 25 billion page views per day; even they use MySQL.
The bottleneck is not MySQL (or any common database). Generally speaking, every database product in the market has the following characteristics to some extent:
- PERSISTENCE: Storage and (random) retrieval of data<
- CONCURRENCY: The ability to support multiple users simultaneously (lock granularity is often an issue here)
- DISTRIBUTION: Maintenance of relationships across multiple databases (support of locality of reference, data replication)
- INTEGRITY: Methods to ensure data is not lost or corrupted (features including automatic two-phase commit, use of dual log files, roll-forward recovery)
- SCALABILITY: Predictable performance as the number of users or the size of the database increase
This post deals about scalability, which we hear quite often when we talk about large systems/big data.
Data volume can be managed if you shard it. If you break the data on different servers at the application level, the scalability of MySQL is not such a big problem. Of course, you cannot make a JOIN with the data from different servers, but choosing a non-relational database doesn’t help either. There is no evidence that even Facebook uses (back in early 2008 its very own) Cassandra as primary storage, and it seems that the only things that’s needed there is a search for incoming messages.
I believe it’s a bad idea to risk your main base on new technology. It would be a disaster to lose or damage the database, and you may not be able to restore everything. Besides, if you’re not a developer of one of these newfangled databases and one of those few who actually use them in combat mode, you can only pray that the developer will fix bugs and issues with scalability as they become available.
In fact, you can go very far on a single MySQL without even caring about a partitioning data at the application level. While it’s easy to scale a server up on a bunch of kernels and tons of RAM, do not forget about replication. In addition, if the server is in front of the memcached layer (which simply scales), the only thing that your database cares is writes. For storing large objects, you can use S3 or any other distributed hash table. Until you are sure that you need to scale the base as it grows, do not shoulder the burden of making the database an order of magnitude more scalable than you need it.
Conclusion!
- Use MySQL or other classic databases for important, persistent data.
- Use caching delivery mechanisms – or maybe even NoSQL – for fast delivery
- Wait until the dust settles, and the next generation, free-semantics relational database rises up.
