I’ve been thinking about this idea for awhile, and I thought I would put a name to the thought. I brought up this idea while I was giving my “Real Life Scaling” presentation at the Utah Open Source Conference in 2009. Here is the problem I think most individuals in the web development face:
Hopefully at some point, your website gets a lot of traffic. Yay, you’ve reached your goal of getting good traffic, but it is soon followed by issues with performance and load. I like to call these the growing pains of a website. So as a web developer, I suddenly have the epiphany of “Hey, I need to scale my website!” What follows next is the biggest mistake a web developer can make:
They start looking at articles on how Google scales, or maybe how Facebook manages all of their traffic.
This is a mistake! To be brutally honest, you are not Google. You are not Facebook. You are not Twitter. You are a website that receives less than 0.000001% of the traffic that some major websites receive.
Why is this dangerous for web developers to do? Google, Twitter, Facebook, and others like them are solving complicated at a very large scale. I remember a presentation by a Twitter engineer who developed a program for a unique ID generator that can generate millions of IDs per second. The probability of you needing this type of solution is about the same as being struck by lightening. Applying these same practices at a much smaller scale are not realistic. If a locally owned grocery store wanted to open a second store, they would not adopt the same practices that Wal-mart use to manage their 8970 stores.
I’m sure that most of my readers know of StackExchange.com. They power the popular website StackOverflow and several others. They have about two million visitors per day. That is a lot of traffic. StackOverflow is ranked #123 on Alexa. So you would imagine that they have a very large infrastructure serving all of this traffic?
Earlier this year, Stack Exchange wrote an article about their production environment. I was surprised on what exactly they were using. In paticular, the number of Production Servers*:
- 12 Web Servers (Windows Server 2008 R2)
- 2 Database Servers (Windows Server 2008 R2 and SQL Server 2008 R2)
- 2 Load Balancers (Ubuntu Server and HAProxy)
- 2 Caching Servers (Redis on CentOS)
- 1 Router / Firewall (Ubuntu Server)
- 3 DNS Servers (Bind on CentOS)
That is 22 servers for 2 Million Visits per day, serving 800 HTTP requests per second. Now, StackExchange did clarify that they did have other servers for management and fail over, but 22 servers handle their production load. This is a website that is ranked the 123rd most visited website in the world.
Honestly, most websites could be run on half a dozen servers if designed and configured correctly, including redundancy. Some really busy websites could run off a dozen servers. Unless you’re in the top 5,000 websites on the web, you really shouldn’t be worried about large-scale techniques.
So when you’re website is starting to grow, and you leave small scale, you’ll enter the phase of “Middle-Scale.”
Middle-Scale is like being an awkward teenager:
You know that you can’t be the only one suffering through this, but you’re unsure how to proceed. It feels like you’re missing missing out on things everyone else must already know, but aren’t talking about. Like everyone else are awesome vampires or something:
But the reality is this: they don’t have some awesome secret! They are just normal teenagers.
This same idea applies to Middle-Scale websites.
Middle-Scale is when the most important things are still the best practices. Only now when you deviate from them you can feel those consequences. When you only had 100 users, a couple of nested queries and missing indexes didn’t cause that much of a problem. Your database is powerful enough to hide the inefficiencies. However, when you get to 10,000 users, your database can no longer hide the inefficiencies.
Middle-Scale is when simply separating your web server and database server isn’t enough. You’ll probably need to add some sort of cache like memcached. You’ll need to start tweaking your MySQL, Apache, and PHP configurations.
Then, after you’ve ironed out your inefficiencies, you’ll start to use multiple servers. You’ll probably add a Load Balancer with multiple web servers. After that, you’ll probably have some sort of Master-Slave replication for your Database for backups and fail-overs.
You start to leave this “Middle-Scale” classification when you move to multiple data centers, and start to do some load balancing at the the DNS layer. This is when you’ll start to have a dedicated sys-admin team.
First off, you must adhere to best practices. If you are working with PHP, research PHP performance and best practices. Do they same for each of your technologies, like Apache and MySQL. You will need to stop treating your application as one big app, and start to understand all of it’s moving parts.
Second, you must understand your specific problems. Scaling ins’t a problem, nor is it a solution. It is a generic term for many different types of solutions. Without understanding why your website is running slow, or why it cannot handle the load, you will not be able to create an effective solution.
So you don’t have a scaling problem. You have a MySQL performance issue, or a Apache problem, or a PHP problem. Most likely, it is something extremely specific. You have a high volume of MySQL write operations (i.e. UPDATE, INSERT, DELETE, REPLACE), or perhaps you are missing some indexes and have too many full table scans.
Third, Googling for help will only get you so far. You are starting to enter a phase when it is harder and harder to find answers to your broad issues. Talking with other experienced people who have gone through the Middle-Scale pains before will help immensely. I cannot recommend highly enough going to User groups. Being able to communicate with someone, either face to face, on the Phone, over IRC, etc. is invaluable. While I’ve learned a lot at conference and usergroup presentations, I’ve learned even more by just talking with the people attending and at the social gatherings.
When you want to scale, it can feel like a very daunting task. It seems like this big unknown complicated solution. What in the world am I going to do? I remember feeling these worries when I first started to investigate load balancing and sharding for some websites I was working on.
The thing is, if you start to profile your application, you will discover it’s inefficiencies. I remember when I spent a sold week, working 12-16 hours a day profiling and optimizing Dating DNA’s database. I found a lot of bad queries, and I was able to cut our load times from 2-5 seconds to under 0.1 seconds. The CPU on the database server went from 80-90% CPU utilization to under 10%. It was incredible, and then I promptly took the entire next week off. When we migrated to new servers, I was able to move to less powerful database server and still have the same great performance. So by profiling and optimizing our database, I didn’t need to worry about spinning up multiple master databases and sharding our data.
With Clipish, we faced almost opposite scaling problems. The database was rarely an issue, but our web server CPU’s were. We do a lot of ImageMagick manipulations of images, and at high volumes on virtual servers this can be a big issue. So over the last year we’ve introduced some load balancing and CDN tools to help serve all 10 TB of bandwidth for Clipish.
The thing is, when you start to profile your application, you start to understand it’s low areas better, so you have a much better idea on what do to. Even if you don’t know your solution, it is much easier to find a solution with a sound understanding of the problem. For example “scaling mysql” yields much less helpful results than “mysql full table scans” in Google.
Of course not! First off, they do cool stuff. Just because I’ll watch NASA launch a space shuttle doesn’t mean I’ll try to make a rocket system for my broken lawn mower. But you have to put what they are doing into context. People from large websites have published several good “best practices” articles on techniques that help any website. Especially things on the client/browser side of things. Just use caution. I cringe when I hear someone say “we’re trying to use Cassandra to solve XYZ problem at work” when it is severe overkill.
Most of the time when I talk about performance and scaling with other people, it is when they are in “critical mode.” Their website is down, slow, unusable, etc, and they are looking to fix the problem. I will say, it is much more difficult to profile in “critical mode” than profiling before hand. The reason is you are much more desperately focused on getting it working again instead of understanding the problem.
I’ll be giving a presentation this Thursday at UPHPU on Profiling PHP Applications. I’ll post the slides, and most likely write some articles on the subject afterwards. As always, feel free to email me or leave a comment.