This last week we pushed live a very large architecture change for Dating DNA. For those who know me, and have heard me talk about the Dating DNA Scoring System, they know how big of a problem we faced. For those who don’t know, let me give some background:
With Dating DNA, our goal was to display a compatibility score with every other user. This score is generated by taking two sets of answers to our 20 page survey, and running it through our algorithm. While it is super convenient for our users, this poses a problem. We wanted not only for people to be able to visit a profile and see a score, which is easy to generate a score on demand. We wanted our users to be able to browse other profiles sorted by their score with them. This requires us to pre-generate and store these scores, and then later query them.
So Ultimately, in theory, our “scores” data scaled at the following rate, with X as the number of registered users:
That is at an exponential rate, which is practically impossible to scale at. The very first version of Dating DNA (before I took over the project) had about 1,500 users. The scores were stored in a single table. Every night, a “cron job” would run and get a list of every user, and loop through every possible iteration and re-generate each score. At 1,500 users that was 2,248,500 records. That is a lot for just 1,500 users. With our current user count, we would roughly have 359,999,400,000 score records. Thats 359 Billion records if you don’t want to count the commas.
This old system of daily cron jobs broke at about 2,000 users. We would have problems with the cron job taking over 48 hours to complete, and would end up with 3 scripts running at the say time. One for today, one for yesterday, and one for the day before that.
We solved our first problem by using some common sense and smart logic. I won’t detail the lengthy measures we go through, but we can basically boil down our entire user base to an estimated top 5000 matches for any given user. If we have a heterosexual man named Joe, he doesn’t care about the hundreds of thousands of other heterosexual men who he scores a 2 or 3 with, but the heterosexual women he scores above a 6 with. So we don’t store the score for Frank, Jimmy, and Alan with Joe, but Sally, Rachael, and Tiffany.
The second part we solved was pre-generating scores for a user. After a User has reached a point in the survey where we have all the information we need to generate scores, and they are just filling in some miscellaneous, we put them in a queue. We then have a server process than is continuously running checking this queue, and spinning off multiple generation “threads” that crunch the data and store the score. We’ve spent a lot of time perfecting this system. Currently we typically can generate any given user’s matches in roughly 5 to 20 seconds, depending on how busy our website is.
The problem we now faced was the write through put of MySQL. Even through sharding and partitioning, we wanted to have a goal of sustaining 1,000 registrations per minute in a scalable and high performance manner. Which comes down to about 83,000 records per second that are either being inserted or updated. We then needed to be able to retrieve large volumes of scores just as fast.
I believe we could have bent MySQL to our will and got it to work, but it would be at a high cost of server power, and that cost wouldn’t scale well with our revenue stream. After we moved from the MySQL storage of the scores, I ran a query to see how many scores we were indeed storing. The final total was 950,363,992. Just 50 Million shy of one billion. It took 1 hour 49 min 38.27 sec to calculate that count. It is evidence that even though MySQL wasn’t the best choice for storing this data, it did it pretty well considering this single table was holding 90 times more data than any other table.
In 2009 we started to throw around ideas for a new scoring system. I cannot stress enough when talking to others about “NoSQL” solutions the best solution for any given job is based on your data’s characteristics. User registration data needs to be treated differently than activity logging and basic stats. It might be okay to lose a few minutes of activity logging (depending on the app), but you definitely don’t want to be losing user accounts.
With Dating DNA’s scores, we had one great advantage. The data could be somewhat volatile, because we can always re-generate a set of matches for any given users. Of course, we didn’t want to lose all of it, because having to regenerate everyone’s scores is a major pain and extremely resource intensive. But if we lost a few minutes, anything lost could easily be regenerated. So when we started research for a solution, we were willing to sacrifice some persistance for performance. We wouldn’t be doing the same for our user registration data.
At first, I was contemplating building a completely in-house project to handle the data storage and retrieval of the scores. It would be a lot of work, and decided against. So I then thought about hacking together a custom solution with memcached. The idea would be a user’s set of matches would be stores in a variable in memcached. So the website and generation scripts would interface with memcached, and a server process would write inactive sets of scores (people who weren’t logged in) to a file on the disk. When they logged it and scores were being pulled and stored for that user, it would load the data from disk into memcached.
While the general concept was sound, the actual execution would be difficult. Memcached only supported strings for values, and we would still need some sort of database to manage which users had the data in memory vs disk, and the server process (probably just a php, python, or node.js script running continuously) would have to be running constantly, and if that broke things could get messy.
It boiled down too many points of failure and complexity. But it was a step in the right direction, so we kept looking for a better solution.
I was talking with Joseph Scott, an employee of Automattic as a Bug Exorcist (not joking, his real title), and he mentioned I should look into Redis. He gave me a brief overview of what it was, and I shuffled that info back in my brain. I can’t remember how much longer it was before I checked out and compiled a copy of Redis, but I quickly discovered it could be a viable storage system for our scores.
So I spun up a virtual machine, installed Redis, and started to pound away at it. One of the things I wanted to test was the new feature (at the time) of Virtual Memory for redis. What this allowed was for Redis to make it’s own Virtual Memory on the server and store the lest recently used data to disk. When a Redis object was retrieved and it was in the VM, it would swap it back into memory, and swap older data to disk. This was just like the idea I had before with using an archaic system with memcached, but much more elegant.
The second thing was Redis’s support for multiple data types. So instead of having a json encoded string that held the scores and user ids for another user, we could have a hashtable or even a sorted set. It was a much more elegant solution than what we were thinking of before.
However, there were a few limitations that we faced when implementing Redis. Redis works flawlessly with smaller sets of data. But the larger your data set, the more careful and aware you need to be about a few things that will kill your redis instance.
First off, with memcached, if you set a memory limit, it is a hard limit. I’ve never seen memcached use more memory than what you allow it to use. Redis, on the other hand, has soft memory limits. This is because of the way the Background Saves work so you can have persistant data. When a Background Save is issued, Redis will fork itself, and have one thread save a snapshot to the disk, and the other thread will continue to operate. In order to do this, Redis will exceed the standard memory limits, and your memory usage will go up much quicker. Once the background save is complete, it will close the forked backup process and sync the memory back to one data set. (I’m not a computer science guy, nor do I know a lot of lower level programming, so I might be describing this not 100% accurate, but this is how I envision it in my head).
Now, if you are not using the Virtual Memory, this isn’t that bad. However, when using Virtual Memory, the Background Saves take a great deal longer (from seconds to almost a minute or so), which isn’t too bad, but there is a catch. You will not be able to swap to the VM Disk until after the BG Save. This means that all Redis objects swaped to memory will stay in memory until after the BG Save is complete. This is because, like the memory from the fork, the Virtual Memory file is being used for the BG Save instead of the process handling requests.
So the one limitation we’ve encountered is we cannot run scripts that “query” large amounts of data from Redis. For example, it would be very simple to get a listing of users using the KEYS command, and then loop through the values using a HLEN to read the length. This will cause you to swap from and to the virtual memory a great deal. If a Background Save is occuring, Redis will not swap to disk until the BG Save is complete. This means if you have 10 GB of data in Virtual Memory, and you have a 1 GB instance of Redis, you will suddenly be reading gigs of data into Redis’s memory. If you are on a 2GB machine, you can easily use up all the memory on the server and then start using the System’s Swap.
Once you start using the Operating System’s Virtual Memory, it is game over. Your Redis instance’s performance will tank, and your Background Save might not finish, and you will need to restart redis.
There are some ways to give Redis a “Hard” limit on memory, but we opted to configure our servers in a way that doesn’t require this. If Redis hits the memory limit, it can start throwing write operation errors, which we didn’t want.
After much trial and error and testing internally, we believe we found a sweet spot. We deploy a redis server, and spin up three redis instances on a different port each. Each is configured with 4 GB of Virtual Memory using 4096 size pages, and only 256 MB of “memory” using the vm-max-memory setting. While you would think this would mean a hard limit, it is a soft limit, and more of a goal “we’ll try to only use 256 MB of memory to store the data, if we’ll exceed it if needed.” Given our patterns of usage, Redis’s actually usage fluctuates (based on ps aux’s reporting) between 640 MB to 1100 MB of RAM, depending on if a background save is being executed or not. Redis is configured to perform background saves every 10 minutes, which take about a minute to perform.
So between the other admin services running on the box, and the three redis instances, we use just over 3GB RAM:
The amount of CPU required is extremely low, and almost 100% from writing the background saves to the disk:
So what do we get in return? We estimate each instance with 256 MB of data can hold roughly 2,000 active users. So with a single server we can support 6,000 users online at any given moment. We can store the scores for roughly 360,000 users on a single 4 GB box, which is about 1.8 Billion scores. Then, if we need more, we just provision another box, and our system will start assigning users to the instances on that machine.
Because of our ability to re-generate the scores, we decided to only convert the users who had logged in the past three months to the new system. If a user who hadn’t logged in since then logged in again, it would in the background assign them to a redis instance and rebuild their matches for them.
A high level view of how we use Redis with our PHP based website is we have a class called RedisManager than manages pretty much all the connections to Redis. It supports lazy connections (which is important to us, since we don’t want to have to connect to every instance of Redis we have), and I hope to open source it some day soon.
One key performance trick we’ve noticed is to use Pipelining to the redis instance. We don’t use this so much on the website, but our score generation “threads.” Writing thousands of scores one by one eats up a lot of network overhead versus sending them in batches (we send in batched of 500 or 1000, depending on the situation). Using pipelining is extremely fast for us, and I highly recommend it for any large batch of commands.
Now, it might seem that we’ve put a lot of thought and effort into using Redis, and I want to make sure it was understood that Redis itself wasn’t difficult to use, but the volume of data were were dealing with. On Dating DNA, we also use Redis to power out in-app chat system (which I’ve written about previously), and it works great and is currently only using 146.70 MB of RAM, and serves thousands of requests per second.
I still have a lot of great ideas for Redis and Dating DNA, both with the score system, and outside of it. I plan on writing several reporting tools for Redis and hope to share them on github. I am currently working on the code and scripts for automatic deployment for Redis servers for Dating DNA, so we can scale easily with the push of a button. I’m excited for the work that is being done on Redis, and highly recommend it to anyone.
If there are details you would like to know more about, leave a comment and I’ll try to answer them. If you see me at tek11, feel free to ask me about this, and I can show you in detail how it works (internet permitting).