Update: So I spelt
Neil Neal’s name wrong. Sorry Neal! I also added some other links to the bottom that he mentioned to me.
Its been a little over 48 hours, and StatsD, Graphite, and Graphene have already paid off major dividends. While speaking at Tek12, Neal Anders was giving a talk on Graphite which I wanted to attend, but I was speaking at the same time I was speaking on Redis so I couldn’t. Then, a week or so after the conference, someone posted a link about StatsD on twitter (I can’t remember who), so I started poking around, loved what I saw, and within an hour I had it running on a server for Dating DNA friday night (I still do some contract work for them to help keep their infrastructure running well). I quickly added about a dozen metrics to track, and went to bed.
I noticed two things after looking at the data this weekend, first off, we had one API in particular that blew me away. We were measuring the time it took to complete API requests, measured is Milliseconds (MS):
Wow… so we had requests taking 5k to 10k milliseconds (5 to 10 seconds), and spiking up to 40 seconds. After 10 minutes of looking at the code, I found the inefficiency, and if you look at it now (the arrow is where the code change was made), its taking less than 300 ms.
Then, I noticed another oddity. Across the board, most APIs had patterns of massive spikes for a few minutes:
Things would run optimally, but those hourly massive spikes were painful. After a few hours of investigation, I found a munin mysql plugin that would aggressively analyze disk usage via querying internal tables on MySQL. This was highly inefficient and taxing on the server, causing massive disk IO issues. After disabling that particular check (and we check MySQL’s disk usage by just checking the file size on disk, not how MySQL internally manages space), those spikes went away.
So within 48 hours I found two serious issues that had flown under the radar until I had a chance to look at the data. Dating DNA now has a dashboard using Graphene monitoring critical metrics. I added tabs to it that auto rotate after mouse inactivity, and we can watch dozens of metrics easily throughout the day. I have a few more things I want to add to the system, such as sending all our munin data to StatsD/Graphite so we can compare things like CPU Usages against Online Users or other Activity. I’m also really excited to add this to Deseret News, I just didn’t feel comfortable hacking on it’s production environment over the weekend my first month on the job.
I’ll have a blog post later this week detailing in more detail how I set everything up. But until then, here are some articles that helped me in setting up StatsD, Graphite, and Graphene:
Honestly, there shouldn’t be a single reason why any company couldn’t track better metrics of their application. Within an hour I was up and running, within a day I had it nice and polished, and within two days I had fixed two major performance issues. You’ll likely be seeing several posts from myself over the next weeks of even cooler things to do with Graphite. Big thanks to Etsy, Orbitz, and all the contributors to these projects!