On the backend of all these games is a tiny systems team of three people – myself, Christopher Holt and Manchul Park. We built everything from scratch and thought we had our approach to building and scaling backend systems down pretty well. That was until Draw Something came along.
The story of Draw Something’s development actually starts around four years ago when the first version of the game was created on our website OMGPOP.com (then iminlikewithyou.com). It was called “Draw My Thing” and was a real-time drawing game. It was fun and had a somewhat big player base relative to how big our site was at that time. We also made a Facebook version of it, and the game developed a pretty large following.
Last year, we decided to make a mobile version of the game. At that point, OMGPOP was still trying to find its way in the world. We did everything we could to land that hit game. For us, like many developers, it meant working on as many games as possible, and fast. Draw Something was no different. We knew the game had potential, but no one could’ve predicted how big of a hit it would become. From a technical standpoint, we treated Draw Something like a lot of our previous games. The backend team has always built things to be efficient, fast, and to scale. We’ve learned to keep things simple. The original backend for Draw Something was designed as a simple key/value store with versioning. The service was built into our existing ruby API (using the merb framework and thin web server). Our initial idea was why not use our existing API for all the stuff we’ve done before, like users, signup/login, virtual currency, inventory; and write some new key/value stuff for Draw Something? Since we design for scale, we initially chose Amazon S3 as our data store for all this key/value data. The idea behind this was why not sacrifice some latency but gain unlimited scalability and storage.
The rest of our stack is pretty standard. Anyone who wants to build scalable systems will attempt to make every layer of the system scale independently from the rest. As the web frontend we use NGINX web server, which points to HAProxy software load balancer, which then hits our ruby API running on a thin web server. The main datastore behind this is MySQL – sharded when absolutely necessary. We use memcached heavily and redis for our asynchronous queueing, using the awesome ruby library called resque.
A few days after Draw Something launched, we started to notice something…strange. The game was growing — on its own. And it was growing fast. On release day, it reached 30,000 downloads. About 10 days later, the rocket ship lifted off – downloads accelerated exponentially, soon topping a million.
Celebrities started tweeting about the game – from John Mayer to Pauly D – sending us even more traffic. And people playing the game weren’t leaving – they were hooked, so total usage climbed even higher than the number of people downloading the game every day.
Most engineers develop their software to scale, but know that in any complex system, even if you try to benchmark and test it, it’s hard to tell exactly where things will fall over, in what way, what system changes need to be made, and when.
The first issue we ran into was the fact that our usual API is really fast, which means that using thin web server in the way we always have – single threaded, 1 request at a time was fine – but for the public cloud, unpredictable response times can back up everything. So we watched and saw things starting to backup, and knew this was not sustainable. In the meantime we just continued to bring up more and more servers to buy us some time. Fortunately we had anticipated this and designed the DrawSomething API in such a way that we can easily break it out from our main api and framework. Being always interested in the latest tech out there, we were looking at Ruby 1.9, fibers, and in particular Event Machine + synchrony for a while. Combined with the need for a solution ASAP – this lead us to Goliath, a non-blocking ruby app server written by the guys at PostRank. Over the next 24 hours I ported over the key/value code and other supporting libraries, wrote a few tests and we launched the service live. The result was great. We went from 115 app instances on over 6 servers to just 15 app instances.
The smooth sailing was short lived. We quickly started seeing spikes and other really strange performance issues. At this point, we were pretty much working around the clock. Things got really bad around 1am one night, which is when we realized the main issue – our cloud data store was throwing errors on 90% of our requests. Shortly after, we received an email from our vendor telling us we were “too hot” and causing issues, so they would have to start rate limiting us. At this point, the service was pretty much a black box to us, and we needed to gain more control. We were now receiving around 30 drawings per second, a huge number (at least to us at the time). So there we were, 1am and needing a completely new backend that can scale and handle our current traffic. We had been using Membase for a while for some small systems, and decided that that would make the most sense as it seemed to have worked well for us.
We brought up a small cluster of Membase (a.k.a Couchbase) rewrote the entire app, and deployed it live at 3am that same night. Instantly our cloud datastore issues slowed down, although we still relied on it to do a lazy migration of data to our new Couchbase cluster. With these improvements the game continued to grow, onward and upward.
The next week was even more of a blur. Other random datastore problems started to pop up, along with having to scale other parts of the infrastructure. During this time, we were trying to do some diligence and speak to anyone we could about how they would handle our exploding growth.
I must’ve spoken to 10+ smart, awesome people including Tom Pinckney and his great team from hunch, Frank Speiser and his team from SocialFlow, Fredrik Nylander from Tumblr, Artur Bergman from Fastly, Michael Abbot formerly of Twitter, and many others. The funny part was that for every person I spoke to I got different, yet all equally valid answers – on how they would handle this challenge. All of this was more moral support than anything and made us realize our own answers were just as valid as any of these other teams of whom we have great respect for. So we continued along the path that we started on, and went with our gut on what tech to pick and how to implement it.
Even with the issues we were having with Couchbase, we decided it was too much of a risk to move off our current infrastructure and go with something completely different. At this point Draw Something was being played by 3-4 million players each day. We contacted Couchbase, got some advice, which really was to expand our clusters, eventually to really beefy machines with SSD hard drives and tons of ram. We did this, made multiple clusters, and sharded between them for even more scalability over the next few days. We were also continuing to improve and scale all of our backend services, as traffic continued to skyrocket. We were now averaging hundreds of drawings per second.
At one point our growth was so huge that our players – millions of them – were doubling every day. It’s actually hard to wrap your head around the fact that if your usage doubles every day, that probably means your servers have to double every day too. Thankfully our systems were pretty automated, and we were bringing up tons of servers constantly. Eventually we were able to overshoot and catch up with growth by placing one order of ~100 servers. Even with this problem solved, we noticed bottlenecks elsewhere.
This had us on our toes and working 24 hours a day. I think at one point we were up for around 60+ hours straight, never leaving the computer. We had to scale out web servers using DNS load balancing, we had to get multiple HAProxies, break tables off MySQL to their own databases, transparently shard tables, and more. This was all being done on demand, live, and usually in the middle of the night.
We were very lucky that most of our layers were scalable with little or no major modifications needed. Helping us along the way was our very detailed custom server monitoring tools which allowed us to keep a very close eye on load, memory, and even provided real time usage stats on the game which helped with capacity planning. We eventually ended up with easy to launch “clusters” of our app that included NGINX, HAProxy, and Goliath servers all of which independent of everything else and when launched, increased our capacity by a constant. At this point our drawings per second were in the thousands, and traffic that looked huge a week ago was just a small bump on the current graphs.
Everyone at OMGPOP was very supportive of our work and fully realized how important what we were doing was for our company. We would walk in to applause, bottles of whiskey on our desk, and positive (but tense) faces.
It’s rare to see growth of this magnitude in such a short period of time. It’s also rare to look under the hood to see what it takes to grow a game at scale. To date, Draw Something has been downloaded more than 50 million times within 50 days. At its peak, about 3,000 drawings are created every second. Along with the game’s success, we’re quite proud to say that although there were a few rough patches, we were able to keep Draw Something up and running. If the game had gone down, our huge growth would’ve come to a dead stop.
This week, we’re thrilled to release some new features in Draw Something like comments and being able to save drawings. Our players have been dying for them.
Now that we’re part of Zynga (technically Zynga Mobile New York) we’re able to re-focus efforts on making Draw Something as good as possible, while still maintaining the culture that makes OMGPOP such a special place. We’re even making plans to move the game over to Zynga’s zCloud – infrastructure that’s tuned and built specially to handle workloads for social games.
Looking back at the roller coaster ride of the last few weeks, it’s crazy to think how far we’ve come in such a short period of time. Coming from a small company where a handful of people needed to do everything to Zynga where we have access to the people and the technology needed to grow – from engineers to the zCloud – it’s amazing.
In the end, we finally found our hit game. Despite the late hours, near misses and near meltdowns, we landed on a backend approach that works. Here at OMGPOP we call that drawsome.
[Originally published on Gamasutra]