By Ken Rudin on Friday, June 24th, 2011 1:45 PM PST

It’s pretty well known that Zynga has a metrics driven culture. It’s true. We meticulously measure most areas of our business, from user survey results, to engagement levels in our games, to the actions our players take during a game session.  As a result, we generate (and analyze) tens of billions of rows of data daily. The question is, how do we store it?

From SQL to Big Data and Hadoop

When we first set out to build our corporate data infrastructure two years ago, the first thing we had to figure out was how to answer that question. It wasn’t an easy task. For the majority of analyses we needed to do, SQL as a query language was a great fit. Plus, a huge population of people knew SQL, including most of our product managers, and there was a vast market of graphical analysis tools that work with SQL databases for those who didn’t know SQL. Given our metrics-based DNA, having a solution that was accessible by everyone (not just engineers) was key to nurturing our culture.

But, conventional wisdom at the time was that SQL databases just couldn’t handle really large scale data.  The NoSQL approaches were rapidly becoming popular, and for many people Big Data was becoming synonymous with Hadoop. Hadoop gives you a high level of scalability and flexibility, which were two things we really needed.  So, we went with Hadoop, right? Actually, no. To understand why, let’s look at why several web companies that have Big Data have adopted Hadoop.

1.   Scale, Baby, Scale: The first reason is scale. Many of these companies generate enormous amounts of web log data, which often exceeds 50TB of new data daily. SQL databases can’t feasibly process these enormous logs. Also, web logs have semi-structured data, making them a poor fit for processing with SQL even at a smaller scale.

2.   Schema and Semantics: The second reason several web companies have adopted Hadoop is the issue with SQL’s schema. With SQL, you need to model the data schema up front, so you have a place to put the data you’re loading. But these Big Data web applications are highly dynamic, just like Zynga’s games, with changes happening weekly or even daily. It’s not feasible to update the SQL schema that frequently. Hadoop doesn’t have a schema, so you can just add any new data you want to the system and then define the data semantics later when you write your Hadoop / MapReduce programs to process the data

Scaling in SQL

But, we came back to the fact that for the types of analyses we do at Zynga, SQL is the most efficient language and SQL databases were the most efficient way to process these queries. Hadoop is more flexible and can be applied to a vast set of complex problems, but the downside of that is it’s less specifically optimized for running analytical queries.  And while solutions like Hive and Pig give you a SQL-like front-end for analysis, they don’t make the MapReduce backend more efficient at processing queries.

So, we turned our attention to identifying an approach that would allow us to use a SQL engine, yet avoid the scale and static schema issues. At the heart of the scale issue is the size of the logs. So, rather than processing 50TB of semi-structured data to get at what for Zynga amounts to around 5TB of structured data, we decided to avoid using the logs as the data source, and instead created a set of functions that our games would call to write data directly into our database. Since we only track the data that the games explicitly want to track, and it’s all structured data with none of the excess cruft found in logs, the scale issue goes away. We just write the 5TB of structured information that we need, and this is a scale that high-end SQL databases can definitely handle.

There are also a few secondary advantages to this approach. Since the data is stored by calling a function, that means that it’s fully structured, so there’s no need to parse or transform the data before you can query it. Of course, the challenge with this approach is that it requires each of our games to be instrumented to call the data tracking functions, rather than inspecting the logs after the fact to figure out what happened. But the benefits for us have massively outweighed the costs.

Focusing on the schema issue, a very simple solution has worked for us. Rather than needing to change the schema weekly, we provide functions that let our games store data in something similar to generic “name-value” pairs in our database for those types of data that don’t naturally fit into existing tables in the schema. Of course, one of the challenges is that they’re generic name-value pairs and can represent anything the games want, so there aren’t any pre-existing semantics we can use to run prebuilt reports against that data. But, we provide reports that let the user specify which data is theirs and gives them the ability to slice and dice it, applying their own semantics to the data they inserted.

Addressing the scale and the schema issues meant that we could in fact use a high-end SQL database to meet our needs. In our case, we chose Vertica as our database vendor. Two years later, it seems our approach has worked, giving us an analytics platform that is efficient and accessible not just to our engineers, but to non-engineers in product management, game design, marketing, finance, and customer support.

If you’ve had to deal with massive amounts of data, what criteria did you use in making your technology selection? Please share any and all experiences in the comments below.


13 Responses to “Deciding How to Store Billions of Rows Per Day”

  1. Vishal Kapoor Says:

    “rather than processing 50TB of semi-structured data to get at what for Zynga amounts to around 5TB of structured data”

    Hi Ken,

    How did you determine that 50TB of semi-structured data = 5TB of structured data?

    Values are typically stored in semi-structured Hadoop logs using special control characters as delimiters. From my experience with such logs, I don’t see how 50TB can be compressed down by an order of magnitude by removing the delimiter information.

    Did you run an experiment to determine this 10 fold space reduction?


  2. Ken Rudin Says:

    @Vishal – Sorry for not being more clear. I don’t mean to imply that 50TB of semi-structured data = 5TB of structured data. I’m saying that in our use case (and in many similar use cases) when log files are written, there’s lots of data written into the logs that you don’t necessarily care about from an analytics perspective. For example, there’s typically a URL and then parameters added at the end. In many cases, we only want information about user’s actions in our games, which is stored in the parameter fields. So, we can discard much of the rest of the information. So, when we’re all done sifting through the 50TB of logs, the data we really want to keep and analyze amounts to 5TB. In the approach we’ve taken at Zynga, our games directly write information into the warehouse. So we only write the information we want to keep and analyze in the first place, which is 5TB.

  3. Ankur Patel Says:

    I have actually had the good success of using the GOOGLES SPREADSHEETS. It let me store many rows of the data and I can use APIS to push pull in the real time.

    u should check it out please

  4. James Rhys Says:

    Hahaha, I agree with Will.

    I think all applications have to plan for future influxes in data requirements and changes in schema. An application needs to respond to changes in client’s information requirements and gameplay. I made the decision to go with Hadoop for my models because I didn’t want to be stuck between a rolling stone and a hard place when my consumers suddenly wanted to start interacting their accounts with G+.

  5. Ankur Patel Says:

    William you should try it before u doing the speaking out of your butt.

    It has good APIs too. here’s PHP version since zinga liking the php:

  6. Ken Rudin Says:

    @James – I agree that being able to handle changes in data requirements is critical, since user’s needs change quickly. The challenge is that we have multiple games, and each one has 1-2 releases per week, often introducing new features into games that would traditionally require schema changes. But we of course can’t update our schema several times a week. We have addressed this problem by having part of our SQL database act like a key-value store, so we can put any data we want into it. Standard data like users, sessions, level in the game, etc go into the traditional SQL schema, and rapidly changing or transient data elements go into the part of the database that looks like a key-value store. We’ve been using this approach for over two years and it works well for us.

  7. Sunil Mallya Says:

    Are the “name-value” pairs stored in vertica too ?
    There wouldn’t be any benefits of storing these in a column store right ? Or am i missing something?

  8. Ken Rudin Says:

    @Sunil – Yes, the name-value pairs are stored in Vertica too. Note that it’s not truly a name-value pair. There are other columns as well to enable providing a bit more structure to the data if desired. But since the table is still pretty narrow, storing the data by columns instead of by rows doesn’t yield quite as significant an advantage as when a table is very wide. But there is still and advantage since you only have to scan the columns you’re interested in, instead of scanning the whole row. Additionally, since the data can come from multiple games and can have very different meanings, the cardinality of any given column is higher than if the columns had a singly clearly specified semantic meaning. So, you’ll get a little less data compression than you would typically get otherwise.

    But, though the advantages are reduced, there still is a performance advantage by using a column store for this data because we have so much data in this table.

    Additionally, even if there weren’t a meaningful advantage of using a column store database for this portion of our data, I’d still prefer to have it in the same database as the other data to simplify querying and database admin.

  9. sa6urx Says:

  10. fy6zxq Says:

  11. nl8xym Says:

  12. un9xxi Says:

  13. fr3jpq Says:;

Leave a Reply

Connect with Facebook