50 million profiles and 100s of ways to view them. 150GB of data processed and 100 million rows inserted. Charts, reports, optimal counts, headroom computations, monitoring dashboards – that is 24 hours in the life of zPerfmon. And it all runs on one machine.
zPerfmon was conceived and built to answer one question: what went wrong in production?
Zynga has a huge amount of code in production across dozens of games and services. While we can unit test and stress test and feature test, it can never cover all that happens in production. If something unexpected happens in production, which is all the time, and there are no footprints in the sand, investigation is arduous and frustrating.
More the games, more the code and more the potential problems. We gave it some thought. Sampling profiles, module boundary tracing, system metrics based extrapolations, random full profile collection – all of these left too many questions un-answered. Uniform rate, full execution profiles from production along with supplementary system metrics seemed to be the best solution. That is what zPerfmon does.
zPerfmon client generates execution profiles at a constant rate from a subset of production machines. These profiles are uploaded to zPerfmon server at staggered intervals. Profile names have a schema to identify page, time, machine configuration and other details.
The server is a processing engine with a heartbeat of 30 minutes. All data that is available at 30 minute boundaries are grouped, sliced and diced. In addition to profiles, server keeps user and instance counts and system metrics. All of this data is keyed with timestamps. The timestamps make it possible to dig down from increase in instance count to a spike in CPU to a page which had a missing break in a foreach() loop.
Processed data is available through a web UI hosted on the processing machine. The UI and profile browser is what a typical user will see and use in zPerfmon. Meanwhile, there are scheduled tasks generating reports, scripts that compute headroom, tasks to do cleanup and more such task-lets infesting cron entries.
zPerfmon isn’t by any stretch of imagination “light and agile with subroutines connected like a string of pearls.“ The client is considered sacred. It shouldn’t cause downtime of its own or interfere with game code execution. We also imposed on ourselves the constraint that we would do all processing required on one machine.
The server has PHP, Python, shell scripts, yaml configs, cron jobs… It has four different ways to access the same DB. The same profile is read from storage many times in the same context. Processes exec others in a daisy chaining frenzy. There are multiple views of many tables and 30+ tables per game. It has 3 different APIs to expose the same data.
Despite all of the above, it crunches through a million profiles and significant chunks of other data every half hour. It is also remarkably stable and resilient.
The client delivers a PHP file which when included enables profile collection. That is accomplished leveraging the constructor and destructor of a static class. There are knobs, valves and switches to control this behavior. It works in CLI or server mode, with or without APC, and can augment profiles with URL parameters, arbitrary strings and such. Profiles are written to a known location from where a cron job picks these up 4 times an hour.
Uploads from clients are dumped to a ramdisk in a known directory. Every 5 minutes those are un-tar-bzipped and grouped according to pages. Every half hour, the big daddy cron job generates aggregated profiles and extracts top functions and tracked ones. All these are zippped into a blob and inserted into a table. Raw profiles are kept on local disk for a few days. Profile extracted data, along with collectd delivered CPU, mem, n/w etc. usage metrics are inserted into tables.
As long as the uploaded data is in xhprof format, the server doesn’t care about source language. We have used it to process ActionScript and java generated profiles as proof points – but there is no reasons you couldn’t do it for Ruby / Go / …
zPerfmon has helped us do post-mortem on a variety of performance issues and to root cause a host of aberrant behaviors in production. In cases it was even used as a debugger. That is saying a lot for a tool which delivers data with guaranteed half hour latency. We used it to tune our production deployments and insights from zPerfmon helped us isolate inefficiencies in code and pin-point the next focus area.
The code is at https://github.com/zynga/zperfmon to carve, criticize or construct. An amazing bunch of people made it happen, I salute them for being such a wonderfully animated team. Seemingly stupid thoughts bordering on genius and profound thoughts masquerading as idiocy made design discussions and code battles lively. Tools and tech built inspired from the data collected was just as exciting if not more so.
We had fun building it – hope you can use it and make a difference in your apps.