Aggregation, Part Two
Jul 12, 12:08 PM
So, last time I wrote about aggregation, I told you a little bit about our general aggregation philosophy. Now I’m going to address the age-old question: what the hell do you aggregate, and when?
Less Is More
While the first temptation when collecting gameplay data is to hoard everything, we’ve already shown that at some point, sooner rather than later, you’re going to run out of space. Which is why we’re aggregating in the first place.
But the important thing to note is that while aggregation is compression, it’s lossy compression. That is to say, you’re not putting all your data into a zip file and then uncompressing it later. You’re actively but selectively throwing out data.
The selection process is very important. Let’s illustrate this with an example. Suppose we have a cartoon-style brawl, where our players are attempting to knock each other out with weapons of extreme comedic value. The following table contains the records of an event that’s fired off every time someone hits somebody else with a weapon. It records the time, who did the hitting, who the target was, what weapon was used, whether the weapon’s secondary mode was being used, and whether the hit resulted in a knockout.
| Timestamp | Game | Player | Target | Weapon | Secondary? | Knockout? |
|---|---|---|---|---|---|---|
| 7/12/2006 10:28:00 | 1 | Tyrone | Osbie | Banjo | 0 | 0 |
| 7/12/2006 10:39:31 | 1 | Katje | Roger | Banjo | 0 | 1 |
| 7/12/2006 10:49:27 | 1 | Tyrone | Katje | Harmonica | 0 | 1 |
| 7/12/2006 10:58:48 | 1 | Roger | Jessica | Harmonica | 0 | 0 |
| 7/12/2006 11:11:29 | 1 | Roger | Katje | Banjo | 1 | 1 |
| 7/12/2006 11:11:55 | 1 | Jessica | Tyrone | Harmonica | 0 | 1 |
| 7/12/2006 11:16:14 | 2 | Katje | Tyrone | Pie | 0 | 0 |
| 7/12/2006 11:17:49 | 2 | Pirate | Tyrone | Banana | 0 | 0 |
| 7/12/2006 11:25:27 | 2 | Tyrone | Osbie | Banjo | 1 | 0 |
| 7/12/2006 11:26:36 | 2 | Pirate | Pirate | Harmonica | 0 | 0 |
| 7/12/2006 11:37:33 | 2 | Roger | Katje | Banjo | 0 | 1 |
| 7/12/2006 11:48:29 | 2 | Jessica | Pirate | Banana | 0 | 1 |
| 7/12/2006 11:55:41 | 2 | Katje | Roger | Toilet Bowl | 1 | 1 |
| 7/12/2006 12:07:04 | 2 | Osbie | Roger | Pie | 1 | 0 |
When discussing aggregation of this data set, there are a few steps you need to go through.
First, figure out who the customers of this data set are. Obviously the game designers care about the data because it’ll help them with balance. Let’s also say you have a leaderboard system, and your community managers care about the data as well. So let’s say those are the two customers.
Next you have to sit down with each type of customer and go over when they need the data, when it becomes deprecated, and what data they would like to stick around for a long time.
You sit down with your lead designer. She says, “We really like this hit data. Probably the most important thing to us is how often each weapon is used, how often the secondary mode is used, and the effectiveness of each weapon.”
Your community manager says, “We care about knowing the effectiveness of each player, how they did in each game, who their favorite targets are, and what their favorite weapon is.”
Hmm. So your designer wants depersonalized weapon data, and your community manager wants personal information. Keep in mind, though, that we’re trying to figure out rules for how to compress old data. So you need to follow up with the question: “When is data old enough that you don’t care about it anymore? When is data old enough that you would settle for a summary rather than exact data?”
Your lead designer says that she doesn’t need any stats older than six months. And the community manager says that they need all data for all time, so there’s a continuous leaderboard record.
At this point, you have a solution. You don’t have to pay attention to the weapon effectiveness when you’re aggregating. You just need player effectiveness, game performance, favored targets, and favored weapons. So you might end up with something like this:
| Game | Player | FavTarget | FavWeapon | Knockouts Receieved | Knockouts Inflicted |
|---|---|---|---|---|---|
| 1 | Tyrone | Osbie | Banjo | 1 | 1 |
| 1 | Katje | Roger | Banjo | 2 | 1 |
| 1 | Roger | Jessica | Harmonica | 1 | 1 |
| 1 | Jessica | Tyrone | Harmonica | 1 | 0 |
| 2 | Tyrone | Osbie | Banjo | 0 | 0 |
| 2 | Katje | Tyrone | Pie | 1 | 1 |
| 2 | Roger | Katje | Banjo | 1 | 1 |
| 2 | Jessica | Pirate | Banana | 0 | 1 |
| 2 | Osbie | Roger | Pie | 0 | 0 |
| 2 | Pirate | Tyrone | Banana | 1 | 0 |
Granted, this doesn’t look like it’s much compression, but that’s because I didn’t include a lot of data points in the original table. If there were thousands of events per game, you would still end up with an aggregate table for that game where the number of rows was equivalent to the number of players (essentially, that’s what our GROUP BY statement in SQL would contain).
The important thing to remember when you’re aggregating metrics is that you have to talk to the customers of the metrics and determine the best compromise to fit all their needs.