Oct 1, 09:45 AM
So I, along with Todd Northcutt of IGN, submitted a joint talk to the 2009 Game Developers Conference. They had a new submission process this year: short submission for a first cut, and then a long submission to make second cut.
I was pretty sure that we would at least make first cut. This is because our presentation was going to be chock full of real data that IGN has collected on top-tier multiplayer games like Command & Conquer 3 and Unreal Tournament 3. IGN did the collecting, and earlier this year, Orbus Gameworks did a bunch of analysis on their data.
As part of the submission, I even sent along some sample data. “Hey, look guys! This is juicy stuff! You know, the sort of real data everyone wishes they could be privvy to but nobody ever shows in public? We will show that data.“
Oh yeah, and I give talks at many CMP conferences, including GDC, and I consistently score in the 4.7 out of 5.0 range: a good 0.8 points above the average. So they know I’m a good speaker, and I even took the time to point that out in the application.
Real data + AAA games + consistently good speaker = awesome. Right?
Apparently not awesome enough for GDC.
Perhaps we were rejected because CMP knows that IGN will pony up for a sponsored session. But I made it really clear in the submission that this talk was not going to be about IGN technology, but player behavior.
Maybe we were rejected because someone else had some even cooler metrics to show. I’d be okay with that.
But honestly I have no idea what the actual guidelines are. In related news, Adam Martin has some excellent suggestions for improving conferences, including ways that conference organizers might be able to provide feedback on rejected talks without spending too many man-hours on it.
— Darius Kazemi
Aug 26, 01:07 PM
This will be at least a two-part series, as there’s a lot of material I have to cover, and I’m a little busy getting ready for PAX this weekend.
I had an email discussion the other day with Ken Harward, formerly Lead Programmer and then Studio Director at what was Ritual Entertainment, in Dallas, TX. The discussion began because I saw a comment in a post on Corvus Elrod’s blog talking about how SiN Episodes had extensive metrics tracking for dynamic difficulty adjustment. I said, “Hey, Ken was a lead on that game, I’ll ask him.” Yes, it is nice to know lots of awesome people in the game industry.
Ken’s response was lengthy, and in reading it, it seemed obvious that the metrics work that they did on SiN Episodes was very important, yet unsung. Although they gave a talk on it at GDC, you don’t hear them mentioned in the same breath as Bungie or Valve when it comes to FPS metrics. This probably has to do with SiN Episodes being a pretty low-selling obscure title, even though it was one of the first original third-party games to be released on Steam.
So I wanted to dedicate some space here on the blog to getting the story of the amazing work they did on the Internet, for all to see. Most of this is me reposting what Ken already wrote in an email. So many thanks to him.
Ken worked on the dynamic difficulty (hereby abbreviated DD) system for SiN Episodes (hereby abbreviated SiN) with the help of Aaron Cole. Ken says, “I honestly believe we wrote for Sin Episodes the most sophisticated dynamic measuring system ever, at least for a shipping game.”
The system consisted of four parts: Statistics, Advisors, the DecisionMaker, and Gameplay Variables.
Statistics
The raw metrics collection was pretty simple. The engine for the game was event-driven to begin with, so they created a system where data files would contain the names of messages to look for, along with aggregation instructions. For example, there might be “a ShotgunNPCDamageTakenPerSecond statistic in a data file. The data file would say, hey, every time there is a ShotgunNPC damage event, accumulate it, and make it a ratio against time elapsed.”
The game ended up collecting hundreds of statistics this way, stored locally on the player’s computer.
Advisors
In the DD system, Advisors were agents that monitored statistics. Advisors would be assigned statistics and then given a target range to shoot for. “A PlayerHealth Advisor had some goal range that the player’s health should be in, for example. Each Advisor monitors many statistics (not just one) and as the player’s health gets out of range (for example, they’re not being damaged enough) then the Advisor makes recommendations on how to fix this.” The target ranges were actually chosen by players at the start of the game! There were two sliders: how much challenge the player wants, and how much help the player wants (in testing, some players were positively allergic to the idea of the game actually helping them, so they included this feature).
Again, the Advisor monitored many statistics to make recommendations about its key statistic. For example, if the player’s health was too high, it did not simply advise bullets to do more damage. “That would be lame,” says Harward, “and that’s what usually dynamic difficulty systems do. Instead, the Advisor had many recommendations for the situation. He might recommend stronger AI to come out, more AI to come out, the AI to hide better, the AI to throw more grenades, etc.”
There were lots of Advisors monitoring lots of statistics, with the goal of making recommendations to keep the statistics in line with their target areas.
Harward says of the Advisors that “most dynamic difficulty systems that I’ve researched were really built to make the game harder. Ours was built specifically so that my mother could finish the game. I promised myself that I would build a game that my mother could finish. She’s never played a first-person shooter in her life. But the game would figure that out, and would continually scale things down. Eventually it would get to the point, if she were bad enough, that the enemies would do 0 damage. But, as soon as she started to live a bit too long, then the enemies would start to do a little damage.”
DecisionMaker
The DecisionMaker was a singular entity that would poll the Advisors, normally every 2 minutes, about their mood. They’d report how they felt, and give recommendations. Each Advisor had many different recommendations, based on how they perceived the game experience. These recommendations could conflict (one Advisor wanting the enemies to hide more, another Advisor wanting them to charge in.) The DecisionMaker would pick two recommendations, out of all the possible recommendations, weighted by the Advisor’s mood and each recommendation’s success rate. The two recommendations would cause two Gameplay Variables to be adjusted, “and the player would react/respond, and the stats might change, and the Advisors might adjust, and presto you have a complete feedback loop. Because there was such a variety of recommendations, even as developers we didn’t know exactly what would happen if we started to play better.”
Gameplay Variables
The Gameplay Variables stored the current settings representing all the different ways in which the game could change. Throughout the code, there were references to these many gameplay variables. So as these variables adjusted, up and down, the game was changing likewise. At any given moment, your game would have a unique value for these many variables, and that would represent you. In other words, it is highly unlikely that any two players ever had the exact same value for all the variables at any given point in time. The variables were like a unique DNA that changed, 2 at a time, every 2 minutes.
Measuring Success
Many DD systems just push around numbers until they meet certain targets, but often the numbers can be right while the desired overall effect is wrong. “It was important to me to be able to say ‘is this working.’ There’s no point in having the AI throwing grenades if it doesn’t accomplish what you want.” Their system would remember when an Advisor’s recommendation was implemented. “If the Advisor became happier after that recommendation was picked, it was assumed that the recommendation may have helped,” and the recommendation would be weighted a bit more heavily towards being picked again. According to Harward, over time, the system would converge on a point where it really did know what the helpful recommendations were.
According to Harward, “enough simple things together generated fairly robust results.”
For the next installment, we’ll be looking into some of the metrics collection that Ritual did, along with some of Ken’s conclusions from that. We’ll also look a bit at the players’ reception of the dynamic difficulty system.
— Darius Kazemi
Aug 4, 01:13 AM
GameShadow has announced a new service called GameShadow metrics, which is another in a genre of what, off the top of my head, I would call consumer behavior metrics. This is the kind of service that measures how much time people spend playing certain games. So you can say, “People who like BioShock also play a lot of Civilization 4 but never play racing games” or what have you. In addition to the more obvious “how much time are people spending on game X” metrics.
There are a bunch of these services. Gamestrata comes to mind as one, although they do not publish their own stats: they actually use feeds from XBLA, IGN/Gamespy, etc., then they aggregate these stats into a web package for consumers. Xfire, the IM/chat service for gamers, publishes statistics on popular games as played by their users. Gamestrata is partnered with a few companies; I’m not sure if Xfire actually sells their data to game companies or publishers. Steam has consumer behavior metrics as well.
I stress that these are consumer behavior metrics because all that they measure is consumption, rather than, say, engagement. They are measuring (a) how many people are playing the game, and (b) for how many hours a week. I would say those two metrics are more on the consumption side than the engagement side because engagement actually says something about the amount a game is played. I could have Battlefield 1942 open in the background for 3 hours while making a sandwich, watching a movie, and waiting for my buddies to get home so we can fire up a game. That would show up in one of these CBM services as 3 hours played, when really it doesn’t tell us all that much about play.
Engagement metrics for video games are things like
- amount of time spent in different activities in the game (versus in a menu screen)
- average number of matches played in a week in a multiplayer deatchmatch game
- total items traded, by volume, value, and type, in an MMORPG
- time spent grouped vs ungrouped in an MMORPG
- time spent using modded content in a game (this points to extended shelf life for a title)
These metrics are incredibly difficult to build a universal plug-and-play type of application for. It’s relatively easy to measure what processes are running right now on the OS and keep track of how long they’ve been running. But to understand what a player is doing in game requires partnership with the game developer to actually put hooks into source code and extract the meaty data.
— Darius Kazemi