In addition to the statistical data with which we’re all generally familiar, a baseball game is packed with data we don’t generally observe. Defensive data, batted ball data, non-pitching velocity data, etc. You may or may not know that this data, all the while, is being collected by radars and cameras all around each Major League ballpark, often for proprietary use by individual teams or by business enterprises with an eye toward licensing whatever information they can cull.
But now, by way of MLB Advanced Media (the media face of the MLB teams), a great deal of that data is going to be available for the public to consume. This is the baseball nerd equivalent of being invited into Willy Wonka’s factory.
You can, and should, read about the plan for the new system here. A snippet:
The goal is to revolutionize the way people evaluate baseball, by presenting for the first time the tools that connect all actions that happen on a field to determine how they work together. This new datastream will enable the industry to understand the whole play on the field — batting, pitching, fielding and baserunning — and enable new metrics for evaluation by clubs, scouts, players and fans.
For instance, on a brilliant, game-saving diving catch by an outfielder, this new system will let us understand what created that outcome. Was it the quickness of his first step, his acceleration? Was it his initial positioning? What if the pitcher had thrown a different pitch? Everything will be connected for the first time, providing a tool for answers to questions like this and more ….
Claudio Silva, PhD and professor of computer science and engineering at NYU Polytechnic School of Engineering, said the biggest challenge was to ensure that the data received reflects actual game play. He said anyone who watches baseball, from club to player to fan, will see a new baseball world that is “completely unexpected.”
“One of the things we had to do to be certain that was the case was to design a whole validation scheme, where we recorded our own video, and designed algorithms that would independently generate some of the metrics to be compared to the data that we were getting out of the vendors,” Silva said. “It’s really very complex algorithms that are going into making this thing work, into the validation process, and actually eventually into all the analysis that people are going to be doing on the metrics.
“One of the goals of what we wanted to achieve was to virtually recreate the game using the geometric data. This actually turns out not to be straightforward. So let’s say you want to compute a player’s speed, you want to compute a ball’s speed. We could actually take the 3D data and match it to a verbal description of the game. This was a very exciting finding for us. You can imagine, it’s kind of like a two-way street. You can use the experts’ opinions to then generate information. You can even imagine other forms of storytelling about a season of a team.”
He called this “a completely new data stream” and added, “To be one of the first few people to have the luxury of looking at the new datastream was a true privilege. I believe that this data is so rich, there are so many interesting things we can do with it, we’re going to be able to comb through this data and find layers and layers of features that we never could see before.”
The system will be operational at Miller Park, Citi Field, and Target Field for 2014, and will come on line at all other parks throughout the year, with a goal of having it working everywhere by Opening Day 2015. The data stream could take a year or so to be fully realized.
The implications of being able to granulize everything that happens in a baseball game like this are enormous, particularly when it comes to defensive metrics and evaluation. Of course, in time, we will almost certainly realize that this system will provide so much more. We’ve seen what having this kind of non-baseball-game-outcome statistical data (anyone got a better way to describe that?) can do when it comes to evaluating pitchers, because we’ve had PitchF/X data available for a while now.
And now folks will debate whether we’re doing harm to our favorite game by focusing on data rather than just watching, and enjoying, the play on the field.
(For my part, I like to enjoy the game both ways – in fact, that’s why baseball remains my favorite sport.)