Analytics Have Taken Over Baseball. What’s Next?
It’s April and baseball season is beginning. More than one author has asserted that if poet Alfred Lord Tennyson had known American habits, his famous line would have been changed to, “It is spring and a young man’s fancy turns to baseball.”
But, I won’t comment on poetry or love. That leaves me with baseball. Baseball is where I first learned about data, and now Big Data. The transformation of baseball (and football) to analytic strategies is the story of Big Data. It includes many of the thought processes and successes, but mostly seems to portray the possible promise of more concrete results sometime later.
So, lots of data is great. Well, maybe.
Think about its use in baseball. I grew up a baseball fan. My Cub Scout pack took us to both the Brooklyn Dodgers at minuscule Ebbets Field and the New York Yankees in the majestic stadium. My parents, both born in Germany, had no idea why one would concentrate on a sport when the announcer would simply summarize each half inning as “no runs, no hits, no errors.”
Ah, but when something happens, the stadium fans erupt. Youngsters like me would learn batting averages and earned run averages. After a while, I came to reason that perhaps they invented baseball just for those who loved to analyze numbers. That was before I even knew what a statistician was. Was this what Abner Doubleday had in mind?
Baseball has even been used as a vehicle for teaching elementary statistics. I think the most common example is Jim Albert’s 2003 text, Teaching Statistics Using Baseball.
But today, the statisticians have indeed taken over baseball … at least to an extent. Let’s look at what Big Data did for baseball. Of course, the oft-cited breakthrough occurred with the Oakland A’s, as documented in the 2003 book Moneyball by Michael Lewis. (I own an autographed copy!) For those of you who are not baseball fans, the basic premise had to do with how a low-budget team like the A’s could do so much better than free spenders such as the Yankees.
The answer was found in data analysis. What did the A’s, with their talented assistant general manager, Paul DePodesta, figure out? They learned that on-base percentage and slugging percentage were much better indicators of baseball success (in terms of winning games) than hits, doubles, triples, stolen bases, and home runs.
And, when I say DePodesta was talented, it was in statistical analysis. He never came close to playing professional baseball. The concepts really took off when Moneyball was made into a movie. Probably every statistician wanted to be Jonah Hill, who played the consummate data analyst. Interestingly, after a few baseball front office positions, DePodesta is now applying his analytic talents for the Cleveland Browns football team.
And where is baseball analytics today? In February, I was fortunate enough to attend a sports analysis discussion, sponsored by the ASA’s Washington DC Chapter, the Washington Statistical Society. The baseball half of the program featured Kevin Tennenbaum of the Baltimore Orioles. Kevin discussed evaluating player talent. He captivated the audience for about an hour just talking about the defensive rating of an outfielder.
The field is divided into about 100 segments and the outfielder is judged on every fielding opportunity. This includes figuring out in each segment if the fielder could have, should have, might have, or actually did catch the ball. Then you add up all possibilities and can rate one fielder against another. And this is just for defense, not even considering the player’s hitting talent. Further, this was just the “publicly available information.” Kevin didn’t discuss the propriety analyses the Orioles do.
But how do such advanced analyses get integrated into baseball operations? Kevin used my favorite word “communications” as the key to explaining all this, or a segment of it, to make a useful decision. Exactly the concern I expressed in my column last month.
I find this so analogous to the Big Data situation facing statisticians. A few pearls are found initially, such as the on-base percentage. Then mountains of data form a deluge that may or may not lead to a substantive improvement in the situation. But analysis proceeds to search for the next pearl.
By the way, many of the criticisms of Big Data, such as the origin and validity of data, as well as the sampling techniques, don’t even enter here. This isn’t sampling. All the data are real, accurate, and readily available. The basic question of what to do with all that data indeed remains.
The other half of the sports analysis discussion featured Daniel Stern presenting an equally fascinating analysis of football. Daniel is employed by the Baltimore Ravens football team. In contrast to Kevin, whose work concentrates on player evaluation in baseball, Daniel’s work involves the actual plays in football. Yup, what to do at second down and seven yards to go. As you can imagine, many variables go into this analysis. But it may not be what you think. I would have thought they would look for plays that would either get a first down or, perhaps, a touchdown. Well, that’s important, but the actual measure they use is expected points, and in a most sophisticated way.
In fact, consider the team contemplating a field goal from a certain yard line. If the kicker has probability p of making the field goal, I would have guessed that the expected value of the “go for the field goal” decision would be 3p. Well, that is so 20th century. Apparently, they look one step further. Since the opposition gets the ball either way after a field goal attempt, you deduct a certain expected value of the opposition’s potential score on the next series of downs.
Perhaps even more intriguing is that the teams are not allowed to use computers in the stadium. And the coach has 20 seconds to call the next play. So, what do you do with all this information? I guess you summarize it somehow, and then, as Kevin would say, it comes back to communication. Daniel clearly was not permitted to divulge proprietary Ravens’ strategies.
So, is more technology coming to the rescue, or merely adding to the melee? Just before the Super Bowl, Matthew Futterman of the Wall Street Journal reported the following:
Each player in Sunday’s Super Bowl will have a computer chip attached to his shoulder pad that tracks his every movement, part of a two-year-old program that opens the league and its fans to a whole new world of statistical possibility.
An even bigger mystery for league officials, coaches, and dataheads is what to do with this new trove of information. The idea was that, by tracking the speed, location, and movement of players, teams and fans could create metrics that would reveal who moved fastest in key situations, covered the most ground on defense, or found the open areas in a zone defense most often.
I have this vision that, much like the person who sneezes at an auction and ends up with an expensive Monet he never wanted, some football player on defense will twitch and a Super Bowl will be lost. Those of you who follow me on twitter (@StatisticsBarry) saw my tweet that this situation was lots of data seeking a useful statistic.
Now, what about other sports? I am quite sure basketball has analytic programs to replace all those white boards coaches use for describing plays. I guess things have become more sophisticated since the 1994 NBA draft, which included Jason Kidd being drafted from the University of California to the Dallas Mavericks. Answering reporters about fitting into his new team, Kidd announced, “We’re going to turn this team around 360 degrees.” Hmmm. I would guess he was not put in charge of advanced analytics.
Sports is a great test field for Big Data, but we can go too far. I think the statistics world can learn and teach a great deal about the uses, abuses, and absurdities of too much data. But hope springs eternal, and the next breakthrough may be just around the corner.