Beware Overfitting Models Even If They Win Baseball Bets

It was the middle of Game 6 of the World Series and the Kansas City Royals were clobbering the San Francisco Giants, when Melanie Winograd made a bold prediction.

“According to our stats, we think the Giants are going to take” the series, she wrote in an e-mail while the Royals were en route to a 10-0 rout of the Giants.

Of course today, with the Giants making plans for their second World Series parade in three years, that prediction looks pretty smart. And it raises the question: what were these magic stats Winograd was using? Was it some sort of statistical voodoo involving Pecota (player empirical comparison and optimization test algorithm) or FIP (fielding independent pitching) or any of the myriad “Moneyball”-type of stats used by the wonkiest wonks of our national pastime?

Not at all. Winograd’s company, the human resources firm Impact Group, simply looked at the economic conditions of the two cities and noted that the past three World Series were won by teams from the town with the lower unemployment rate, or an improving rate.

Whether it be the baseball field or the stock market, powerful computers these days are spitting out endless batches of statistical analysis that at times lead us to believe we’ve deciphered the numerical Rosetta stone that will guarantee past returns ARE indicative of future results. Thank God and Bill Gates for modern computers, right? Well, not so fast.

“Results from simulations run on patternless data indicate that computing power makes it easy to be fooled by randomness,” is how CXO Advisory Group succinctly summed up a recent paper from researchers at Lawrence Berkeley National Laboratory and elsewhere.

‘Backtest Overfitting’

The study looked at the phenomenon known as “backtest overfitting.” In the quant world, a “backtest” is the use of historical data to gauge the performance of a trading strategy. “Overfitting” the data means you’re simply expecting too much of it -- and the way the strategy performed in your historical data sample will not necessarily be replicated in the future.

(If you need another baseball reference to get you through this part, think of the “backtest overfitting” that the Phillies engaged in when they signed Ryan Howard to a five-year, $125 million contract extension in 2010.)

Here’s how the researchers put it: “It is a relatively simple matter for a present-day computer system to explore thousands, millions or even billions of variations of a proposed strategy, and pick the best performing variant as the ‘optimal’ strategy ‘in sample,’” they wrote. “Unfortunately, such an ‘optimal’ strategy often performs very poorly ‘out of sample’ (i.e. on another dataset), because the parameters of the invest strategy have been overfit to the in-sample data.”

‘Optimal’ Fail

The Berkeley Lab researchers created an online simulator to demonstrate the concept and they encourage tinkering with it. You can find it here. The tool develops the “optimal” variant of a simple strategy based on random dates in the past, then tests it on a second random time series.

“In most runs using our online tool, the ‘optimal’ strategy derived from the first time series performs poorly on the second time series, demonstrating how hard it is not to overfit a backtest,” the study said.

So the lesson, which probably goes without saying, is that the next time you see an impressive quant study, think really hard about it before basing any decisions on it. And don’t bet on the Cubs to win it all next year: the Chicago area has a 6.1 percent unemployment rate. Plus, the Cubs stink.