Beware of Data Mining
I keep promising to stop writing about lessons from the election that are applicable to markets, and then I keep finding more examples. So rather than make any promises I cannot keep, let’s just jump right into this.
Since Donald Trump’s surprise victory -- though it wasn’t a surprise to those of you with the power of hindsight -- there have been numerous after-the-fact explanations for why Trump beat Hillary Clinton. Many appear to be delightful exercises in data mining, the finding of “historical patterns that are driven by random, not real, relationships.” Add to this the assumption that these explanations are durable and will repeat in the future, and you have the makings of a terrible investment process.
Consider the various claims as to what the key to the election was:
- Local health outcomes predict Trumpward swings (the Economist)
- Education, not income, predicted who would vote for Trump (FiveThirtyEight)
- Two economic variables perfectly predict election results (Statistical Ideas)
- Clinton won 64 percent of America's economic activity versus Trump's 36 percent (Washington Post)
- Clinton won the cities, Trump won the suburbs (New York Times)
None of these elements “predicted” anything. Each was the result of an analysis of what had already occurred. Post-election, data was sifted, a midpoint in each data set was located where a majority of Trump voters resided over Clinton voters, and a conclusion was reached.
This is classic data mining, and it should never be relied upon to make future forecasts.
Salil Mehta, former TARP director of analytics and author of "Statistics Topics," has been critical of pollsters’ election forecasts. He spent much of the time before the election lecturing them that their models were underestimating the possibility of a Trump victory. In an e-mail exchange, he observed:
There is an increased craving to slice and dice the recent election data, particularly given that the major pollsters have been shamed as they all immensely errored in projecting this year’s election’s victor. All gave President-elect Trump <15% a faux probability of winning. The risk of now retorting with data-mining this single election result is that they often miss an analysis of the predictive errors in this unique match-up (e.g., record high undecideds on Election eve), don’t take into account budding geospatial patterns to validate evidence, and in most case none of this should deceptively be promoted as an election forecasting model.
Correlations are very different from what is required to create a reliable model that correctly forecasts a future election or investing outcomes. Rather than mine data, Mehta suggests instead we engage in hypothesis testing.
The obvious parallel to investing is the myriad of back-tested strategies, many of which engage in similar sorts of data mining as the recent election post-mortems do. They seem to work perfectly in the past, but they are less robust than desired. Models that inform us of what has already happened but not what might occur in the future are of limited value.
Cliff Asness of AQR warns us not to confuse factor investing with data mining. He notes that French-Fama factors such as value, momentum and size have all been tested out of sample and proven to be robust. Out-of-sample testing could verify if an election model's backtest is valid: Take the five data claims above, then apply them to Obama versus McCain or Bush versus Gore to see if they are at all predictive. The same is true for investing models. To avoid poorly constructed models that are form-fitted to past experience, apply them to different data sets than the test.
If a gold mine is a hole in the ground with a liar standing on top of it, a successful data miner is a quant with a data set lying to himself. You probably have never seen a sales pitch that didn’t have a back test “proving” market-beating returns. If only you had a time machine to go back to the period of time covered by the data set.
Investing after the fact is easy. Investors should be cautious when presented with results that only tell you what just happened, not what is about to occur.
Lots of “multicolinearities” -- economic inequality, poor health, low educational attainment -- may be associated with Trump voters, but they are not likely to forecast the next election. For example, higher education (and therefore better health and possibly higher income) might present a proclivity toward voting red or blue, but as Mehta points out, not all college degrees are created equal. Some generate much greater potential future incomes than others (“nonheterogeneous”).
This column does not necessarily reflect the opinion of the editorial board or Bloomberg LP and its owners.
To contact the author of this story:
Barry Ritholtz at firstname.lastname@example.org
To contact the editor responsible for this story:
Brooke Sample at email@example.com