The future of prediction: Building a smart consensus of forecasters

November 17, 2020

Armed with the tools and techniques of data science, it would seem that current professional forecasters have significant advantages over their peers of prior decades and certainly over the so-called wisdom of crowds. However, studies have shown that there are times when the crowd still tends to get it right and, in turbulent times, the power of prediction matters more than ever.

Measurement matters too, and Bloomberg’s Quant Research team has developed a method for scoring forecasters along several dimensions, evaluating the forecasters not only for their accuracy, but also for the timing, direction, and boldness of their predictions. The outcome of the analysis contributes to a form of smart consensus as well, providing acknowledgement for those who consistently lean in the right direction over time.

In order to determine the quality of predictions, it is essential to start with clean data. The problem is that in any data set there will be edge cases, e.g. outliers, as well as data points that are simply bad; a fat finger error or mistakes in units (billions vs millions), for example. While human annotators may be able to flag a certain amount of bad data, with the vast collections of data available, it is not possible to catch every instance. It is paramount to have robust statistical methods that can flag errors in an automated way.

Empower your work with enterprise data & tech insights.

Some apparent errors are actually outliers; they may make for messy data sets, but they are not wrong. Alternatively, there may have been a regime change in the market leading to significant data shift. Again, these data points are legitimate and, even though they do not match up well with the old regime, should not be flagged as errors. A situation like this unfolded in March 2020 with the outbreak of the coronavirus, when many analysts dropped their revenue forecasts for various industries.

However, other data may be truly erroneous; the two main types of errors entail: 1) errors in the sign (+/-) and 2) errors with the scale (magnitude). There are solutions, but manipulating data sets also brings risks. Calibration is noisy and even though models can be precisely calibrated, they can also be overfit, leading to further complications with the analysis and potential of false positive and false negative results.

A typical machine learning framework requires labeled data (aka “ground truth”) for training and it is particularly challenging to employ methods that ensure both efficiency and consistency in the process of gathering that truth. As Bloomberg’s Arun Verma notes, “Typically, for the data we are working with, only 0.1% or less will be erroneous. This means that we should source the ground truth very selectively to obtain labels only for the points that are either errors with high confidence, or the edge cases that can help precisely fine tune the classification boundaries of the model.” He continues, “There is also potential for confusion in the truth itself; different types of experts may see and label errors differently. Therefore, we have to ask: Is there truly an error here? And if so, what kind of error is it? The machine learning algorithm must perform robustly given all of these considerations and it should also avoid the trap of overfitting while also being interpretable and transparent.”

In the Bloomberg project, a simple baseline model is first used to generate tentative error flags; the researchers ask for truth only for flagged instances and some selected non-flagged instances around the boundaries of the error classification of the baseline algorithm. Once the “truth” is received, the model is fine-tuned to optimize its performance based on precision and recall metrics and generates the final flags for error correction or remediation.

Then, all of the non-erroneous analyst forecasts are scored against what actually happened. Here the work takes a unique turn: while some might say that accuracy is the most salient attribute of a forecast, others might assert that factors like timing, directionality, consistency, and independence are also deeply important. Directionality, in particular, is a very interesting property in the context of financial markets – if an analyst is consistently right on the direction, this will impact profitability across a range of market environments. In addition, if an analyst tends to take a contrarian position and is consistently right in differentiating from the crowd, it also bears closer attention and greater credit. These considerations highlight the importance of going beyond a metric of basic accuracy when judging the value of a set of forecasts over time.

The final step entails an aggregation of forecasts and forecasters to determine not only who has been the best for a specific quarter, Q3 2020, for example, but more generally as well. Returning to the idea of a smart consensus, in applying a distinct set of principles and supporting the uniformity of scoring methods, the aggregate model gives a higher weighting to the forecasters who are consistent over time and the analysis can be extended to score analysts across instruments, periods, sectors, and geographies. Is a particular forecaster great with FX or commodities? Insightful on events in Europe or Asia? The results will show the performance clearly, with all forecasts being normalized and graded on a bell curve.

Given the volume of data and analytical techniques that are available now, the days of mysterious predictions with crystal balls are over. As Will Rogers once said, “Good judgment comes from experience, and a lot of that comes from bad judgement.” Hopefully, with skillful analysis of data on past predictions, we can glean the wheat from the chaff more quickly.

Get in touch with Bloomberg to learn more about Bloomberg consensus methodology.

Recommended for you

Request a Demo

Bloomberg quickly and accurately delivers business and financial information, news and insight around the world. Now, let us do that for you.