Explaining the Bloomberg News 2020 Election Turnout Model

The reporting of counted ballots in the 2020 election will feature several differences from prior years. Most states made changes to how voters can cast ballots in 2020 to account for the Covid-19 pandemic, particularly allowing for more early voting, either in person or by mail. Those changes have led to record numbers of early votes cast in this election.

News organizations have for decades looked to the percentage of precincts that have reported results as a guide for how much of the vote is in and what remains to be counted. But this measure often doesn’t capture ballots that aren’t cast at a polling location on Election Day. And with tens of millions of voters choosing to mail their ballots or vote early this year, the precincts reporting measure will not be as reliable a measure of vote-counting progress for most states.

So instead, Bloomberg News created a voter turnout forecasting model that estimates how many votes are expected in this election, and we will use those estimates to understand where we are in the process of counting ballots on election night and the days after.

Overview

Our 2020 turnout forecasts are generated by a series of predictive models built on historic turnout patterns. These models estimate county-level turnout rates (as a proportion of each county’s citizen voting-age population) in the 2004, 2008, 2012, and 2016 elections based on each county’s demographic characteristics, its voting rules, its turnout rates in past elections, and the competitiveness of the races on the ballot that year. By applying those models to equivalent data for the 2020 election, we are able to generate forecasts of how many votes we expect to be cast in each county, congressional district and state, both for the presidential election and for downballot races. By looking at the variation in turnout rates across elections as well as the accuracy of the models at predicting past turnout, we can also estimate the uncertainty in our predictions and how they would change under different overall levels of voter turnout.

Where our data comes from

These forecasts incorporate data from a wide variety of sources. The main sources we use to build our models are:

Historical election returns from the MIT Election Data and Science Lab and Ballotpedia. These archives include state-level election returns for presidential, senate and gubernatorial races, district-level house election returns and county-level presidential election returns.
Demographic and geographic data from the US Census Bureau. The Census Bureau’s American Community Survey and the Current Population Survey datasets provide information on the number of people in each area and their demographics, including characteristics such as age, education and income which are highly correlated with voter turnout.
Current and historical data on electoral competitiveness from publications such as FiveThirtyEight and Ballotpedia. FiveThiryEight’s polling archive gives us survey-based indicators of electoral competitiveness across presidential, senate and gubernatorial races, while Ballotpedia’s compilation of candidate information and publicly-released district ratings gives us similar information for house races.
Current and historical voting rules from the National Conference of State Legislatures, the Sentencing Project, the Pew Research Center and other sources. These data sources were used to compile state-by-state data for each election on voting rules and policies related to early voting, absentee voting, voter ID requirements, voter registration, felon disenfranchisement laws and similar characteristics. Where necessary, these sources were validated and supplemented by referring to the original sources (such as states’ own election administration websites) and contemporaneous reporting on the rules in place in each election.

These data sources were combined, analyzed and processed into a form suitable for building predictive models, as described below.

How we build our models

The process of building our forecasts starts by developing models of presidential turnout in each county during the 2004, 2008, 2012 and 2016 elections. We use each county’s turnout rate (the total number of presidential votes cast, divided by the number of citizens of voting age in that county) as our outcome, and model turnout using three different modeling algorithms: random forest regression, extra-trees regression and gradient-boosted decision trees. These three algorithms each have different strengths and weaknesses and handle different types of data in different ways, so we average their predictions together in order to form an ensemble prediction that is more accurate than any individual model can provide on its own. (The improved performance of the ensemble over individual models for this use case was validated during the development process.)

In looking at the impact of each of the input features on the overall prediction accuracy (using a technique called permutation feature importance), we see that the most important features in the model are ones we would expect to find based on decades of scholarly research into voter turnout patterns. First, historical turnout in each county is the best predictor of future turnout, which aligns with the well-established finding that voting is habitual. Beyond that, we see that counties with higher education levels, higher incomes and higher rates of home ownership have higher turnout on average, while urban areas and those with more mobile residents tend to see lower turnout. Voting rules also play a part in our predictions, with access to early and no-excuse absentee voting correlated with higher turnout and strict voter ID laws with lower turnout, but these factors have only a minor impact compared to the characteristics of the individuals in each county. Finally, electoral competitiveness also plays a major factor: As we would expect, states with close presidential contests tend to see significantly higher turnout, but we also see that close races for Senate and governor likely bring voters to the polls as well.

To check the accuracy of this modeling approach, we ran a series of four validation tests, each of which held one year’s data out of our models’ training data and then used it to evaluate the performance of models built on the other three years. These tests were meant to approximate the uncertainty involved in predicting turnout for 2020, and showed that our models were able to account for between 68% and 84% of the variation in county-level turnout rates each year. In general, the model was very good at distinguishing high- and low-turnout counties from one another, and most of the remaining variation stemmed from changes in the overall turnout level from one level to the next. These results were used to create the uncertainty estimates we describe in the following section.

The final versions of our presidential turnout models were built using all four presidential election years in our training set, then predicted for the same characteristics of each county in 2020. The resulting predictions ranged from a low of 35% of eligible voters turning out to a high of 87%, but these extremes were limited to a few, mostly small counties. The overwhelming majority of counties (more than 90%) had estimated turnout rates of between 48% and 75% of voting-age citizens, with a median turnout rate across counties of 61%. Overall, our forecast models expect 63.8% of voting aged citizens (as measured in 2018, the most recent year for which county-level census estimates are available) to turn out in 2020, which is higher than 2004, 2012 and 2016 but lower than 2008.

After building our presidential turnout models and estimating total turnout for 2020, we then build a series of “drop-off” models for Senate, governor, and House races. These models estimate the constituency-level (state or congressional district) turnout in these races in relation to the total presidential vote. We have a more limited sample size in this case, with 179 examples across four elections in the combined Senate and governor state-level model and 866 examples across two elections in the house model (which was limited by the availability of presidential vote data at the district level), so the models used here were simpler than those for the presidential model. These models employed a single linear regression, and as predictors relied just on electoral competitiveness, voting rules and the education and income of the electorate.

The resulting models predicted that across districts in 2020, the number of House votes would range from 86% to 100% of the presidential vote across districts, while the number of votes in races for Senate and governor would range from 88% to 100% of the presidential vote. In both cases, most of this variation can be attributed to the education level of the electorate, the competitiveness of the election and (for House races) the presence of candidates from both parties on the ballot.

Why and how we include uncertainty

An old saying in the statistics world is that “all models are wrong, but some are useful.” We know that our models can’t predict the number of votes cast exactly, so we want to provide an honest estimate for readers of how much uncertainty we have about our forecasts. That way, as they see votes coming in, they can tell how close the reported numbers are to the totals we expect to see.

There are two main sources of uncertainty in our forecasts: variation across counties and states due to model uncertainty, and variation across elections in the overall level of turnout. The first of these is relatively straightforward to account for. Based on our validation tests (where we built each model on three elections and used it to predict the fourth), we know that the typical range of uncertainty for our model’s predictions at the state, county and congressional district levels is about +/-5% of our prediction. That is, if our overall turnout level prediction is basically correct, the vast majority of turnout totals (80% or more) should be within 5% of our prediction. (The predictions for downballot races have a bit more uncertainty due to the additional downballot models, but not so much that we would expect the overall range to increase substantially—at least in competitive races, the variability in drop-off rates across elections is relatively small.)

The more difficult type of uncertainty to account for is the year-to-year variation in turnout levels. Our “baseline” forecast is based on historical patterns, so we expect it to be a more-or-less “typical” election year, and as such the estimated turnout rate of 62.7% is right in the middle of the four elections we used to build it. (Changes to rules around absentee and early voting give us a slightly higher prediction than we would otherwise expect, but the difference is relatively small.)

Most observers, though, don’t expect 2020 to be a “typical” year in terms of turnout, so we’re providing two additional sets of estimates, both of which are based on higher-than-normal levels of turnout. Readers should refer to these alternative estimates if the results from states that complete their counts early suggest that overall levels are more in line with these predictions than with the baseline forecast’s estimates—in that case, we would expect results from states which are still counting to be closer to the alternative forecasts’ predictions.

The first of these estimates, our “high turnout” forecast, estimates what turnout will look like if we see turnout at a level 5 percent higher than our baseline forecast (which is similar to the difference between 2008’s high turnout levels and the lower turnout levels seen in 2012 and 2016, after accounting for population growth). If we end up seeing the high rates of turnout many expect, these forecasts may end up being more accurate for most areas. The second estimate, our “historic turnout” estimate, adds an additional 5 percent to the high turnout forecasts across counties, states and districts. This level of turnout is not very likely to occur, and would be the highest turnout level in modern U.S. history. But if 2020 has taught us anything, it’s to be prepared for the unexpected, and so we’re providing these forecasts just in case.