Imputation of missing ESG data using deep latent variable models
In finance, data is often incomplete because the data is unavailable, inapplicable or unreported. Unfortunately, many classical data analysis techniques — for instance, linear regression — cannot function if data values are missing. To mitigate this problem, imputation techniques replace missing data with values consistent with the observed patterns.
Approaches to imputation can range from simplistic mean imputation, replacing a missing value with the mean of the non-missing values in that field, to more advanced statistical methods. The statistical models, in turn, formulate the problem as a conditional distribution given the observed values. Sampling methods can then be applied to that distribution to generate a set of likely candidates for the missing values. The major limitation of such classical statistical approaches: they rely on a simple class of distributions to describe the data, so the model is unable to capture complex, non-linear relationships in the data. Recently, advances in machine learning have led to the development of deep latent variable models (DLVMs), a deep-learning approach to statistical modeling that has the capacity to capture these relationships.
In this paper, Bloomberg researchers show the applicability of DLVMs to imputing missing values in Bloomberg’s Environment, Social, and Governance (ESG) dataset. In particular, they evaluated a DLVM approach that incorporates conditional and importance-weighted learning schemes into a variational autoencoder (VAE). To evaluate imputation performance, the researchers introduced several metrics and show that DLVM outperform classical imputation models as well as classical predictive models.
Bloomberg’s ESG dataset
Bloomberg provides annual self-reported ESG data for over 11,700 global companies, with data going as far back as 2006 and spanning over 1,100 Bloomberg sourced fields. The reporting of ESG data is often purely voluntarily; the consequent low self-reporting rates make the ESG dataset an excellent candidate for imputation techniques.
DLVM for ESG
To test how well DLVMs scale to large amounts of data and improve the imputation performance for the ESG fields, the researchers utilized other datasets, including Bloomberg Industry Classification, Factory Data and Fundamentals. This gave the models more data within which to find patterns related to ESG fields. Since the goal is to impute ESG fields, instead of modeling the additional fields in the joint distribution, these additional fields are incorporated as conditionals. Additionally, as the goal of this paper is imputation, any columns that are never missing such as Country of Incorporation are used as conditional variables.
Evaluation metrics
It may not be intuitively obvious how to compare the performance of one imputation strategy with another. Since the target data are missing, how do you know if you are getting closer? The researchers measured performance in several ways, including:
- Performance of downstream tasks: as our recent ESG paper showed, using imputed values in a logistic regression resulted in portfolios that beat their index’s Sharpe Ratio
- Measuring the calibration error of the imputed cumulative distribution function to evaluate multiple imputation
- Computing the uncertainty coefficient between columns
- Using a four-pane residual distribution plot to visualize the performance across a full range of data values
Deep-learning imputation
By applying several rigorous evaluation metrics to both simple and DLVM-based imputation techniques, the research demonstrated that implementations of the classical DLVM framework such as variational autoencoders (VAEs) perform well on imputing missing ESG data — much better than do the simpler models. This allows the imputed values to be used across tasks, including downstream machine-learning models.
Access the white paper to learn more about how building advanced deep-learning models can help handle missing values in complex datasets.