Qualifying quality: Best practices for defining financial data quality

June 16, 2016

Data quality is a top priority for financial firms and it has only grown in importance because of regulation and the need for better operational efficiency. Data quality is hard to measure in the absence of a clear definition. I am often asked by Bloomberg’s Data License clients to help them define and measure data quality. Repeating this exercise with a variety of different financial firms, and focusing on it here at Bloomberg, gives me a unique perspective on the characteristics of high-quality data.

My conversations with clients and data practitioners across Bloomberg helped me land on best practices for data quality, laid out in this article, which the sophisticated data managers among us may appreciate.

First, the most fundamental best practice is to provide a definition of what the data should be – a definition of quality. For example, for a Bond Duration Field: the value is a Decimal, the units of measurement are years, and it’s a kind of Macaulay Duration, on the official closing Bid price. This data about data is metadata. This metadata drives the definitions used in data quality.

Second, the metadata should be readable both to people and machines. For example, the Bloomberg metadata is visible in the Bloomberg Terminal as a CSV, and in OWL as ontology.

The scope of the data definitions should cover Products (data as it is bought or licensed), Datasets (the individual files), Records (the rows and structures in the files), Properties (individual Fields in Records), and Enumerations (the list of valid values that act as a ‘lookup’). My experience is these key concepts: Products, Datasets, Records, Properties, and Enumerations are enough sophistication to drive a data quality initiative.

Where metadata is being provided it should reuse existing well-defined standards and conventions. This allows the reuse of existing tools and expertise. For examples, within Data License we reuse the Dublin Core, Friend-Of-A-Friend, and DBPedia metadata. For units of measure, the International System of Units is used.

Separating out the cultural localization from the data simplifies the quality control. This process is known as “canonicalization” – turning the data into a standard form. For example, within Data License we use the ISO Standard 8601 as the standard for representing dates. For example, the canonical form of July 30th 1966 is “1966-07-30Z”.

For each dataset a license should be captured. That license should cover deriving data, republishing, sharing within firms and their customers – all likely scenarios in financial services. Ideally, the license would be captured in machine readable form. The W3C’s Open Data Rights Statement Vocabulary, is the latest open standard for data licenses.

Bloomberg’s Reference Data service provides the provenance of each dataset. That provenance is machine readable and open, because it is in the W3C’s PROV format. Provenance allows customers to derive end to end data lineage.

Finally, advanced data quality uses “coefficients of similarity” to measure changes in data. For example, to flag up for review a stock price that was $30 and jumped overnight to $3,000. That data change looks suspicious. The same character, “0”, was added twice – an indication of a likely mis-key. The rest of the market and sector barely moved. The jump is unlikely given the stock’s historic volatility. There was no news, events, nor corporate actions on that stock to cause the price jump. Taking all these factors into account and determining the data change is unexplained, is a significantly complex calculation, requiring an awareness of how different changes affect prices. Generic data quality tools simply don’t have the domain awareness of Financial Services, so within Bloomberg we use our Data Utility: PolarLake. PolarLake combines advanced coefficients of similarity with built-in knowledge of financial data.

Beyond checking individual fields and records, we also look at population changes. For example, a change in name on a stock is plausible, but a change in name on 1% of stocks overnight is implausible. For applying coefficients of similarity on entire datasets, we often use a Jaccard index. These kinds of techniques are part of the discipline of Record Linkage. We help customers by supplying the metadata for them to use Record Linkage tools and engines, but it is much faster and simpler to use the existing PolarLake service.

Matthew Rawlings has over 20 years’ experience of buy-side and sell-side data and technology leadership, and is currently making financial data easy to use for developers.

Recommended for you

Request a Demo

Bloomberg quickly and accurately delivers business and financial information, news and insight around the world. Now, let us do that for you.