Editorial November 2016

Data Vault as a Flexible Solution to Data Warehousing

by Dan Linstedt

Data Vault was developed in the 90’s in the U.S. government surroundings to support sophisticated requirements for enterprise data warehousing (EDW). The goal of the initial client was to provide the client usable information about cost allocations, time-lines and mitigation strategies needed to streamline their business. In addition to these business requirements, the solution should meet highest security and privacy standards, and provide full auditability.

The reason why I started developing the Data Vault model was that there was no feasible solution in the industry that was able to meet these and additional sophisticated requirements by the client. 

The industry followed a vision called "Single version of the truth."   I was dealing with audit requirements for a variety of government agencies - including internal commercial lines of business.  All the previous efforts to provide "single version of the truth" or "360 degree view of customers" had failed miserably, for one reason: they could not be audited successfully.

In other words, there was no way for these systems to prove how they got to the answers they produced.  We had a new requirement on our hands: "Single version of the facts for any specific point in time".  The Data Vault solution began by separating concerns:  Facts (or data) from Information.  This changed the definition of the data warehouses we built, from "analytical" to "system of record".  It was the only way to construct an auditable solution.

An additional goal of the Data Vault warehouse solution was to provide the business with business process cycle time, as well as report the long-running business processes, the broken data sets, and the elements of the source applications which no longer matched the perception of how the business should be run.  To accomplish this, the Data Vault methodology had to incorporate Key Process Areas (KPA's) and Key Process Indicators (KPI's).

As a result, the solution began providing specific data marts, as well as information marts.  The data marts were in fact "error marts" that were applied to solve Gap Problems (show the dirty data, and mis-matched or mis-aligned business processes).  This is how we met the KPI side of the house.  The KPA's were the business processes themselves, tracked of course, by business keys.

But getting back to the requirements of the initial client...  One of the things that long plagued (and still does) their previous Data Warehousing efforts was to inherit any and all changes from the source systems.  They needed a solution that we could alter, and enhance, that would not force them to re-engineer any of their solutions.  This extended not only to the Data Vault model, but also in to the loading and querying processes.  It was with this thought that I began to think about how the data model for the data warehouse needed to change. 

Because there was no data model available that meet my expectations, I started to look into some other unusual direction to break out of the box. I looked at nature and how things evolve over time there. Think about trees and their roots, branches, leaves and stock. Think about the human brain with its dendrites, neurons, and synapses.

The initial set of patterns included more than 50 potential patterns that had to be researched. It took me more than ten years to experiment with these initial patterns: I created a small model based on some of the patterns, tried to load some terabytes of data as quick as possible into the model and tried to extract useful information as fast as possible from this model. Over the course of these ten years, I was able to find the ultimate model, consisting of only three entity types, called hubs, links, and satellites. This data model is called Data Vault.

The model has turned to be the most flexible model during my research and we had a lot of success with the model at the initial client, eventually sourcing more than 3 petabytes of data.

To be able to source such amounts of data, it is not only required to use massively parallel processing (MPP) systems, which is the foundation for today’s cloud systems, such as Amazon AWS and Microsoft Azure cloud. In addition, I further developed Data Vault into a System of Business Intelligence, with multiple components known as the pillars of Data Vault 2.0:

The flexible data model itself is the foundation for the other pillars and is used to integrate data from multiple source systems.

An agile methodology based on Scrum, CMMI level 5, Total Quality Management (TQM), Six Sigma and other industry best practices. Instead of waiting for months or even years to see the first reports from the data warehouse, this methodology helps us to deliver new functionality on a regular basis, e.g. every three to four weeks. Even from the onset. We didn’t source the 3 PB of data over night - instead it was a process that evolved over time.

Implementation best practices for loading all data from all source systems at any time, in parallel. All the Data Vault 2.0 patterns are standardized, fully recoverable and can be fully generated. The generation helps to speed up the implementation of the data warehouse, thus improving the agility of the project.

A reference architecture that is capable of absorbing data from all source systems of the enterprise: relational source systems (often loaded overnight or on another regular schedule), semi- and unstructured data sources (for example emails and documents) and real-time data, even on high volume such as in the Internet of Things (IoT). The architecture can live on multiple platforms today, for example some data might be stored on an on-premise data warehouse, other data might be on the Hadoop cluster, while other data is stored and processed in the cloud (e.g. real-time data from IoT).

These pillars have enabled our clients to solve business challenges unthinkable in the past. For example, a financial institution in Australia is using a hybrid architecture with Hadoop and Teradata to source all data of its enterprise, including all customer records, financial data and real-time weblog data. The data is combined in the Data Vault and made available to business analysts for use in standard reports and dashboards and to data scientists for sophisticated analysis.

Another project, by my co-author and European business partner Michael Olschimke, has analyzed the call volume in a cloud-based application for call centers to increase the reaction time in the case of error. This real-time dashboard aggregates all data and applies all required business logic in the real-time data stream, providing instant insights on the operational dashboard. In addition, all data is collected in the Data Vault, fully integrated with all other data sources available in the data warehouse to make the real-time data available for strategic reporting.

Yet another case by Michael integrates unstructured data and structured data: a dashboard for watching open fires (e.g., forest fires) integrates satellite imagery, and near-realtime data from social networks that is being linked to the satellite image based on textual analysis.

All these solutions have in common that they provide timely access to useful information. This is due to the agile methodology, and the generation of loading procedures for the data warehouse. This agile methodology ensures a constant stream of new functionality to become available to business analysts and managers.

With our proven implementation patterns, we bring more than 25 years of experience to our clients world-wide. All these patterns have been optimized for parallelization, auditability, recoverability and performance, among other requirements.

And the enhancements of Data Vault 2.0 enable the enterprise data warehouse to span across multiple environments, integrating on-premise systems, databases in the cloud and NoSQL environments.

To ensure highest implementation quality, we also make sure that all our consultants have access to our internal knowledge base, providing best practices, implementation patterns and use cases for various technology platforms to ensure the success of our clients.