The Data Vault Handbook - Concepts and Applications

BI, Guide

THE DATA VAULT HANDBOOK: CORE CONCEPTS AND MODERN APPLICATIONS

Build Your Path to a Scalable and Resilient Data Platform

LEVERAGE YOUR DATA

ABOUT US Scalefree International GmbH is a specialized business intelligence consul- tancy offering advisory, implementation services and training. Our offerings are designed to empower companies to build efficient data platforms and leverage their data for long-term success. Founded in 2016 by Daniel Linstedt, the inventor of Data Vault, and Mi- chael Olschimke, Scalefree has become a prominent name in the Data Vault community. Both founders also co-authored the influential book “Building a Scalable Data Warehouse with Data Vault 2.0”, published in 2015, which remains a key reference for data professionals worldwide. Scalefree is dedicated to delivering practical and sustainable solutions for modern data infrastructures, providing services in areas such as data archi- tecture, modeling, integration, automation, and DevOps. As the leading Data Vault consultancy in Europe, we commit to the standards set by Data Vault to ensure quality and sustainability.

THE DATA VAULT HANDBOOK: CORE CONCEPTS AND MODERN APPLICATIONS

Build Your Path to a Scalable and Resilient Data Platform

LEVERAGE YOUR DATA

The Data Vault Handbook: Core Concepts and Modern Applications

Version 1.0: January 2025

Copyright © 2025 Scalefree International GmbH. All rights reserved. No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of Scalefree, except in the case of brief quotations embodied in critical reviews and certain other noncommercial uses permitted by copyright law.

For permission requests reach out to Scalefree using the contact information below.

Scalefree International GmbH Schützenallee 3 30519 Hannover Germany

+49 (511) 879 89341 info@scalefree.com www.scalefree.com

Company Headquarters: Hannover, Germany Registration: Amtsgericht Hannover, HRB 213578

CEOs: Michael Olschimke & Christof Wenzeritt

III

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

INTRODUCTION TO THE HANDBOOK This handbook serves as an introductory guide for those looking to under- stand Data Vault. Often, the available material on Data Vault tends to fall into two extremes—either comprehensive books or fragmented blog posts. Our goal here is to find a middle ground, offering a broad yet cohesive overview of Data Vault concepts. The guide is designed for data practitioners—data engineers, architects, an- alysts, and others—who are experienced in building data platforms but new to Data Vault. We assume a foundational understanding of data warehousing concepts, yet we introduce all Data Vault principles from the ground up. Throughout the guide, we have carefully balanced the trade-off between the depth of coverage on each topic and the need for conciseness, so as not to produce yet another exhaustive volume. The priority has been to keep the writing approachable and straightforward, catering especially to readers who are new to Data Vault. In crafting examples, we’ve emphasized clari- ty and simplicity over technical complexity, using accessible scenarios and clear tables and diagrams. Remember, each of the sections presented here can be further explored through the recommended resources at the end of the guide.

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

TABLE OF CONTENTS

1. INTRODUCTION TO DATA VAULT

1.1. Why Data Vault

1.2. Data Vault Pillars

1.3. General Definitions

1.3.1. Business Rules

1.3.2. Business Keys

1.3.3. Hash Keys

2. DATA VAULT ARCHITECTURE

2.1. Staging Area

2.1.1. Transient Staging Area vs. Persistent Staging Area

2.1.2. Load Date Timestamp, Record Source & Hash Diff

2.2. Enterprise Data Warehouse

2.2.1. Raw Data Vault

2.2.2. Business Vault

2.3. Information Delivery

2.4. Modern Architectures: Data Fabric & Data Mesh

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

3. DATA VAULT MODELING

3.1. Core Standard Entities

3.1.1. Hubs

3.1.2. Links

3.1.3. Satellites

3.1.4. Core Entities Logical Model

3.2. Derived Entities

3.2.1. Non-Historized Links and Satellites

3.2.2. Multi-Active Satellites

3.2.3. Reference Data Entities

3.2.4. Other Entities

3.3. Query Assistant Tables

3.3.1. Point-in-Time Tables

3.3.2. Bridge Tables

4. DATA VAULT METHODOLOGY

4.1. Disciplined Agile

4.2. Project vs. Product Mindset

4.3. Roles

5. RESOURCES

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

1. INTRODUCTION TO DATA VAULT

Data Vault is a system of business intelligence including the required com- ponents to realize the enterprise vision in data integration and information delivery. Unlike its predecessor, Data Vault 1.0, which focused primarily on modeling, the more modern versions of Data Vault (2.0 and later versions) present a fully renovated framework also encompassing architecture, meth- odology, and best practices. Data Vault is termed a “system of business intelligence” because its ultimate goal is to add value to the organization implementing it. This value is re- alized through the creation of dashboards, reports, KPIs, AI solutions, and other data-driven applications. At Scalefree, we consistently emphasize that the objective is to provide the right data, at the right time, to the right people, enabling informed decision-making within the organization. However, handling multiple data sources that update frequently and involve millions of rows presents significant challenges, especially with the volume, velocity, and variety of big data. These challenges include cumbersome sys- tem integration, dirty data, lack of a single version of truth for key business definitions, delayed data team deliveries, lengthy data loads, inadequate documentation and standards, and difficulties in collaboration and deploy- ment. Traditional modeling methods sought to address many of these challenges by organizing data into structured, reliable schemas. For instance, some approaches emphasized a top-down, normalized data warehouse, aiming

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

for a single, integrated version of the truth, while other methods favored a bottom-up, star schema approach that prioritized ease of reporting and analysis.

However, these methods encountered limitations, particularly when facing the growing complexity and rapid changes of modern data environments.

LIMITATIONS OF TRADITIONAL MODELING METHODS

— Handling Big Data and High Velocity: Can struggle to keep pace with the volume and speed of big data sources, leading to performance issues and delays — Managing Data Source Changes: Require substantial rework when data sources evolve, making them less adaptable to the dynamic nature of today’s data landscapes — Maintaining Flexibility in Business Rules: Embed business rules within the model, meaning any rule change can require extensive restructuring, which is costly and time- consuming — Ensuring Data Lineage and Auditability: With an increased need for data governance, traceability, and regulatory compliance, traditional models may lack built-in mechanisms for maintaining full history and audit trails of all changes in the data

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

1.1. WHY DATA VAULT Data Vault offers a structured approach to addressing these challenges. Ev- ery set of best practices, frameworks and methodologies that you will find in Data Vault are crafted specifically to create our data solutions in the best possible way, integrating people, processes and technology. One of the key principles in Data Vault is that we follow standards. These stiff standards are not a whim, but they are answers to problems or devia- tions you might find in your project and will hinder your chances to scale accordingly along with your data. By following a clear set of guidelines, you can enhance collaboration within the data team and automate processes ef- ficiently. Data Vault has been in existence for decades, and its effectiveness is prov- en by the growing number of organizations implementing it, the increasing number of Data Vault conferences worldwide, and the professionals adding Data Vault certifications to their credentials. This guide details the application of Data Vault to specific problems, illustrat- ing how each implementation addresses the dynamic and exponentially in- creasing data challenges faced by data teams in companies around the globe.

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

DATA VAULT ADVANTAGES

— Adaptability: Handles evolving data structures and sources with minimal rework

— Modularity: Supports parallel development for faster delivery

— Auditability: Preserves complete history and ensures traceability of all data changes — Data Separation: Distinguishes between raw and business-ready data for better integration and analytics

— Scalability: Scales horizontally to accommodate growing data volumes

— Real-Time Compatibility: Works seamlessly with both real-time and batch data processing

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

1.2. DATA VAULT PILLARS Data Vault organizes its knowledge base around three main pillars, namely ar- chitecture, modeling, and methodolo- gy. Each pillar plays a crucial role in the overall framework, interacting and complementing each other to provide a comprehensive data management solu- tion.

Data Vault

ARCHITECTURE

A data architecture describes how the data pipelines are organized from the source system to the final destination, such as a report. It outlines each step in the pipeline, answering questions like: Should we transform the data immediately upon retrieval or later? Where do we integrate different data sources and formats? Where do we store historical data? How do we ensure data quality and consistency? And more. The Data Vault architecture provides a structured approach to defining clear steps (or layers) in our pipelines, each serving a distinct purpose. This will mean that with a clearly specified general blueprint of your project’s archi- tecture, you can organize the work more efficiently and minimize deviations from the plan. Additionally, modern approaches increasingly incorporate a domain-driv- en enterprise architecture, which structures data around specific business domains. This strategy aligns data pipelines closely with business functions, and Data Vault architecture is inherently adaptable to this approach, allow-

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

ing data to be organized by different domains while maintaining consisten- cy, scalability, and flexibility across the organization.

MODELING

Data modeling is the process of analyzing and defining the various types of data your business collects and produces, along with the relationships be- tween these data elements. A well-specified data model will help your com- pany integrate and retrieve the right data at the right moment for the right user. As the number of source systems increases, it’s crucial to ensure that your data model scales accordingly. Failing to do so can result in unreliable data, making it difficult to deliver value, such as accurately identifying your last quarter’s profit.

“ The goal is to create a model that clearly identifies the core business objects and integrates them without redundancies [...]”

Data Vault modeling techniques offer a flexible and scalable approach to in- tegrating your data. The goal is to cre- ate a model that clearly identifies the core business objects and integrates them without redundancies, generat- ing a single version of the truth and facilitating swift business decisions.

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

METHODOLOGY

The methodology in Data Vault will provide us with clear ways of working. Here we will get best practices, frameworks, and guidelines on how to co- ordinate the work and integrate people, processes, and technology. While these standards are rooted in software engineering, they have been updated to meet the needs of modern data projects. Additionally, the methodology extends beyond best practices and coordina- tion strategies to include implementation standards. These standards define how to load data structures and derive information from them, like facts and dimensions. Leveraging the pattern-based nature of the Data Vault model— which relies on metadata—the implementation process itself follows a pat- tern-based approach. This ensures consistency, repeatability, and scalability across the entire data platform. 1.3. GENERAL DEFINITIONS In this guide, we will frequently reference several key concepts related to Data Vault. This section provides brief descriptions of these fundamental terms. 1.3.1. BUSINESS RULES Business rules (also known as soft rules) are logical transformations that we can apply to our data. We call them business rules because these transfor- mations have the goal of adapting the data in a way that serves the business. This could imply cleaning the data, aggregating it, calculating, normalizing or denormalizing, etc. Basically, anything that transforms the data from how it looked from the original source, including its granularity, will be a business rule.

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

On the other hand, Data Vault also makes use of the so-called “hard” rules. Differently from (soft) business rules, hard rules are those that don’t change the data, but just adapt the data structure to load it into the data platform. For instance, a hard rule could be the change of a data type of an attribute, or the addition of technical fields into our table.

Hard Rules Technical adjustments without altering the data’s content

Soft (Business) Rules Transform the data by adapting it from its original form to serve end-user requirements

In summary, hard rules help us to integrate and work with our data without transforming it in a fundamental way, while (soft) business rules transform the data to reach the end-user’s needed specifications. 1.3.2. BUSINESS KEYS Business keys are unique identifiers of relevant business objects, such as customer or order, enabling their identification, tracking, and location with- in a data system. Furthermore, they serve as the foundation for data integra- tion, allowing businesses to establish a common language for organizing and understanding their data. Identifying business keys involves a thorough understanding of how busi- ness users navigate and retrieve information. Analyzing business process- es, source system data, and data models also aids in pinpointing business keys. Keep in mind that a good business key should have a low propensity to change and be unique.

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

What is the difference between business keys and surrogate keys?

Surrogate keys are synthetic keys, meaning they are artificially generated by the operational system and should not have any business meaning. One of the biggest problems with using surrogate keys as our business keys to identify business objects is that, if we are integrating multiple systems in our Data Vault, they only make sense inside a single source system and therefore might generate collisions.

CHARACTERISTICS OF A GOOD BUSINESS KEY

— Low Propensity to Change: Should remain stable over time to avoid tracking issues

— Uniqueness: Must uniquely identify records to ensure data integrity

— Globally Recognized: Preferred to use universal identifiers over system-specific ones for consistency

— Known by the Business Users: Ideally they should be meaningful and familiar to stakeholders

— Natural Key over Surrogate Key: When possible use inherent business identifiers instead of generated ones

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

1.3.3. HASH KEYS Hashing is the process of converting an input into a fixed-size output using a deterministic algorithm known as a hash function. In this sense, a hash key is a unique identifier generated by applying a hash function to a specific in- put data, like a business key (BK). For instance, consider a product’s BK, such as “123”, which upon hashing, might yield a result like MD5(123) = 202cb962 ac59075b964b07152d234b70, a 32-character string. This process is determin- istic, meaning that given the same input (123), the resulting output (202cb...) will consistently remain the same. In Data Vault, business keys are typically hashed to create fixed-length rep- resentations that enhance join performance. Given that keys may originate from diverse sources and be depicted in various formats, applying a hash- ing function standardizes them across the board. Hash keys offer fixed-size representations, which can streamline indexing and retrieval operations, particularly in traditional database environments. By ensuring a consistent length for join conditions and minimizing the computational overhead asso- ciated with variable-length keys, hash keys may significantly optimize query execution and overall system efficiency.

Customer Hashkey Source 1 720d115e73ef907...

Customer Business Key Source 1 CUST-987654

Customer Hashkey Source 2 12e0b9eb873ac01...

Customer Business Key Source 2 XYZ123456

Customer Hashkey Source 3 41e11a27864066f2...

Customer Business Key Source 3 2024-0001-ABC

Additionally, hashing is particularly advantageous when dealing with com- posite business keys, which can become lengthy and cumbersome. By re-

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

ducing their size to a fixed length, hashing optimizes storage and retrieval operations, making these keys more manageable.

Lastly, unlike the use of sequences in Data Vault 1.0, hash keys enable us to circumvent dependencies during the loading process. Since the hashing of each business key is independent of others and consistently yields the same hash key for the same BK, the loading process can occur concurrently. This parallel loading capability enhances scalability and performance by opti- mizing the utilization of computational resources.

ADVANTAGES OF HASHING

— Consistency: Ensures uniformity across diverse source formats

— Performance: Enhances efficiency with fixed-length representations

— Optimization: Streamlines indexing and retrieval operations

— Parallelism: Enables concurrent, dependency-free loading processes

While there are multiple reasons to implement hash keys, the decision ulti- mately depends on whether the anticipated benefits align with your model- ing strategy and if they complement other factors, including your tool stack. There may be valid situations where using hash keys is significantly less performant, making it preferable to work directly with the business keys instead.

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

In this chapter • Learn the three-layer Data Vault architecture and how it ensures data integration, historization, and flexible delivery • Discover the purpose of Staging, Raw Data Vault, and Business Vault layers and their role in a scalable data platform • Explore how the Information Delivery layer connects the Data Vault to end-users by enabling analytics and reporting

2. DATA VAULT ARCHITECTURE

The Data Vault architecture is built upon a three-layer data platform structure , each layer purposefully designed to ad- dress specific demands and challenges

encountered in developing an Enterprise Business Intelligence solution. In this section, we will first present a visualization of the Data Vault architec- ture, which will serve as a guide as we explore each layer in detail. An Enterprise Data Platform is a comprehensive framework that centraliz- es, integrates, and manages an organization’s data assets, enabling seamless access, analysis, and utilization of data across the enterprise. It encompasses the tools, technologies, and processes necessary for collecting, storing, pro- cessing, and delivering data to meet business needs. An effective Enterprise Data Platform supports a wide range of data-driven applications, from re- porting and analytics to machine learning and artificial intelligence.

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

In this case, the Enterprise Data Platform will be composed of three layers: the Staging Area, where raw data is ingested and prepared; the Enterprise Data Warehouse, where data is integrated, standardized, and stored; and the Information Delivery Layer, where data is presented and made accessible to users for decision-making and other business processes. Furthermore, a Metadata Management layer spans across the Enterprise Data Platform in order to facilitate data discovery and automation, improve data governance, and enable efficient management of the data lifecycle across various stages of the platform, from ingestion to delivery. One of the key features of this architecture is its ability to distinctly address the demands of both data integration and information delivery. By struc- turing the Enterprise Data Platform into the three layers, it allows for the separation of concerns—ensuring that the complexities of data integration, storage, and historical tracking are managed independently from the pro- cesses involved in delivering actionable insights to end users. This separa-

tion enhances the system’s flexibility, scalability, and maintainability, en- abling each layer to evolve and opti- mize according to its specific require- ments without disrupting the overall data flow. Additionally, Data Vault can be adapt- ed to a domain-driven enterprise ar- chitecture (Section 2.4) and can incor- porate real-time capabilities using an Enterprise Service Bus (ESB); howev- er, the discussion of real-time capa- bilities falls outside the scope of this guide.

“ One of the key features of this

architecture is its ability to distinctly address the demands of both data integration and information delivery.”

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

2.1. STAGING AREA The Staging Area is the initial layer in the Data Vault architecture, responsible for col- lecting and storing raw, unmodified data from various source systems. Its primary purpose is to offload the operational bur-

den from these systems and get the control of the data on our side, serving as a centralized repository where data is gathered before any significant transformations occur. This layer ensures that the data remains as close to its original form as possible, preserving its integrity for auditability, histor- ization and the subsequent processing. In the Staging Area, only hard rules are applied (Section 1.3.1), so that the data’s fundamental meaning is not modified. Instead, these rules focus on addressing technical requirements to facilitate storage and processing. Ex- amples of hard rules include aligning data types, computing hashes (Hash Keys & Hash Diffs), and adding technical fields such as Record Source and Load Date Timestamp. By focusing solely on these hard rules, the Staging Area prepares the data for further processing in the subsequent layers of the Data Vault architecture. 2.1.1. TRANSIENT AND PERSISTENT STAGING AREA When designing a Staging Area, a crucial decision involves determining how long data should be retained in this layer. Generally, you can choose to keep the complete history of your data sources (Persistent Staging Area) or retain data only for a limited period, such as a single load cycle or a specific num- ber of days (Transient Staging Area). Each approach comes with its own set of advantages and disadvantages.

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

A Transient Staging Area is designed to hold data temporarily. Once the data has been processed and moved to subsequent layers, it might be discard- ed from the Staging Area. The main advantage of this approach is that it reduces storage requirements, as data is not retained long-term. This can be particularly beneficial in environments with limited storage capacity or where data sources are extremely large. Additionally, a Transient Staging Area simplifies the management and maintenance of the staging layer, as outdated data is automatically purged. However, the downside of a Transient Staging Area is that if an error occurs or data needs to be reprocessed in our Enterprise Data Warehouse, it cannot be retrieved from the Staging Area and must be re-extracted from the origi- nal source systems. This might not be always possible, posing a risk of losing historical data. In contrast, a Persistent Staging Area can store the full history of data sources. This approach offers signif- icant benefits, particularly in terms of data recovery and reprocessing. If errors are detected or data needs to be reloaded, the Persistent Staging Area allows for easy retrieval of pre- vious data loads without requiring a fresh extraction from the source sys- tems. However, the trade-off is that it requires more storage space and can lead to increased management com- plexity. The Persistent Staging Area also demands robust governance to ensure data integrity over time. “ [A Persistent Staging Area] offers significant benefits, particularly in terms of data recovery and reprocessing.”

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

Most companies tend to begin their staging process with a landing zone in the shape of a data lake—a centralized repository that enables the storage of large volumes of structured, semi-structured, and unstructured data at scale. This data lake could be persisted in order to host the history of the data sources. Following this, they design a relational Transient Staging Area that adapts to changes in data structures in the source and apply the necessary hard rules for further processing in the next layers. Finally, the choice between these designs is especially important in the Euro- pean Union, where laws like the General Data Protection Regulation (GDPR) require companies to address scenarios where users request the deletion of their personal data. A Transient Staging Area can simplify compliance by en- suring that data is only temporarily stored, reducing the complexity of data deletion requests. However, this approach also introduces the aforemen- tioned challenges related to reprocessing and data retrieval. On the other hand, companies opting for a Persistent Staging Area will need to implement additional measures, such as data encryption protocols.

2.1.2. LOAD DATE TIMESTAMP, RECORD SOURCE & HASH DIFF

We will provide a brief overview of the technical attributes included in the Staging Area and their significance for Data Vault. It’s important to note that Hash Keys are not covered here, as they were already discussed in Section 1.3.3.

RECORD SOURCE

The Record Source attribute is essential for tracking the origin of data, pro- viding a clear and detailed reference to the source system from which the data was extracted. Therefore, its goal is to maintain traceability and allow

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

for troubleshooting, ensuring that every piece of data can be traced back to its originating source.

The Record Source is a key technical attribute that supports debugging pro- cesses by providing a direct reference to the origin of the data. While it helps ensure data traceability back to the original source systems, its primary val- ue lies in assisting technical teams in identifying and resolving issues during data loading and processing. To make the most of the Record Source attri- bute, it is advisable to use the most detailed level of granularity possible.

Less detailed Record Source SALESFORCE

More detailed Record Source (recommended) SALESFORCE/Partners/2024-01-01T07:00:00

LOAD DATE TIMESTAMP

The Load Date Timestamp is a system-generated and maintained attribute that records the exact timestamp when a business key first enters the data platform. This technical timestamp has no business meaning and is used purely for technical purposes. In this sense, the field is crucial for tracking the timing of data ingestion and is consistent across all data loaded within the same batch. By ensuring that all records in a batch share the same Load Date Timestamp, the system enables traceability, making it easier to identify and resolve any technical issues or errors that may have occurred during the loading process.

THE DATA VAULT HANDBOOK © SCALEFREE INTERNATIONAL GMBH 2025

Once set, the Load Date Timestamp should never be altered, preserving a reliable, time-stamped view of the data as it appeared at the moment of load- ing into the Staging Area. This is also important for implementing the incre- mental loading of our entities through the different layers.

HASH DIFF

The Hash Diff is an attribute used in satellites to optimize the detection of changes in data rows. It speeds up lookups by providing a quick way to com- pare rows and identify modifications. The Hash Diff is generated similarly to Hash Keys but, instead of just hashing business keys, descriptive attributes from the source system are concatenated using a delimiter before the hash calculation. Prior to concatenation, leading and trailing spaces are removed, and the data is formatted in a standardized way to ensure consistency. In practice, the Hash Diff is used to quickly determine whether a change has occurred for a given business key. As new data arrives, a Hash Diff is calcu- lated by hashing the current set of descriptive attributes. This value is then compared to the most recent Hash Diff stored in the satellite for the same business key. If the Hash Diff remains unchanged, no new row is inserted, as there is no detected delta. If the value differs, it indicates that at least one descriptive attribute has changed, and a new record is added to the satellite.

— Row 1: Business key “AAA” appears with new attributes (“Blue”, 100, “Small”). The Hash Diff is new, so the row is inserted into the satellite.

— Row 2: The same business key is loaded again, but the attributes haven’t changed. Since the Hash Diff is the same, the row is not inserted.

— Row 3: One attribute changes to “Green”, resulting in a different Hash Diff. This counts as a delta, so the row is inserted.

— Row 4: The data is identical to the previous day. The Hash Diff doesn’t change, so the row is skipped.

In conclusion, rather than comparing each attribute individually, the load- ing process for our Satellite entities (Section 3.1.3) from the Staging Area to the Raw Data Vault will utilize the Hash Diff column to efficiently detect changes in attributes for each business key. While the example provided only includes three attributes, the performance benefits become even more sig- nificant when handling tables with dozens or even hundreds of attributes. This approach streamlines the detection process, reducing the computation- al overhead and enhancing the overall efficiency of the data pipeline.

2.2. ENTERPRISE DATA WARE- HOUSE

The second layer of the Data Vault archi- tecture is the Enterprise Data Warehouse, where source data is integrated to create a single point of the facts—providing all the data, all the time—using Data Vault mod-

eling standards. The Enterprise Data Warehouse is divided into two com- ponents: the Raw Data Vault and the Business Vault, each of which will be explained in detail in the following sections. 2.2.1. RAW DATA VAULT The Raw Data Vault is the foundational component of the Enterprise Data Warehouse, designed to store and manage all the raw unmodified data coming from the Staging Area. Here only hard rules can be applied (Section 1.3.1). This layer is where the data from various source systems is integrated and modeled using Data Vault techniques, ensuring that the data’s original granularity is preserved while establishing a single version of the facts. In the Raw Data Vault, data is organized into Hubs, Links, and Satellites, which are the core entities of the Data Vault model (Section 3.1), and oth- er derived entities (Section 3.2). Hubs, Links, and Satellites represent core business objects, their relationships, and their descriptive attributes, respec- tively. One of the key advantages of the Raw Data Vault is its ability to retain the full historical context of the data. Unlike traditional data warehouses, where data might be aggregated or summarized on the way to the warehouse, the

“ The Raw Data Vault stores data in its original raw format [...] crucial for auditing, traceability, and compliance.”

Raw Data Vault stores data in its original raw format, ensuring that no data is lost over time. This approach is crucial for auditing, traceability, and compliance, as it allows organizations to maintain a complete record of their data, with every change and update preserved in its his- torical context. Moreover, the Raw Data Vault’s architec- ture is highly scalable and flexible. By de- coupling the business logic from the data storage, it allows for easier integration of

new data sources and the ability to adapt to changing business requirements without significant reengineering. This flexibility makes the Raw Data Vault an ideal choice for organizations dealing with complex data environments or changing data structures. 2.2.2. BUSINESS VAULT The Business Vault serves as an intermediary layer between the Raw Data Vault and the Information Delivery layer. Situated within the Enterprise Data Warehouse (EDW) layer, the Business Vault is designed to bridge the gap be- tween raw, unmodified data stored in the Raw Data Vault and the end-user structures required for reporting and analysis. Unlike the Raw Data Vault, where data is stored in its original, unaltered form, the Business Vault allows for the application of soft (business) rules (Section 1.3.1) that modify the data, including its granularity, to better align with business needs.

The primary purpose of the Business Vault is to facilitate the efficient cre- ation of end-user structures in the Information Delivery layer. By leverag- ing the Business Vault, organizations can generate query assistance entities, precompute calculated fields, and apply reusable business logic, all of which streamline the population of Information Marts in the Information Delivery layer. The Business Vault is optional and can be virtualized when possible, depending on the specific needs and architecture of the organization.

A Business Vault is not a copy of your Raw Data Vault

Build only necessary Business Vault entities that ease the creation of end-user structures

A key feature of the Business Vault is its flexibility compared to the Raw Data Vault. While the Raw Data Vault is designed to maintain strict auditability and historical accuracy, the Business Vault does not have to adhere to these stringent requirements. This allows for the creation, modification, and de- letion of Business Vault entities as needed, and being able to recreate them from the Raw Data Vault when necessary. Entities within the Business Vault are typically created only when necessary for business operations, reducing unnecessary complexity and storage overhead. Common entities found within the Business Vault include Point-In-Time (PIT) tables, Bridge tables (Section 3.3), Computed Satellites, and Exploration Links. These entities play a crucial role in optimizing query performance, storing computed data, and establishing connections between Hubs that were not linked in the Raw Data Vault. Additionally, the Business Vault can be used to host business logic that filters out outliers or applies other data quality criteria before the data is consumed by the Information Delivery layer.

2.3. INFORMATION DELIVERY The Information Delivery layer is the final, user-facing layer in the Data Vault architec- ture, designed to cater to the specific needs of business users. This layer transforms the data stored in the Raw and Business Vaults

into formats that are easily accessible and actionable for end-users. Since this layer is closely aligned with business requirements, its design is highly flexible, accommodating various data models, such as star schemas, OLAP cubes, and flat or wide tables. The primary role of the Information Delivery layer is to ensure that data is delivered in a form that supports effective decision-making. This involves transforming the granular, historical data stored in the Raw Data Vault and the business logic applied in the Business Vault into intuitive, user-friendly formats. These formats are tailored to specific business use cases, allowing users to perform analysis, reporting, and data exploration with minimal complexity. By doing so, the Information Delivery layer acts as the bridge between the technical underpinnings of the data warehouse and the practi- cal needs of the business. Entities created within the Information Delivery layer can query both the Raw Data Vault and the Business Vault, depending on the nature of the data required. For instance, if a report demands raw, unmodified historical data, it might query the Raw Data Vault directly. Conversely, if the report needs data that has undergone business rule transformations, it would query the Business Vault. This dual capability ensures that the Information Delivery layer can provide the most relevant and accurate data for any given business scenario.

An important aspect of the Information Delivery layer is the creation of data marts—specialized subsets of the data platform that serve distinct business needs. These marts are designed to optimize performance and usability for specific types of analysis.

EXAMPLES OF MARTS

— Information Mart: Focuses on presenting business metrics, KPIs, and performance indicators, offering insights for decision-making and reporting

— Error Mart: Tracks and manages inconsistencies that violate hard rules

— Meta Mart: Centralized repository containing metadata about data structures, lineage, transformations, and applied business rules

— AI Mart: Feature store derived from multiple data sources used by data scientists to train AI models

The Information Delivery layer empowers business users with accurate, timely, and relevant data. This layer is not just about data presentation; it’s about enabling better business outcomes by providing the right data in the right format at the right time. Whether through dashboards, reports, or more complex analytical models, the Information Delivery layer is crucial in transforming raw data into actionable insights.

2.4. MODERN ARCHITECTURES: DATA FABRIC & DATA MESH Data Fabric and Data Mesh are two complementary approaches to modern data architecture, both designed to solve challenges related to scaling, man- aging, and delivering data across complex organizations. Data Fabric is a centralized architecture that integrates and governs data from a variety of sources, such as legacy systems, cloud applications, data lakes, and ware- houses. It provides a unified, secure, and consistent data layer that serves both operational and analytical needs, leveraging technologies like data cat- aloging, orchestration, and governance to ensure data is delivered efficient- ly across the enterprise. While Data Fabric offers a global view of enterprise data, Data Mesh decen- tralizes data management by giving domain-oriented teams the ownership and responsibility for their own data products. The two architectures work in harmony, as Data Fabric can provide the underlying connective tissue that ensures data remains integrated and governed, even as Data Mesh empow- ers different domains to manage and scale their data products independent- ly. This blend of centralized governance (via Data Fabric) and decentralized ownership (via Data Mesh) offers flexibility, scalability, and efficiency in modern data ecosystems. In particular, when we implement a Data Mesh architecture with Data Vault, we should aim for a balance between decentralization and centralization to maximize efficiency and scalability. While it is possible to completely decen- tralize the entire data pipeline, there are significant advantages to keeping certain layers centralized, at least up to the Raw Vault.

CORE PRINCIPLES OF DATA MESH

— Domain Ownership: Data is managed by the teams who understand the domain best. These teams are responsible for ensuring the quality, governance, and availability of their data, making them more accountable and responsive to business needs — Data as a Product: Data is treated as a first-class product. Each domain ensures its data is discoverable, accessible, and trustworthy for other teams to use. The aim is to deliver value continuously, maintaining high standards for data quality — Self-Serve Data Infrastructure: To enable decentralization, a self-service infrastructure is necessary. This allows domain teams to independently access tools, pipelines, and platforms needed to create, manage, and share data products without the reliance on centralized data engineering teams — Federated Governance: Even though teams manage their own data, governance is still crucial. Federated governance ensures that global standards such as data security, privacy, and compliance are adhered to while allowing teams the freedom to operate independently

The Raw Vault, being purely focused on capturing hard facts without apply- ing any business logic, is an ideal candidate to remain centralized. It serves as a clean, shared foundation of data, which all domains can access with- out having to manage redundant staging processes. By centralizing the Raw Data Vault, we can leverage high levels of automation, ensuring that all raw data is available for various domains to build upon while preventing unnec- essary duplication of effort or data.

Once we move to the Business Vault and the Information Mart, decentraliza- tion becomes more advantageous. These layers apply business rules, which are best understood by the specific domains. By decentralizing at these lay- ers, we allow domain teams to take full ownership of their data products, applying their expertise to transform raw data into meaningful business insights. This approach also empowers the domains to work independent- ly, reducing bottlenecks typically found in centralized IT departments and accelerating the delivery of new data products.

While the platform team can continue to provide technical support and en- sure the infrastructure operates smoothly, the responsibility for the data from the Business Vault onwards lies with the domain teams. This helps to ensure that business value is generated more efficiently across the organi- zation. Finally, in very large organizations, full decentralization, including the Stag- ing and Raw Vault layers, may be necessary—especially if they manage mul- tiple data platforms or distinct business units. In such cases, each domain might manage its own end-to-end pipeline, but this should only be consid- ered where it provides clear benefits in terms of efficiency and scalability.

DATA VAULT

TRAININGS Learn from our experienced trainers and get certified.

https://scalefr.ee/training

In this chapter • Learn the roles of Hubs, Links, and Satellites in capturing business keys, their relationships and descriptive attributes • Discover how derived entities like Non-Historized Links, Multi- Active Satellites, and Reference Data Tables address specific data challenges and business needs • Explore how Point-in-Time (PIT) and Bridge tables optimize performance by simplifying joins and enhancing query efficiency in the Business Vault

3. DATA VAULT MODELING

Data Vault modeling is one of the most defining aspects of Data Vault, distin- guishing it from other approaches. At Scalefree, we frequently observe that this area is a focal point where our cli- ents seek the most support and guidance.

While it can be challenging to grasp initially, mastering the Data Vault model patterns is crucial for effectively implementing a robust and scalable Enter- prise Data Platform. Each entity within the model is designed with a specific purpose, providing not only a structured approach to data loading but also a framework for efficient data storage and retrieval, ultimately optimizing the workflow across the entire data ecosystem. In this section, we will begin by exploring the three foundational entities of the Data Vault model: Hubs, Links, and Satellites. These core components serve as the building blocks upon which more complex entities are derived. A solid understanding of these basics is essential before delving into the more advanced derived entities. Additionally, we will introduce Point-In- Time (PIT) tables and Bridge tables, which are special entities designed to address specific challenges within the data warehousing process. Keep in mind that these modeling standards will be applied throughout our Enterprise Data Warehouse layer (Section 2.2), encompassing both the Raw Data Vault and the Business Vault. The placement of entities within either the Raw or Business Vault will depend on whether we are applying hard or soft rules (Section 1.3.1). Finally, to enhance understanding, practical exam-

ples will accompany each entity, illustrating their role and how they interact within the overall model. 3.1. CORE STANDARD ENTITIES The selection of Hubs, Links, and Satellites as the core entities in Data Vault modeling is rooted in their ability to represent a business from its fundamental components: business keys, relationships, and descriptive data, respectively. Business keys (Section 1.3.2) serve as the backbone, uniquely identifying and connecting these business objects within our Enterprise Data Warehouse.

To ensure a comprehensive understanding of these core entities, we will ex- plore them in detail. For each entity, we will describe its purpose, outline its structure, and illustrate its application with a practical example. 3.1.1. HUBS

Hubs are the foundational entities in the Data Vault model, representing core business concepts through a distinct list of business keys. These keys are cru-

cial for tracking business objects both in time—answering questions like “Is this client still active?”—and in space across the entire data pipeline, such as “Is this lead defined by marketing the same as the lead presented in this report?” Hubs play a critical role in consolidating diverse data sources, ensuring a uni- fied and consistent granularity for each business concept. The separation of business keys into Hubs emphasizes their central importance in identifying and managing business objects. A Hub entity is designed to store these keys alongside essential metadata (Section 2.1.2), including the source system (Record Source) and the timestamp when the business key was first in- troduced to the data platform (Load Date). This structure not only simplifies the integration of new data sources—where new business keys can easily be added to existing Hubs—but also provides the necessary granularity for downstream objects, such as dimension tables. This granularity is vital for efficient query- ing and minimizes computational overhead, making Hubs indispensable for maintaining the integrity and performance of the Enterprise Data Warehouse.

HUB STRUCTURE

— Hash Key: The primary key for the Hub, generated by hashing the business key, ensuring uniqueness and consistency across all data sources

— Load Date TS: A timestamp marking when the business key was loaded into the data platform, crucial for tracking and batch management

— Record Source: Identifies the originating system or application, providing traceability and ensuring data lineage

— Business Key: The original value from the source system representing a core business concept

31/60

Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10 Page 11 Page 12 Page 13 Page 14 Page 15 Page 16 Page 17 Page 18 Page 19 Page 20 Page 21 Page 22 Page 23 Page 24 Page 25 Page 26 Page 27 Page 28 Page 29 Page 30 Page 31 Page 32 Page 33 Page 34 Page 35 Page 36 Page 37 Page 38 Page 39 Page 40 Page 41 Page 42 Page 43 Page 44 Page 45 Page 46 Page 47 Page 48 Page 49 Page 50 Page 51 Page 52 Page 53 Page 54 Page 55 Page 56 Page 57 Page 58 Page 59 Page 60 Page 61 Page 62 Page 63 Page 64 Page 65 Page 66 Page 67 Page 68