DataVault4dbt - A Comprehensive Guide

BI, Guide

DATAVAULT4DBT: A COMPREHENSIVE GUIDE

— Intro to DataVault4dbt — Setup Instructions — Detailed Examples — Best Practices — Useful Resources

TABLE OF CONTENTS

1. INTRODUCTION

Overview of Datavault4dbt

Purpose and Benefits of Using Datavault4dbt

Target Audience

Key Features

Why Choose Datavault4dbt?

2. EXPLANATIONS

What is Data Vault 2.0?

Key Concepts and Terminology

Why Use Data Vault?

Integration with dbt (Data Build Tool)

Advantages of Datavault4dbt

Handling Heterogeneous Data Landscapes

Future-Proof Implementation

3. SETUP INSTRUCTIONS

Requirements

Installation Instructions

Global Variables

4. DETAILED EXAMPLES

General Overview

Special Specs

Supported Entities in datavault4dbt

Staging

Hubs

Links

Satellites

Reference Data

Point-in-Time (PIT) Tables

5. MOVING FORWARD

Next Steps

DBT TALK - HUB & SPEAK EVERY FIRST THURSDAY OF THE MONTH

CLICK HERE TO REGISTER FOR THE NEXT SESSION https://scalefr.ee/dbt-talk

1. INTRODUCTION OVERVIEW OF DATAVAULT4DBT DataVault4dbt is the official Data Vault 2.0 dbt package developed by the team led by Michael Olschimke. This open-source tool is designed to streamline and automate the implementation of Data Vault solutions within the dbt environment. By leveraging DataVault4dbt, data engineers can significantly reduce the time and effort required to build and maintain robust data warehouse models that adhere to Data Vault 2.0 standards.

PURPOSE AND BENEFITS OF USING DATAVAULT4DBT

The primary purpose of DataVault4dbt is to provide a quick, easy-to-use, and highly customizable solution for implementing performant Data Vault models. This package is based on years of practical experience and expertise in loading Data Vault entities across various environments. DataVault4dbt offers several key benefits: — Adherence to Data Vault 2.0 Standards: Ensures compliance with the recommended and future-proof methodologies of Data Vault 2.0, making it compatible with a wide variety of reference information — Support for Heterogeneous Data Landscapes: Capable of handling Change Data Capture (CDC), transient sources, and persistent sources, DataVault4dbt adapts to diverse data environments seamlessly

DATAVAULT4DBT: A COMPREHENSIVE GUIDE © SCALEFREE INTERNATIONAL GMBH 2025

3/18

— High Customizability: Provides numerous global variables and parameters that allow users to tailor the package to fit their specific data environments and requirements — Efficiency and Performance: Designed to optimize incremental loads and support a modern insert- only approach, avoiding the need to update data and ensuring high performance TARGET AUDIENCE This guide is intended for data engineers, data architects, and other data professionals who are involved in building and maintaining data warehouse solutions using dbt. Whether you are new to Data Vault methodologies or an experienced practitioner looking to streamline your processes, this guide will provide you with the knowledge and tools to effectively utilize DataVault4dbt in your projects. KEY FEATURES

DataVault4dbt offers a comprehensive set of features to support Data Vault 2.0 implementation:

— Extensive Macros: Includes macros for Stages, Hubs, Links, Satellites, Non-Historized Links and Satellites, Multi-Active Satellites, Record-Tracking Satellites, Reference Tables and Point-in-Time Tables (PITs)

DATAVAULT4DBT: A COMPREHENSIVE GUIDE © SCALEFREE INTERNATIONAL GMBH 2025

4/18

— Staging Area Support: Facilitates hashing, prejoins, and ghost records in the staging area

— Incremental Loading Optimization: Implements a high-water-mark technique that works across multiple sources

— Auditability: Ensures a fully auditable data warehouse solution

— Centralized Business Interface: Supports the creation of a snapshot-based centralized business interface WHY CHOOSE DATAVAULT4DBT? The package is designed to be user-friendly, efficient, and adaptable to a wide range of data environments. By sticking to the standards of Data Vault 2.0 and providing high integrated customizability, DataVault4dbt stands out as a reliable and powerful solution for data warehouse automation. Visit the GitHub repository for more information, to access the source code, or to contribute to the project. Join the growing community of DataVault4dbt users and start building your performant Data Vault solutions today!

DATAVAULT4DBT: A COMPREHENSIVE GUIDE © SCALEFREE INTERNATIONAL GMBH 2025

5/18

2. EXPLANATIONS WHAT IS DATA VAULT 2.0?

Data Vault 2.0 is a methodology designed to provide long-term historical storage of data from multiple operational systems. It aims to offer a flexible, scalable, and auditable approach to data warehousing. Data Vault 2.0 extends the original Data Vault methodology by incorporating best practices, techniques, and methodologies for achieving a comprehensive data management system, including both the technical and the business aspects. KEY CONCEPTS AND TERMINOLOGY — Hubs: Represent unique business keys. Each hub table contains a list of unique business keys, which are the main entities in the data model — Links: Capture the relationships between hubs. They represent associations or transactions between business keys — Satellites: Store the descriptive attributes and context for the business keys and links. Satellites contain historical data and are associated with either hubs or links

— Business Keys: Unique identifiers that represent the core business concepts or entities

DATAVAULT4DBT: A COMPREHENSIVE GUIDE © SCALEFREE INTERNATIONAL GMBH 2025

6/18

WHY USE DATA VAULT?

— Scalability: Data Vault models are highly scalable and can handle large volumes of data efficiently — Flexibility: The methodology allows for easy adaptation to changes in business requirements and source systems without significant rework — Auditability: Data Vault provides a complete audit trail, enabling traceability and compliance with regulatory requirements — Separation of Concerns: By separating business keys, relationships, and descriptive attributes, Data Vault models ensure clear and manageable data structures INTEGRATION WITH DBT (DATA BUILD TOOL) dbt (Data Build Tool) is a command-line tool that enables data analysts and engineers to transform data within their data warehouse more effectively. By integrating Data Vault methodology with dbt, you can leverage the power of dbt’s transformation capabilities along with the robust modeling techniques of Data Vault.

DATAVAULT4DBT: A COMPREHENSIVE GUIDE © SCALEFREE INTERNATIONAL GMBH 2025

7/18

— Automated Transformations: dbt allows you to define your data transformations using SQL, and it takes care of running these transformations in the correct order — Version Control: Using dbt with version control systems like Git helps manage changes to your data models and transformations — Documentation and Testing: dbt provides built-in documentation and testing frameworks to ensure data quality and maintainability ADVANTAGES OF DATAVAULT4DBT — Open-Source and Community-Driven: DataVault4dbt is an open-source package, allowing you to benefit from community contributions and continuous improvements

— Performance Optimization: Designed to handle large datasets efficiently, with features like incremental loading and high-water-mark techniques

— Compliance with Data Vault 2.0 Standards: Ensures that your data models adhere to industry best practices and methodologies

— Customizability: High degree of customization through global variables and configurable macros to fit your specific data environment

DATAVAULT4DBT: A COMPREHENSIVE GUIDE © SCALEFREE INTERNATIONAL GMBH 2025

8/18

HANDLING HETEROGENEOUS DATA LANDSCAPES

DataVault4dbt is designed to adapt to various data sources and types:

— CDC (Change Data Capture): Efficiently captures and processes changes in source data, ensuring your data warehouse stays up-to-date

— Transient Sources: Handles data from sources that may not retain historical data, ensuring that no data is lost

— Persistent Sources: Works seamlessly with sources that maintain historical records, integrating them into your Data Vault model FUTURE-PROOF IMPLEMENTATION By adhering to Data Vault 2.0 standards and providing a highly customizable framework, DataVault4dbt ensures that your data warehouse solutions are not only performant but also future-proof. The package’s flexibility allows for easy adaptation to new business requirements and technological advancements.

DATAVAULT4DBT: A COMPREHENSIVE GUIDE © SCALEFREE INTERNATIONAL GMBH 2025

9/18

3. SETUP INSTRUCTIONS REQUIREMENTS

To use the macros efficiently, there are a few prerequisites you need to provide:

— Flat & Wide Source Data: Ensure that your source data is flat and wide, and available within your target database

— Load Date Column: This column should represent the arrival time of the data in the source data storage. Can be generated with DataVault4dbt, if not present — Record Source Column: This column should give information about where the source data is coming from (e.g., the file location inside a Data Lake). Can be generated with DataVault4dbt, if not present INSTALLATION INSTRUCTIONS 1. Include the Package in your packages.yml: • Open your packages.yml file in your dbt project directory • Add the following lines to include the DataVault4dbt package:

packages: - package: ScalefreeCOM/datavault4dbt version: [Insert latest version here]

DATAVAULT4DBT: A COMPREHENSIVE GUIDE © SCALEFREE INTERNATIONAL GMBH 2025

10/18

• Check the dbt Hub for the latest version and installation instructions: dbt Hub - DataVault4dbt

2. Run dbt deps: • Open your command line and navigate to your dbt project directory • Run the following command to install the package: dbt deps 3. For Further Information: • For more detailed instructions on how to install packages in dbt, please visit the following link: dbt Package Management GLOBAL VARIABLES DataVault4dbt is highly customizable by using many global variables. These variables ensure a high rate of standardization across your Data Vault 2.0 solution as they are applied on multiple levels. 1. Default Values: • The default values of these global variables are set inside the package’s dbt_project.yml • You should copy these values to your own project’s dbt_project.yml to customize them according to your needs 2. Copy Global Variables: • Open the dbt_project.yml file in your DataVault4dbt package • Copy the global variables section to your own project’s dbt_project.yml 3. Explanation of Global Variables: • For a detailed explanation of all global variables and how to customize them, please refer to the DataVault4dbt wiki

DATAVAULT4DBT: A COMPREHENSIVE GUIDE © SCALEFREE INTERNATIONAL GMBH 2025

11/18

4. DETAILED EXAMPLES GENERAL OVERVIEW

The DataVault4dbt package provides a comprehensive set of macros for staging and creating all the Data Vault entities necessary to build a robust Data Vault 2.0 solution. By leveraging these macros, users can automate the creation of hubs, links, satellites, and other essential components of a Data Vault model without writing SQL manually. This approach ensures consistency, scalability, and adherence to Data Vault 2.0 standards.

The typical structure of models created with DataVault4dbt includes three main parts:

1. Metadata Definition: A YAML-formatted section where metadata for the Data Vault entities is defined. This includes for example the definition of hash keys, business keys, and source models

2. Parsing of the metadata into a dictionary

3. Macro Invocation: The defined metadata are passed as parameters to the DataVault4dbt macros, which then generate the necessary SQL code based on the provided metadata

DATAVAULT4DBT: A COMPREHENSIVE GUIDE © SCALEFREE INTERNATIONAL GMBH 2025

12/18

Below is an example model to illustrate this structure:

{%- set yaml_metadata -%} hashkey: ‘hk_account_h’ business_keys: - account_key - account_number source_models: stage_account {%- endset -%}

{%- set metadata_dict = fromyaml(yaml_metadata) -%}

{{ datavault4dbt.hub(hashkey=metadata_dict.get(‘hashkey’), business_keys=metadata_dict.get(‘business_keys’), source_models=metadata_dict.get(‘source_models’) ) }}

This example consists of the following parts:

1. DBT Configuration: • {{ config(materialized=’incremental’) }} : Sets the materialization type to incremental, meaning that the table will be updated incrementally rather than being rebuilt from scratch each time 2. Metadata Definition: • The yaml_metadata block contains the definition of the hash key, business keys, and source models in YAML syntax. This metadata must be written correctly to ensure accurate parsing and SQL generation

DATAVAULT4DBT: A COMPREHENSIVE GUIDE © SCALEFREE INTERNATIONAL GMBH 2025

13/18

3. Macro Invocation: • The metadata_dict variable is created by parsing the YAML metadata

• The DataVault4dbt.hub macro is then called with the parsed metadata as parameters, generating the SQL code needed to create the hub table

In this example:

— The hashkey is defined as hk_account_h

— The business_keys are account_key and account_number and will also be used for the hashing

— The source_models is stage_account

— The DataVault4dbt.hub macro generates the SQL to create the account hub including the hash key, business key(s), load date and record source as well as the automatic creation of ghost records SPECIAL ASPECTS — No SQL Writing Required: One of the significant advantages of using DataVault4dbt is that you don’t have to write SQL for your raw vault entities manually. The macros generate all necessary SQL based on logical rules and the metadata defined — Pattern-Based Implementation: Since Data Vault 2.0 is pattern-based, the use of macros ensures that all entities are created consistently and adhere to best practices

14/18

— Automation and Efficiency: By automating the creation of Data Vault entities, DataVault4dbt significantly reduces development time and increases the efficiency of data warehouse projects These examples and explanations provide a foundational understanding of how to use DataVault4dbt to build your Data Vault 2.0 solution. For more detailed instructions and advanced configurations, please refer to the DataVault4dbt documentation. SUPPORTED ENTITIES IN DATAVAULT4DBT STAGING: — Purpose: Prepare raw data for loading into the Data Vault model by applying necessary transformations such as hashing and pre-joins HUBS:

— Available Macros: Standard Hubs

— Purpose: Store unique business keys and their associated metadata

— Components: Hash key (a unique identifier for each business key), business keys (natural keys from the source systems), load date and record source

15/18

LINKS:

— Available Macros:

Standard Links, Non-Historized Links

— Purpose: Capture relationships between business keys stored in hubs

— Components: Hash key (a unique identifier for each relationship), foreign keys (references to the business keys in the hubs), load date and record source SATELLITES:

— Available Macros:

— Standard Satellites (v0, v1) — Multi-Active Satellites (v0, v1) — Record-Tracking Satellites — Non-Historized Satellites — Effectivity Satellites

— Purpose: Store descriptive attributes and historical data related to hubs or links

— Components: Hash key (a unique identifier), attributes (descriptive data), record source, load dates, and end dates for tracking changes over time

16/18

REFERENCE DATA:

— Available Macros: — Reference Hub

— Reference Satellite (v0, v1) — Reference Tables

— Purpose: Store static or semi-static data that doesn’t change frequently and can be used to enrich the Data Vault model POINT-IN-TIME (PIT) TABLES:

— Available Macros:

Standard PITs (+ PIT Cleanup Hook)

— Purpose: Provide a snapshot of data at specific points in time for easier querying and reporting — Components: Combination of hash keys from hubs and links to capture the state of data at a given time

17/18

5. MOVING FORWARD

As the data landscape continues to evolve, the need for adaptable, scalable, and auditable data warehousing solutions remains paramount. DataVault4dbt rises to this challenge, offering a streamlined path to implementing Data Vault 2.0 within the dbt ecosystem. By embracing the principles and practices outlined in this guide, data professionals can unlock the full potential of their data, ensuring that their organizations are well-equipped to navigate the complexities of the modern data-driven world. NEXT STEPS Interested in exploring the potential of DataVault4dbt for your data warehousing needs? Visit the GitHub repository for detailed documentation and examples. The DataVault4dbt community is readily available to provide support and assistance as you embark on your Data Vault journey. CONTACT US If you want to know more about our package or have specific questions, please visit https://scalefr.ee/dbt-talk or contact us via datavault4dbt@scalefree.com.

18/18

Scalefree International GmbH Schützenallee 3 30519 Hannover Germany

P: +49 (511) 879 893 41 M:+49 (175) 811 0336 F: +49 (511) 879 893 49

info@scalefree.com www.scalefree.com

CEOs: Michael Olschimke & Christof Wenzeritt

Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10 Page 11 Page 12 Page 13 Page 14 Page 15 Page 16 Page 17 Page 18 Page 19 Page 20