Building a Data Platform: Part 5 - Data Transformation (loading the warehouse)

Introduction

Once the data has been ingested into the data lake it's time to make it available for reporting by transforming the data into structures that are simple to query for a BI tool. This part of the series will describe the data modelling approach, the transformation tool (DBT) and how to transform the data ready for reporting. The example architecture diagram is shown below.

Data Model

There are a variety of data modelling approaches that can be used to model a data warehouse, popular approaches include:

Dimensional
Data Vault
Third-Normal Form (3NF)

A comparison of the different approaches is beyond the scope of this article, however for our example architecture we will be using dimensional modelling. Dimensional modelling has been the dominant data warehouse modelling approach in my personal experience and there are a good number of reasons to use it, the most important being ease-of-use for analysts and business users and performance.

Dimensional modelling has two main entities: facts and dimensions. Facts record measurements and events while dimensions record the contextual information for the fact.

Consider the simple data model below representing orders for an online shop. Note that I've prefixed the tables with fct_ to designate a fact table and dim_ to represent a dimension table to ease understanding.

In the diagram there is a central fact table fct_order_item that is used to represent each product included in the order. The quantity and unit_price_gbp fields are measures and are included on the fact table. Other than the measures, a fact table should be made up of foreign keys pointing to the dimensions that provide the order context. In this case, there are user, product and date dimensions that are used to describe who placed the order, what products are included in the order and when the order was placed. Note that the product dimension (dim_product) contains both the product information and the associated category details for the product. This denormalised approach is normal for data warehouses but inadvisable for operational transactional databases.

For a detailed introduction to Dimensional Modelling I highly recommend "The Data Warehouse Toolkit" by Ralph Kimball & Margy Ross.

In addition to facts and dimension tables, it's common for other tables to be present, predominantly aggregate tables which are used to provide "roll-up" facts to a higher-level of granularity (usually for performance reasons). For example, an aggregate table might be created that provides a summary view of the value of all orders across all categories by financial quarter in order to improve the performance of dashboards for senior management.

Transformation Tool - DBT

There are a plethora of tools on the marketing that can be used to transform data, from ETL tools such as Informatica and Talend to ELT tools such as FiveTran and Matillion. AWS also provides Glue for ETL however I personally think this is an expensive option and more complicated than is required for our transformation requirements.

I've chosen DBT for our transformation tool for several reasons:

The data is already available to Redshift so only the "T" of ETL is required.
Data structures can be specified using fairly simple SQL Select statements.
Orchestration of SQL statement execution and dependency management is powerful yet simple.
DBT can generate a documentation website that includes lineage.

Transforming the data

Process

The basic process of transforming the data is shown below.

The first step is to fetch all the new records from the data lake. If using a technology that performs in-place updates such as Apache Iceberg, this would be new or updated records). If you are not expecting to handle in-place updates, then it should be efficient to pull new records based on a date/time from the target fact tables (or an ETL audit table) and use it for partition pruning to reduce data lake query costs and execution time.

Once the new records have been obtained, they should be used to determine what inserts or updates should be made to the dimension tables. This should be done before the fact tables are loaded as there is a foreign key relationship and you'll need to generate the appropriate dimension IDs first. Note that Redshift doesn't enforce foreign key constraints and when using hashing to generate IDs with DBT (discussed later) you don't have to load the dimensions first, but it makes sense logically to load the dimension tables first.

Once the dimension tables have been loaded, the fact tables can then be loaded.

Finally, any aggregate tables that are being used can be recreated from the fact and dimension tables.

Transformation

Before the data can be transformed, the data lake Glue databases need to be exposed to Redshift as external schemas (AWS provides instructions on how to do this here). I'd recommend using a prefix and following a convention to identify the external schemas such as lake_, lk_, dl_ or ext_ to identify data held in the lake. The reason for this is that we want to make it clear which schemas are stored within Redshift (and therefore cheap and fast to query if using provisioned clusters) and which are stored in the lake and subject to per-query charging. I highly recommend setting up permissions that only let engineers and certain trusted analysts have access to the external schemas rather than exposing them to everyone.

DBT uses SQL Select statements to create what it refers to as models (tables and views in the database). The DBT materialisation determines if a view or table is created. For the purposes of our example architecture, I'm going to assume we are interested in creating three types of model:

Staging models
Target Models
Aggregate Models

Staging Models

Staging models are used to prepare data prior to loading the target fact and dimension models. Staging models should be used to standardise field names and perform any necessary type conversions. For example, if there are multiple sources that refer to the same value (e.g. customer_number) but each source in the lake gives it a different name (e.g. custnum, custkey, etc) then the staging tables for each source should rename the field to be customer_number and convert the value to the appropriate type. I suggest using the ephemeral materialisation for staging tables so that the partition pruning works correctly when the staging model is used by the target and reduces the amount of data queried in the lake. Make sure you set up source and use the source macro to enable the docs generator to create lineage diagrams.

Target Models

Target models are the facts and dimension tables we want to load the transformed data into. These models should use the incremental materialisation as they should be persistent and would use the latest date/time from the target table or run-time from an audit table to filter new records via the staging model. It is essential to use the partition key for the data lake tables to avoid fetching all data from the lake every time the process runs. Make sure you refer to the staging model using the ref macro to enable the docs generator to create lineage diagrams.

Aggregate Models

Aggregate models are the summary aggregate tables we want to create for improving dashboard performance and are based on the contents of the target models. These models should generally use the table materialisation as they are likely to need to be re-created every time the process runs. Make sure you refer to the target models using the ref macro to enable the docs generator to create lineage diagrams.

Surrogate keys

DBT provides a macro called generate_surrogate_key in the dbt-utils package to generate surrogate keys. This macro generates an MD5 hash of the fields you specify as parameters and DBT Labs recommend this approach as it is portable across multiple data platforms. This goes against the "traditional" data warehousing approach of using automatically incrementing integers. The performance impact of using the MD5 hash in a varchar field isn't generally too bad, however if you want to use numeric primary keys for performance, you can override this macro and use the fnv_hash function that Redshift provides (at the cost of platform portability).

Execution

In our example architecture, we are using ECS to execute the DBT process by creating a docker image that contains the DBT code to execute and an entry point that runs the DBT process when the container starts (using dbt run and dbt test). The ECS job can then be triggered using an EventBridge rule with a Cron schedule expression that will run the DBT process at the same time every day (for example at 01:00 UTC).

A CI/CD pipeline should be used to build the docker image and upload it to AWS ECR.

Documentation

DBT can generate a documentation website that includes the descriptions from DBT's yaml files for views, tables & columns, a lineage diagram showing which models rely on data from others and the SQL used to create the data models.

DBT includes a simple web server to host the documentation website but this is not recommended for production use. In our example architecture, we would run the DBT docs website as an ECS service meaning it would be a docker image including a web server (such as Nginx). It's highly recommend to secure the docs website in some way (such as by only exposing it to internal networks and/or protecting it with your organisation's Single Sign-On mechanism).

The CI/CD pipeline that creates the docker image to execute the DBT process should also be used to build the docker image for the DBT docs website and publish it to ECR.

Up Next

The first five parts of the series have covered the majority of what is required to set up a simple modern architecture in AWS. The next and final part of the series will wrap up by briefly discussing reporting and a few other small topics.

Data Saluki

Search This Blog