Building a Data Platform: Part 1

Building a Data Platform: Part 1 - Concepts

Introduction

This is the first part of a series about how to build a basic modern data platform that should be sufficient for most small-medium sized businesses. The series will focus on the key architecture elements, why they are needed and potential technologies that can be used. I'll only discuss batch processing initially, but I may cover real-time/streaming ingestion in a future post.

This post will cover the conceptual elements of a modern data platform which can be used as a guide if you're starting out and help when considering which services or tools will meet your requirements. There isn't a "best" technology stack when building a data platform, it's all trade-offs. For example, ETL tools are easier to use than writing code but often hard to integrate within CI/CD pipelines.

Conceptual Data Platform

Conceptually, let's define a data platform as a collection of technologies that enables an organisation to use its data to the fullest. Generally the platform will be expected to:

Ingest data from multiple sources
Transform data so it can be used by consumers
Serve transformed data to consumers

The diagram below shows the conceptual elements of a basic batch data platform. There are a variety of elements and we'll discuss each of them in this section. Note that some technologies and products enable can support multiple conceptual elements at once and we'll see this in the example data platform later.

Data Sources

These are systems and processes that generate data that needs to be loaded into the data platform. These are usually operational systems that run business processes such as customer management and order processing. However, these can also include the outputs of data science models (e.g. a new customer segment, a sales forecast, etc) and other analytics systems (e.g. click-stream data or the results of an attribution model).

The way data can be obtained from a source varies depending on the source but common examples are:

Periodic export of a file containing data from the business system (CSV, JSON, etc)
Directly extracting data from the source system's database or via a replica of the database
Calling an API that provides methods for extracting the data held in the source system

Batch Ingestion

The Batch ingestion element is responsible for acquiring data and loading it into the raw storage area. There will often be a variety of processes as different sources will provide different ways to obtain their data.

A batch ingestion process will:

Extract data from a source system (DB, API, file, etc)
Add metadata fields to the data set
Store the dataset in raw storage

Adding metadata fields to a dataset when storing it is useful for understanding data lineage. Example metadata include:

Source system the data was extracted from
The date and time the data was ingested
The name/id of the process that ingested the data

Batch ingestion usually has a variety of technologies involved due to the variety of source systems and cost efficiency; for large datasets, services such as AWS Batch, ECS or EMR are likely required. However, the cost and start-up time of an ECS cluster may be hard to justify if ingesting a few hundred rows from a CSV file, in this scenario, a lambda function is likely cheaper to run and faster to develop.

Note that data validation doesn't take place during ingestion as the raw data should be stored in as close as the original format as possible. Data quality checks can be run once the data has been loaded into the raw data storage as desired.

Raw Data Storage

The raw data storage is area is where the data is stored in a format as close as possible to the source system and with minimal transformation. Raw data storage is usually provided by an object store such as AWS S3 or Azure Data Lake Storage.

When data is written to raw storage, it is usually transformed into a format that is compatible with a query engine and efficient to query. This happens as raw storage is often made available to a subset of data users in the organisation in the form of a data lake and therefore making the files easier to query is beneficial. When processing data that was provided as a file, it's often useful to copy the original file into an archive area so it can be referred back to / reprocessed if there are any issues.

Making raw data available for querying provides a number of benefits:

It enables engineers to view the data and test their ingestion processes
Data consumers can use the data for proof-of-concepts
Easy integration for data quality tooling
A SQL-oriented data transformation process can consume the data easily

The key requirements for raw data storage are:

Scalability
Ability to store a variety of data (i.e. data, image, text, audio and video files)
Inexpensive
Resilient

Scalability and low expense are key requirements (albeit relative ones) due to the sheer volume of data that will need to be stored - ideally all useful data from all source systems.

Data Transformation

Data transformation typically involves combining the various data sources to load a conformed data model that is as high-quality as possible. Data models will be discussed in the Transformed Data Storage section.

Data Transformation often begins as one large process that is made up of various sub-elements chained together but can often be split out to independently scheduled processes (due to the time constraints on when source data are received).

Data transformation is a crucial element of a data platform as it has the greatest impact on the user base - from the data structures and transformations themselves up to when the transformed data will be available to use each day.

Generally it's better to use a library or tool for data transformation than roll-your-own as this is where most engineering effort will take place and it's important to make that as accessible to new members of the team as possible without compromising your platform.

The key requirements for data transformation are:

Performance
Flexibility

High performance is essential as the transformed data cannot be consumed each day until the batch process is complete and although the batch window won't be able to grow, the volumes and sources of data definitely will.

Flexibility is important as the raw data can be stored in a variety of ways and the target model is likely to be very different so the ability to create any transformation you want is highly desirable. This can be alleviated somewhat by using a query engine pointing at the raw data store as the input for the data transformation tool.

Transformed Data Storage

The transformed data must be stored in a technology that provides fast access and high throughput and concurrency. Normally this is some form of data warehouse technology such as Amazon Redshift or Snowflake. Although using a standard RDBMS such as PostgreSQL is possible, it's generally inadvisable given the different workload types between an operational DB and an analytical DB. Specialist data warehousing products and platforms have lots of clever tricks that make them fast at querying millions and billions of rows such as columnar storage and compression encodings. The reverse is also true, using a data warehouse platform for transactional workloads will be very slow and expensive.

There are various data modelling approaches that can be used in transformed storage including data vault, third normal form (3NF) and dimensional with the latter being far and away the most popular. Dimensional models are designed to be simple to use for end users and fast for querying at the trade-off of data duplication. To learn more about dimensional modelling I highly recommend The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling by Ralph Kimball and Margy Ross.

In order to handle online analytical processing (OLAP) by a large number of users, the transformed data storage should be:

Scalable
Fast
Resilient

Scalability is important as the volume of data will probably always grow and speed is important to ensure users are able to obtain the results they need in a reasonable time. As organisations mature, they build a lot of important processes on top of transformed data (e.g. feeds to CRM systems, data science models for fraud detection, etc) meaning resilient (and ideally self-healing storage) is required to avoid an outage having a material impact on business.

Query Engine

The query engine is the entry point for data consumers. As such it needs to support a variety of access methods (ODBC & JDBC as a minimum) and should be supported by the data transformation and BI tools that you select.

The key requirements for the query engine are:

Scalability (concurrency)
Fast
Resilient
Workload management

As the query engine will have have myriad consumers performing varied work from standard repeatable reports and extracts to ad hoc analysis by data scientists, it will need to support a high-level of concurrency. The ability to elastically scale up and down will be useful for cost efficiency - for example run at 3x normal capacity on Monday morning for weekly report generation and running at minimum capacity over weekends. Resiliency is important as key business processes are often use the transformed data. Workload management is the ability to limit how much of the available capacity a user can consume, this helps protect the platform by ensuring poorly written queries can't take up all the available resource and prevent other processes running. For services that operate on a pay per query basis (such as AWS Athena or Google Big Query) the ability to restrict spend is useful to avoid bill shock.

Usually, the query engine is provided by the same technology as the transformed data storage - for example Amazon Redshift or Snowflake. However, there are products such as Starburst that fully separate compute and storage.

Data Consumers

A data consumer can be any individual or process that uses the transformed data. Generally they fit into one of these types of work:

Reporting
Ad Hoc Analysis
Data Science
Outbound ETL to other systems

Data Catalogue

A Data Catalogue is a central repository of metadata that provides details of the data held by an organisation and can be searched by users. A data catalogue is extremely useful for analysts and other data consumers as it enables them to find out what data is held by the organisation as well as understand it's meaning, structure and lineage.

A data catalogue usually contains:

The collection of datasets the organisation holds
How to gain access to the datasets
The owners/stewards of each dataset
Descriptions of tables, columns, etc in the various data stores as well as details about the meaning of each attribute.
Lineage information so that users can know where data came from (a very common question from analysts).
Links to any additional documentation

Documentation

Documentation is one of the most critical and often overlooked part of a data platform. Documentation isn't just useful for end users but also for engineers themselves. Data platforms generally have a long lifetime so when a key ingestion process breaks five years after it was developed and everyone who worked on it has left the organisation, good documentation about how it works is essential for fixing the problem and reducing the impact to the business.

End User Documentation should include:

Where to find the data catalogue and how to gain access to it
Details of any maintenance windows
Who to contact if there is an issue
Best practices for using the data

Engineer documentation should include:

List of ingestion and transformation processes including:

The process' schedule (e.g. Daily at 06:00 UTC)
Detailed description of the process
Where the source code is located
Known errors and associated recovery actions
Dependencies

Details of any maintenance windows
List of systems the platform depends on
List of systems that depend on the platform
"How to" guides for development and maintenance

Conclusion

I've presented my conceptual view of a basic data platform for a small-medium enterprise in this post and therefore this is just my opinion (although it is based on many successful deployments). A huge amount of research and effort has gone into trying to tackle the problems of managing data and making it available for analysis and there are many great books written by incredibly clever people so please do read up as much as possible before building a data platform.

Always remember that the various technologies you use to build your data platform will need to fit your circumstances and that there isn't a "best" technology. There are lots of great products and services out there and each major cloud vendor has a very comprehensive set of data services.

Up Next

In part 2 we'll walk through a simple example architecture and how it relates to the conceptual architecture.

Data Saluki

Search This Blog