Skip to main content

Building a Data Platform: Part 2 - Example Data Platform

Introduction

In part 1 we walked through a conceptual model of a basic data platform that covered off the key capabilities and requirements for each capability. In this part we'll discuss an example architecture that provides the minimum capabilities for a modern data platform.

Conceptual to Logical architecture

The diagram below shows a potential logical architecture for building our conceptual data platform. Before we dive into discussing the elements of it, note that the logical architecture is always going to depend on your situation and decisions will be affected by:

  • The skills you have within your organisation
  • Which cloud provider you use (assuming you're even in the cloud)
  • Existing technology stack
  • The appetite for change to the technical estate

The key things to note are:

  • All conceptual capabilities have a technology to provide them.
  • Some technologies provide multiple conceptual capabilities (e.g. Redshift providing the transformed data storage and query engine)
  • This architecture supports ingestion from APIs, files and databases.
  • I've recommended serverless and managed services wherever possible to make it easier to get started with implementing the architecture.

Logical Architecture

The diagram below shows the logical architecture without the conceptual overlay and makes it easier to see how data flows through the platform.


Ingestion

There are two mechanisms for ingestion: lambda functions and ECS tasks. The lambda functions are useful for ingesting small datasets in a fast and inexpensive fashion. I'd recommend writing ingestion processes using Python and the AWS Data Wrangler library which makes storing data in S3 and using Glue as a data catalogue very easy.

For larger datasets or processes that will exceed the limits of lambda (e.g. take more than 15 minutes to run) Python jobs running as ECS Tasks are an excellent choice. There are a large variety of libraries and frameworks that can help with processing data from AWS Data Wrangler to PySpark. Note that AWS Batch provides a wrapper around ECS for batch jobs that you may prefer to work with; I haven't included it in the architecture as I've tried to minimise the number of technologies and services involved. 

Raw Storage

Raw storage is provided by AWS S3 buckets; the primary bucket is called the "raw" bucket which is where all of the ingested data should be stored. At this point only minimal transformation should have occurred which would include:
  • Transforming the data into Parquet files
  • Adding metadata fields such as ingestion date and time
Note that there are plenty of file formats to choose from, but parquet is popular and easy to work with.

The archive bucket is used to store copies of the original files from source systems that provide file-based extracts which is useful should data need to be reprocessed due to a bug in the transformation code or some other issue.

AWS Glue provides a simple catalogue that when populated can be used to query data held in files in S3 by Redshift and Athena. Ingestion processes should ensure that when they produce new datasets or write new data to existing datasets that the glue catalog is updated with new tables and partitions.

You should ensure that all S3 buckets are secured using appropriate access controls (e.g. IAM roles) and have public access blocked.

Data Transformation, Storage & Query Engine

The transformed data storage and query engine are provided by Redshift. Redshift provides a fairly low cost data warehousing platform with features that make it very useful as the core of our data platform:

  • Columnar storage for faster queries
  • Elastic scalability (especially with Redshift serverless)
  • The option to separate compute and storage depending on cluster type
  • Ability to query the raw data storage directly using Redshift spectrum that integrates with the glue catalog

Because we are able to query data held in raw and transformed storage, it is possible to use a SQL-based tool for the data transformation capability. I've chosen Data Build Tool (DBT) for this example architecture as it's available for free and easy to learn if you know SQL and basic Python. DBT is a tool for writing data transformations as SQL select statements (which it wraps as Create Table As Select - CTAS - statements). DBT provides a number of features to help with transformations including incrementally loading tables, reusable macros, and simple customisation.

The SQL Statements that DBT generates are executed against the Redshift cluster and it's possible to scale the number of concurrent threads that DBT uses as well as the capabilities of the Redshift cluster to provide a fast transformation capability. Additionally DBT is able to generate a documentation website which is discussed late in the Data Catalogue & Documentation section.

Data Consumers

We're not going to spend much time on consumers as it's likely your organisation already has many data consumers in place.

AWS Quicksight provides a cost-effective reporting platform with a lot of excellent features. Each cloud has their own offering such as Power BI in Azure and Looker in GCP.

For data science I've recommended SageMaker as it's the AWS data science platform so will integrate well with the other elements of this architecture. Different clouds have their own equivalent and there are also vendors such as DataBricks that offer their platform across several cloud providers.

For outbound batch data integration I've recommended ECS tasks again for simplicity and consistency, however Lambda functions could also be used for smaller datasets.

Data Catalogue & Documentation

A basic data catalogue should provide a searchable list of the contents of a dataset and document the various tables and columns. One of the most powerful features of DBT is the ability to generate a documentation website for a project that contains this information. Therefore DBT provides an MVP (minimum viable product) data catalogue. I'd recommend hosting the DBT documentation website using a dedicated webs server (e.g. Nginx) in ECS rather than relying on the built-in server.

As discussed in Part 1, a documentation site is essential for providing useful information for users and engineers such as maintenance windows, schedules, how-to guides, etc. For this I've included Confluence as it's a popular with many organisations, but any wiki software will suffice.

Not Pictured

There is a lot of supporting infrastructure that hasn't been included in the diagrams such as IAM roles, VPCs, Security Groups, etc. I've left those out as it would become too complicated to handle in just blog posts, but be aware that you will need them.

It is also essential to have good CI/CD pipelines to automate the build and release of as much of this process as possible and there are various tools available both within AWS and from other vendors.

Finally, ensure you are using an Infrastructure-as-Code (IAC) tool as much as possible to limit drift between environments and make releasing fast and easy. I'd recommend using Terraform for this purpose.

Potential improvements

Although this example architecture provides a complete batch data architecture, there is always room for improvement.

The ingestion processes will be fine for small-to-medium sized datasets but may struggle with very large ones. In this case, I'd recommend using something like AWS Elastic Map Reduce to run Spark jobs to deal with those datasets that are too large to be handled by ECS tasks.

The data catalogue provided by DBT is great for an MVP however this won't be enough for sophisticated data management and governance practices. There are a lot of good data catalogues on the market such as Alation and Collibra.

Summary

This post introduced a basic example architecture based on AWS services to illustrate which technologies to use to build out our conceptual architecture. It is important to note that technology choices depend upon the context of your organisation - e.g. an organisation with a heavy investment in Azure is less likely to user AWS for their data architecture.

Up Next

In Part 3 we'll talk about the types of individual you will need to build this example architecture and how a team could be structured.

Comments

Popular posts from this blog

Building a Data Platform: Part 5 - Data Transformation (loading the warehouse)

Introduction  Once the data has been ingested into the data lake it's time to make it available for reporting by transforming the data into structures that are simple to query for a BI tool. This part of the series will describe the data modelling approach, the transformation tool (DBT) and how to transform the data ready for reporting. The example architecture diagram is shown below. Data Model There are a variety of data modelling approaches that can be used to model a data warehouse, popular approaches include: Dimensional Data Vault Third-Normal Form (3NF) A comparison of the different approaches is beyond the scope of this article, however for our example architecture we will be using dimensional modelling. Dimensional modelling has been the dominant data warehouse modelling approach in my personal experience and there are a good number of reasons to use it, the most important being ease-of-use for analysts and business users and performance. Dimensional modelling has two main...

Building a Data Platform: Part 1 - Concepts

Introduction This is the first part of a series about how to build a basic modern data platform that should be sufficient for most small-medium sized businesses. The series will focus on the key architecture elements, why they are needed and potential technologies that can be used. I'll only discuss batch processing initially, but I may cover real-time/streaming ingestion in a future post. This post will cover the conceptual elements of a modern data platform which can be used as a guide if you're starting out and help when considering which services or tools will meet your requirements. There isn't a "best" technology stack when building a data platform, it's all trade-offs. For example, ETL tools are easier to use than writing code but often hard to integrate within CI/CD pipelines. Conceptual Data Platform Conceptually, let's define a data platform as a collection of technologies that enables an organisation to use its data to the fullest. Generally the...