Introduction
In part 1 we walked through a conceptual model of a basic data platform that covered off the key capabilities and requirements for each capability. In this part we'll discuss an example architecture that provides the minimum capabilities for a modern data platform.
Conceptual to Logical architecture
The diagram below shows a potential logical architecture for building our conceptual data platform. Before we dive into discussing the elements of it, note that the logical architecture is always going to depend on your situation and decisions will be affected by:
- The skills you have within your organisation
- Which cloud provider you use (assuming you're even in the cloud)
- Existing technology stack
- The appetite for change to the technical estate
- All conceptual capabilities have a technology to provide them.
- Some technologies provide multiple conceptual capabilities (e.g. Redshift providing the transformed data storage and query engine)
- This architecture supports ingestion from APIs, files and databases.
- I've recommended serverless and managed services wherever possible to make it easier to get started with implementing the architecture.
Logical Architecture
Ingestion
There are two mechanisms for ingestion: lambda functions and ECS tasks. The lambda functions are useful for ingesting small datasets in a fast and inexpensive fashion. I'd recommend writing ingestion processes using Python and the AWS Data Wrangler library which makes storing data in S3 and using Glue as a data catalogue very easy.
For larger datasets or processes that will exceed the limits of lambda (e.g. take more than 15 minutes to run) Python jobs running as ECS Tasks are an excellent choice. There are a large variety of libraries and frameworks that can help with processing data from AWS Data Wrangler to PySpark. Note that AWS Batch provides a wrapper around ECS for batch jobs that you may prefer to work with; I haven't included it in the architecture as I've tried to minimise the number of technologies and services involved.
Raw Storage
- Transforming the data into Parquet files
- Adding metadata fields such as ingestion date and time
The archive bucket is used to store copies of the original files from source systems that provide file-based extracts which is useful should data need to be reprocessed due to a bug in the transformation code or some other issue.
AWS Glue provides a simple catalogue that when populated can be used to query data held in files in S3 by Redshift and Athena. Ingestion processes should ensure that when they produce new datasets or write new data to existing datasets that the glue catalog is updated with new tables and partitions.
You should ensure that all S3 buckets are secured using appropriate access controls (e.g. IAM roles) and have public access blocked.
Data Transformation, Storage & Query Engine
The transformed data storage and query engine are provided by Redshift. Redshift provides a fairly low cost data warehousing platform with features that make it very useful as the core of our data platform:
- Columnar storage for faster queries
- Elastic scalability (especially with Redshift serverless)
- The option to separate compute and storage depending on cluster type
- Ability to query the raw data storage directly using Redshift spectrum that integrates with the glue catalog
Data Consumers
We're not going to spend much time on consumers as it's likely your organisation already has many data consumers in place.
AWS Quicksight provides a cost-effective reporting platform with a lot of excellent features. Each cloud has their own offering such as Power BI in Azure and Looker in GCP.
For data science I've recommended SageMaker as it's the AWS data science platform so will integrate well with the other elements of this architecture. Different clouds have their own equivalent and there are also vendors such as DataBricks that offer their platform across several cloud providers.
For outbound batch data integration I've recommended ECS tasks again for simplicity and consistency, however Lambda functions could also be used for smaller datasets.
Data Catalogue & Documentation
A basic data catalogue should provide a searchable list of the contents of a dataset and document the various tables and columns. One of the most powerful features of DBT is the ability to generate a documentation website for a project that contains this information. Therefore DBT provides an MVP (minimum viable product) data catalogue. I'd recommend hosting the DBT documentation website using a dedicated webs server (e.g. Nginx) in ECS rather than relying on the built-in server.
As discussed in Part 1, a documentation site is essential for providing useful information for users and engineers such as maintenance windows, schedules, how-to guides, etc. For this I've included Confluence as it's a popular with many organisations, but any wiki software will suffice.
Not Pictured
There is a lot of supporting infrastructure that hasn't been included in the diagrams such as IAM roles, VPCs, Security Groups, etc. I've left those out as it would become too complicated to handle in just blog posts, but be aware that you will need them.
It is also essential to have good CI/CD pipelines to automate the build and release of as much of this process as possible and there are various tools available both within AWS and from other vendors.
Finally, ensure you are using an Infrastructure-as-Code (IAC) tool as much as possible to limit drift between environments and make releasing fast and easy. I'd recommend using Terraform for this purpose.
Potential improvements
Although this example architecture provides a complete batch data architecture, there is always room for improvement.
The ingestion processes will be fine for small-to-medium sized datasets but may struggle with very large ones. In this case, I'd recommend using something like AWS Elastic Map Reduce to run Spark jobs to deal with those datasets that are too large to be handled by ECS tasks.
The data catalogue provided by DBT is great for an MVP however this won't be enough for sophisticated data management and governance practices. There are a lot of good data catalogues on the market such as Alation and Collibra.
Summary
This post introduced a basic example architecture based on AWS services to illustrate which technologies to use to build out our conceptual architecture. It is important to note that technology choices depend upon the context of your organisation - e.g. an organisation with a heavy investment in Azure is less likely to user AWS for their data architecture.
Up Next
In Part 3 we'll talk about the types of individual you will need to build this example architecture and how a team could be structured.
Comments
Post a Comment