Skip to main content

Building a Data Platform - Part 3 - The Team

Introduction

In Part 2 we walked through an example modern data architecture (shown below).

Example Architecture


In this third part we will discuss the people required to build it and discuss the types of skills required. This discussion will focus on three key areas of the platform:

  1. Ingestion of data into the lake
  2. Transformation of the data so it is ready for consumers
  3. The infrastructure to support these two processes
This means that with the exception of analysts, roles that are focused on data sources (e.g. DBAs) or data consumers (e.g. Data Scientists) will not be discussed in this part.

Required Skills

The skills required for the example architecture can be broken down roughly into the following:
  • Infrastructure - Terraform, IAM, Networking & Security, ECS, S3, Glue
  • Data Ingestion - Python, S3, Glue, Lambda, ECS
  • Data Transformation - Redshift, SQL, DBT, Python, ECS
Some areas will overlap and the level of skill required will vary on the team. For example, the Python knowledge to run DBT is significantly less than that required to build ingestion pipelines.

Roles

This section discusses the roles required to build the example architecture. Note that role definitions can vary widely across industry and can be subjective to an extent. As such, this is only my definition of the roles.

The table below shows a mapping of roles to the skills required. Note that this is not exhaustive and it is very common for individuals to have a mix of these skills rather than fitting neatly into one profile or another.


Roles

On the surface, it would appear that all you need are Data Engineers to be successful but that isn’t necessarily the case and should become clear as we look at the roles in more detail.

Platform Engineer

Platform Engineers are cloud and infrastructure specialists who focus on making an organisation's cloud platforms available for use - this encompasses accounts, networking, security and other infrastructure related tasks. Platform Engineers are often responsible for any shared infrastructure (e.g. networking that links accounts together, CI/CD platforms, etc) as well as provisioning new cloud accounts. Platform Engineers have a deep knowledge of cloud providers and how they interconnect and are essential within an organisation.

Data Engineer

Data Engineers are a specialised form of Software Engineer that focus on building infrastructure and processes which integrate and store data. As such Data Engineers have a very broad skillset across data technologies, frameworks and programming languages. Most data engineers will also have some form of platform engineering skills including IaC. However, data engineers are data specialists, not cloud specialists so while they can build out a lot of infrastructure, it’s rare for them to set up a completely new cloud environment (i.e. entirely new AWS organisation/accounts) without support from a platform engineer.

A data engineer may not know DBT specifically but with their skills they will be able to pick it up very quickly which is why it has been included in their role profile (as DBT is a data transformation tool).

Analytics Engineer

The analytics engineer is a relatively new role profile but is growing in popularity thanks to tools such as DBT. The analytics engineer is focussed on transforming data so that it can be used by data consumers. Analytics engineers sit between the analyst and data engineer skillsets; they have enough engineering knowledge to be able to build and deploy automated data transformation processes and enough analytical knowledge to know how the data is likely to be used and therefore how to model it effectively.

Analyst

Analysts are focused on consuming data for reports and analyses. However, they have the core skills which make them great for cross-training towards analytics engineering. DBT uses SQL to define transformations which makes it easier with some programming knowledge to cross train the analytics engineer skillset.

Additional Key Roles

There are several additional roles that are key to a data team that don’t directly build the infrastructure or datasets:
  • Delivery Manager - Accountable for the team’s output and overall performance. Runs the various ceremonies, helps with blockers and assists with collaboration across teams.
  • Product Owner/Manager - Defines the roadmap for the data products and works with stakeholders to build a prioritised backlog for the engineers to implement. Sometimes combined with the Business Analyst Role.
  • Business Analyst - This role gathers the detailed requirements from the data consumers and presents them as user stories, process maps and other documentation to guide the engineers in implementation. Sometimes combined with the Product Owner role.
  • Quality Assurance Engineer - Responsible for defining the testing strategy and approach. Often responsible for creation and execution of test scripts and automated testing processes.

Minimum Squad

Now that all the key roles have been defined, a minimum squad can be defined.

Squad


Multiple Squads

There are two main approaches to having multiple squads working on the example data architecture:
  1. Vertical
  2. Horizontal
In the vertical approach, teams are defined based on slicing the architecture vertically into two sides: data ingestion and data transformation. With this approach there is a team responsible for ingesting data into the lake and running the underlying infrastructure (data engineering) and a team responsible for transforming data for delivery to consumers (analytics engineering).

By contrast, the horizontal approach has cross-discipline teams that build processes across all parts of the architecture but are defined by some form of domain or product-set. For example, there could be a team responsible fore ingesting and transforming customer and order data and another one responsible for warehousing and shipping data.

The approach to use depends on the context of your organisation and will depend on a variety of factors including:
  • The number of domains/data products
  • The role mix the organisation has / can hire
  • Where bottlenecks are (it could be far quicker and easier to ingest data than transform it or vice versa)
Although having multiple teams does increase the amount of data delivered to consumers, it also increases complexity. At this point it is common for there to be additional roles such as architects involved to provide direction and governance.

Summary

This article outlined the skills required to build the example data architecture and mapped them to roles. The article then defined what a sensible minimum squad structure would be in terms of roles and the number of individuals required for each role. We also discussed ways to scale the engineering function beyond a single team.

The key takeaway is that Data Engineers are essential for building a modern data architecture but other roles are equally important and which roles you need will depend on the skills your organisation already has available.

Up Next

The next part will look at data ingestion in more detail.

Comments

Popular posts from this blog

Building a Data Platform: Part 5 - Data Transformation (loading the warehouse)

Introduction  Once the data has been ingested into the data lake it's time to make it available for reporting by transforming the data into structures that are simple to query for a BI tool. This part of the series will describe the data modelling approach, the transformation tool (DBT) and how to transform the data ready for reporting. The example architecture diagram is shown below. Data Model There are a variety of data modelling approaches that can be used to model a data warehouse, popular approaches include: Dimensional Data Vault Third-Normal Form (3NF) A comparison of the different approaches is beyond the scope of this article, however for our example architecture we will be using dimensional modelling. Dimensional modelling has been the dominant data warehouse modelling approach in my personal experience and there are a good number of reasons to use it, the most important being ease-of-use for analysts and business users and performance. Dimensional modelling has two main...

Building a Data Platform: Part 1 - Concepts

Introduction This is the first part of a series about how to build a basic modern data platform that should be sufficient for most small-medium sized businesses. The series will focus on the key architecture elements, why they are needed and potential technologies that can be used. I'll only discuss batch processing initially, but I may cover real-time/streaming ingestion in a future post. This post will cover the conceptual elements of a modern data platform which can be used as a guide if you're starting out and help when considering which services or tools will meet your requirements. There isn't a "best" technology stack when building a data platform, it's all trade-offs. For example, ETL tools are easier to use than writing code but often hard to integrate within CI/CD pipelines. Conceptual Data Platform Conceptually, let's define a data platform as a collection of technologies that enables an organisation to use its data to the fullest. Generally the...

Building a Data Platform: Part 2 - Example Data Platform

Introduction In  part 1  we walked through a conceptual model of a basic data platform that covered off the key capabilities and requirements for each capability. In this part we'll discuss an example architecture that provides the minimum capabilities for a modern data platform. Conceptual to Logical architecture The diagram below shows a potential logical architecture for building our conceptual data platform. Before we dive into discussing the elements of it, note that the logical architecture is always going to depend on your situation and decisions will be affected by: The skills you have within your organisation Which cloud provider you use (assuming you're even in the cloud) Existing technology stack The appetite for change to the technical estate The key things to note are: All conceptual capabilities have a technology to provide them. Some technologies provide multiple conceptual capabilities (e.g. Redshift providing the transformed data storage and query engine) This...