Data Saluki

Posts

Building a Data Platform: Part 4 - Ingesting data

Introduction Having been through concepts, architecture and the team required to build a modern data platform it’s time to focus on the various stages of the data flow, beginning with ingestion. As in previous posts the focus will be on building the example architecture from part 2 . Ingestion Workflow The aim of the ingestion stage is to have loaded data from a source into the data lake so that it can be queried. The amount of transformation should be kept to the bare minimum and should consist of: Translating the data into a storage format that is efficient for data lake usage (e.g. Parquet files) Standardising field names (e.g. converting all fields names to lowercase and using underscores to separate words) Type conversion (e.g. if loading data from a CSV, convert any numeric fields to an appropriate numeric data type) Adding metadata to the records for lineage The metadata to include on each record can include: The date and time the data was ingested The sou...

Building a Data Platform - Part 3 - The Team

Introduction In Part 2 we walked through an example modern data architecture (shown below). In this third part we will discuss the people required to build it and discuss the types of skills required. This discussion will focus on three key areas of the platform: Ingestion of data into the lake Transformation of the data so it is ready for consumers The infrastructure to support these two processes This means that with the exception of analysts, roles that are focused on data sources (e.g. DBAs) or data consumers (e.g. Data Scientists) will not be discussed in this part. Required Skills The skills required for the example architecture can be broken down roughly into the following: Infrastructure - Terraform, IAM, Networking & Security, ECS, S3, Glue Data Ingestion - Python, S3, Glue, Lambda, ECS Data Transformation - Redshift, SQL, DBT, Python, ECS Some areas will overlap and the level of skill required will vary on the team. For example, the Python knowledge to run DBT is signifi...

Building a Data Platform: Part 2 - Example Data Platform

Introduction In part 1 we walked through a conceptual model of a basic data platform that covered off the key capabilities and requirements for each capability. In this part we'll discuss an example architecture that provides the minimum capabilities for a modern data platform. Conceptual to Logical architecture The diagram below shows a potential logical architecture for building our conceptual data platform. Before we dive into discussing the elements of it, note that the logical architecture is always going to depend on your situation and decisions will be affected by: The skills you have within your organisation Which cloud provider you use (assuming you're even in the cloud) Existing technology stack The appetite for change to the technical estate The key things to note are: All conceptual capabilities have a technology to provide them. Some technologies provide multiple conceptual capabilities (e.g. Redshift providing the transformed data storage and query engine) This...

Building a Data Platform: Part 1 - Concepts

Introduction This is the first part of a series about how to build a basic modern data platform that should be sufficient for most small-medium sized businesses. The series will focus on the key architecture elements, why they are needed and potential technologies that can be used. I'll only discuss batch processing initially, but I may cover real-time/streaming ingestion in a future post. This post will cover the conceptual elements of a modern data platform which can be used as a guide if you're starting out and help when considering which services or tools will meet your requirements. There isn't a "best" technology stack when building a data platform, it's all trade-offs. For example, ETL tools are easier to use than writing code but often hard to integrate within CI/CD pipelines. Conceptual Data Platform Conceptually, let's define a data platform as a collection of technologies that enables an organisation to use its data to the fullest. Generally the...