Curriculum

Master the skills to become highly effective data engineers with the modern data stack in 16 weeks

Topic 01

01: Extract Transform Load (ETL) with Python

This section is designed not only to introduce you to the basics of ETL but also to equip you with hands-on experience using Python, one of the most versatile and widely-used programming languages in the data engineering field. You’ll learn through practical examples, using tools and libraries that are vital for any aspiring data engineer. This foundation is crucial, as it supports advanced topics and tools such as Airflow for data orchestration and Airbyte for data integration, which you will encounter later in your data engineering career.

Python Virtual Environments

Importance of isolating project dependencies.
Steps to create and manage virtual environments.

ETL with Interactive Jupyter Notebooks

Leveraging Jupyter Notebooks for ETL process experimentation.
Real-time data manipulation and analysis.

Data Extraction Patterns from APIs

Full Extraction: Downloading all data from the source.
Incremental Extraction: Retrieving only new or changed data since the last extraction.

Data Transformation Using Dataframes (Pandas)

Introduction to Pandas for data manipulation.
Techniques for cleaning, transforming, and preparing data for analysis.

Data Loading Patterns to Files and Databases

Files: Saving data to CSV and Parquet formats.
Databases: Utilizing SQLAlchemy and PostgreSQL for data storage.

Functional Programming with Python

Concepts of functional programming in Python.
How to apply functional programming for data processing tasks.

Modular Programming Using Python Modules

Organizing code into reusable modules.
Benefits of modular programming for ETL pipelines.

Object-Oriented Programming with Python

Basics of object-oriented programming (OOP) in Python.
Designing data models and processing pipelines using OOP principles.

Logging with Python (PostgreSQL)

Implementing logging in Python for ETL processes.
Storing and managing logs in PostgreSQL.

Unit and Integration Testing with Python

Writing unit and integration tests for ETL code.
Tools and frameworks for testing in Python.

Code Linting with Python

Importance of code quality and readability.
Using linters to identify and fix coding issues.

Metadata Config Pipeline (YAML)

Using YAML for pipeline configuration management.
Advantages of YAML for defining and updating metadata.

Metadata Logging to Database

Techniques for logging pipeline metadata to a database.
Benefits of metadata logging for monitoring and debugging.

Cron Scheduling

Utilizing cron for scheduling ETL jobs.
Best practices for ETL job scheduling with cron.

Topic 1 serves as the bedrock upon which the art and science of data engineering are built. By mastering ETL processes, you gain the ability to efficiently handle data from its extraction through transformation, and finally, to its loading into a usable format. This foundational knowledge is not only critical for tackling more advanced data engineering challenges but also immensely valuable in a data-driven world.

The skills you acquire here, from managing Python environments to implementing sophisticated data transformations and automation, will prepare you for a successful career in data engineering. You’ll be able to design robust, scalable ETL pipelines that can handle the complexities of modern data ecosystems, making you an asset to any organization and propelling you to the forefront of the industry

Topic 02

02: Extract Load Transform (ELT) with Python and SQL

In this section of our data engineering bootcamp, we explore the Extract, Load, Transform (ELT) process, a methodology that has gained popularity with the rise of cloud technologies. This topic will not only broaden your understanding of data engineering concepts but also equip you with the practical skills needed to excel in this dynamic field.

Data Extraction Patterns from Databases

Full Extraction: Retrieving all data from the source system for every ETL execution.
Incremental Extraction: Only extracting data that has changed since the last execution.
Change Data Capture (CDC): Identifying and capturing changes made to the data in the source system.

Data Loading Patterns to Databases

Overwrite: Replacing existing data with new data.
Insert: Adding new data without affecting existing data.
Upsert: Updating existing data and inserting new data as necessary.
Merge: Combining data from multiple sources into a single, unified dataset.

SQL Transformations in Databases (PostgreSQL)

Utilizing PostgreSQL for complex data transformations directly within the database.
Leveraging SQL’s powerful syntax to perform operations such as filtering, aggregation, and joining.

Jinja for SQL Templating

Introduction to Jinja: Understanding how Jinja can be used for dynamic SQL query generation.
Integration with Python and SQLAlchemy: Automating SQL script creation and execution with Python and SQLAlchemy.

Directed Acyclic Graphs (DAGs) with Python

Explaining the concept of DAGs and their importance in structuring ELT pipelines.
Utilizing Python to create and manage DAGs for orchestrating ELT processes.

SQL Common Table Expressions (CTEs)

Understanding CTEs and their role in writing more readable and modular SQL queries.
Implementing CTEs in data transformation processes for cleaner and more efficient SQL code.

SQL Window Functions

Introduction to window functions and how they are used for advanced data analysis tasks.
Practical examples of using window functions for aggregations, rankings, and analytical computations.

Modularising ELT Pipelines

Breaking down ELT pipelines into manageable, reusable components.
Strategies for organizing code and SQL scripts to enhance maintainability and scalability.

Unit Testing ELT Pipelines

The importance of testing in ELT pipeline development.
Approaches to writing and executing unit tests to ensure the accuracy and reliability of data transformations

Logging for ELT Pipelines (PostgreSQL)

Implementing logging mechanisms within ELT pipelines to monitor performance and troubleshoot issues.
Storing logs in PostgreSQL for easy access and analysis.

Metadata Config for ELT Pipelines (YAML)

Using YAML for configuration management in ELT pipelines.
Examples of how metadata configuration can streamline the management of pipeline parameters and settings.

In the realm of data engineering, mastering the ELT process represents a crucial competency, particularly in an era dominated by cloud computing and big data. This curriculum section not only equips you with the theoretical knowledge needed to understand the ELT framework but also provides hands-on experience with the tools and technologies urrently shaping the industry. From learning how to efficiently extract and load data to performing complex transformations within databases, this topic ensures a comprehensive understanding of modern data engineering practices.

As you progress through this course, you’ll gain invaluable skills that are highly sought after in the job market. The practical knowledge of Python, SQL, and other tools you’ll acquire here is directly applicable to real-world scenarios, preparing you for a successful career in data engineering. By understanding the intricacies of ELT, you’ll be well-positioned to design and implement efficient data pipelines that can handle the volume, velocity, and variety of today’s data ecosystems. This knowledge not only makes you a valuable asset to any organisation but also opens up a pathway to innovation and problem-solving within the vast landscape of data.

Topic 03

03: Productionizing pipelines

In Topic 3 of our data engineering bootcamp, we shift our focus towards the critical phase of productionizing pipelines. This section is designed to equip you with the expertise needed to containerize, build, and deploy ETL pipelines into a production environment, particularly within the cloud. Additionally, we delve into the essentials of code versioning and fostering team collaboration through Git.

As data engineering projects grow in complexity and scale, these skills become
indispensable for ensuring that pipelines are not only functional but also maintainable, scalable, and seamlessly integrated into production workflows. By mastering these concepts, you’ll be well-prepared to navigate the challenges of deploying data pipelines in real-world scenarios, making you a valuable asset in the field of data engineering.

Git

Git Version Control System: Introduction to Git as a tool for tracking and managing code changes.
Git Workflow: Understanding the basic Git commands such as add, commit, push, merge, and pull.
Git Branching: Strategies for managing branches in a project for feature development and bug fixes.
GitHub Pull Requests: How to use pull requests for code review and collaboration on GitHub.

Docker

Computer vs Virtual Machine vs Docker: Comparing these technologies to understand the advantages of Docker in resource efficiency and isolation.
Docker Image and Container: The difference between an image and a container, and how they are used in Docker.
● Docker Commands: Key commands for managing Docker containers and images.
Dockerfile: Writing Dockerfiles to automate the creation of Docker images.
Docker Volumes: Utilizing volumes for persistent data storage in Docker.
Docker Compose: Orchestrating multi-container applications with Docker Compose.
Docker Repository: Managing images using Docker repositories.
Containerize an ETL Pipeline: Practical guide to containerizing an ETL pipeline with Docker.

Cloud (AWS)

Identity and Access Management (IAM): Managing users, policies, groups, and roles for secure access to AWS resources.
Relational Database Service (RDS): Utilizing RDS for managed database services.
Simple Storage Service (S3): Storing and retrieving data with S3.
AWS CLI and Boto3: Interacting with AWS services using the command line and Python SDK.
Elastic Container Registry (ECR): Storing Docker images in a managed AWS Docker registry.
Elastic Container Service (ECS): Deploying and managing containerized applications on AWS.
Deploy and Schedule an ETL Pipeline on ECS: Steps for deploying and automating ETL pipelines on ECS.

Mastering the deployment and management of ETL pipelines in a production environment is a significant milestone in a data engineer’s career. This topic not only introduces you to the technicalities of containerization with Docker and cloud services with AWS but also emphasizes the importance of code versioning and collaboration using Git.

These skills are fundamental in today’s data-driven landscape, where the ability to efficiently deploy, manage, and scale data pipelines is as crucial as the insights derived from the data itself.

By the end of this topic, you’ll have a comprehensive understanding of the tools and practices needed to bring data engineering projects from development to production. This knowledge not only prepares you for the technical aspects of data engineering but also equips you with the collaborative and management skills necessary for working within modern data teams. The ability to production pipelines effectively ensures that your data projects are robust, scalable, and aligned with the evolving needs of businesses, positioning you as a key player in the field of data engineering.

Topic 04

04: Data integration pipelines with Airbyte

In the modern data landscape, businesses are inundated with data from a myriad of sources: Customer Relationship Management (CRM) systems, Order Management Systems (OMS), accounting platforms, marketing tools, and much more. The task of crafting custom Extract and Load (E&L) logic for each of these data sources is not only time-consuming but also prone to inefficiency and errors.

Topic 4 of our data engineering bootcamp introduces a powerful solution to this challenge: Airbyte, an open-source data integration platform that automates the E&L processes, making data integration seamless and scalable.

This section is meticulously designed to provide a deep dive into Airbyte’s capabilities, from understanding its sources, destinations, and connections to mastering data extraction and loading patterns. By the end of this topic, you’ll be equipped with the knowledge to deploy Airbyte in real-world scenarios, significantly enhancing your skills in building efficient, reliable data integration pipelines.

Airbyte Sources, Destinations, and Connections

Introduction to Airbyte’s architecture and how it simplifies data integration.
Understanding the wide range of sources and destinations Airbyte supports.

Airbyte Extract Patterns

Full: Extracting all data from the source system.
Incremental: Extracting only data that has changed since the last extraction.
Change Data Capture (CDC): Capturing and extracting real-time data changes.

Airbyte Load Patterns

Overwrite: Replacing existing data with new data in the destination.
Insert: Adding new data without affecting existing data.
Upsert: Updating existing data and inserting new data as necessary.
Merge: Combining data from multiple sources into a unified dataset in the destination.

Octavia CLI

Leveraging the Octavia CLI for enhanced management of Airbyte configurations and operations.

Airbyte API

Utilizing the Airbyte API for programmatically managing Airbyte resources and automating data integration tasks.

Airbyte Custom Connectors

Developing custom connectors to extend Airbyte’s capabilities to unsupported sources or destinations.

Deploying Airbyte on AWS EC2

Step-by-step guide to deploying Airbyte on an AWS EC2 instance, ensuring scalability and reliability.

End-to-End ELT Pipeline with Airbyte on AWS

Building a comprehensive ELT pipeline using Airbyte for data extraction and loading, integrated with cloud services for transformation and analysis

The advent of tools like Airbyte represents a significant leap forward in the field of data engineering, democratizing data integration by providing a uniform platform to connect disparate data sources with minimal manual coding. Topic 4 not only equips you with the practical skills to implement Airbyte for automating data pipelines but also deepens your understanding of modern ELT processes, preparing you for the challenges of handling data in a multi-system environment.

Upon completing this topic, you’ll possess a robust set of skills that are highly sought after in the data engineering domain. The ability to seamlessly integrate data from various sources into coherent, analysis-ready datasets opens up new avenues for insights and decision-making. Your expertise in deploying and managing Airbyte pipelines, especially in cloud environments like AWS, will make you a pivotal asset in any data-driven organization, ready to tackle the complexities of today’s data ecosystem and drive meaningful business outcomes.

Topic 05

05: Analytics engineering with Snowflake and dbt

As businesses grow and their data volumes expand, the challenge of processing vast amounts of information efficiently becomes paramount. Traditional methods of data processing often hit a bottleneck, unable to cope with the scale and agility required in today’s fast-paced environment.

Topic 5 of our data engineering bootcamp addresses this challenge head-on by introducing students to the world of Analytics Engineering, focusing on two groundbreaking technologies: Snowflake for data storage and analytics, and dbt (data build tool) for transforming data in a more modular and version-controlled manner. This topic is designed to equip you with the advanced skills needed to tackle large-scale data projects, streamlining the transformation process and ensuring that data analytics can be conducted with precision at scale.

OLAP vs OLTP

Understanding the differences between Online Analytical Processing (OLAP) and Online Transaction Processing (OLTP).
Exploring how these systems are designed for different kinds of workload requirements.

Snowflake Architecture

Diving into Snowflake’s unique architecture, understanding its cloud-based data warehousing solution.
Examining how Snowflake separates compute and storage resources to offer scalable data processing.

Snowflake RBAC

Learning about Role-Based Access Control (RBAC) in Snowflake to manage data access securely.

Loading Data into Snowflake

Techniques and best practices for efficiently loading large volumes of data into Snowflake.

Parsing JSON with Snowflake

Utilizing Snowflake’s capabilities to parse and query JSON data directly, enabling flexible data analysis.

Snowflake Micro-partitions and Clustering

Understanding how Snowflake utilizes micro-partitions and clustering to optimize data storage and query performance.

dbt Project

Setting up a dbt project, structuring it for growth, collaboration, and maintainability.

dbt Commands

Mastering key dbt commands like run, test, build, and list to manage and deploy transformations.

Writing and Running a dbt Model

Learning how to define, build, and execute dbt models that transform raw data into actionable insights.

dbt Seeds, Tests, and Macros

Using dbt seeds for loading static data, writing tests to ensure data integrity, and employing macros to simplify SQL code.

dbt Docs

Generating documentation with dbt to provide insights into the data models and transformation logic.

dbt in Production

Strategies for deploying dbt projects in production environments, including setting up profiles, targets, and deployment on AWS.

The convergence of Snowflake and dbt in the analytics engineering landscape represents a significant evolution in how data teams approach large-scale data transformation and analysis. Through this topic, you’ll gain not only the technical acumen to leverage these powerful tools but also a deeper understanding of their role in modern data engineering practices. Analytics engineering with Snowflake and dbt enables data teams to build more efficient, scalable, and manageable data pipelines, fundamentally changing the speed and efficacy with which businesses can derive insights from their data.

By mastering the concepts and practices taught in this topic, you will be well-equipped to navigate the complexities of large-scale data analytics projects. Your ability to efficiently process and transform data with Snowflake and dbt will make you an invaluable asset to any data-driven organization, ready to tackle the challenges of analytics at scale and drive forward the strategic goals of your business. This expertise not only enhances your career prospects but also positions you at the forefront of data engineering innovation.

Topic 06

06: Data modelling and semantic modelling

In Topic 6 of our data engineering bootcamp, we delve into the crucial concepts of data modelling and semantic modelling, which stand at the heart of making data comprehensible and useful for end-user consumption. This topic is designed to bridge the gap between raw data processing and the delivery of insightful, actionable information suitable for applications in machine learning, business intelligence, and analytics. By applying software engineering principles such as modularity and reusability to data modelling, you will learn to create structured, efficient data models that serve as the foundation for robust analytics.

Furthermore, the introduction of a semantic layer atop the data warehouse facilitates intuitive data exploration, enabling users to easily interact with the underlying models. This comprehensive overview will not only enhance your technical skills but also deepen your understanding of how data engineering supports and enhances data-driven decision-making processes.

Normalization vs Denormalization

Exploring the trade-offs between normalization (to reduce data redundancy) and denormalization (to improve query performance).

Data Modelling Concepts

Dimensional Modelling by Ralph Kimball (Star Schema): Learn about designing data warehouses using the star schema for optimized analytics querying.
Data Warehouse Modelling by Bill Inmon: Understanding the top-down approach to building a normalized data warehouse.
Data Vault Modelling by Dan Linstedt: Exploring the Data Vault methodology for agile and adaptable data warehouse design.
One Big Table (OBT): Discussing the concept and applications of consolidating data
into a single, large table for certain analytical scenarios.

Applied Dimensional Modelling Using dbt

Fact and Dimension Tables: Distinguishing between fact tables (which store transactional data) and dimension tables (which store descriptive attributes).
dbt Snapshots and Slowly Changing Dimensions (SCD): Implementing strategies to capture changes in dimension data over time using dbt.
Transactional Fact Table: Modeling data that captures transactions.
Snapshot Fact Table: Tracking metrics at specific points in time.
Accumulating Snapshot Fact Table: Monitoring processes or events that span over time.
Factless Fact Table: Representing relationships or events without metric measurements.
Incremental Fact Table Load: Efficiently updating fact tables with new data.

Semantic Modelling

Semantic Modelling Concepts and Tools: Understanding the principles behind semantic layers and exploring the tools available for semantic modelling.
Semantic Modelling and Metrics Using Preset: Learning how to define and use metrics within the semantic layer using Preset.
Preset Chart and Dashboard: Creating visual representations of data and building interactive dashboards with Preset for end-user analytics.

Data modelling and semantic modelling are pivotal in translating complex data into formats that are readily understandable and usable by end-users. This topic equips you with the methodologies and tools needed to construct effective data models and semantic layers, ensuring that the data processed and stored within your systems can be efficiently analyzed and interpreted.

By the end of this topic, you’ll have a solid grasp of both traditional and modern data modelling techniques, as well as the ability to implement a semantic layer that enhances data accessibility and usability. These skills are indispensable in today’s data-centric world, enabling you to support a wide range of analytics applications and empower decision-makers with the insights needed to drive business success. Your expertise in these areas will not only elevate your value as a data engineer but also contribute significantly to the strategic use of data within any organization.

Topic 07

07: Data lakehouse with Databricks and Spark

Topic 7 of our data engineering bootcamp brings you to the cutting edge of big data processing by exploring the Data Lakehouse architecture, utilizing Databricks and Apache Spark. This segment is meticulously crafted to offer a deep dive into the world of scalable data processing, streamlining workflows for data engineering, stream processing, and machine learning. The advent of the Data Lakehouse, supported by technologies like Spark and Databricks, represents a significant leap forward, merging the flexibility of data lakes with the management features of data warehouses.

Through this topic, you’ll learn how Spark’s distributed data processing capabilities, combined with Databricks’ comprehensive ecosystem, enable the handling of vast data volumes efficiently and effectively. This knowledge is crucial for modern data engineers tasked with building scalable, robust data pipelines that can accommodate the exploding volume, velocity, and variety of data in today’s digital landscape.

Big Data Processing Architectures

Overview of architectures used for processing big data, including the benefits of a lakehouse approach.

Spark Internals and Core Concepts

Dive into Spark’s architecture, understanding how it enables distributed data processing.

Spark Reading and Writing

Techniques for efficiently reading from and writing to various data sources with Spark.

Spark SQL

Utilizing Spark SQL for executing SQL queries on structured data, enabling seamless data analysis.

Spark DataFrame

Exploring the use of DataFrames for distributed data processing and manipulation in Spark.

Spark Joins, Group By, and Aggregation

Understanding how to perform joins, group-by operations, and aggregations to analyze large datasets.

Spark UDF (User Defined Functions)

Creating custom functions in Spark to extend its capabilities for data processing.

Spark Query Plan and Optimization

Delving into how Spark optimizes queries, understanding the query execution plan.

Spark Partition Keys

Leveraging partitioning in Spark to optimize data distribution and query performance.

ACID File Formats (Delta File Format)

Learning about ACID properties in file formats, focusing on the Delta format for reliable data storage.

Data Orchestration Using Databricks Workflows

Orchestrating complex data workflows in Databricks for automated and efficient data processing.

Manage the Databricks Workspace Using API and CLI

Utilizing Databricks’ API and CLI for workspace management, streamlining operations and integration.

Data Quality Testing with Great Expectations

Implementing data quality tests using Great Expectations, ensuring the integrity and reliability of your data pipelines.

Through the exploration of Databricks and Spark within the Data Lakehouse paradigm, this topic equips you with the skills and knowledge to tackle big data challenges head-on. You’ll learn not only about the technical aspects of data processing at scale but also about ensuring data quality and optimizing performance, which are crucial for delivering actionable insights.

Upon completion of this topic, you’ll have a solid understanding of how to leverage Databricks and Spark in a Data Lakehouse architecture to build scalable, efficient, and reliable data pipelines. This expertise is invaluable in a world where data is continuously growing in importance, enabling you to drive innovation and make data-driven decisions that can significantly impact the success of any organization. Your ability to apply these advanced data engineering techniques will set you apart in the field, preparing you for a rewarding career in data engineering and beyond.

Topic 08

08: Data orchestration with Airflow/Dagster

Topic 8 of our data engineering bootcamp transitions focus towards data orchestration with Dagster, an innovative tool that reimagines the orchestration and observability of data pipelines. Dagster is designed to address the complexities of modern data applications, offering a more integrated approach to constructing, executing, and monitoring data workflows. Unlike traditional orchestrators, Dagster emphasizes the development experience and operational robustness, making it an attractive choice for data engineers seeking to streamline their data processes.

This topic aims to equip you with comprehensive knowledge of Dagster’s capabilities, from its intuitive programming model to its operational features, enabling you to build sophisticated, maintainable, and scalable data pipelines that are tightly integrated with your data stack, including tools like Airbyte, dbt, Snowflake, and Databricks.

Introduction to Dagster

Understanding Dagster’s core philosophy and how it differs from other data orchestration tools.
Overview of Dagster’s programming model focused on data dependencies and type safety

Dagster Workspaces and Repositories

Setting up Dagster workspaces and repositories to organize and manage your data pipelines

Dagster Solids and Graphs

Learning how to define computational units (solids) and compose them into executable graphs.
Exploring graph composition and the reusability of solids.

Dagster Schedules and Sensors

Configuring schedules to automate pipeline execution and sensors to trigger pipelines based on external events or conditions.

Dagster Partitions and Backfills

Managing dataset partitions and performing backfills for historical data processing.

Dagster's Type System and Data Validation

Utilizing Dagster’s robust type system for data validation and ensuring pipeline integrity.

Dagster Asset Materializations and Observability

Implementing asset materializations to track the outputs of pipeline runs and enhance observability.
● Leveraging Dagster’s built-in tools for monitoring and debugging pipelines.

Dagster in Production

Strategies for deploying Dagster pipelines in production environments, including
containerization and cloud deployments.
Best practices for operationalizing Dagster pipelines, ensuring reliability and
scalability.

Integrating Dagster with Data Tools

Connecting Dagster with popular data tools and platforms like Airbyte for data integration, dbt for transformation, Snowflake for data warehousing, and Databricks for big data processing.
Setting up alerts and notifications to monitor pipeline health and performance.

Extending Dagster with Custom Extensions

Developing custom extensions and plugins to extend Dagster’s functionality to fit specific needs or integrate with other tools.

Through this exploration of Dagster, you’ll discover a holistic approach to data pipeline orchestration that not only simplifies the development and management of complex workflows but also provides superior visibility and control over data operations. Dagster’s emphasis on type safety, asset tracking, and comprehensive observability addresses many of the challenges faced in modern data engineering practices, offering a path to more reliable, maintainable, and scalable data ecosystems.

Upon completing this topic, you’ll possess a solid foundation in orchestrating data workflows with Dagster, prepared to tackle the intricacies of data engineering with confidence. Your ability to leverage Dagster’s advanced features for pipeline construction, execution, and monitoring will make you an invaluable asset in any data-driven organization. Armed with these skills, you’re well-positioned to contribute significantly to the efficiency, reliability, and success of data projects, driving forward the strategic objectives of your organization through effective data orchestration.

Topic 09

09: Streaming analytics with Kafka, Confluent, and Clickhouse

Topic 9 of our data engineering bootcamp delves into the dynamic world of streaming
analytics, focusing on leveraging Kafka, Confluent, and Clickhouse to harness real-time insights from rapidly moving data. As businesses increasingly rely on timely data for decision-making, understanding and implementing streaming data architectures becomes crucial.

This topic is designed to provide you with a solid foundation in the principles of stream processing, enabling you to deploy Kafka topics on Confluent Cloud and integrate real-time events into Clickhouse for analysis. Additionally, you’ll learn how to transform data within Clickhouse and utilize dbt for defining and testing materialized views, equipping you with the skills needed to build scalable, real-time analytics solutions.

Streaming Concepts

Introduction to the basics of streaming data and its importance in modern data architectures.

Kafka Key Concepts

Understanding Kafka’s role in streaming data architectures, including producers, consumers, brokers, and topics.

Kafka Broker and Topics

Deep dive into how Kafka manages data through brokers and organizes it into topics for efficient processing.

Kafka CLI

Utilizing the Kafka Command Line Interface (CLI) for managing Kafka environments.

Creating a Python Kafka Producer

Steps for creating a Kafka producer in Python to send data to Kafka topics.

Creating a Python Kafka Consumer

Developing a Kafka consumer in Python to read data from Kafka topics.

Stream Analytics Using ksqlDB

Leveraging ksqlDB for performing real-time analytics on streaming data in Kafka.

Deploy Kafka on Confluent Cloud

Guiding through the process of deploying Kafka topics on Confluent Cloud for managed streaming services.

Real-Time Databases with Clickhouse

Introduction to Clickhouse as a real-time OLAP database, ideal for handling large volumes of streaming data.

Clickhouse Architecture and Internals

Exploring the architecture and internal mechanisms of Clickhouse that make it highly efficient for real-time analytics.

Use Kafka Connect to Integrate Data into Clickhouse

Integrating streaming data from Kafka into Clickhouse using Kafka Connect for real-time data analysis.

Tables, Views, and Materialized Views on Clickhouse

Understanding the concepts of tables, views, and particularly materialized views in Clickhouse for dynamic data analysis.

dbt to Test and Deploy Clickhouse Objects

Utilizing dbt for testing and deploying database objects in Clickhouse, ensuring data integrity and reliability.

Real-Time Dashboard with Preset, Clickhouse, and Kafka

Creating real-time dashboards using Preset to visualize streaming analytics powered by Clickhouse and Kafka data pipelines.

Through this comprehensive exploration of streaming analytics with Kafka, Confluent, and Clickhouse, you’ll acquire the capability to build and maintain robust, scalable systems that provide real-time insights into data. This topic not only covers the technical aspects of stream processing technologies but also emphasizes practical applications and best practices for deploying these solutions in real-world scenarios.

Upon completing this topic, you’ll be adept at navigating the complexities of streaming data, from ingestion with Kafka and Confluent to analysis and visualization with Clickhouse and Preset. Your newfound skills will enable you to deliver valuable, timely insights that can drive strategic decisions and operational efficiencies in any organization. Embracing streaming analytics will position you at the forefront of data engineering innovation, ready to tackle the challenges and opportunities presented by real-time data processing.

Topic 10

10: Continuous integration and deployment

Topic 10 of our data engineering bootcamp focuses on the critical practice of Continuous Integration (CI) and Continuous Deployment (CD) within the realm of data engineering. As data teams expand and projects become more complex, ensuring code quality and seamless deployment becomes increasingly challenging.

This topic is designed to equip you with the knowledge and skills to implement automated CI/CD pipelines, fostering a culture of DataOps that emphasizes rapid, reliable, and automated data pipeline development. By integrating these practices, you’ll learn how to enhance team collaboration, streamline code integration, and ensure consistent deployments to various environments, including staging and production.

Principles of DataOps

Understanding the DataOps philosophy, focusing on improving communication, integration, and automation of data flows between managers and consumers of data within an organization.

Continuous Integration Pipelines

Unit Testing: Writing and automating unit tests to validate individual pieces of code for correctness.
Code Linting Tests: Implementing code linting to ensure adherence to coding standards and detect syntax errors.
Data Quality Testing: Employing automated tests to verify data integrity, consistency, and quality throughout the development process.
Branch-Based Testing Environments: Utilizing branch-based environments for testing code changes in isolation before merging to the main branch.
CI Pipelines for dbt: Setting up continuous integration pipelines specifically for dbt projects to automatically test and validate data models.
CI Pipelines for Python ETL: Creating CI pipelines to automate testing and integration for Python-based ETL scripts and applications.

Continuous Deployment Pipelines

Containerize and Build: Containerizing applications and data pipelines for consistency across different environments and simplifying the build process.
Deployment Environments: Managing multiple environments (e.g., development, staging, production) to ensure safe and controlled release processes.
Deploy Using Infrastructure as Code (IaC): Automating infrastructure provisioning and deployment using IaC tools to maintain consistency, repeatability, and scalability across environments.

By the end of this topic, you’ll have a comprehensive understanding of how to implement CI/CD pipelines in data engineering projects, aligning with the DataOps principles for enhanced efficiency and collaboration. These practices not only facilitate quicker iterations and improvements of data pipelines but also significantly reduce the risk of errors and downtime in production environments.

Upon completing this topic, you’ll be well-prepared to contribute to a culture of continuous improvement within your data team, employing CI/CD pipelines to automate testing, integration, and deployment processes. Your ability to implement these methodologies will ensure that your data pipelines are robust, reliable, and ready for the demands of modern data-driven organizations. This expertise is crucial for any data engineer looking to excel in today’s fast-paced, quality-oriented industry, making you a valuable asset to any team focused on delivering high-quality data solutions efficiently.

Enquire or download brochure

Get our bootcamp brochure

Get our curriculum week by week

Get our pricing information

Speak to our enrolments team