Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

11/19/25

From Raw Data to Governance: Refining Data with the Medallion Architecture Nov 2025

Overview

Build upon your existing data engineering expertise and discover how Medallion Architecture can transform your data strategy. This session provides a hands-on approach to implementing Medallion principles, empowering you to create a robust, scalable, and governed data platform.

We'll explore how to align data engineering processes with Medallion Architecture, identifying opportunities for optimization and improvement. By understanding the core principles and practical implementation steps, you'll learn how to optimize data pipelines, enhance data quality, and unlock valuable insights through a structured, layered approach to drive business success.

From Raw Data to Governance: Refining Data with the Medallion Architecture

  • Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

  • Read more information on my blog at:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

YouTube Video

Video Agenda

  • Introduction to Medallion Architecture

    • Defining Medallion Architecture
    • Core Principles
    • Benefits of Medallion Architecture
  • The Raw Zone

    • Understanding the purpose of the Raw Zone
    • Best practices for data ingestion and storage
  • The Bronze Zone

    • Data transformation and cleansing
    • Creating a foundation for analysis
  • The Silver Zone

    • Data optimization and summarization
    • Preparing data for consumption
  • The Gold Zone

    • Curated data for insights and action
    • Enabling self-service analytics
  • Empowering Insights

    • Data-driven decision-making
    • Accelerated Insights
  • Data Governance

    • Importance of data governance in Medallion Architecture
    • Implementing data ownership and stewardship
    • Ensuring data quality and security

Why Attend:

Gain a deep understanding of Medallion Architecture and its application in modern data engineering. Learn how to optimize data pipelines, improve data quality, and unlock valuable insights. Discover practical steps to implement Medallion principles in your organization and drive data-driven decision-making.

Presentation

Introducing Medallion Architecture

Medallion architecture is a data management approach that organizes data into distinct layers based on its quality and processing level.

  • Improved Data Quality: By separating data into different zones, you can focus on data quality at each stage.
  • Enhanced Data Governance: Clear data ownership and lineage improve data trustworthiness.
  • Accelerated Insights: Optimized data in the Silver and Gold zones enables faster query performance.
  • Scalability: The layered approach can accommodate growing data volumes and complexity.
  • Cost Efficiency: Optimized data storage and processing can reduce costs.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Design Diagram

The Raw Zone: Foundation of Your Data Lake

The Raw Zone is the initial landing place for raw, unprocessed data. It serves as a historical archive of your data sources.

  • Key Characteristics:
    • Unstructured or semi-structured format (e.g., CSV, JSON, Parquet)
    • Data is ingested as-is, without any cleaning or transformation
    • High volume and velocity
    • Data retention policies are crucial
  • Benefits:
    • Preserves original data for potential future analysis
    • Enables data reprocessing
    • Supports data lineage and auditability

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Raw Zone Diagram

The Bronze Zone: Transforming Raw Data

The Bronze Zone is where raw data undergoes initial cleaning, structuring, and transformation. It serves as a staging area for data before moving to the Silver Zone.

  • Key Characteristics:
    • Data is cleansed and standardized
    • Basic transformations are applied (e.g., data type conversions, null handling)
    • Data is structured into tables or views
    • Data quality checks are implemented
    • Data retention policies may be shorter than the Raw Zone
  • Benefits:
    • Improves data quality and consistency
    • Provides a foundation for further analysis
    • Enables data exploration and discovery

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Bronze Zone Diagram

The Silver Zone: A Foundation for Insights

The Silver Zone houses data that has been further refined, aggregated, and optimized for specific use cases. It serves as a bridge between the raw data and the final curated datasets.

  • Key Characteristics:
    • Data is cleansed, standardized, and enriched
    • Data is structured for analytical purposes (e.g., normalized, de-normalized)
    • Data is optimized for query performance (e.g., partitioning, indexing)
    • Data is aggregated and summarized for specific use cases
  • Benefits:
    • Improved query performance
    • Supports self-service analytics
    • Enables advanced analytics and machine learning
    • Reduces query costs

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Silver Zone Diagram

The Gold Zone: Your Data's Final Destination

  • Definition: The Gold Zone contains the final, curated datasets ready for consumption by business users and applications. It is the pinnacle of data transformation and optimization.
  • Key Characteristics:
    • Data is highly refined, aggregated, and optimized for specific use cases
    • Data is often materialized for performance
    • Data is subject to rigorous quality checks and validation
    • Data is secured and governed
  • Benefits:
    • Enables rapid insights and decision-making
    • Supports self-service analytics and reporting
    • Provides a foundation for advanced analytics and machine learning
    • Reduces query latency

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Gold Zone Diagram

The Gold Zone: Empowering Insights and Actions

The Gold Zone is the final destination for data, providing a foundation for insights, analysis, and action. It houses curated, optimized datasets ready for consumption.

  • Key Characteristics:
    • Data is accessible and easily consumable
    • Supports various analytical tools and platforms (BI, ML, data science)
    • Enables self-service analytics
    • Drives business decisions and actions
  • Examples of Consumption Tools:
    • Business Intelligence (BI) tools (Looker, Tableau, Power BI)
    • Data science platforms (Python, R, SQL)
    • Machine learning platforms (TensorFlow, PyTorch)
    • Advanced analytics tools

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Analysis Diagram

Data Governance: The Cornerstone of Data Management

Data governance is the framework that defines how data is managed within an organization, while data management is the operational execution of those policies. Data Governance is essential for ensuring data quality, consistency, and security.

Key components of data governance include:

  • Data Lineage: Tracking data's journey from source to consumption.
  • Data Ownership: Defining who is responsible for data accuracy and usage.
  • Data Stewardship: Managing data on a day-to-day basis, ensuring quality and compliance.
  • Data Security: Protecting data from unauthorized access, use, disclosure, disruption, modification, or destruction.
  • Compliance: Adhering to industry regulations (e.g., GDPR, CCPA, HIPAA) and internal policies.

By establishing clear roles, responsibilities, and data lineage, organizations can build trust in their data, improve decision-making, and mitigate risks.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Data Governance

Data Transformation and Incremental Strategy

The data transformation phase is a critical stage in a data warehouse project. This phase involves several key steps, including data extraction, cleaning, loading, data type casting, use of naming conventions, and implementing incremental loads to continuously insert the new information since the last update via batch processes.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Data transformation lineage

Data Lineage: Tracks the flow of data from its origin to its destination, including all the intermediate processes and transformations that it undergoes.

Data Governance : Metadata

Assigns the owner, steward and responsibilities of the data.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Governance Metadata

Summary: Leverage Medallion Architecture for Success

  • Key Benefits:
    • Improved data quality
    • Enhanced governance
    • Accelerated insights
    • Scalability
    • Cost Efficiency.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Diagram

We've covered a lot today, but this is just the beginning!

If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.

Data Engineering Process Fundamentals - Book by Oscar Garcia Data Engineering Process Fundamentals - Book by Oscar Garcia

Upcoming Talks:

Join us for subsequent sessions in our Data Engineering Process Fundamentals series, where we will delve deeper into specific facets of data engineering, exploring topics such as data modeling, pipelines, and best practices in data governance.

This presentation is based on the book, Data Engineering Process Fundamentals, which provides a more comprehensive guide to the topics we'll cover. You can find all the sample code and datasets used in this presentation on our popular GitHub repository Introduction to Data Engineering Process Fundamentals.

Thanks for reading! 😊 If you enjoyed this post and would like to stay updated with our latest content, don’t forget to follow us. Join our community and be the first to know about new articles, exclusive insights, and more!

👍 Originally published by ozkary.com

10/29/25

From Raw Data to Analytics: The Modern Data Layer Architecture



Overview

This presentation is part of the Data Engineering Process Fundamentals series, focusing on the essential architectural components—the Data Lake and the Data Warehouse—and defining their respective roles in a modern analytics ecosystem.

From Raw Data to Analytics: The Modern Data Layer Architecture

  • Follow this GitHub repo during the presentation: (Star the project to follow and get updates)

👉 GitHub Repo

  • Data engineering Series:

👉 Blog Series

YouTube Video

Video Agenda

Agenda:

  1. Introduction to Data Engineering:

  2. Brief overview of the data engineering landscape and its critical role in modern data-driven organizations.

  3. Operational Data

  4. Understanding Data Lakes:

  5. Explanation of what a data lake is and its purpose in storing vast amounts of raw and unstructured data.

  6. Exploring Data Warehouses:

  7. Definition of data warehouses and their role in storing structured, processed, and business-ready data.

  8. Comparing Data Lakes and Data Warehouses:

  9. Comparative analysis of data lakes and data warehouses, highlighting their strengths and weaknesses.

  10. Discussing when to use each based on specific use cases and business needs.

  11. Integration and Data Pipelines:

  12. Insight into the seamless integration of data lakes and data warehouses within a data engineering pipeline.

  13. Code walkthrough showcasing data movement and transformation between these two crucial components.

  14. Real-world Use Cases:

  15. Presentation of real-world use cases where effective use of data lakes and data warehouses led to actionable insights and business success.

  16. Hands-on demonstration using Python, Jupyter Notebook and SQL to solidify the concepts discussed, providing attendees with practical insights and skills.

  17. Q&A and Hands-on Session:

  18. An interactive Q&A session to address any queries.

Conclusion:

This session aims to equip attendees with a strong foundation in data engineering, focusing on the pivotal role of data lakes and data warehouses. By the end of this presentation, participants will grasp how to effectively utilize these tools, enabling them to design efficient data solutions and drive informed business decisions.

This presentation will be accompanied by live code demonstrations and interactive discussions, ensuring attendees gain practical knowledge and valuable insights into the dynamic world of data engineering.

Supporting Materials Reminder

Subsequent Sessions: Join us for future sessions in our Data Engineering Process Fundamentals series, where we will build a data pipeline and delve deeper into topics like orchestration and governance.

Resources: This presentation is based on the book, Data Engineering Process Fundamentals, and all supporting code and examples are available on our popular GitHub repository.

Presentation

Data Engineering Overview

A Data Engineering Process involves executing steps to understand the problem, scope, design, and architecture for creating a solution. This enables ongoing big data analysis using analytical and visualization tools.

Topics

  • Data Lake and Data Warehouse
  • Discovery and Data Analysis
  • Design and Infrastructure Planning
  • Data Lake - Pipeline and Orchestration
  • Data Warehouse - Design and Implementation
  • Analysis and Visualization

Follow this project: Give a star

👉 Data Engineering Process Fundamentals

Operational Data

Operational data is often generated by applications, and it is stored in transactional relational databases like SQL Server, Oracle and NoSQL (JSON) databases like MongoDB, Firebase. This is the data that is created after an application saves a user transaction like contact information, a purchase or other activities that are available from the application.

Features:

  • Application support and transactions
  • Relational data structure and SQL or document structure NoSQL
  • Small queries for case analysis

Not Best For:

  • Reporting system
  • Large queries
  • Centralized Big Data system

Data Engineering Process Fundamentals - Operational Data

Data Lake - Analytical Data Staging

A Data Lake is an optimized storage system for Big Data scenarios. The primary function is to store the data in its raw format without any transformation. Analytical data is the transaction data that has been extracted from a source system via a data pipeline as part of the staging data process.

Features:

  • Store the data in its raw format without any transformation
  • This can include structure data like CSV files, unstructured data like JSON and XML documents, or column-base data like parquet files
  • Low Cost for massive storage power
  • Not Designed for querying or data analysis
  • It is used as external tables by most systems

Data Engineering Process Fundamentals - Analytical Data staging

Data Warehouse - Analytical Data

A Data Warehouse is a centralized storage system that stores integrated data from multiple sources. The system is designed to host and serve Big Data scenarios with lower operational cost than transaction databases, but higher costs than a Data Lake. This system host the Analytical Data that has been processed and is ready for analytical purposes.

Data Warehouse Features:

  • Stores historical data in relational tables with an optimized schema, which enables the data analysis process
  • Provides SQL support to query the data
  • It can integrate external resources like CSV and parquet files that are stored on Data Lakes as external tables
  • The system is designed to host and serve Big Data scenarios. It is not meant to be used as a transactional system
  • Storage is more expensive
  • Offloads archived data to Data Lakes

Data Engineering Process Fundamentals - Analytical Data Store

Discovery - Data Analysis

During the discovery phase of a Data Engineering Process, we look to identify and clearly document a problem statement, which helps us have an understanding of what we are trying to solve. We also look at our analytical approach to make observations about at the data, its structure and source. This leads us into defining the requirements for the project, so we can define the scope, design and architecture of the solution.

  • Download sample data files
  • Run experiments to make observations
  • Write Python scripts using VS Code or Jupyter Notebooks
  • Transform the data with Pandas
  • Make charts with Plotly
  • Document the requirements

Data Engineering Process Fundamentals - Data Analysis and discovery

Design and Planning

The design and planning phase of a data engineering project is crucial for laying out the foundation of a successful system. It involves defining the system architecture, designing data pipelines, implementing source control practices, ensuring continuous integration and deployment (CI/CD), and leveraging tools like Docker and Terraform for infrastructure automation.

  • Use GitHub for code repo and for CI/CD actions
  • Use Terraform is an Infrastructure as Code (IaC) tool that enables us to manage cloud resources across multiple cloud providers
  • Use Docker containers to run the code and manage its dependencies

Data Engineering Process Fundamentals - Design and Planning

Data Lake - Pipeline and Orchestration

A data pipeline is basically a workflow of tasks that can be executed in Docker containers. The execution, scheduling, managing and monitoring of the pipeline is referred to as orchestration. In order to support the operations of the pipeline and its orchestration, we need to provision a VM and data lake, and monitor cloud resources.

  • This can be code-centric, leveraging languages like Python
  • Or a low-code approach, utilizing tools such as Azure Data Factory, which provides a turn-key solution
  • Monitor services enable us to track telemetry data
  • Docker Hub, GitHub can be used for the CI/CD process

Data Engineering Process Fundamentals - Data Lake - Data Pipeline and Orchestration

Data Warehouse - Design and Implementation

In the design phase, we lay the groundwork by defining the database system, schema model, and technology stack required to support the data warehouse’s implementation and operations. In the implementation phase, we focus on converting conceptual data models into a functional system. By creating concrete structures like dimension and fact tables and performing data transformation tasks, including data cleansing, integration, and scheduled batch loading, we ensure that raw data is processed and unified for analysis. Create a repeatable and extendable process.

Data Engineering Process Fundamentals - Data Warehouse Design and Implementation

Data Warehouse - Data Analysis

Data analysis is the practice of exploring data and understanding its meaning. It involves activities that can help us achieve a specific goal, such as identifying data dimensions and measures, as well as data analysis to identify outliers, trends, and distributions.

  • We can accomplish these activities by writing code using Python and Pandas, SQL, Visual Studio Code or Jupyter Notebooks.
  • What's more, we can use libraries, such as Plotly, to generate some visuals to further analyze data and create prototypes.

Data Engineering Process Fundamentals - Data Analysis

Data Analysis and Visualization

Data visualization is a powerful tool that takes the insights derived from data analysis and presents them in a visual format. While tables with numbers on a report provide raw information, visualizations allow us to grasp complex relationships and trends at a glance.

  • Dashboards, in particular, bring together various visual components like charts, graphs, and scorecards into a unified interface that can help us tell a story
  • Use tools like PowerBI, Looker, Tableau to model the data and create enterprise level visualizations

Data Engineering Process Fundamentals - Data Visualization

Conclusion

Both data lakes and data warehouses are essential components of a data engineering project. The primary function of a data lake is to store large amounts of operational data in its raw format, serving as a staging area for analytical processes. In contrast, a data warehouse acts as a centralized repository for information, enabling engineers to transform, process, and store extensive data. This allows the analytical team to utilize coding languages like Python and tools such as Jupyter Notebooks, as well as low-code platforms like Looker Studio and Power BI, to create enterprise-quality dashboards for the organization.

Upcoming Talks:

Join us for subsequent sessions in our Data Engineering Process Fundamentals series, where we will delve deeper into specific facets of data engineering, exploring topics such as data modeling, pipelines, and best practices in data governance.

This presentation is based on the book, Data Engineering Process Fundamentals, which provides a more comprehensive guide to the topics we'll cover. You can find all the sample code and datasets used in this presentation on our popular GitHub repository Introduction to Data Engineering Process Fundamentals.

Thanks for reading! 😊 If you enjoyed this post and would like to stay updated with our latest content, don’t forget to follow us. Join our community and be the first to know about new articles, exclusive insights, and more!

👍 Originally published by ozkary.com

8/27/25

From Raw Data to Roadmap: The Discovery Phase in Data Engineering Process Fundamentals

Overview

The discovery process involves identifying the problem, analyzing data sources, defining project requirements, establishing the project scope, and designing an effective architecture to address the identified challenges.

In this session, we will delve into the essential building blocks of data engineering, placing a spotlight on the discovery process. From framing the problem statement to navigating the intricacies of exploratory data analysis (EDA) using Python, VSCode, Jupyter Notebooks, and GitHub, you'll gain a solid understanding of the fundamental aspects that drive effective data engineering projects.

DevFest Series Data Engineering Process Fundamentals Series

From Raw Data to Roadmap: The Discovery Phase in Data Engineering - Data Engineering Process Fundamentals

  • Follow this GitHub repo during the presentation: (Give it a star)

👉 GitHub Repo

Jupyter Notebook

👉 Jupyter Notebook

  • Data engineering Series:

👉 Blog Series

👉 Data Engineering Book on Amazon

YouTube Video

Video Agenda

In this session, we will delve into the essential building blocks of data engineering, placing a spotlight on the discovery process. From framing the problem statement to navigating the intricacies of exploratory data analysis (EDA), data modeling using Python, VS Code, Jupyter Notebooks, SQL, and GitHub, you'll gain a solid understanding of the fundamental aspects that drive effective data engineering projects.

  1. Introduction:

    • The "Why": We'll discuss why understanding your data upfront is crucial for success.
    • The Problem: We'll introduce a real-world problem that will guide our exploration.
  2. Data Loading and Preparation:

    • Loading: We'll demonstrate how to efficiently load data from an online source directly into our workspace.
    • Structuring: We'll prepare the loaded data for analysis, making it easy to work with.
  3. Exploratory Data Analysis (EDA):

    • First Look: We'll learn how to quickly generate and interpret summary statistics for our data.
    • The Story: We'll use these statistics to understand the data's characteristics and identify any red flags or anomalies.
  4. Data Cleaning and Modeling:

    • Cleaning: We'll identify and handle common data issues like missing values and inconsistencies.
    • Modeling: We'll organize our data into separate tables for dimensions (descriptive attributes) and facts (measurable values).
  5. Visualization and Real-World Application:

    • Bringing it to Life: We'll create charts to visualize the data and find patterns.
    • Solving the Problem: We'll apply the insights gained to address our original problem and discuss practical solutions.

Key Takeaways:

  • Mastery of the foundational aspects of data engineering.
  • Hands-on experience with EDA techniques, emphasizing the discovery phase.
  • Appreciation for the value of a code-centric approach in the data engineering discovery process.

Upcoming Talks:

Join us for subsequent sessions in our Data Engineering Process Fundamentals series, where we will delve deeper into specific facets of data engineering, exploring topics such as data modeling, pipelines, and best practices in data governance.

This presentation is based on the book, "Data Engineering Process Fundamentals," which provides a more comprehensive guide to the topics we'll cover. You can find all the sample code and datasets used in this presentation on our popular GitHub repository.

Presentation

Data Engineering Overview

A Data Engineering Process involves executing steps to understand the problem, scope, design, and architecture for creating a solution. This enables ongoing big data analysis using analytical and visualization tools.

Topics

  • Importance of the Discovery Process
  • Setting the Stage - Technologies
  • Exploratory Data Analysis (EDA)
  • Code-Centric Approach
  • Version Control
  • Real-World Use Case

Follow this project: Give a star

👉 Data Engineering Process Fundamentals

Importance of the Discovery Process

The discovery process involves identifying the problem, analyzing data sources, defining project requirements, establishing the project scope, and designing an effective architecture to address the identified challenges.

  • Clearly document the problem statement to understand the challenges the project aims to address.
  • Make observations about the data, its structure, and sources during the discovery process.
  • Define project requirements based on the observations, enabling the team to understand the scope and goals.
  • Clearly outline the scope of the project, ensuring a focused and well-defined set of objectives.
  • Use insights from the discovery phase to inform the design of the solution, including data architecture.
  • Develop a robust project architecture that aligns with the defined requirements and scope.

Data Engineering Process Fundamentals - Discovery Process

Setting the Stage - Technologies

To set the stage, we need to identify and select the tools that can facilitate the analysis and documentation of the data. Here are key technologies that play a crucial role in this stage:

  • Python: A versatile programming language with rich libraries for data manipulation, analysis, and scripting.

Use Cases: Data download, cleaning, exploration, and scripting for automation.

  • Jupyter Notebooks: An interactive tool for creating and sharing documents containing live code, visualizations, and narrative text.

Use Cases: Exploratory data analysis, documentation, and code collaboration.

  • Visual Studio Code: A lightweight, extensible code editor with powerful features for source code editing and debugging.

Use Cases: Writing and debugging code, integrating with version control systems like GitHub.

  • SQL (Structured Query Language): A domain-specific language for managing and manipulating relational databases.

Use Cases: Querying databases, data extraction, and transformation.

Data Engineering Process Fundamentals - Discovery Tools

Exploratory Data Analysis (EDA)

EDA is our go-to method for downloading, analyzing, understanding and documenting the intricacies of the datasets. It's like peeling back the layers of information to reveal the stories hidden within the data. Here's what EDA is all about:

  • EDA is the process of analyzing data to identify patterns, relationships, and anomalies, guiding the project's direction.

  • Python and Jupyter Notebook collaboratively empower us to download, describe, and transform data through live queries.

  • Insights gained from EDA set the foundation for informed decision-making in subsequent data engineering steps.

  • Code written on Jupyter Notebook can be exported and used as the starting point for components for the data pipeline and transformation services.

Data Engineering Process Fundamentals - Discovery Pie Chart

Code-Centric Approach

A code-centric approach, using programming languages and tools in EDA, helps us understand the coding methodology for building data structures, defining schemas, and establishing relationships. This robust understanding seamlessly guides project implementation.

  • Code delves deep into data intricacies, revealing integration and transformation challenges often unclear with visual tools.

  • Using code taps into Pandas and Numpy libraries, empowering robust manipulation of data frames, establishment of loading schemas, and addressing transformation needs.

  • Code-centricity enables sophisticated analyses, covering aggregation, distribution, and in-depth examinations of the data.

  • While visual tools have their merits, a code-centric approach excels in hands-on, detailed data exploration, uncovering subtle nuances and potential challenges.

Data Engineering Process Fundamentals - Discovery Pie Chart

Version Control

Using a tool like GitHub is essential for effective version control and collaboration in our discovery process. GitHub enables us to track our exploratory code and Jupyter Notebooks, fostering collaboration, documentation, and comprehensive project management. Here's how GitHub enhances our process:

  • Centralized Tracking: GitHub centralizes tracking and managing our exploratory code and Jupyter Notebooks, ensuring a transparent and organized record of our data exploration.

  • Sharing: Easily share code and Notebooks with team members on GitHub, fostering seamless collaboration and knowledge sharing.

  • Documentation: GitHub supports Markdown, enabling comprehensive documentation of processes, findings, and insights within the same repository.

  • Project Management: GitHub acts as a project management hub, facilitating CI/CD pipeline integration for smooth and automated delivery of data engineering projects.

Data Engineering Process Fundamentals - Discovery Problem Statement

Summary: The Power of Discovery

By mastering the discovery phase, you lay a strong foundation for successful data engineering projects. A thorough understanding of your data is essential for extracting meaningful insights.

  • Understanding Your Data: The discovery phase is crucial for understanding your data's characteristics, quality, and potential.
  • Exploratory Data Analysis (EDA): Use techniques to uncover patterns, trends, and anomalies.
  • Data Profiling: Assess data quality, identify missing values, and understand data distributions.
  • Data Cleaning: Address data inconsistencies and errors to ensure data accuracy.
  • Domain Knowledge: Leverage domain expertise to guide data exploration and interpretation.
  • Setting the Stage: Choose the right language and tools for efficient data exploration and analysis.

The data engineering discovery process involves defining the problem statement, gathering requirements, and determining the scope of work. It also includes a data analysis exercise utilizing Python and Jupyter Notebooks or other tools to extract valuable insights from the data. These steps collectively lay the foundation for successful data engineering endeavors.

Thanks for reading! 😊 If you enjoyed this post and would like to stay updated with our latest content, don’t forget to follow us. Join our community and be the first to know about new articles, exclusive insights, and more!

👍 Originally published by ozkary.com

2/26/25

Discovering Machine Language a Primer Guide

Overview

Machine Learning can seem like a complex and mysterious field. This presentation aims to discover the core concepts of Machine Learning, providing a primer guide to key ideas like supervised and unsupervised learning, along with practical examples to illustrate their real-world applications. We'll also explore a GitHub repository with code examples to help you further your understanding and experimentation.

#BuildwithAI Series

Discovering Machine Language a Primer Guide

  • Follow this GitHub repo during the presentation: (Please star and follow the project for updates.)

👉 https://github.com/ozkary/machine-learning-engineering

YouTube Video

Video Agenda

Agenda:

  1. What is Machine Learning?

    • Definition and core concepts
  2. Why is Machine Learning Important?

    • Key applications and benefits
  3. Types of Machine Learning

    • Supervised Learning
    • Examples: Classification & Regression
    • Unsupervised Learning
    • Examples: Clustering & Dimensionality Reduction
  4. Problem Types

    • Regression: Predicting continuous values
    • Classification: Predicting categorical outcomes
  5. Model Development Process

    • Understand the Problem
    • Exploratory Data Analysis (EDA)
    • Data Preprocessing
    • Feature Engineering
    • Data Splitting
    • Model Selection
    • Training & Evaluation

Presentation

What is Machine Learning

ML is a subset of AI that focuses on enabling computers to learn and improve performance on a specific task without being explicitly programmed. In essence, it's about learning from data patterns to make predictions or decisions based on it.

Core Concepts

  • Learn from data
  • Improve performance with more training data
  • Main goal is to make predictions and decisions on new data
  • Learn the relation of data + outcome to define the model
  • The new data + model predicts an outcome

Discovering Machine Language a Primer Guide: What are LLMs?

Why is Machine Learning Important?

ML impacts how computers solve problems. Traditional systems rely on pre-defined rules programmed by humans. This approach struggles with complexity and doesn't adapt to new information. In contrast, ML enables computers to learn directly from data, similar to how humans learn.

  • Coding Rules
def heart_disease_risk_rule_based(age, overweight, diabetic):
     """
     Assesses heart disease risk based on a set of predefined rules.

     Args:
         age: Age of the individual (int).
         overweight: True if overweight, False otherwise (bool).
         diabetic: True if diabetic, False otherwise (bool).

     Returns:
         "High Risk" or "Low Risk" (str).
     """
     if age > 50 and overweight and diabetic:
         return "High Risk"
     elif age > 60 and (overweight or diabetic):
         return "High Risk"
     elif age > 40 and overweight and not diabetic:
        return "Moderate Risk"
     else:
         return "Low Risk"
  • Learning from data
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


df = pd.DataFrame(data)
# Prepare the data
X = df[['Age', 'Overweight', 'Diabetic']]  # Features
y = df['Heart Disease']  # Target

# Split data into training and testing sets
# X has the categories/features
# y has the target value
# train data is for training
# test data is for testing
# .2 means 20% of the data is used for testing 80% for training
# 42 is the seed for random shuffling

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest classifier
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the model : {accuracy}")

# 70% - 80%: Often considered a reasonable starting point for many classification problems.
# 80% - 90%: Good performance for many applications.
# 90% - 95%: Very good performance. Often challenging to achieve, but possible for well-behaved problems with good data.
# > 95%: Excellent performance, potentially approaching the limits of what's possible for the problem. Be careful of overfitting if you're achieving very high accuracy.
# 100%: Usually a sign of overfitting.

👉 Jupyter Notebook

Types of ML Models - Supervised Learning

Examples

  • Regression: Predicting a continuous value (e.g., house prices, stock prices).
  • Classification: Predicting a category or class label (e.g., cat/dog/bird, disease/no disease).
  • Model Examples: Linear Regression, Logistic Regression, Decision Trees, Random Forest.

Discovering Machine Language a Primer Guide: Supervised Learning

Types of ML Models - Unsupervised Learning

Examples

  • Clustering: Grouping similar data points together (e.g., group patients by symptoms, age groups)

  • Association: Discovering relationships or associations between items (e.g., symptom association)

"Patients who report 'Fever' and 'Cough' are also frequently reporting 'Headache' or 'Muscle Aches'."

  • Model Examples: Clustering (k-means), association (Frequent Pattern Growth)

Discovering Machine Language a Primer Guide: Unsupervised Learning

Supervise Learning - Common Problem Types

Regression and Classification are two main problem types to solve. With Regression, we look to predict a continuous target variable like price, cost. With Classification, we look to predict a discrete target like a group or y/n class.

Problem Types

  1. Regression:

    • In regression, the target variable is continuous and represents a quantity or a number.
    • Example: Predicting house prices, temperature predictions, stock prices.
  2. Classification:

    • In classification, the target variable is discrete and represents a category or a class.
    • Example: spam vs. non-spam emails, predicting heart disease Y/N.

Discovering Machine Language a Primer Guide: Regression and Classification

ML Model Development Process - MLOps

Developing a new ML model involves understanding the core problem, then using a data engineering process to gather, explore, and prepare the data. We then move to the ML process to split the data, select the algorithm, train and evaluate the model.

  • Development Process
    • Understand the problem
    • Exploratory Data Analysis (EDA)
    • Data Preprocessing
    • Feature Engineering
    • Data Splitting
    • Model Selection
    • Train TRaining
    • Model Evaluate & Tuning
    • Deployment

MLOps is the operation process to manage the training, evaluation and deployment of your models

Discovering Machine Language a Primer Guide: MLOps process

👉 Vehicle MSRP Regression

Machine Learning Summary

Machine learning (ML) enables computers to learn patterns from data and make predictions or decisions without explicit programming, unlike rule-based systems. ML models improve as they process more data.

  • Supervised Learning:

    • Learns from labeled data (input-output pairs)
    • Regression: Predicts continuous values (e.g., house prices)
    • Classification: Predicts categories (e.g., heart disease y/n)
  • Unsupervised Learning:

    • Learns from unlabeled data (only inputs) and discovers patterns and structures.
    • Clustering: Grouping similar data points (e.g., group patients by symptoms, age groups)
    • Association: Discovering relationships between items (e.g., symptoms association)

While we've explored foundational areas, numerous other exciting topics exist, such as neural networks, natural language processing, computer vision, large language models (LLMs). Visit the repository for more exploration.

Thanks for reading! 😊 If you enjoyed this post and would like to stay updated with our latest content, don’t forget to follow us. Join our community and be the first to know about new articles, exclusive insights, and more!

Leave comments on this post or contact me at:

👍 Originally published by ozkary.com

1/22/25

Smart Charts: Powered by AI to enhance Data Understanding

Overview

This presentation explores how Generative AI, particularly Large Language Models (LLMs), can empower engineers with deeper data understanding. We'll delve into creating complex charts using Python and demonstrate how LLMs can analyze these visualizations, identify trends, and suggest actionable insights. Learn how to effectively utilize LLMs through prompt engineering with natural language and discover how this technology can save you valuable time and effort.

#BuildwithAI Series

Smart Charts: Powered by AI to enhance Data Understanding

  • Follow this GitHub repo during the presentation: (Give it a star and follow the project)

👉 https://github.com/ozkary/ai-engineering

YouTube Video

Video Agenda

Agenda:

  1. Introduction to LLMs and their Role in Data Analysis and Training

    • What are LLMs, and how do they work?
    • LLMs in the context of data analysis and visualization.
  2. Prompt Engineering - Guiding the LLM

    • Crafting effective prompts for chart analysis.
    • Providing context within the prompt (chart type, data).
  3. Tokens - The Building Blocks

    • Understanding the concept of tokens in LLMs.
    • How token limits impact prompt design and model performance.
  4. Let AI Help with Data Insights - Real Use Case

    • Creating complex charts using Python libraries.
    • Write Prompts for Chart Analysis
    • Utilizing an LLM to analyze the generated charts.
    • Demonstrating how LLMs can identify trends, anomalies, and potential areas for improvement.
  5. Live Demo - Create complex charts using python and ask AI to help you with the analysis

    • Live coding demonstration of creating a complex chart and using an LLM to analyze it.

Why Attend?

  • Discover how to leverage LLMs to gain deeper insights from your data visualizations.
  • Learn practical techniques for crafting effective prompts to guide LLM analysis.
  • Enhance your data analysis skills with the power of AI.

Presentation

What are LLM Models - Not Skynet

Large Language Model (LLM) refers to a class of Generative AI models that are designed to understand prompts and questions and generate human-like text based on large amounts of training data. LLMs are built upon Foundation Models which have a focus on language understanding.

Common Tasks

  • Text and Code Generation: LLMs can generate code and data analysis text based on specific prompts

  • Natural Language Processing (NLP): Understand and generate human language, sentiment analysis, translation

  • Text Summarization: LLMs can condense lengthy pieces of text into concise summaries

  • Question Answering: LLMs can access and process information from various sources to answer questions, making a great fit for chatbots

Smart Charts with AI: What are LLMs?

Training LLM Models - Secret Sauce

Models are trained using a combination of machine learning and deep learning. Massive datasets of text are collected, cleaned, and fed into complex neural networks with multiple layers. These networks iteratively learn by analyzing patterns in the data, allowing them to map inputs like chart data to desired outputs, such as chart analysis.

Training Process:

  • Data Collection: Sources from books, articles, code repositories, and online conversations

  • Preprocessing: Data cleaning and formatting for the ML algorithms to understand it effectively

  • Model Training: The neural network architecture is trained on the data. The network adjusts its internal parameters to learn how to map input data (user stories) to desired outputs (code snippets)

  • Fine-tuning: Fine-tune models for specific tasks like code generation, by training the model on relevant data (e.g., specific programming languages, coding conventions).

Smart Charts with AI: Neural-Network

Transformer Architecture - Not Autobots

Transformer is a neural network architecture that excels at processing long sequences of text by analyzing relationships between words, no matter how far apart they are. This allows LLMs to understand complex language patterns and generate human-like text.

Components

  • Encoder: Process the input (use story) by using multiple encoder layers with self-attention Mechanism to analyze the relationship between words

  • Decoder: Uses the encoded information and its own attention mechanism to generate the output text (like code), ensuring it aligns with the text.

  • Attention Mechanism: Enables the model to effectively focus on the most important information for the task at hand, leading to improved NLP and generation capabilities.

Smart Charts with AI: Transformers encoder decoder attention mechanism

👉 Read: Attention is all you need by Google, 2017

Fine-Tuning for Specific Domain

Fine-tuning LLMs is a process to specialize a pre-trained model into a specific domain like data analysis with your process information.

Process:

  • Use the knowledge and parameters gained from a large pretrained dataset, source model
  • Enhance the model performance by retraining the source model with a domain specific and smaller dataset
  • Use the target model for the final integration

Smart Charts with AI: Fine-tuning a model

Tokens - The Building Blocks of Language Models

Large language models work by dissecting text into a sequence of tokens. These tokens act as the building blocks, allowing it to grasp the essence, structure, and connections within the text.

Details

  • Tokens can be individual words, punctuation marks, or even smaller sub-word units, depending on the specific LLM architecture.
  • The length of a word can influence the number of tokens it generates.
  • Similar to how Lego bricks come in various shapes and sizes, tokens can vary depending on the model's design.
  • Measure cost.

👉 Think of tokens as Lego blocks

Prompt Engineering - What is it?

Prompt engineering is the process of designing and optimizing prompts to better utilize LLMs. Well described prompts can help the AI models better understand the context and generate more accurate responses.

Features

  • Clarity and Specificity: Effective prompts are clear, concise, and specific about the task or desired response

  • Task Framing: Provide background information, specifying the desired output format (e.g., code, email, poem), or outlining specific requirements

  • Examples and Counter-Examples: Including relevant examples and counterexamples within the prompt can further guide the LLM

  • Instructional Language: Use clear and concise instructions to improve the LLM's understanding of what information to generate

Smart Charts with AI: Data Analysis Prompt

Chart Analysis with AI

By combining existing charts with AI-driven analysis, we can unlock deeper insights, automate interpretation, and empower users to make more informed decisions.

Data Analysis Flow:

  • Chart Data: Identify the key data points from the chart (e.g., x-axis values, y-axis values, data labels, limits).

  • Chart Prompt: Format this information in a concise and human-readable format, such as:

"We are looking at a control chart measuring Curvature data points…"

  • Analysis Prompt: Provide details about what would you like to learn from the data:

"Interpret this chart, data series, limits and action items to take?"

👉 LLM generated analysis is not perfect, if the prompt is not detail enough, hallucinations may occurred

Smart Charts with AI: Data Analysis Chart

Smart Charts Powered by AI - Summary

LLMs empower developers, data engineers/analysts, and scientists by enhancing data understanding through AI-driven chart analysis. To ensure accurate and insightful analysis, crafting detailed prompts is crucial.

  • Provide Chart Context:

    • Chart Type: (e.g., line chart, bar chart, pie chart, scatter plot)
    • Chart Title
    • Data Series
    • Data Ranges/Limits: (e.g., Time period, Upper/Lower Limits)
  • Provide Guiding Questions:

    • What is the overall trend of the data?
    • Are there any significant peaks or dips?
    • Are there any outliers or anomalies?
    • What are the key takeaways from this chart?
    • What actions, if any, should be considered?

👉 By framing the prompt with contextual and guiding questions, you effectively "train" the model to analyze the chart in a more human-like and insightful manner.

Thanks for reading! 😊 If you enjoyed this post and would like to stay updated with our latest content, don’t forget to follow us. Join our community and be the first to know about new articles, exclusive insights, and more!

Leave comments on this post or contact me at:

👍 Originally published by ozkary.com

10/31/24

A Hands-On Exploration into the discovery phase - Data Engineering Process Fundamentals

Overview

The discovery process involves identifying the problem, analyzing data sources, defining project requirements, establishing the project scope, and designing an effective architecture to address the identified challenges.

In this session, we will delve into the essential building blocks of data engineering, placing a spotlight on the discovery process. From framing the problem statement to navigating the intricacies of exploratory data analysis (EDA) using Python, VSCode, Jupyter Notebooks, and GitHub, you'll gain a solid understanding of the fundamental aspects that drive effective data engineering projects.

A Hands-On Exploration into the discovery phase - Data Engineering Process Fundamentals

  • Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

Jupyter Notebook

👉 https://github.com/ozkary/data-engineering-mta-turnstile/blob/main/Step1-Discovery/mta_discovery.ipynb

  • Data engineering Series:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

Jupyter Notebook Preview

# Standard library imports
from time import time
from pathlib import Path
import requests
from io import StringIO
# Load pandas support for data analysis tasks, dataframe (two-dimensional data structure with rows and columns) management
import pandas as pd    
import numpy as np 

# URL of the file you want to download. Note: It should be a Saturday date
url = 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_241026.txt'

# Download the file in memory
response = requests.get(url)
response.raise_for_status()  # Check if the request was successful

# Create a DataFrame from the downloaded content
data = StringIO(response.text)
df = pd.read_csv(data)

# Display the DataFrame first 10 rows
df.head(10)

# use info to get the column names, data type and null values
df.info()

# remove spaces and type case the columns
df.columns = [column.strip() for column in df.columns]
print(df.columns)
df["ENTRIES"] = df["ENTRIES"].astype(int)
df["EXITS"] = df["EXITS"].astype(int)

# Define the set of special characters you want to check for
special_characters_set = set('@#$%/')


def has_special_characters(col, special_characters):
    # Check if any character in the column name is not alphanumeric or in the specified set
    return any(char in special_characters for char in col)

def rename_columns(df, special_characters_set):
    # Create a mapping of old column names to new column names
    mapping = {col: ''.join(char for char in col if char.isalnum() or char not in special_characters_set) for col in df.columns}

    print(mapping)
    # Rename columns using the mapping
    df_renamed = df.rename(columns=mapping)

    return df_renamed


# Identify columns with special characters using list comprehension syntax
columns_with_special_characters = [col for col in df.columns if has_special_characters(col, special_characters_set)]

# Print the result
print("Columns with special characters:", columns_with_special_characters)

# Identify columns with special characters and rename them
df = rename_columns(df, special_characters_set)

# Display the data frame again. there should be no column name with special characters
print(df.info())

YouTube Video

Video Agenda

  1. Introduction:

    • Unveiling the importance of the discovery process in data engineering.

    • Setting the stage with a real-world problem statement that will guide our exploration.

  2. Setting the Stage:

    • Downloading and comprehending sample data to kickstart our discovery journey.

    • Configuring the development environment with VSCode and Jupyter Notebooks.

  3. Exploratory Data Analysis (EDA):

    • Delving deep into EDA techniques with a focus on the discovery phase.

    • Demonstrating practical approaches using Python to uncover insights within the data.

  4. Code-Centric Approach:

    • Advocating the significance of a code-centric approach during the discovery process.

    • Showcasing how a code-centric mindset enhances collaboration, repeatability, and efficiency.

  5. Version Control with GitHub:

    • Integrating GitHub seamlessly into our workflow for version control and collaboration.

    • Managing changes effectively to ensure a streamlined data engineering discovery process.

  6. Real-World Application:

    • Applying insights gained from EDA to address the initial problem statement.

    • Discussing practical solutions and strategies derived from the discovery process.

Key Takeaways:

  • Mastery of the foundational aspects of data engineering.

  • Hands-on experience with EDA techniques, emphasizing the discovery phase.

  • Appreciation for the value of a code-centric approach in the data engineering discovery process.

Some of the technologies that we will be covering:

  • Python
  • Data Analysis and Visualization
  • Jupyter Notebook
  • Visual Studio Code

Presentation

Data Engineering Overview

A Data Engineering Process involves executing steps to understand the problem, scope, design, and architecture for creating a solution. This enables ongoing big data analysis using analytical and visualization tools.

Topics

  • Importance of the Discovery Process
  • Setting the Stage - Technologies
  • Exploratory Data Analysis (EDA)
  • Code-Centric Approach
  • Version Control
  • Real-World Use Case

Follow this project: Give a star

👉 Data Engineering Process Fundamentals

Importance of the Discovery Process

The discovery process involves identifying the problem, analyzing data sources, defining project requirements, establishing the project scope, and designing an effective architecture to address the identified challenges.

  • Clearly document the problem statement to understand the challenges the project aims to address.
  • Make observations about the data, its structure, and sources during the discovery process.
  • Define project requirements based on the observations, enabling the team to understand the scope and goals.
  • Clearly outline the scope of the project, ensuring a focused and well-defined set of objectives.
  • Use insights from the discovery phase to inform the design of the solution, including data architecture.
  • Develop a robust project architecture that aligns with the defined requirements and scope.

Data Engineering Process Fundamentals - Discovery Process

Setting the Stage - Technologies

To set the stage, we need to identify and select the tools that can facilitate the analysis and documentation of the data. Here are key technologies that play a crucial role in this stage:

  • Python: A versatile programming language with rich libraries for data manipulation, analysis, and scripting.

Use Cases: Data download, cleaning, exploration, and scripting for automation.

  • Jupyter Notebooks: An interactive tool for creating and sharing documents containing live code, visualizations, and narrative text.

Use Cases: Exploratory data analysis, documentation, and code collaboration.

  • Visual Studio Code: A lightweight, extensible code editor with powerful features for source code editing and debugging.

Use Cases: Writing and debugging code, integrating with version control systems like GitHub.

  • SQL (Structured Query Language): A domain-specific language for managing and manipulating relational databases.

Use Cases: Querying databases, data extraction, and transformation.

Data Engineering Process Fundamentals - Discovery Tools

Exploratory Data Analysis (EDA)

EDA is our go-to method for downloading, analyzing, understanding and documenting the intricacies of the datasets. It's like peeling back the layers of information to reveal the stories hidden within the data. Here's what EDA is all about:

  • EDA is the process of analyzing data to identify patterns, relationships, and anomalies, guiding the project's direction.

  • Python and Jupyter Notebook collaboratively empower us to download, describe, and transform data through live queries.

  • Insights gained from EDA set the foundation for informed decision-making in subsequent data engineering steps.

  • Code written on Jupyter Notebook can be exported and used as the starting point for components for the data pipeline and transformation services.

Data Engineering Process Fundamentals - Discovery Pie Chart

Code-Centric Approach

A code-centric approach, using programming languages and tools in EDA, helps us understand the coding methodology for building data structures, defining schemas, and establishing relationships. This robust understanding seamlessly guides project implementation.

  • Code delves deep into data intricacies, revealing integration and transformation challenges often unclear with visual tools.

  • Using code taps into Pandas and Numpy libraries, empowering robust manipulation of data frames, establishment of loading schemas, and addressing transformation needs.

  • Code-centricity enables sophisticated analyses, covering aggregation, distribution, and in-depth examinations of the data.

  • While visual tools have their merits, a code-centric approach excels in hands-on, detailed data exploration, uncovering subtle nuances and potential challenges.

Data Engineering Process Fundamentals - Discovery Pie Chart

Version Control

Using a tool like GitHub is essential for effective version control and collaboration in our discovery process. GitHub enables us to track our exploratory code and Jupyter Notebooks, fostering collaboration, documentation, and comprehensive project management. Here's how GitHub enhances our process:

  • Centralized Tracking: GitHub centralizes tracking and managing our exploratory code and Jupyter Notebooks, ensuring a transparent and organized record of our data exploration.

  • Sharing: Easily share code and Notebooks with team members on GitHub, fostering seamless collaboration and knowledge sharing.

  • Documentation: GitHub supports Markdown, enabling comprehensive documentation of processes, findings, and insights within the same repository.

  • Project Management: GitHub acts as a project management hub, facilitating CI/CD pipeline integration for smooth and automated delivery of data engineering projects.

Data Engineering Process Fundamentals - Discovery Problem Statement

Summary: The Power of Discovery

By mastering the discovery phase, you lay a strong foundation for successful data engineering projects. A thorough understanding of your data is essential for extracting meaningful insights.

  • Understanding Your Data: The discovery phase is crucial for understanding your data's characteristics, quality, and potential.
  • Exploratory Data Analysis (EDA): Use techniques to uncover patterns, trends, and anomalies.
  • Data Profiling: Assess data quality, identify missing values, and understand data distributions.
  • Data Cleaning: Address data inconsistencies and errors to ensure data accuracy.
  • Domain Knowledge: Leverage domain expertise to guide data exploration and interpretation.
  • Setting the Stage: Choose the right language and tools for efficient data exploration and analysis.

The data engineering discovery process involves defining the problem statement, gathering requirements, and determining the scope of work. It also includes a data analysis exercise utilizing Python and Jupyter Notebooks or other tools to extract valuable insights from the data. These steps collectively lay the foundation for successful data engineering endeavors.

Thanks for reading.

Send question or comment at Twitter @ozkary 👍 Originally published by ozkary.com

8/21/24

Medallion Architecture: A Blueprint for Data Insights and Governance - Data Engineering Process Fundamentals

Overview

Gain understanding of Medallion Architecture and its application in modern data engineering. Learn how to optimize data pipelines, improve data quality, and unlock valuable insights. Discover practical steps to implement Medallion principles in your organization and drive data-driven decision-making.

Data Engineering Process Fundamentals - Medallion Architecture

  • Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

  • Read more information on my blog at:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

YouTube Video

Video Agenda

  • Introduction to Medallion Architecture

    • Defining Medallion Architecture
    • Core Principles
    • Benefits of Medallion Architecture
  • The Raw Zone

    • Understanding the purpose of the Raw Zone
    • Best practices for data ingestion and storage
  • The Bronze Zone

    • Data transformation and cleansing
    • Creating a foundation for analysis
  • The Silver Zone

    • Data optimization and summarization
    • Preparing data for consumption
  • The Gold Zone

    • Curated data for insights and action
    • Enabling self-service analytics
  • Empowering Insights

    • Data-driven decision-making
    • Accelerated Insights
  • Data Governance

    • Importance of data governance in Medallion Architecture
    • Implementing data ownership and stewardship
    • Ensuring data quality and security

Why Attend:

Gain a deep understanding of Medallion Architecture and its application in modern data engineering. Learn how to optimize data pipelines, improve data quality, and unlock valuable insights. Discover practical steps to implement Medallion principles in your organization and drive data-driven decision-making.

Presentation

Introducing Medallion Architecture

Medallion architecture is a data management approach that organizes data into distinct layers based on its quality and processing level.

  • Improved Data Quality: By separating data into different zones, you can focus on data quality at each stage.
  • Enhanced Data Governance: Clear data ownership and lineage improve data trustworthiness.
  • Accelerated Insights: Optimized data in the Silver and Gold zones enables faster query performance.
  • Scalability: The layered approach can accommodate growing data volumes and complexity.
  • Cost Efficiency: Optimized data storage and processing can reduce costs.

Data Engineering Process Fundamentals - Medallion Architecture Design Diagram

The Raw Zone: Foundation of Your Data Lake

The Raw Zone is the initial landing place for raw, unprocessed data. It serves as a historical archive of your data sources.

  • Key Characteristics:
    • Unstructured or semi-structured format (e.g., CSV, JSON, Parquet)
    • Data is ingested as-is, without any cleaning or transformation
    • High volume and velocity
    • Data retention policies are crucial
  • Benefits:
    • Preserves original data for potential future analysis
    • Enables data reprocessing
    • Supports data lineage and auditability

Data Engineering Process Fundamentals - Medallion Architecture Raw Zone Diagram

Use case Background

The Metropolitan Transportation Authority (MTA) subway system in New York has stations around the city. All the stations are equipped with turnstiles or gates which tracks as each person enters (departure) or exits (arrival) the station.

  • The MTA subway system has stations around the city.
  • All the stations are equipped with turnstiles or gates which tracks as each person enters or leaves the station.
  • CSV files provide information about the amount of commuters per stations at different time slots.

Data Engineering Process Fundamentals - Data streaming MTA Gates

Problem Statement

In the city of New York, commuters use the Metropolitan Transportation Authority (MTA) subway system for transportation. There are millions of people that use this system every day; therefore, businesses around the subway stations would like to be able to use Geofencing advertisement to target those commuters or possible consumers and attract them to their business locations at peak hours of the day.

  • Geofencing is a location based technology service in which mobile devices’ electronic signal is tracked as it enters or leaves a virtual boundary (geo-fence) on a geographical location. Businesses around those locations would like to use this technology to increase their sales.
  • Businesses around those locations would like to use this technology to increase their sales by pushing ads to potential customers at specific times.

ozkary-data-engineering-mta-geo-fence

The Bronze Zone: Transforming Raw Data

The Bronze Zone is where raw data undergoes initial cleaning, structuring, and transformation. It serves as a staging area for data before moving to the Silver Zone.

  • Key Characteristics:
    • Data is cleansed and standardized
    • Basic transformations are applied (e.g., data type conversions, null handling)
    • Data is structured into tables or views
    • Data quality checks are implemented
    • Data retention policies may be shorter than the Raw Zone
  • Benefits:
    • Improves data quality and consistency
    • Provides a foundation for further analysis
    • Enables data exploration and discovery

Data Engineering Process Fundamentals - Medallion Architecture Bronze Zone Diagram

The Silver Zone: A Foundation for Insights

The Silver Zone houses data that has been further refined, aggregated, and optimized for specific use cases. It serves as a bridge between the raw data and the final curated datasets.

  • Key Characteristics:
    • Data is cleansed, standardized, and enriched
    • Data is structured for analytical purposes (e.g., normalized, de-normalized)
    • Data is optimized for query performance (e.g., partitioning, indexing)
    • Data is aggregated and summarized for specific use cases
  • Benefits:
    • Improved query performance
    • Supports self-service analytics
    • Enables advanced analytics and machine learning
    • Reduces query costs

Data Engineering Process Fundamentals - Medallion Architecture Silver Zone Diagram

The Gold Zone: Your Data's Final Destination

  • Definition: The Gold Zone contains the final, curated datasets ready for consumption by business users and applications. It is the pinnacle of data transformation and optimization.
  • Key Characteristics:
    • Data is highly refined, aggregated, and optimized for specific use cases
    • Data is often materialized for performance
    • Data is subject to rigorous quality checks and validation
    • Data is secured and governed
  • Benefits:
    • Enables rapid insights and decision-making
    • Supports self-service analytics and reporting
    • Provides a foundation for advanced analytics and machine learning
    • Reduces query latency

Data Engineering Process Fundamentals - Medallion Architecture Gold Zone Diagram

The Gold Zone: Empowering Insights and Actions

The Gold Zone is the final destination for data, providing a foundation for insights, analysis, and action. It houses curated, optimized datasets ready for consumption.

  • Key Characteristics:
    • Data is accessible and easily consumable
    • Supports various analytical tools and platforms (BI, ML, data science)
    • Enables self-service analytics
    • Drives business decisions and actions
  • Examples of Consumption Tools:
    • Business Intelligence (BI) tools (Looker, Tableau, Power BI)
    • Data science platforms (Python, R, SQL)
    • Machine learning platforms (TensorFlow, PyTorch)
    • Advanced analytics tools

Data Engineering Process Fundamentals - Medallion Architecture Analysis Diagram

Data Governance: The Cornerstone of Data Management

Data governance is the framework that defines how data is managed within an organization, while data management is the operational execution of those policies. Data Governance is essential for ensuring data quality, consistency, and security.

Key components of data governance include:

  • Data Lineage: Tracking data's journey from source to consumption.
  • Data Ownership: Defining who is responsible for data accuracy and usage.
  • Data Stewardship: Managing data on a day-to-day basis, ensuring quality and compliance.
  • Data Security: Protecting data from unauthorized access, use, disclosure, disruption, modification, or destruction.
  • Compliance: Adhering to industry regulations (e.g., GDPR, CCPA, HIPAA) and internal policies.

By establishing clear roles, responsibilities, and data lineage, organizations can build trust in their data, improve decision-making, and mitigate risks.

Data Engineering Process Fundamentals - Medallion Architecture Data Governance

Data Transformation and Incremental Strategy

The data transformation phase is a critical stage in a data warehouse project. This phase involves several key steps, including data extraction, cleaning, loading, data type casting, use of naming conventions, and implementing incremental loads to continuously insert the new information since the last update via batch processes.

Data Engineering Process Fundamentals - Data transformation lineage

Data Lineage: Tracks the flow of data from its origin to its destination, including all the intermediate processes and transformations that it undergoes.

Data Governance : Metadata

Assigns the owner, steward and responsibilities of the data.

Data Engineering Process Fundamentals - Medallion Architecture Governance Metadata

Summary: Leverage Medallion Architecture for Success

  • Key Benefits:
    • Improved data quality
    • Enhanced governance
    • Accelerated insights
    • Scalability
    • Cost Efficiency.

Data Engineering Process Fundamentals - Medallion Architecture Diagram

We've covered a lot today, but this is just the beginning!

If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.

Data Engineering Process Fundamentals - Book by Oscar Garcia Data Engineering Process Fundamentals - Book by Oscar Garcia

Thanks for reading.

Send question or comment at Twitter @ozkary 👍 Originally published by ozkary.com