1: Data Ingestion & Storage
Task 1: Download a Real-world Dataset
Dataset: New York Taxi Trips Data
Download: NYC Taxi Data (Parquet format)
Alternative: Kaggle Datasets (Download CSV datasets)
Task 2: Load Data into a Local Database
• Install and Use PostgreSQL (or SQLite) as a database.
• Write a Python script to load data into the database.
Resources:
• PostgreSQL Installation Guide
• Pandas to PostgreSQL (Tutorial)
• SQLite Quickstart
Practice Steps:
Install PostgreSQL or SQLite.
Use Pandas to read the dataset.
Write a Python script to insert data into the database.
2: Data Processing & Transformation
Task 3: Transform Data Using Pandas & SQL
• Filter out invalid data (e.g., negative trip distances).
• Convert datetime columns into proper formats.
• Aggregate data (e.g., average fare per trip).
Resources:
• SQL Basics (W3Schools)
• Pandas Data Transformations
Practice Steps:
Write SQL queries to clean the data.
Perform aggregations using Pandas.
3: Data Orchestration with Apache Airflow
Task 4: Automate Data Processing with Airflow
• Install Apache Airflow (pip install apache-airflow).
• Create an Airflow DAG (Directed Acyclic Graph) to automate:
• Ingesting data from the dataset.
• Transforming data using SQL.
• Storing results in a database.
Resources:
• Airflow Quickstart Guide
• Airflow DAGs Tutorial
Practice Steps:
Install Airflow and configure it.
Write a DAG to automate data ingestion & transformation.
Schedule the DAG to run every fixed interval e.g.: 5 minute or every hour:
Additional Resources for Downloading Notebooks &
Datasets
Open Datasets
1. Kaggle – [Link]
2. Google Dataset Search – [Link]
3. AWS Open Data – [Link]
4. NYC Taxi Data – [Link]
Jupyter Notebooks & Tutorials
1. DataTalksClub Data Engineering
Zoomcamp – [Link]
2. Data Engineering Notebooks
(GitHub) – [Link]
3. Pandas & SQL Practice Notebooks – [Link]
4. Apache Airflow
Examples – [Link]
What You Will Have Built in 3 Labs Above:
Ingested a real dataset into a database (PostgreSQL).
Transformed & cleaned data using Pandas & SQL.
Automated data processing with Apache Airflow.
Created a reproducible data pipeline for ML.
📌 What's Next?
If you have more time, try these:
Deploy your pipeline on the cloud (AWS/GCP/Azure).
Use Kafka for real-time data ingestion.
Implement a Feature Store with Feast.