Data Engineering Essentials
Modified Complete Course Content (Merged) + Prioritized Learning Path
-----------------------------------------------------------------------
CONTENTS
1. SQL Fundamentals & Data Warehousing (Core)
- Introduction to SQL for Data Engineering
- Overview of Application Architecture and RDBMS
- Overview of Database Technologies and relevance of SQL
- Overview of Purpose Built Databases
- Overview of Data Warehouse and Data Lake
- (INSERTED) Section: Data Modeling & Schema Design (High Priority)
* Data modeling fundamentals: OLTP vs OLAP
* Dimensional modeling: star and snowflake
* Slowly Changing Dimensions (SCD) types
* Fact tables, grain, surrogate keys
* Partitioning strategies, schema evolution
* Workshops & SCD Type-2 implementation using Delta/MERGE
- Usage of RDBMS and Data Warehouse technologies
- Differences and Similarities between RDBMS and Data Warehouse Technologies
2. Postgres & Hands-On SQL (Existing)
- Overview of Postgres Database Server and pgAdmin
- Overview of Database Connection Details
- Overview of Connecting to External Databases using pgAdmin
- Create Application Database and User in Postgres Database Server
- Clone Data Sets from Git Repository for Database Scripts
- Register Server in pgAdmin using Application Database and User
- Setup Application Tables and Data in Postgres Database
- Overview of pgAdmin to write SQL Queries
- Review Data Model Diagram
- Define Problem Statement for SQL Queries
- Filtering Data using SQL Queries
- Total Aggregations using SQL Queries
- Group By Aggregations using SQL Queries
- Order of Execution of SQL Queries
- Rules and Restrictions to Group and Filter Data in SQL queries
- Filter Data based on Aggregated Results using Group By and Having
- Joins (Inner and Outer) and advanced filtering on join results
- Views, CTEs, CTAS, OVER / PARTITION BY, Ranking, and Exercises
- SQL Troubleshooting, Explain Plans, Indexing, Performance Tuning modules
3. Data Modeling Workshop (detail)
- (continued Data Modeling exercises, SCD implementation, schema evolution)
4. Python Fundamentals & Development Practices
- Setup Visual Studio Workspace for Python Application Development
- VS Code Notebooks, Cells, Functions, Running by line
- Python basics: data types, lists, strings, loops, functions, file I/O
- JSON handling, Pandas basics, reading/writing CSV/JSON
- Pandas advanced: joins, aggregations, sorting, writing files
- Projects: File Format Converter, File→DB Loader (existing)
- Exception handling, environment variables, runtime args, logging
- (INSERTED) Section: Testing, Debugging & TDD for Data Pipelines
* Unit tests for Spark transformations (pytest)
* Integration tests & mocking external systems
* End-to-end test harness, test data management, CI gating
5. File Format Converter & File→DB Loader Projects (existing)
- Project 1: File Format Converter (setup, glob, regex, dynamic schema)
- Project 2: Files To Database Loader (chunked loads, multiprocessing)
- Deploy and troubleshoot file loader, performance tuning with chunksize - Refactor for multiprocessing and
validation
6. CI/CD, Containerization & Version Control (INSERTED)
- Git best practices, branching and PR workflows
- Unit testing patterns, pytest for Python + Spark
- GitHub Actions for CI to run tests and deploy artifacts
- Docker basics and containerizing Spark jobs for local testing
- Terraform basics for provisioning cloud resources (GCP examples)
- Demo: CI pipeline that builds, tests and deploys a Databricks job / Airflow DAG
- Notebook versioning best practices and converting notebooks to scheduled jobs - Code review checklist,
collaboration patterns and runbooks
7. Getting Started with GCP & Databricks (existing)
- Pre-requisite Skills, signing up, GCP credits, Cloud Shell, gcloud SDK
- Analytics Services on GCP, Databricks on GCP setup, workspaces, clusters
- Databricks CLI, DBFS, creating tables, temp views, Spark SQL examples
- (INSERTED) Section: Cloud Managed Data Warehouses & Managed Services
* Overview: BigQuery, Snowflake, Redshift
* Loading patterns: batch vs streaming ingestion
* Cost & performance considerations: partitioning, clustering
* Integrating Databricks/Spark with BigQuery or Snowflake (connectors) * Hands-on: load
Delta/Parquet data to BigQuery and run queries
8. Spark SQL & DataFrames (existing)
- Spark SQL functions: string/date/numeric/null handling, case/when, aggregation
- Basic transformations, filtering, GROUP BY, ORDER BY, joins, ranking, JSON processing
- Copying results into metastore tables (CTAS/INSERT/MERGE)
- Spark DataFrame API: select, withColumn, joins, aggregations, sorting, nulls handling - Integration of
Spark
SQL and DataFrame APIs; manage metastore objects
9. (INSERTED) Streaming & Real-time Processing (High Priority)
- Stream processing concepts & use cases
- Kafka fundamentals & cloud Pub/Sub overview
- Schema design for streaming: Avro/Protobuf + registry
- Spark Structured Streaming: micro-batch vs continuous
- Hands-on: Kafka → Spark Structured Streaming → Delta
- Windowed aggregations, stateful processing, checkpointing
- Exactly-once semantics, watermarking, late data handling
- Streaming performance tuning, monitoring and troubleshooting
- Exercise: end-to-end streaming pipeline and solution walkthrough
10. Databricks Workflows, ELT Pipelines & Jobs (existing)
- Pass arguments to notebooks (Python & SQL), create & run first Databricks job
- Run jobs & tasks with parameters, orchestrated pipelines using Databricks Jobs
- Import ELT apps into Databricks, build workflows, review execution details
- (INSERTED) Orchestration & Workflow Management (Airflow / Prefect)
* DAGs, Airflow fundamentals, operators/sensors/XCom, scheduling
* Airflow integrations: DatabricksOperator, KubernetesPodOperator
* Prefect basics and comparison with Airflow
* Deploying Airflow (managed vs self-hosted), observability, CI for DAGs * Exercise:
orchestrate an ELT pipeline (schedule + backfill + alerts)
11. Spark Performance Tuning & Explain Plans (existing)
- Catalyst optimizer overview, explain plans for DataFrames and SQL
- Interpret explain plans, Spark architecture, broadcasting, filter pushdown
- Cluster config: all-purpose vs jobs clusters, autoscaling, executor/executor memory
- Partitioning, columnar formats (Parquet/Delta), schema inference, partition pruning
- Performance assessment for Parquet, shuffling, adaptive query execution, dynamic allocation - Review
Spark UI, job details, YARN logs, and performance tuning scenarios
12. Data Storage & Lakehouse (existing)
- Delta Lake: managed vs external tables, CRUD, MERGE semantics
- Creating Delta tables, copy data to metastore, insert, validate
- Columnar file formats, folder structure for partitioned data, parquet best practices
13. Hadoop, HDFS & Hive (existing)
- Dataproc clusters, single node & multinode setups, HDFS commands, file blocks, replication
- Hive applications and scripts, partitioned parquet tables, staging in HDFS - Scheduling using crontab
for Hive jobs, develop shell wrappers
14. Advanced Spark & Deployment (existing)
- Spark submit modes, dependencies as packages/jars, submit scripts
- Logging in Spark apps (python logging), validating logs client/cluster mode
- Running spark applications with/without AQE, dynamic allocation, partitions
15. Performance & Cluster Operations (existing)
- Compute capacity, YARN capacity, Spark History Server, Spark UI deep dives
- Generate test data, WordCount app, disable/override dynamic allocation, shuffling
16. Observability, Data Quality & Lineage (INSERTED)
- Why data quality matters, Great Expectations overview
- Add validations into pipelines (batch & streaming)
- Observability: metrics, logs, tracing, and integrating with Prometheus/Datadog
- Lineage & metadata: OpenLineage, Amundsen, data catalog basics - Demo: add GE validations to
File→DB loader and dashboarding
17. Security, Governance & Compliance (INSERTED)
- IAM and access controls (GCP/AWS concepts), encryption at rest/in transit
- Row/column-level security, masking, PII handling, GDPR basics
- Auditing, lineage, ownership and governance operational checks
18. Linux, Shell Scripts & Automation (existing)
- SSH, PATH, mkdir/cp/mv/rm, find, grep, shell scripts, debug scripts with args
- Hadoop/Spark executables, start/stop clusters, VS Code remote setups
19. Final Projects & Capstone Integration
- Project 1: Batch ELT pipeline (CSV→Parquet→Delta→Databricks job): implement partitioning, MERGE,
performance tuning
- Project 2: Streaming analytics demo (Simulated events→Kafka→Spark Structured Streaming→Delta +
dashboard)
- Project 3: Orchestrated pipeline with Airflow (Ingest→Transform→Load to BigQuery or Snowflake; DAGs,
monitoring, retries)
- Each project includes: README, architecture diagram, run instructions, sample data, tests, CI config
20. Wrap-up: best practices, interview prep checklist, next steps
-----------------------------------------------------------------------
PRIORITIZED LEARNING PATH (Mapped to the 3 Portfolio Projects)
Goal: produce 3 interview ready projects. Below is a prioritized, timeboxed plan with milestones.
Project A — Batch ELT pipeline (Priority: High)
Target skills: SQL, Postgres, Pandas, Spark, Databricks, Delta, Partitioning, Explain Plans, Indexing, CI.
Milestones:
Week 1: Review SQL fundamentals & Data Modeling (sections 2–4). Create star schema for e-commerce.
Week 2: Implement File Format Converter (Project 1) using Pandas. Add unit tests.
Week 3: Build Spark job to convert CSV→Parquet→Delta, create partitioned target table.
Week 4: Create Databricks job, add Explain Plan analysis, performance tuning (partition pruning).
Week 5: Add CI (GitHub Actions), containerize small test harness, add Great Expectations checks.
Deliverables: repo with README, DAG (if used), Databricks job config, CI, tests, screenshots of Spark UI/explain
plan.
Project B — Streaming analytics (Priority: High)
Target skills: Kafka / PubSub, Spark Structured Streaming, schema registry, windowing, exactly once, monitoring.
Milestones:
Week 1: Study Streaming fundamentals & Kafka (INSERTED streaming section).
Week 2: Build a local Kafka producer and topic; set up schema registry (Avro).
Week 3: Develop Spark Structured Streaming job to consume, aggregate (windowed counts), write to Delta.
Week 4: Add checkpointing, watermarking; test late data handling; scale locally (multiple partitions).
Week 5: Add monitoring (expose metrics), logging, and end-to-end validation with Great Expectations.
Deliverables: repo, sample event generator, Spark Structured Streaming code, README, dashboard screenshots.
Project C — Orchestrated ELT + Warehouse (Priority: Medium)
Target skills: Airflow/Prefect, Big Query/Snowflake, CI/CD, Terraform, lineage.
Milestones:
Week 1: Learn Airflow basics & set up local Airflow or managed Composer.
Week 2: Create DAG to run Project A Spark job + Project B streaming validation steps.
Week 3: Add operator to load processed data into BigQuery or Snowflake; add tests.
Week 4: Add Terraform basics for provisioning minimal infra (bucket, service account).
Week 5: Add lineage metadata (Open Lineage) and data catalog entries.
Deliverables: DAG repo, Terraform config, documentation, screenshots of DAG UI and BigQuery/Snowflake
queries.
Recommended order of study (fastest path to hireable skillset):
1) SQL & Data Modeling + Postgres hands-on (2 weeks)
2) Python & Pandas + File→DB loader (1 week)
3) Spark fundamentals + Databricks (2 weeks)
4) Streaming (Spark Structured Streaming + Kafka) (2 weeks)
5) Orchestration (Airflow) + CI/CD + Testing (2 weeks)
6) Cloud Warehouse (BigQuery/Snowflake) + governance & security (1–2 weeks)
7) Final polish: tests, CI, documentation, public repos (1 week) --------------------------------------------------------------------
---
HOW TO USE THIS PDF
- Follow the prioritized path for fastest hiring readiness.
- Build each project in a public GitHub repo with clear READMEs and run instructions.
- Keep each project small but production aware (tests, CI, basic infra).
- During interviews, show architecture diagrams and explain tradeoffs, costs, and scaling decisions.
-----------------------------------------------------------------------
END OF DOCUMENT