0% found this document useful (0 votes)
5 views5 pages

Data Engineering Modified Course and Learning Path

Uploaded by

trillionwilson3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views5 pages

Data Engineering Modified Course and Learning Path

Uploaded by

trillionwilson3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Data Engineering Essentials

Modified Complete Course Content (Merged) + Prioritized Learning Path


-----------------------------------------------------------------------

CONTENTS

1. SQL Fundamentals & Data Warehousing (Core)


- Introduction to SQL for Data Engineering
- Overview of Application Architecture and RDBMS
- Overview of Database Technologies and relevance of SQL
- Overview of Purpose Built Databases
- Overview of Data Warehouse and Data Lake
- (INSERTED) Section: Data Modeling & Schema Design (High Priority)
* Data modeling fundamentals: OLTP vs OLAP
* Dimensional modeling: star and snowflake
* Slowly Changing Dimensions (SCD) types
* Fact tables, grain, surrogate keys
* Partitioning strategies, schema evolution
* Workshops & SCD Type-2 implementation using Delta/MERGE
- Usage of RDBMS and Data Warehouse technologies
- Differences and Similarities between RDBMS and Data Warehouse Technologies

2. Postgres & Hands-On SQL (Existing)


- Overview of Postgres Database Server and pgAdmin
- Overview of Database Connection Details
- Overview of Connecting to External Databases using pgAdmin
- Create Application Database and User in Postgres Database Server
- Clone Data Sets from Git Repository for Database Scripts
- Register Server in pgAdmin using Application Database and User
- Setup Application Tables and Data in Postgres Database
- Overview of pgAdmin to write SQL Queries
- Review Data Model Diagram
- Define Problem Statement for SQL Queries
- Filtering Data using SQL Queries
- Total Aggregations using SQL Queries
- Group By Aggregations using SQL Queries
- Order of Execution of SQL Queries
- Rules and Restrictions to Group and Filter Data in SQL queries
- Filter Data based on Aggregated Results using Group By and Having
- Joins (Inner and Outer) and advanced filtering on join results
- Views, CTEs, CTAS, OVER / PARTITION BY, Ranking, and Exercises
- SQL Troubleshooting, Explain Plans, Indexing, Performance Tuning modules

3. Data Modeling Workshop (detail)


- (continued Data Modeling exercises, SCD implementation, schema evolution)

4. Python Fundamentals & Development Practices


- Setup Visual Studio Workspace for Python Application Development
- VS Code Notebooks, Cells, Functions, Running by line
- Python basics: data types, lists, strings, loops, functions, file I/O
- JSON handling, Pandas basics, reading/writing CSV/JSON
- Pandas advanced: joins, aggregations, sorting, writing files
- Projects: File Format Converter, File→DB Loader (existing)
- Exception handling, environment variables, runtime args, logging
- (INSERTED) Section: Testing, Debugging & TDD for Data Pipelines
* Unit tests for Spark transformations (pytest)
* Integration tests & mocking external systems
* End-to-end test harness, test data management, CI gating

5. File Format Converter & File→DB Loader Projects (existing)


- Project 1: File Format Converter (setup, glob, regex, dynamic schema)
- Project 2: Files To Database Loader (chunked loads, multiprocessing)
- Deploy and troubleshoot file loader, performance tuning with chunksize - Refactor for multiprocessing and
validation

6. CI/CD, Containerization & Version Control (INSERTED)


- Git best practices, branching and PR workflows
- Unit testing patterns, pytest for Python + Spark
- GitHub Actions for CI to run tests and deploy artifacts
- Docker basics and containerizing Spark jobs for local testing
- Terraform basics for provisioning cloud resources (GCP examples)
- Demo: CI pipeline that builds, tests and deploys a Databricks job / Airflow DAG
- Notebook versioning best practices and converting notebooks to scheduled jobs - Code review checklist,
collaboration patterns and runbooks

7. Getting Started with GCP & Databricks (existing)


- Pre-requisite Skills, signing up, GCP credits, Cloud Shell, gcloud SDK
- Analytics Services on GCP, Databricks on GCP setup, workspaces, clusters
- Databricks CLI, DBFS, creating tables, temp views, Spark SQL examples
- (INSERTED) Section: Cloud Managed Data Warehouses & Managed Services
* Overview: BigQuery, Snowflake, Redshift
* Loading patterns: batch vs streaming ingestion
* Cost & performance considerations: partitioning, clustering
* Integrating Databricks/Spark with BigQuery or Snowflake (connectors) * Hands-on: load
Delta/Parquet data to BigQuery and run queries

8. Spark SQL & DataFrames (existing)


- Spark SQL functions: string/date/numeric/null handling, case/when, aggregation
- Basic transformations, filtering, GROUP BY, ORDER BY, joins, ranking, JSON processing
- Copying results into metastore tables (CTAS/INSERT/MERGE)
- Spark DataFrame API: select, withColumn, joins, aggregations, sorting, nulls handling - Integration of
Spark
SQL and DataFrame APIs; manage metastore objects

9. (INSERTED) Streaming & Real-time Processing (High Priority)


- Stream processing concepts & use cases
- Kafka fundamentals & cloud Pub/Sub overview
- Schema design for streaming: Avro/Protobuf + registry
- Spark Structured Streaming: micro-batch vs continuous
- Hands-on: Kafka → Spark Structured Streaming → Delta
- Windowed aggregations, stateful processing, checkpointing
- Exactly-once semantics, watermarking, late data handling
- Streaming performance tuning, monitoring and troubleshooting
- Exercise: end-to-end streaming pipeline and solution walkthrough

10. Databricks Workflows, ELT Pipelines & Jobs (existing)


- Pass arguments to notebooks (Python & SQL), create & run first Databricks job
- Run jobs & tasks with parameters, orchestrated pipelines using Databricks Jobs
- Import ELT apps into Databricks, build workflows, review execution details
- (INSERTED) Orchestration & Workflow Management (Airflow / Prefect)
* DAGs, Airflow fundamentals, operators/sensors/XCom, scheduling
* Airflow integrations: DatabricksOperator, KubernetesPodOperator
* Prefect basics and comparison with Airflow
* Deploying Airflow (managed vs self-hosted), observability, CI for DAGs * Exercise:
orchestrate an ELT pipeline (schedule + backfill + alerts)

11. Spark Performance Tuning & Explain Plans (existing)


- Catalyst optimizer overview, explain plans for DataFrames and SQL
- Interpret explain plans, Spark architecture, broadcasting, filter pushdown
- Cluster config: all-purpose vs jobs clusters, autoscaling, executor/executor memory
- Partitioning, columnar formats (Parquet/Delta), schema inference, partition pruning
- Performance assessment for Parquet, shuffling, adaptive query execution, dynamic allocation - Review
Spark UI, job details, YARN logs, and performance tuning scenarios

12. Data Storage & Lakehouse (existing)


- Delta Lake: managed vs external tables, CRUD, MERGE semantics
- Creating Delta tables, copy data to metastore, insert, validate
- Columnar file formats, folder structure for partitioned data, parquet best practices

13. Hadoop, HDFS & Hive (existing)


- Dataproc clusters, single node & multinode setups, HDFS commands, file blocks, replication
- Hive applications and scripts, partitioned parquet tables, staging in HDFS - Scheduling using crontab
for Hive jobs, develop shell wrappers

14. Advanced Spark & Deployment (existing)


- Spark submit modes, dependencies as packages/jars, submit scripts
- Logging in Spark apps (python logging), validating logs client/cluster mode
- Running spark applications with/without AQE, dynamic allocation, partitions

15. Performance & Cluster Operations (existing)


- Compute capacity, YARN capacity, Spark History Server, Spark UI deep dives
- Generate test data, WordCount app, disable/override dynamic allocation, shuffling

16. Observability, Data Quality & Lineage (INSERTED)


- Why data quality matters, Great Expectations overview
- Add validations into pipelines (batch & streaming)
- Observability: metrics, logs, tracing, and integrating with Prometheus/Datadog
- Lineage & metadata: OpenLineage, Amundsen, data catalog basics - Demo: add GE validations to
File→DB loader and dashboarding

17. Security, Governance & Compliance (INSERTED)


- IAM and access controls (GCP/AWS concepts), encryption at rest/in transit
- Row/column-level security, masking, PII handling, GDPR basics
- Auditing, lineage, ownership and governance operational checks

18. Linux, Shell Scripts & Automation (existing)


- SSH, PATH, mkdir/cp/mv/rm, find, grep, shell scripts, debug scripts with args
- Hadoop/Spark executables, start/stop clusters, VS Code remote setups
19. Final Projects & Capstone Integration
- Project 1: Batch ELT pipeline (CSV→Parquet→Delta→Databricks job): implement partitioning, MERGE,
performance tuning
- Project 2: Streaming analytics demo (Simulated events→Kafka→Spark Structured Streaming→Delta +
dashboard)
- Project 3: Orchestrated pipeline with Airflow (Ingest→Transform→Load to BigQuery or Snowflake; DAGs,
monitoring, retries)
- Each project includes: README, architecture diagram, run instructions, sample data, tests, CI config
20. Wrap-up: best practices, interview prep checklist, next steps

-----------------------------------------------------------------------
PRIORITIZED LEARNING PATH (Mapped to the 3 Portfolio Projects)

Goal: produce 3 interview ready projects. Below is a prioritized, timeboxed plan with milestones.

Project A — Batch ELT pipeline (Priority: High)


Target skills: SQL, Postgres, Pandas, Spark, Databricks, Delta, Partitioning, Explain Plans, Indexing, CI.
Milestones:
Week 1: Review SQL fundamentals & Data Modeling (sections 2–4). Create star schema for e-commerce.
Week 2: Implement File Format Converter (Project 1) using Pandas. Add unit tests.
Week 3: Build Spark job to convert CSV→Parquet→Delta, create partitioned target table.
Week 4: Create Databricks job, add Explain Plan analysis, performance tuning (partition pruning).
Week 5: Add CI (GitHub Actions), containerize small test harness, add Great Expectations checks.
Deliverables: repo with README, DAG (if used), Databricks job config, CI, tests, screenshots of Spark UI/explain
plan.

Project B — Streaming analytics (Priority: High)


Target skills: Kafka / PubSub, Spark Structured Streaming, schema registry, windowing, exactly once, monitoring.
Milestones:
Week 1: Study Streaming fundamentals & Kafka (INSERTED streaming section).
Week 2: Build a local Kafka producer and topic; set up schema registry (Avro).
Week 3: Develop Spark Structured Streaming job to consume, aggregate (windowed counts), write to Delta.
Week 4: Add checkpointing, watermarking; test late data handling; scale locally (multiple partitions).
Week 5: Add monitoring (expose metrics), logging, and end-to-end validation with Great Expectations.
Deliverables: repo, sample event generator, Spark Structured Streaming code, README, dashboard screenshots.

Project C — Orchestrated ELT + Warehouse (Priority: Medium)


Target skills: Airflow/Prefect, Big Query/Snowflake, CI/CD, Terraform, lineage.
Milestones:
Week 1: Learn Airflow basics & set up local Airflow or managed Composer.
Week 2: Create DAG to run Project A Spark job + Project B streaming validation steps.
Week 3: Add operator to load processed data into BigQuery or Snowflake; add tests.
Week 4: Add Terraform basics for provisioning minimal infra (bucket, service account).
Week 5: Add lineage metadata (Open Lineage) and data catalog entries.
Deliverables: DAG repo, Terraform config, documentation, screenshots of DAG UI and BigQuery/Snowflake
queries.

Recommended order of study (fastest path to hireable skillset):


1) SQL & Data Modeling + Postgres hands-on (2 weeks)
2) Python & Pandas + File→DB loader (1 week)
3) Spark fundamentals + Databricks (2 weeks)
4) Streaming (Spark Structured Streaming + Kafka) (2 weeks)
5) Orchestration (Airflow) + CI/CD + Testing (2 weeks)
6) Cloud Warehouse (BigQuery/Snowflake) + governance & security (1–2 weeks)
7) Final polish: tests, CI, documentation, public repos (1 week) --------------------------------------------------------------------
---
HOW TO USE THIS PDF
- Follow the prioritized path for fastest hiring readiness.
- Build each project in a public GitHub repo with clear READMEs and run instructions.
- Keep each project small but production aware (tests, CI, basic infra).
- During interviews, show architecture diagrams and explain tradeoffs, costs, and scaling decisions.

-----------------------------------------------------------------------
END OF DOCUMENT

You might also like