0% found this document useful (0 votes)

5 views5 pages

Data Engineering Modified Course and Learning Path

Uploaded by

trillionwilson3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views5 pages

Data Engineering Modified Course and Learning Path

Uploaded by

trillionwilson3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Data Engineering Essentials

Modified Complete Course Content (Merged) + Prioritized Learning Path

-----------------------------------------------------------------------

CONTENTS

1. SQL Fundamentals & Data Warehousing (Core)

- Introduction to SQL for Data Engineering
- Overview of Application Architecture and RDBMS
- Overview of Database Technologies and relevance of SQL
- Overview of Purpose Built Databases
- Overview of Data Warehouse and Data Lake
- (INSERTED) Section: Data Modeling & Schema Design (High Priority)
* Data modeling fundamentals: OLTP vs OLAP
* Dimensional modeling: star and snowflake
* Slowly Changing Dimensions (SCD) types
* Fact tables, grain, surrogate keys
* Partitioning strategies, schema evolution
* Workshops & SCD Type-2 implementation using Delta/MERGE
- Usage of RDBMS and Data Warehouse technologies
- Differences and Similarities between RDBMS and Data Warehouse Technologies

2. Postgres & Hands-On SQL (Existing)

- Overview of Postgres Database Server and pgAdmin
- Overview of Database Connection Details
- Overview of Connecting to External Databases using pgAdmin
- Create Application Database and User in Postgres Database Server
- Clone Data Sets from Git Repository for Database Scripts
- Register Server in pgAdmin using Application Database and User
- Setup Application Tables and Data in Postgres Database
- Overview of pgAdmin to write SQL Queries
- Review Data Model Diagram
- Define Problem Statement for SQL Queries
- Filtering Data using SQL Queries
- Total Aggregations using SQL Queries
- Group By Aggregations using SQL Queries
- Order of Execution of SQL Queries
- Rules and Restrictions to Group and Filter Data in SQL queries
- Filter Data based on Aggregated Results using Group By and Having
- Joins (Inner and Outer) and advanced filtering on join results
- Views, CTEs, CTAS, OVER / PARTITION BY, Ranking, and Exercises
- SQL Troubleshooting, Explain Plans, Indexing, Performance Tuning modules

3. Data Modeling Workshop (detail)

- (continued Data Modeling exercises, SCD implementation, schema evolution)

4. Python Fundamentals & Development Practices

- Setup Visual Studio Workspace for Python Application Development
- VS Code Notebooks, Cells, Functions, Running by line
- Python basics: data types, lists, strings, loops, functions, file I/O
- JSON handling, Pandas basics, reading/writing CSV/JSON
- Pandas advanced: joins, aggregations, sorting, writing files
- Projects: File Format Converter, File→DB Loader (existing)
- Exception handling, environment variables, runtime args, logging
- (INSERTED) Section: Testing, Debugging & TDD for Data Pipelines
* Unit tests for Spark transformations (pytest)
* Integration tests & mocking external systems
* End-to-end test harness, test data management, CI gating

5. File Format Converter & File→DB Loader Projects (existing)

- Project 1: File Format Converter (setup, glob, regex, dynamic schema)
- Project 2: Files To Database Loader (chunked loads, multiprocessing)
- Deploy and troubleshoot file loader, performance tuning with chunksize - Refactor for multiprocessing and
validation

6. CI/CD, Containerization & Version Control (INSERTED)

- Git best practices, branching and PR workflows
- Unit testing patterns, pytest for Python + Spark
- GitHub Actions for CI to run tests and deploy artifacts
- Docker basics and containerizing Spark jobs for local testing
- Terraform basics for provisioning cloud resources (GCP examples)
- Demo: CI pipeline that builds, tests and deploys a Databricks job / Airflow DAG
- Notebook versioning best practices and converting notebooks to scheduled jobs - Code review checklist,
collaboration patterns and runbooks

7. Getting Started with GCP & Databricks (existing)

- Pre-requisite Skills, signing up, GCP credits, Cloud Shell, gcloud SDK
- Analytics Services on GCP, Databricks on GCP setup, workspaces, clusters
- Databricks CLI, DBFS, creating tables, temp views, Spark SQL examples
- (INSERTED) Section: Cloud Managed Data Warehouses & Managed Services
* Overview: BigQuery, Snowflake, Redshift
* Loading patterns: batch vs streaming ingestion
* Cost & performance considerations: partitioning, clustering
* Integrating Databricks/Spark with BigQuery or Snowflake (connectors) * Hands-on: load
Delta/Parquet data to BigQuery and run queries

8. Spark SQL & DataFrames (existing)

- Spark SQL functions: string/date/numeric/null handling, case/when, aggregation
- Basic transformations, filtering, GROUP BY, ORDER BY, joins, ranking, JSON processing
- Copying results into metastore tables (CTAS/INSERT/MERGE)
- Spark DataFrame API: select, withColumn, joins, aggregations, sorting, nulls handling - Integration of
Spark
SQL and DataFrame APIs; manage metastore objects

9. (INSERTED) Streaming & Real-time Processing (High Priority)

- Stream processing concepts & use cases
- Kafka fundamentals & cloud Pub/Sub overview
- Schema design for streaming: Avro/Protobuf + registry
- Spark Structured Streaming: micro-batch vs continuous
- Hands-on: Kafka → Spark Structured Streaming → Delta
- Windowed aggregations, stateful processing, checkpointing
- Exactly-once semantics, watermarking, late data handling
- Streaming performance tuning, monitoring and troubleshooting
- Exercise: end-to-end streaming pipeline and solution walkthrough

10. Databricks Workflows, ELT Pipelines & Jobs (existing)

- Pass arguments to notebooks (Python & SQL), create & run first Databricks job
- Run jobs & tasks with parameters, orchestrated pipelines using Databricks Jobs
- Import ELT apps into Databricks, build workflows, review execution details
- (INSERTED) Orchestration & Workflow Management (Airflow / Prefect)
* DAGs, Airflow fundamentals, operators/sensors/XCom, scheduling
* Airflow integrations: DatabricksOperator, KubernetesPodOperator
* Prefect basics and comparison with Airflow
* Deploying Airflow (managed vs self-hosted), observability, CI for DAGs * Exercise:
orchestrate an ELT pipeline (schedule + backfill + alerts)

11. Spark Performance Tuning & Explain Plans (existing)

- Catalyst optimizer overview, explain plans for DataFrames and SQL
- Interpret explain plans, Spark architecture, broadcasting, filter pushdown
- Cluster config: all-purpose vs jobs clusters, autoscaling, executor/executor memory
- Partitioning, columnar formats (Parquet/Delta), schema inference, partition pruning
- Performance assessment for Parquet, shuffling, adaptive query execution, dynamic allocation - Review
Spark UI, job details, YARN logs, and performance tuning scenarios

12. Data Storage & Lakehouse (existing)

- Delta Lake: managed vs external tables, CRUD, MERGE semantics
- Creating Delta tables, copy data to metastore, insert, validate
- Columnar file formats, folder structure for partitioned data, parquet best practices

13. Hadoop, HDFS & Hive (existing)

- Dataproc clusters, single node & multinode setups, HDFS commands, file blocks, replication
- Hive applications and scripts, partitioned parquet tables, staging in HDFS - Scheduling using crontab
for Hive jobs, develop shell wrappers

14. Advanced Spark & Deployment (existing)

- Spark submit modes, dependencies as packages/jars, submit scripts
- Logging in Spark apps (python logging), validating logs client/cluster mode
- Running spark applications with/without AQE, dynamic allocation, partitions

15. Performance & Cluster Operations (existing)

- Compute capacity, YARN capacity, Spark History Server, Spark UI deep dives
- Generate test data, WordCount app, disable/override dynamic allocation, shuffling

16. Observability, Data Quality & Lineage (INSERTED)

- Why data quality matters, Great Expectations overview
- Add validations into pipelines (batch & streaming)
- Observability: metrics, logs, tracing, and integrating with Prometheus/Datadog
- Lineage & metadata: OpenLineage, Amundsen, data catalog basics - Demo: add GE validations to
File→DB loader and dashboarding

17. Security, Governance & Compliance (INSERTED)

- IAM and access controls (GCP/AWS concepts), encryption at rest/in transit
- Row/column-level security, masking, PII handling, GDPR basics
- Auditing, lineage, ownership and governance operational checks

18. Linux, Shell Scripts & Automation (existing)

- SSH, PATH, mkdir/cp/mv/rm, find, grep, shell scripts, debug scripts with args
- Hadoop/Spark executables, start/stop clusters, VS Code remote setups
19. Final Projects & Capstone Integration
- Project 1: Batch ELT pipeline (CSV→Parquet→Delta→Databricks job): implement partitioning, MERGE,
performance tuning
- Project 2: Streaming analytics demo (Simulated events→Kafka→Spark Structured Streaming→Delta +
dashboard)
- Project 3: Orchestrated pipeline with Airflow (Ingest→Transform→Load to BigQuery or Snowflake; DAGs,
monitoring, retries)
- Each project includes: README, architecture diagram, run instructions, sample data, tests, CI config
20. Wrap-up: best practices, interview prep checklist, next steps

-----------------------------------------------------------------------
PRIORITIZED LEARNING PATH (Mapped to the 3 Portfolio Projects)

Goal: produce 3 interview ready projects. Below is a prioritized, timeboxed plan with milestones.

Project A — Batch ELT pipeline (Priority: High)

Target skills: SQL, Postgres, Pandas, Spark, Databricks, Delta, Partitioning, Explain Plans, Indexing, CI.
Milestones:
Week 1: Review SQL fundamentals & Data Modeling (sections 2–4). Create star schema for e-commerce.
Week 2: Implement File Format Converter (Project 1) using Pandas. Add unit tests.
Week 3: Build Spark job to convert CSV→Parquet→Delta, create partitioned target table.
Week 4: Create Databricks job, add Explain Plan analysis, performance tuning (partition pruning).
Week 5: Add CI (GitHub Actions), containerize small test harness, add Great Expectations checks.
Deliverables: repo with README, DAG (if used), Databricks job config, CI, tests, screenshots of Spark UI/explain
plan.

Project B — Streaming analytics (Priority: High)

Target skills: Kafka / PubSub, Spark Structured Streaming, schema registry, windowing, exactly once, monitoring.
Milestones:
Week 1: Study Streaming fundamentals & Kafka (INSERTED streaming section).
Week 2: Build a local Kafka producer and topic; set up schema registry (Avro).
Week 3: Develop Spark Structured Streaming job to consume, aggregate (windowed counts), write to Delta.
Week 4: Add checkpointing, watermarking; test late data handling; scale locally (multiple partitions).
Week 5: Add monitoring (expose metrics), logging, and end-to-end validation with Great Expectations.
Deliverables: repo, sample event generator, Spark Structured Streaming code, README, dashboard screenshots.

Project C — Orchestrated ELT + Warehouse (Priority: Medium)

Target skills: Airflow/Prefect, Big Query/Snowflake, CI/CD, Terraform, lineage.
Milestones:
Week 1: Learn Airflow basics & set up local Airflow or managed Composer.
Week 2: Create DAG to run Project A Spark job + Project B streaming validation steps.
Week 3: Add operator to load processed data into BigQuery or Snowflake; add tests.
Week 4: Add Terraform basics for provisioning minimal infra (bucket, service account).
Week 5: Add lineage metadata (Open Lineage) and data catalog entries.
Deliverables: DAG repo, Terraform config, documentation, screenshots of DAG UI and BigQuery/Snowflake
queries.

Recommended order of study (fastest path to hireable skillset):

1) SQL & Data Modeling + Postgres hands-on (2 weeks)
2) Python & Pandas + File→DB loader (1 week)
3) Spark fundamentals + Databricks (2 weeks)
4) Streaming (Spark Structured Streaming + Kafka) (2 weeks)
5) Orchestration (Airflow) + CI/CD + Testing (2 weeks)
6) Cloud Warehouse (BigQuery/Snowflake) + governance & security (1–2 weeks)
7) Final polish: tests, CI, documentation, public repos (1 week) --------------------------------------------------------------------
---
HOW TO USE THIS PDF
- Follow the prioritized path for fastest hiring readiness.
- Build each project in a public GitHub repo with clear READMEs and run instructions.
- Keep each project small but production aware (tests, CI, basic infra).
- During interviews, show architecture diagrams and explain tradeoffs, costs, and scaling decisions.

-----------------------------------------------------------------------
END OF DOCUMENT

Apache Spark Vs Apache Flink, Reproducible Experiments On Cloud
No ratings yet
Apache Spark Vs Apache Flink, Reproducible Experiments On Cloud
10 pages
Hive: SQL Solution for Big Data
No ratings yet
Hive: SQL Solution for Big Data
26 pages
Hesham Nasser: Software Engineer Profile
No ratings yet
Hesham Nasser: Software Engineer Profile
1 page
Big Data Notes
No ratings yet
Big Data Notes
12 pages
Hadoop Admin 171103e Exercise Manual
No ratings yet
Hadoop Admin 171103e Exercise Manual
103 pages
Hadoop & MongoDB Setup Guide
No ratings yet
Hadoop & MongoDB Setup Guide
8 pages
Google Professional Cloud Data Engineer Practice Exam PDF
No ratings yet
Google Professional Cloud Data Engineer Practice Exam PDF
34 pages
Distributed Systems Exam Answers
No ratings yet
Distributed Systems Exam Answers
4 pages
Creating RDD, RDD Operations, and Saving RDD-1
No ratings yet
Creating RDD, RDD Operations, and Saving RDD-1
28 pages
Hadoop Architecture Overview
No ratings yet
Hadoop Architecture Overview
18 pages
UNIT 1 To 5
No ratings yet
UNIT 1 To 5
37 pages
Unit V Cloud Technologies and Advancements
100% (1)
Unit V Cloud Technologies and Advancements
33 pages
MapReduce Scheduling Algorithms Review
No ratings yet
MapReduce Scheduling Algorithms Review
5 pages
CC1A2
No ratings yet
CC1A2
10 pages
Overview of Hadoop Architecture and Components
No ratings yet
Overview of Hadoop Architecture and Components
75 pages
Hadoop/Spark Developer Resume
No ratings yet
Hadoop/Spark Developer Resume
7 pages
Terminologies Used in Big Data Environments
No ratings yet
Terminologies Used in Big Data Environments
3 pages
Python Data Science 3 Books in 1 - Hands On Learning For Beginners A Hands-On Guide Beyond The Basics A Hands-On Guide For Experts
No ratings yet
Python Data Science 3 Books in 1 - Hands On Learning For Beginners A Hands-On Guide Beyond The Basics A Hands-On Guide For Experts
358 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
h18300 Increasing Hdfs Performance Powerscale Google Cloud
No ratings yet
h18300 Increasing Hdfs Performance Powerscale Google Cloud
4 pages
Homework Labs WithProfessorNotes
33% (3)
Homework Labs WithProfessorNotes
129 pages
Big Data: Characteristics & Platforms
No ratings yet
Big Data: Characteristics & Platforms
11 pages
IT Professional with 10+ Years Experience
No ratings yet
IT Professional with 10+ Years Experience
8 pages
BDA Lab Manual R22
0% (1)
BDA Lab Manual R22
70 pages
Resume 1 Naresh Bolikonda
No ratings yet
Resume 1 Naresh Bolikonda
3 pages
Understanding Hive in Hadoop
No ratings yet
Understanding Hive in Hadoop
17 pages
Unit I DBMS Notes
No ratings yet
Unit I DBMS Notes
36 pages
Assignment 02 BigData Computing Noc23-Cs112
No ratings yet
Assignment 02 BigData Computing Noc23-Cs112
9 pages
Big Data Applications Across Industries
No ratings yet
Big Data Applications Across Industries
71 pages
Big Data Technology Stack Guide
100% (1)
Big Data Technology Stack Guide
12 pages

Data Engineering Modified Course and Learning Path

Uploaded by

Data Engineering Modified Course and Learning Path

Uploaded by

Data Engineering Essentials

Modified Complete Course Content (Merged) + Prioritized Learning Path

1. SQL Fundamentals & Data Warehousing (Core)

2. Postgres & Hands-On SQL (Existing)

3. Data Modeling Workshop (detail)

4. Python Fundamentals & Development Practices

5. File Format Converter & File→DB Loader Projects (existing)

6. CI/CD, Containerization & Version Control (INSERTED)

7. Getting Started with GCP & Databricks (existing)

8. Spark SQL & DataFrames (existing)

9. (INSERTED) Streaming & Real-time Processing (High Priority)

10. Databricks Workflows, ELT Pipelines & Jobs (existing)

11. Spark Performance Tuning & Explain Plans (existing)

12. Data Storage & Lakehouse (existing)

13. Hadoop, HDFS & Hive (existing)

14. Advanced Spark & Deployment (existing)

15. Performance & Cluster Operations (existing)

16. Observability, Data Quality & Lineage (INSERTED)

17. Security, Governance & Compliance (INSERTED)

18. Linux, Shell Scripts & Automation (existing)

Project A — Batch ELT pipeline (Priority: High)

Project B — Streaming analytics (Priority: High)

Project C — Orchestrated ELT + Warehouse (Priority: Medium)

Recommended order of study (fastest path to hireable skillset):

You might also like