PUML4PC03 Data Engineering LTPC
(Common to CSE ,CSBS,AIDS) 4 003
COURSE OBJECTIVE
To introduce the fundamentals and evolution of data engineering in modern organizations.
To understand operating system roles in managing resources for data pipelines.
To explore database principles and optimization techniques for large-scale data.
To design and optimize data models and transformations for batch and streaming workflows.
To evaluate replication, consistency, and fault-tolerant architectures in distributed systems.
UNIT – I INTRODUCTION OF DATA ENGINEERING 6
Data engineering: concept of data engineering – The data engineering life cycle – Evolution of the data engineer –
Data engineering and data science, Data engineering skills and activities: Data maturity and the data engineer, Data
engineers inside an organization.
UNIT-II DATA ENGINEERING AND OPERATING SYSTEMS 12
Role of Operating Systems in Data Engineering-OS-level support for Data Pipelines: Process scheduling, memory
management, and I/O handling-File systems and data storage mechanisms: (HDFS, ext4, NTFS overview)-
Concurrency and parallel processing: Threads, multiprocessing, synchronization, and deadlocks in data workflows-
Resource management for data-intensive applications: CPU, memory, and disk utilization-Virtualization and
containerization in Data Engineering (Docker, Kubernetes overview)-Performance optimization and fault tolerance
at the OS level.
UNIT III DATABASE SYSTEMS IN DATA ENGINEERING 12
Database Systems and Data Engineering – Role of Databases in Data Pipelines – Data Storage and Retrieval –
Database Architecture: Components and Layers – Relational Model: Tables, Attributes, and Keys – Schema Design
and Normalization – SQL for Data Definition and Manipulation – Indexing and Query Optimization – Transactions
and Concurrency Control – Distributed and NoSQL Databases in Modern Data Platforms – Integration of Databases
with Data Pipelines.
UNIT IV QUERIES, MODELING AND TRANSFORMATION 8
Queries-Life of a Query, the query optimizer:Improving query performance, queries on streaming data-Data
modeling- Transformations: Batch transformations- Materialized views, Federation, and Query virtualization -
Streaming Transformations and Processing, Upstream stakeholders, Downstream stakeholders.
UNIT V SYSTEM OF RECORD 5
Replication: Shared nothing architectures –Leaders and followers - Problem with replication lag – Multi leader
replication, Topologies-Leaderless Replication.
TOTAL:45 PERIODS
At the end of the course, Students will be able to:
CO’S COURSE OUTCOMES COGNITIVE
LEVEL
CO1 Understand the fundamentals and role of data engineering in data-driven Understand
organizations.
CO2 Analyze OS concepts that enhance performance and reliability in data pipelines. Analyse
CO3 Apply database design and optimization techniques in data workflows. Apply
Design and optimize data models and transformations for batch and streaming Evaluate
CO4
processing.
Evaluate replication, consistency, and fault-tolerant architectures in distributed Evaluate
CO5
systems.
CO – PO Mapping
PO1 PO PSO
CO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO12 PSO2
0 11 1
CO 1 - - - 2 1
1 3 2 1 1 1 2 1 3
CO 1 - - - 3 2
2 2 3 2 2 2 2 1 2
CO 1 - - - 3 2
3 3 2 3 2 3 2 1 3
CO 1 - - - 3 3
4 2 2 3 3 3 3 1 3
CO 1 - - - 3 3
5 2 3 2 3 3 3 1 2
1-low, 2-medium, 3-high, ‘-’ - no correlation
TEXTBOOK:
1. Fundamentals of data engineering. Reis, J., & Housley, M. O'Reilly Media, Inc (2022).
2. Designing Data-Intensive Applications" by Martin Kleppmann (2023).
3. Modern Operating Systems, 5th Edition" by Andrew S. Tanenbaum & Herbert Bos (2024).
REFERENCES:
1. Data Engineering with Python: Work with massive datasets to design data models and automate data pipelines using
Python. Crickard, P. Packt Publishing Ltd. (2020).
2. Database Systems: The Complete Book" by Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom
(2023).
3. Operating Systems: Three Easy Pieces" by Remzi Arpaci-Dusseau & Andrea Arpaci-Dusseau (2024)
4. Data Engineering on Azure" by Vlad Riscutia (2023).
NPTEL/ SWAYAM/ MOOC REFERENCE:
1. https://atlan.com/automation-for-data-engineering-teams/
2. https://www.coursera.org/specializations/gcp-data-machine-learning
3. https://swayam.gov.in/nd1_noc19_cs60/preview?utm_source
4. https://odsc.medium.com/10-data-engineering-topics-and-trends-you-need-to-know-in-2024-bd2af52d95f4
5. https://learning.dell.com/content/dell/en-us/home/training/data-engineering.html