0% found this document useful (0 votes)

32 views42 pages

L1 - Introduction and Data EcoSystem

The document outlines the CS5481 Data Engineering course, emphasizing the importance of data engineers in managing big data. It covers course organization, types of data, data engineering ecosystems, and the languages and tools used by data professionals. Assessment methods include continuous assessment, group projects, and a final exam, along with resources for learning and practical application.

Uploaded by

hungsir86

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views42 pages

L1 - Introduction and Data EcoSystem

Uploaded by

hungsir86

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Lecture 1: Introduction &

Data Ecosystem
CS5481 Data Engineering
Instructor: Linqi Song
• Big data is changing the way we do business and creating a need for data
engineers who can collect and manage large quantities of data.
• Data engineering is the practice of designing and building systems for
collecting, storing, and analyzing data at scale.

2
Outline

• 1. Course organization

• 2. Overview of the data engineering ecosystem

• 3. Types of data

• 4. Languages and tools for data professionals

3
Bird’s-eye view of this course
• Data pipeline
• Data acquisition, data processing, and data storage
• Data management
• Data processing techniques for data driven applications
• Information retrieval, recommendations, social network analysis, anomaly
detection

4
Instruction pattern
• 3-hour in-class learning
• 2-hour lecture, Tuesdays, 18:30-20:20, LI 2505
• Cover main topics of the course
• 1-hour tutorial
• In-class hands-on ability, discussions, presentations, etc.
• T01, Tuesdays, 20:30-21:20, LI-4412
• T02, Tuesdays, 21:30-22:20, LI-4412
• After class
• Homework assignments and projects
• Reading recent advances of related fields, implementation details,
• Papers, technical blogs, GitHub, etc.
• QA
• Canvas -> discussions is most preferred
• Email for more personalized questions
• TAs for hands-on issues

5
Teaching team
• Prof. Linqi Song
• Yeung Y6425, [email protected]

• TAs
Mr. Wei Shao ([email protected]),
Mr. Zengyan Liu ([email protected]),
Ms. Yuxuan Yao ([email protected]),
Mr. Guangfeng Yan ([email protected]),
Mr. Haochen Tan ([email protected]),
Ms. Shuqi Liu ([email protected]).

6
Assessment
• Continuous assessment (60%)
• 2 individual homework assignments (each 15%)
• Answer questions and/or programming to implement simple data engineering tasks
• 1 group project with presentations (30%)
• Form a group of 2-3 students (before Week 5)
• Select one topic among several given topics
• Do experiments and show your innovation and novelty
• Reports + codes + others (datasets, proofs, figures, etc.)
• Presentation (Weeks 13)
• Final exam (40%)

7
Proper use of Large Language Models (e.g.,
ChatGPT)
• We will specify whether to use or not use LLMs in our HW/projects,
following the department policy.

8
Schedule

9
Resources
• Computing resources
• Jupyterhub for tutorials, home assignments and group projects.
• CS department: https://mljh.cs.cityu.edu.hk/
• Google Colab: https://colab.research.google.com/
• Other resources: Kaggle (kaggle.com/notebooks), other online resources, CS
Lab MMW 2462
• Other online courses

10
How to learn this course well?
• Higher-level postgraduate courses
• What problems to solve?
• How to approach the problem?
• Systematic ideas instead of details
• This data engineering course
• Get your hands dirty, as it is mainly about how to process data and implement
systems for domain applications
• Follow recent academic and industrial advances
• Discuss with others

11
Outline

• 1. Course organization

• 2. Overview of the data engineering ecosystem

• 3. Types of data

• 4. Languages and tools for data professionals

12
Venn diagram with three dimensional skills for ‘data related
work’: coding + maths + domain knowledge

13
Data engineering
pipeline
• The Goal of Data
Engineering is to provide
organized, standard data
flow to enable data-
driven models such as
machine learning models,
data analysis.

14
Data engineering ecosystem
A Data Engineer’s ecosystem includes the infrastructure, tools,
frameworks, and processes for:
• Extracting data from disparate sources
• Architecting and managing data pipelines for transformation, integration, and
storage of data
• Architecting and managing data repositories
• Automating and optimizing workflows and flow of data between systems
• Developing applications needed through the data engineering flow

It is a diverse, rich , and challenging ecosystem.

15
Data engineering ecosystem: data sources
Data comes in a wide-ranging variety of file formats being collected from a
variety of data sources:

16
Data engineering ecosystem: data storage

Online Analytical Online Transaction

Processing (OLAP) Systems Processing (OLTP) Systems

• Optimized for • Designed to store

conducting complex high volume day-to-
data analytics day operational data

• Include relational and • Typically relational,

non-relational but can also be non-
databases, data relational
warehouses, data
marts, data lakes, and
big data stores
17
Data engineering ecosystem: data integration

Combine data from disparate sources into a unified view, accessed by users to
query and manipulate the data.

18
Data engineering ecosystem: data analysis
Data analysis is to discover useful
information, informing conclusions, and
supporting decision-making from data, in
different business, science, and social
science domains. In today's business world,
data analysis plays a role in making decisions
more scientific and helping businesses
operate more effectively.

It often involves data mining and machine

learning techniques to model and process
the data.
19
Data engineering ecosystem: data visualization
Business Intelligence (BI) and Reporting Tools
• Collect data from multiple data sources and present them in a visual format,
such as interactive dashboards.
• Visualize data in real-time and predefined schedule.
• Drag and drop products that do not require knowledge of programming

20
Data engineering ecosystem: data-driven
applications (1)
• Information retrieval (IR) is the process of
obtaining information system resources that are
relevant to an information need from a collection of
those resources (texts, images or sounds). Searches
can be based on full-text or other content-based
indexing.

• A recommender system is a subclass of information

filtering system that seeks to predict the “rating” or
“preference” a user would give to an item.

21
Data engineering ecosystem: data-driven
applications (2)
• Social network analysis investigates social structures
through the use of networks and graph theory. It
characterizes networked structures in terms of nodes
(individual actors, people, or things within the
network) and the ties, edges, or links (relationships
or interactions) that connect them.

• Anomaly detection is the identification of rare

events, items, or observations which are suspicious
because they differ significantly from standard
behaviors or patterns. Anomalies in data are also
called standard deviations, outliers, noise, novelties,
and exceptions.

22
Data engineering ecosystem: data governance
• Data quality: how well suited a data set is to serve its specific purpose, such as
accuracy, completeness, consistency, validity, uniqueness, biasness, and timeliness.
• Data security: safeguarding data throughout its entire life cycle to protect it from
corruption, theft, or unauthorized access. It covers everything—hardware, software,
storage devices, and user devices; access and administrative controls; and
organizations' policies and procedures.
• Data privacy: proper handling of sensitive
data including personal data and other
confidential data, such as certain financial
data and intellectual property data, to
meet regulatory requirements as well as
protecting the confidentiality and
immutability of the data.
23
Outline

• 1. Course organization

• 2. Overview of the data engineering ecosystem

• 3. Types of data

• 4. Languages and tools for data professionals

24
Types of data

Structured Semi-Structured Unstructured

Data that follows a rigid Mix of data that has Data that is complex and
format and can be consistent characteristics mostly qualitative
organized into rows and and data that does not information that cannot be
columns. conform to a rigid structured into rows and
structure columns
25
Structured data (1)

• Has a well defined structures.

• Can be stored in well-defined schemas.

• Can be represented in a tabular manner with

rows and columns.
Structured data is objective facts
and numbers that can be
collected, exported, stored, and
organized in typical databases.

26
Structured data (2)
Sources of structured data includes:

27
Semi-structured data (1)

• Has some organizational properties but lacks

a fixed or rigid schema.
• Cannot be stored in the form of rows and
columns as in databases.
• Contains tags and elements, or metadata,
which is used to group data and organize it in
a hierarchy.

28
Semi-structured data (2)
Sources of semi-structured data

XML and JSON allow users to

define tags and attributes to
store data in a hierarchical
form and are used widely to
store and exchange semi-
structured data.
29
Unstructured data (1)

• Does not have an easily identifiable structure.

• Cannot be organized in a mainstream relational

database in the form of rows and columns.

• Does not follow any particular format, sequence

semantics, or rules.

30
Unstructured data (2)
Sources of unstructured data

31
Examples of different types of data

32
Outline

• 1. Course organization

• 2. Overview of the data engineering ecosystem

• 3. Types of data

• 4. Languages and tools for data professionals

33
Languages for data professionals

Query Languages Programming Languages Shell and Scripting Languages

For example, SQL For example, Python For repetitive and

for querying and for developing data time-consuming
manipulating applications operational tasks
data

34
Query languages
Advantages of using SQL:
• SQL is portable and platform independent.
• Can be used for querying data in a wide variety of databases
and data repositories.
• Has a simple syntax that is similar to the English language
• Can retrieve large amount of data quickly and efficiently
• Runs on an interpreter system.

35
General programming languages – Python
Python is one of the fastest-growing programming languages in the world.

Advantages of using Python:

• Easy to learn and Open-source
• Can be ported to multiple platforms and has widespread
community support.
• Provides open-source libraries for data manipulation, data
visualization, statistics, mathematics.

36
Statistical programming languages – R
Advantages of using R-programming:
• Open-source and platform-independent.
• Can be paired with many programming languages and
highly extensible.
• Facilitates the handling of structured and unstructured
data.
• Can be used for developing statistical tools.

37
General programming languages – Java
Java is an object-oriented, class based, and platform-independent programming language.

Advantages of using Java:

• One of the top-ranked programming languages used today.
• Used in a number of data analytics processes – cleaning data,
importing and exporting data, statistical analysis, data visualization.
• Used in the development of big data frameworks and tools –
Hadoop, Hive, Spark
• Well-suited for speed critical projects.

38
Shell and scripting languages
Typical operation performed by shell scripts include:
• File manipulation
• Program execution
• System administration tasks such as dis backups and evaluating system logs
• Installation scripts for complex programs
• Executing routine backups
• Running batches.

39
Shell and scripting languages - PowerShell
PowerShell is a cross-platform automation tool and configuration framework by Microsoft
that is optimized for working with structured data formats.

• Consists of command-line shell and scripting language

• Object-based and can be used to filter, sort, measure, group, and
compare objects as they pass through a data pipeline.
• Used for data mining, building GUIs, creating charts, dashboards, and
interactive reports.

40
41
References
1. https://www.coursera.org/learn/introduction-to-data-engineering
2. https://macxima.medium.com/data-engineering-572733412d54
3. https://www.analyticsvidhya.com/blog/2021/06/data-engineering-
concepts-and-importance/

Data Engineering Course Overview
No ratings yet
Data Engineering Course Overview
33 pages
Lecture 1.1 - Introduction To DE
No ratings yet
Lecture 1.1 - Introduction To DE
27 pages
Data Engineering Essentials
No ratings yet
Data Engineering Essentials
24 pages
Data-Engineering Compressed
No ratings yet
Data-Engineering Compressed
20 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
13 pages
Week 2 - The Data Engineering Ecosystem
No ratings yet
Week 2 - The Data Engineering Ecosystem
21 pages
Data Engineering Unit-1
No ratings yet
Data Engineering Unit-1
16 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
28 pages
Syllabus - Fundamentals of Data Engineering
No ratings yet
Syllabus - Fundamentals of Data Engineering
4 pages
Data Engineering Training Technology Agnostic Foundations
No ratings yet
Data Engineering Training Technology Agnostic Foundations
50 pages
Intro To Data Engineering!
No ratings yet
Intro To Data Engineering!
34 pages
An Introduction To Data Engineering
No ratings yet
An Introduction To Data Engineering
2 pages
Data Engineering
No ratings yet
Data Engineering
144 pages
100 Data Engineering QUESTIONS ANSWERS
No ratings yet
100 Data Engineering QUESTIONS ANSWERS
59 pages
Data Engineering Learning Pathways
No ratings yet
Data Engineering Learning Pathways
4 pages
Data Engineering Overview and Tools
No ratings yet
Data Engineering Overview and Tools
34 pages
KIIT Data Analytics Course Guide
No ratings yet
KIIT Data Analytics Course Guide
65 pages
Data Engineering Career Guide
100% (2)
Data Engineering Career Guide
14 pages
Lec 01 - DATA 101 Sp24 - Welcome To Data Engineering!
No ratings yet
Lec 01 - DATA 101 Sp24 - Welcome To Data Engineering!
31 pages
Unit 1 Introduction To Data Engineering
No ratings yet
Unit 1 Introduction To Data Engineering
32 pages
22CS911-DEC Unit 5
No ratings yet
22CS911-DEC Unit 5
68 pages
Complete Step-By-Step Roadmap To Learn Data Engineering in 2025
No ratings yet
Complete Step-By-Step Roadmap To Learn Data Engineering in 2025
13 pages
IDS - Sem Ans Unit 1
No ratings yet
IDS - Sem Ans Unit 1
10 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
DA Full
No ratings yet
DA Full
738 pages
Data Engineering Course Outline
No ratings yet
Data Engineering Course Outline
3 pages
60+ Data Engineer Interview Questions and Answers
No ratings yet
60+ Data Engineer Interview Questions and Answers
16 pages
Data Engineering Syllabus
No ratings yet
Data Engineering Syllabus
1 page
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
91 pages
BCSG 0034
No ratings yet
BCSG 0034
2 pages
BE (CSE) - Honors Prog
No ratings yet
BE (CSE) - Honors Prog
16 pages
Data Science Overview and Concepts
No ratings yet
Data Science Overview and Concepts
20 pages
Brochure Professional Certificate in Data Engineering
100% (1)
Brochure Professional Certificate in Data Engineering
14 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
Introduction to Data Engineering
No ratings yet
Introduction to Data Engineering
3 pages
Python For Data Science 2025 Slides
No ratings yet
Python For Data Science 2025 Slides
364 pages
Data Engineering Career Boost
No ratings yet
Data Engineering Career Boost
15 pages
8.de CDP
No ratings yet
8.de CDP
4 pages
NoteGPT - AWS Data Engineer Full Course in 10 Hours (2025) - Data Engineer Course For Beginner - Edureka Live
No ratings yet
NoteGPT - AWS Data Engineer Full Course in 10 Hours (2025) - Data Engineer Course For Beginner - Edureka Live
141 pages
Fundamentals of Data Engineering Concepts
No ratings yet
Fundamentals of Data Engineering Concepts
219 pages
Data Engineering Flow
No ratings yet
Data Engineering Flow
4 pages
Data Engineering and Big Data: Hadrien Lacroix
No ratings yet
Data Engineering and Big Data: Hadrien Lacroix
79 pages
Becoming A Data Engineer (The StudyPlan)
No ratings yet
Becoming A Data Engineer (The StudyPlan)
4 pages
Data Engineering Life Cycle
No ratings yet
Data Engineering Life Cycle
5 pages
Data Science
No ratings yet
Data Science
244 pages
Week 1 Slides
No ratings yet
Week 1 Slides
16 pages
Essentials of Data Engineering - Saini, DR - Mukesh - 2024 - Anna's Archive
No ratings yet
Essentials of Data Engineering - Saini, DR - Mukesh - 2024 - Anna's Archive
431 pages
Essentials of Data engineeringByMukeshSaini
No ratings yet
Essentials of Data engineeringByMukeshSaini
30 pages
Data Engineering Notes Expanded
No ratings yet
Data Engineering Notes Expanded
2 pages
100 Dataengineering Interview Questions TRRaveendra 1694654407
No ratings yet
100 Dataengineering Interview Questions TRRaveendra 1694654407
58 pages
Complete Data Engineering Roadmap With Resources
No ratings yet
Complete Data Engineering Roadmap With Resources
16 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
CS3352 - Foundations of Data Science
No ratings yet
CS3352 - Foundations of Data Science
142 pages
Data Science Essentials for Beginners
No ratings yet
Data Science Essentials for Beginners
20 pages
Unit 1
No ratings yet
Unit 1
21 pages
Data Engineering UNIT 1
100% (1)
Data Engineering UNIT 1
16 pages
Big Data Course Overview
No ratings yet
Big Data Course Overview
97 pages
Haloalkanes and Haloarenes NCERT Content
No ratings yet
Haloalkanes and Haloarenes NCERT Content
27 pages
Two-Storey Townhouse Specifications
No ratings yet
Two-Storey Townhouse Specifications
8 pages
Affective Learning Competencies Overview
No ratings yet
Affective Learning Competencies Overview
26 pages
Printing Equipment Auction Notice
No ratings yet
Printing Equipment Auction Notice
3 pages
Temporomandibular Disorders and Dental Occlusion - What Do We Know So Far - - ScienceDirect - 颞下颌关节疾病和牙齿咬合：到目前为止我们知道什么？- 科学直通
No ratings yet
Temporomandibular Disorders and Dental Occlusion - What Do We Know So Far - - ScienceDirect - 颞下颌关节疾病和牙齿咬合：到目前为止我们知道什么？- 科学直通
13 pages
Freelancers' Guide to Effective DMs
100% (1)
Freelancers' Guide to Effective DMs
4 pages
Hitema Screw Chiller ECU 545 PDF
No ratings yet
Hitema Screw Chiller ECU 545 PDF
30 pages
Physics Kinematics Practice Set
No ratings yet
Physics Kinematics Practice Set
1 page
Sustainable Tourism
100% (1)
Sustainable Tourism
14 pages
High Density Orcharding
0% (1)
High Density Orcharding
2 pages
Overview of the Respiratory System
No ratings yet
Overview of the Respiratory System
1 page
STGB30H60DFB, STGP30H60DFB: Trench Gate Field-Stop 600 V, 30 A High Speed HB Series IGBT
No ratings yet
STGB30H60DFB, STGP30H60DFB: Trench Gate Field-Stop 600 V, 30 A High Speed HB Series IGBT
21 pages
Brine Solution Power Bank
No ratings yet
Brine Solution Power Bank
14 pages
Rupesh Format
No ratings yet
Rupesh Format
5 pages
Advertising Effectiveness: New Model Insights
No ratings yet
Advertising Effectiveness: New Model Insights
7 pages
Engaging Ideas: A Professor's Guide
0% (1)
Engaging Ideas: A Professor's Guide
3 pages
Cannabis in Alchemical Literature: Green Lion, Philosopher's Stone
0% (1)
Cannabis in Alchemical Literature: Green Lion, Philosopher's Stone
8 pages
BHT - T-RstaticBalance
No ratings yet
BHT - T-RstaticBalance
5 pages
Calculation:: Calculate Break Even Point
No ratings yet
Calculation:: Calculate Break Even Point
2 pages
KET (A2) : Reading and Writing Part 1 Questions 1-6
No ratings yet
KET (A2) : Reading and Writing Part 1 Questions 1-6
6 pages
1 - CEA - UAS Juni 2020
No ratings yet
1 - CEA - UAS Juni 2020
9 pages
Reflective Essay On Module
No ratings yet
Reflective Essay On Module
5 pages
NCP Traction
100% (1)
NCP Traction
9 pages
101 Idioms for English Exam Prep
No ratings yet
101 Idioms for English Exam Prep
8 pages
Chapter 3 HUMAN RESOURCE MANAGEMENT
No ratings yet
Chapter 3 HUMAN RESOURCE MANAGEMENT
10 pages
Echo vs. Print Statement.: Differences Between GET and POST Methods ?
No ratings yet
Echo vs. Print Statement.: Differences Between GET and POST Methods ?
3 pages
Dr. Venu - Management Accounting
No ratings yet
Dr. Venu - Management Accounting
2 pages
PRC Actual Delivery Case
No ratings yet
PRC Actual Delivery Case
2 pages
Soil-Cement Seminar Report 2024-25
No ratings yet
Soil-Cement Seminar Report 2024-25
43 pages
Reid TacticalStrategicDeception 2017
No ratings yet
Reid TacticalStrategicDeception 2017
25 pages

L1 - Introduction and Data EcoSystem

Uploaded by

L1 - Introduction and Data EcoSystem

Uploaded by

Lecture 1: Introduction &

• 2. Overview of the data engineering ecosystem

• 4. Languages and tools for data professionals

• 2. Overview of the data engineering ecosystem

• 4. Languages and tools for data professionals

It is a diverse, rich , and challenging ecosystem.

Online Analytical Online Transaction

• Optimized for • Designed to store

• Include relational and • Typically relational,

It often involves data mining and machine

• A recommender system is a subclass of information

• Anomaly detection is the identification of rare

• 2. Overview of the data engineering ecosystem

• 4. Languages and tools for data professionals

Structured Semi-Structured Unstructured

• Has a well defined structures.

• Can be stored in well-defined schemas.

• Can be represented in a tabular manner with

• Has some organizational properties but lacks

XML and JSON allow users to

• Does not have an easily identifiable structure.

• Cannot be organized in a mainstream relational

• Does not follow any particular format, sequence

• 2. Overview of the data engineering ecosystem

• 4. Languages and tools for data professionals

Query Languages Programming Languages Shell and Scripting Languages

For example, SQL For example, Python For repetitive and

Advantages of using Python:

Advantages of using Java:

• Consists of command-line shell and scripting language

You might also like