0% found this document useful (0 votes)
32 views42 pages

L1 - Introduction and Data EcoSystem

The document outlines the CS5481 Data Engineering course, emphasizing the importance of data engineers in managing big data. It covers course organization, types of data, data engineering ecosystems, and the languages and tools used by data professionals. Assessment methods include continuous assessment, group projects, and a final exam, along with resources for learning and practical application.

Uploaded by

hungsir86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views42 pages

L1 - Introduction and Data EcoSystem

The document outlines the CS5481 Data Engineering course, emphasizing the importance of data engineers in managing big data. It covers course organization, types of data, data engineering ecosystems, and the languages and tools used by data professionals. Assessment methods include continuous assessment, group projects, and a final exam, along with resources for learning and practical application.

Uploaded by

hungsir86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Lecture 1: Introduction &

Data Ecosystem
CS5481 Data Engineering
Instructor: Linqi Song
• Big data is changing the way we do business and creating a need for data
engineers who can collect and manage large quantities of data.
• Data engineering is the practice of designing and building systems for
collecting, storing, and analyzing data at scale.

2
Outline

• 1. Course organization

• 2. Overview of the data engineering ecosystem

• 3. Types of data

• 4. Languages and tools for data professionals

3
Bird’s-eye view of this course
• Data pipeline
• Data acquisition, data processing, and data storage
• Data management
• Data processing techniques for data driven applications
• Information retrieval, recommendations, social network analysis, anomaly
detection

4
Instruction pattern
• 3-hour in-class learning
• 2-hour lecture, Tuesdays, 18:30-20:20, LI 2505
• Cover main topics of the course
• 1-hour tutorial
• In-class hands-on ability, discussions, presentations, etc.
• T01, Tuesdays, 20:30-21:20, LI-4412
• T02, Tuesdays, 21:30-22:20, LI-4412
• After class
• Homework assignments and projects
• Reading recent advances of related fields, implementation details,
• Papers, technical blogs, GitHub, etc.
• QA
• Canvas -> discussions is most preferred
• Email for more personalized questions
• TAs for hands-on issues

5
Teaching team
• Prof. Linqi Song
• Yeung Y6425, [email protected]

• TAs
Mr. Wei Shao ([email protected]),
Mr. Zengyan Liu ([email protected]),
Ms. Yuxuan Yao ([email protected]),
Mr. Guangfeng Yan ([email protected]),
Mr. Haochen Tan ([email protected]),
Ms. Shuqi Liu ([email protected]).

6
Assessment
• Continuous assessment (60%)
• 2 individual homework assignments (each 15%)
• Answer questions and/or programming to implement simple data engineering tasks
• 1 group project with presentations (30%)
• Form a group of 2-3 students (before Week 5)
• Select one topic among several given topics
• Do experiments and show your innovation and novelty
• Reports + codes + others (datasets, proofs, figures, etc.)
• Presentation (Weeks 13)
• Final exam (40%)

7
Proper use of Large Language Models (e.g.,
ChatGPT)
• We will specify whether to use or not use LLMs in our HW/projects,
following the department policy.

8
Schedule

9
Resources
• Computing resources
• Jupyterhub for tutorials, home assignments and group projects.
• CS department: https://mljh.cs.cityu.edu.hk/
• Google Colab: https://colab.research.google.com/
• Other resources: Kaggle (kaggle.com/notebooks), other online resources, CS
Lab MMW 2462
• Other online courses

10
How to learn this course well?
• Higher-level postgraduate courses
• What problems to solve?
• How to approach the problem?
• Systematic ideas instead of details
• This data engineering course
• Get your hands dirty, as it is mainly about how to process data and implement
systems for domain applications
• Follow recent academic and industrial advances
• Discuss with others

11
Outline

• 1. Course organization

• 2. Overview of the data engineering ecosystem

• 3. Types of data

• 4. Languages and tools for data professionals

12
Venn diagram with three dimensional skills for ‘data related
work’: coding + maths + domain knowledge

13
Data engineering
pipeline
• The Goal of Data
Engineering is to provide
organized, standard data
flow to enable data-
driven models such as
machine learning models,
data analysis.

14
Data engineering ecosystem
A Data Engineer’s ecosystem includes the infrastructure, tools,
frameworks, and processes for:
• Extracting data from disparate sources
• Architecting and managing data pipelines for transformation, integration, and
storage of data
• Architecting and managing data repositories
• Automating and optimizing workflows and flow of data between systems
• Developing applications needed through the data engineering flow

It is a diverse, rich , and challenging ecosystem.

15
Data engineering ecosystem: data sources
Data comes in a wide-ranging variety of file formats being collected from a
variety of data sources:

16
Data engineering ecosystem: data storage

Online Analytical Online Transaction


Processing (OLAP) Systems Processing (OLTP) Systems

• Optimized for • Designed to store


conducting complex high volume day-to-
data analytics day operational data

• Include relational and • Typically relational,


non-relational but can also be non-
databases, data relational
warehouses, data
marts, data lakes, and
big data stores
17
Data engineering ecosystem: data integration

Combine data from disparate sources into a unified view, accessed by users to
query and manipulate the data.

18
Data engineering ecosystem: data analysis
Data analysis is to discover useful
information, informing conclusions, and
supporting decision-making from data, in
different business, science, and social
science domains. In today's business world,
data analysis plays a role in making decisions
more scientific and helping businesses
operate more effectively.

It often involves data mining and machine


learning techniques to model and process
the data.
19
Data engineering ecosystem: data visualization
Business Intelligence (BI) and Reporting Tools
• Collect data from multiple data sources and present them in a visual format,
such as interactive dashboards.
• Visualize data in real-time and predefined schedule.
• Drag and drop products that do not require knowledge of programming

20
Data engineering ecosystem: data-driven
applications (1)
• Information retrieval (IR) is the process of
obtaining information system resources that are
relevant to an information need from a collection of
those resources (texts, images or sounds). Searches
can be based on full-text or other content-based
indexing.

• A recommender system is a subclass of information


filtering system that seeks to predict the “rating” or
“preference” a user would give to an item.

21
Data engineering ecosystem: data-driven
applications (2)
• Social network analysis investigates social structures
through the use of networks and graph theory. It
characterizes networked structures in terms of nodes
(individual actors, people, or things within the
network) and the ties, edges, or links (relationships
or interactions) that connect them.

• Anomaly detection is the identification of rare


events, items, or observations which are suspicious
because they differ significantly from standard
behaviors or patterns. Anomalies in data are also
called standard deviations, outliers, noise, novelties,
and exceptions.

22
Data engineering ecosystem: data governance
• Data quality: how well suited a data set is to serve its specific purpose, such as
accuracy, completeness, consistency, validity, uniqueness, biasness, and timeliness.
• Data security: safeguarding data throughout its entire life cycle to protect it from
corruption, theft, or unauthorized access. It covers everything—hardware, software,
storage devices, and user devices; access and administrative controls; and
organizations' policies and procedures.
• Data privacy: proper handling of sensitive
data including personal data and other
confidential data, such as certain financial
data and intellectual property data, to
meet regulatory requirements as well as
protecting the confidentiality and
immutability of the data.
23
Outline

• 1. Course organization

• 2. Overview of the data engineering ecosystem

• 3. Types of data

• 4. Languages and tools for data professionals

24
Types of data

Structured Semi-Structured Unstructured


Data that follows a rigid Mix of data that has Data that is complex and
format and can be consistent characteristics mostly qualitative
organized into rows and and data that does not information that cannot be
columns. conform to a rigid structured into rows and
structure columns
25
Structured data (1)

• Has a well defined structures.

• Can be stored in well-defined schemas.

• Can be represented in a tabular manner with


rows and columns.
Structured data is objective facts
and numbers that can be
collected, exported, stored, and
organized in typical databases.

26
Structured data (2)
Sources of structured data includes:

27
Semi-structured data (1)

• Has some organizational properties but lacks


a fixed or rigid schema.
• Cannot be stored in the form of rows and
columns as in databases.
• Contains tags and elements, or metadata,
which is used to group data and organize it in
a hierarchy.

28
Semi-structured data (2)
Sources of semi-structured data

XML and JSON allow users to


define tags and attributes to
store data in a hierarchical
form and are used widely to
store and exchange semi-
structured data.
29
Unstructured data (1)

• Does not have an easily identifiable structure.

• Cannot be organized in a mainstream relational


database in the form of rows and columns.

• Does not follow any particular format, sequence


semantics, or rules.

30
Unstructured data (2)
Sources of unstructured data

31
Examples of different types of data

32
Outline

• 1. Course organization

• 2. Overview of the data engineering ecosystem

• 3. Types of data

• 4. Languages and tools for data professionals

33
Languages for data professionals

Query Languages Programming Languages Shell and Scripting Languages

For example, SQL For example, Python For repetitive and


for querying and for developing data time-consuming
manipulating applications operational tasks
data

34
Query languages
Advantages of using SQL:
• SQL is portable and platform independent.
• Can be used for querying data in a wide variety of databases
and data repositories.
• Has a simple syntax that is similar to the English language
• Can retrieve large amount of data quickly and efficiently
• Runs on an interpreter system.

35
General programming languages – Python
Python is one of the fastest-growing programming languages in the world.

Advantages of using Python:


• Easy to learn and Open-source
• Can be ported to multiple platforms and has widespread
community support.
• Provides open-source libraries for data manipulation, data
visualization, statistics, mathematics.

36
Statistical programming languages – R
Advantages of using R-programming:
• Open-source and platform-independent.
• Can be paired with many programming languages and
highly extensible.
• Facilitates the handling of structured and unstructured
data.
• Can be used for developing statistical tools.

37
General programming languages – Java
Java is an object-oriented, class based, and platform-independent programming language.

Advantages of using Java:


• One of the top-ranked programming languages used today.
• Used in a number of data analytics processes – cleaning data,
importing and exporting data, statistical analysis, data visualization.
• Used in the development of big data frameworks and tools –
Hadoop, Hive, Spark
• Well-suited for speed critical projects.

38
Shell and scripting languages
Typical operation performed by shell scripts include:
• File manipulation
• Program execution
• System administration tasks such as dis backups and evaluating system logs
• Installation scripts for complex programs
• Executing routine backups
• Running batches.

39
Shell and scripting languages - PowerShell
PowerShell is a cross-platform automation tool and configuration framework by Microsoft
that is optimized for working with structured data formats.

• Consists of command-line shell and scripting language


• Object-based and can be used to filter, sort, measure, group, and
compare objects as they pass through a data pipeline.
• Used for data mining, building GUIs, creating charts, dashboards, and
interactive reports.

40
41
References
1. https://www.coursera.org/learn/introduction-to-data-engineering
2. https://macxima.medium.com/data-engineering-572733412d54
3. https://www.analyticsvidhya.com/blog/2021/06/data-engineering-
concepts-and-importance/

42

You might also like