🔥
Week 2 - The Data Engineering Ecosystem
Overview of the Data Engineering Ecosystem
Week 2 - The Data Engineering Ecosystem 1
Week 2 - The Data Engineering Ecosystem 2
Week 2 - The Data Engineering Ecosystem 3
Conclusion
Automated tools, frameworks, and processes for all stages of the data analytics process are part of the Data Engineer’s
ecosystem.
It‘s a diverse, rich, and challenging ecosystem.
Types of Data
Structured data
Has a well-defined structure
Can be stored in well-defined schemas
Can be represented in a tabular manner with rows and columns
Week 2 - The Data Engineering Ecosystem 4
Semi-Structured data
Has some organizational properties but lacks a fixed or rigid schema
Cannot be stored in the forms of rows and columns as in databases
Contains tags and elements, or metadata, which is used to griup data and organize it in a hierarchy
Unstructured data
Does not have an easily identifiable structure
Cannot be organized in a mainstream relational database in the form of rows and columns
Does not follow any particular format, sequence, semantics, or rules
Week 2 - The Data Engineering Ecosystem 5
Conclusion
Structured data is data that is well organized in formats that can be stored in databases and lends itself to standard data
analysis methods and tools;
Semi-structured data is data that is somewhat organized and relies on meta tags for grouping and hierarchy;
Unstructured data is data that is not conventionally organized in the form of rows and columns in a particular format. In the
next video, we will learn about the different types of file structures.
Understanding Different Types of File Formats
Week 2 - The Data Engineering Ecosystem 6
Week 2 - The Data Engineering Ecosystem 7
Week 2 - The Data Engineering Ecosystem 8
Sources of Data
Week 2 - The Data Engineering Ecosystem 9
Week 2 - The Data Engineering Ecosystem 10
Week 2 - The Data Engineering Ecosystem 11
Week 2 - The Data Engineering Ecosystem 12
Week 2 - The Data Engineering Ecosystem 13
Languages for Data Professionals
Week 2 - The Data Engineering Ecosystem 14
Week 2 - The Data Engineering Ecosystem 15
Week 2 - The Data Engineering Ecosystem 16
Week 2 - The Data Engineering Ecosystem 17
Week 2 - The Data Engineering Ecosystem 18
Reading: Metadata and Metadata Management
[Link]
[Link]?origin=[Link]
Summary and Highlights
A Data Engineer’s ecosystem includes the infrastructure, tools, frameworks, and processes for extracting data,
architecting and managing data pipelines and data repositories, managing workflows, developing applications, and
managing BI and Reporting tools.
Based on how well-defined the structure of the data is, data can be categorized as
Structured data, that is data which is well organized in formats that can be stored in databases.
Semi-structured data, that is data which is partially organized and partially free-form.
Unstructured data, that is data which can not be organized conventionally into rows and columns.
Data comes in a wide-ranging variety of file formats, such as, delimited text files, spreadsheets, XML, PDF, and JSON,
each with its own list of benefits and limitations of use.
Data is extracted from multiple data sources, ranging from relational and non-relational databases, to APIs, web services,
data streams, social platforms, and sensor devices.
Once the data is identified and gathered from different sources, it needs to be staged in a data repository so that it can be
prepared for analysis. The type, format, and sources of data influence the type of data repository that can be used.
Data professionals need a host of languages that can help them extract, prepare, and analyse data. These can be
classified as:
Querying languages, such as SQL, used for accessing and manipulating data from databases.
Programming languages such as Python, R, and Java, for developing applications and controlling application
behavior.
Shell and Scripting languages, such as Unix/Linux Shell, and PowerShell, for automating repetitive operational tasks.
Quiz
Practice Quiz
Question 1
Week 2 - The Data Engineering Ecosystem 19
Automated tools, frameworks, and processes for all stages of the data analytics process are part of the Data Engineer’s
ecosystem. What role do data integration tools play in this ecosystem?
Store high-volume day-to-day operational data in data repositories
Cover the entire journey of data from source to destination
Combine data from multiple sources into a unified view that is accessed by data consumers to query and
manipulate data
Conduct complex data analytics
Question 2
Which of these data sources is an example of semi-structured data?
Documents
Social media feeds
Emails
Network and web logs
Question 3
Which one of the provided file formats is commonly used by APIs and Web Services to return data?
XML
Delimited file
JSON
XLS
Question 4
What is one example of the relational databases
discussed in the video?
Spreadsheet
XML
Flat files
SQL Server
Question 5
Which of the following languages is one of the most popular querying languages in use today?
SQL
Java
Python
Graded Quiz
Question 1
There are two main types of data repositories – Transactional and Analytical. For high-volume day-to-day operational data
such as banking transactions, Transactional, or OLTP, systems are the ideal choice.
True
False
Transactional, or OLTP, systems are designed and optimized for handling high-volume transactions.
Question 2
Which of the following is an example of unstructured data?
Zipped files
Week 2 - The Data Engineering Ecosystem 20
Video and Audio files
XML
Spreadsheets
Question 3
Which one of these file formats is independent of software, hardware, and operating systems, and can be viewed the
same way on any device?
XML
XLSX
PDF
Delimited text file
PDF format is independent of software, hardware, and operating systems, and can be viewed the same way on any
device.
Question 4
Which data source can return data in plain text, XML, HTML, or JSON among others?
APIs
Delimited text file
XML
PDF
APIs can return data in a wide variety of formats such as plain text, XML, HTML, or JSON among others.
Question 5
In the data engineer’s ecosystem, languages are classified by type. What are shell and scripting languages most
commonly used for?
Manipulating data
Building apps
Automating repetitive operational tasks
Querying data
Week 2 - The Data Engineering Ecosystem 21