0% found this document useful (0 votes)
25 views30 pages

Emerging Chapter 2

Chapter 2 of the course 'Introduction to Emerging Technology' focuses on data science, covering its definition, the distinction between data and information, data types, and the data value chain. It outlines the objectives for students, including understanding the data processing life cycle and the basics of big data, as well as the Hadoop ecosystem. The chapter also discusses the characteristics of big data and the importance of clustered computing in managing large datasets.

Uploaded by

ambachewm27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views30 pages

Emerging Chapter 2

Chapter 2 of the course 'Introduction to Emerging Technology' focuses on data science, covering its definition, the distinction between data and information, data types, and the data value chain. It outlines the objectives for students, including understanding the data processing life cycle and the basics of big data, as well as the Hadoop ecosystem. The chapter also discusses the characteristics of big data and the importance of clustered computing in managing large datasets.

Uploaded by

ambachewm27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Jigjiga University

Course Name:
Introduction to Emerging Technology
Chapter 2

Data Science
Introduction

 In the previous chapter, the concept of the role of


data for emerging technologies was discussed.
• In this chapter, you are going to learn more
about:-
Data science,
Data vs. information,
Data types and representation,
Data value chain, and
Basic concepts of big data.
Objectives
After completing this chapter, the students will be able to:
 Describe what data science is and the role of data scientists.
Differentiate data and information.
 Describe data processing life cycle
 Understand different data types from diverse perspectives
 Describe data value chain in emerging era of big data.
 Understand the basics of Big Data.
 Describe the purpose of the Hadoop ecosystem
components.
Activity 2.1

 What is data science? Can you describe the


role of data in emerging technology?
 What are data and information?
 What is big data?
An Overview of Data Science


Data science is a multi-disciplinary field that uses scientific
methods, processes, algorithms, and systems to extract
knowledge and insights from structured, semi-structured
and unstructured data.

Data science is much more than simply analyzing data.
It offers a range of roles and requires a range of skills.

What are data and information?

Data can be defined as:-



It can be described as unprocessed facts and figures.
An Overview of Data Science
Whereas information is: -
The processed data on which decisions and actions are
based.
 It is data that has been processed into a form that is
meaningful to the recipient and is of real or perceived
value in the current or the prospective action or decision
of recipient.
Furtherer more, information is interpreted data; created
from organized, structured, and processed data in a
particular context.
The difference between Data and Information
What is Data?
 Data is definid as the symbols that represent people, events, things
and ideas.
 Data becomes information when it is presented in a format that
people can understand and use.
 Data is a raw material for information and Data alone tells no story.
What is Information?
 Information is the collection of facts and figures which are
organized in a meaningful manner to be used as a base for guidance
and decision making.
 Information is a data that has been processed and has a meaning to
the user.
 Information is a processed data we get as an output.
Data vs Information

Data Information
• Meaningless • Meaningful
• Doesn’t used for- • used for decision
decision making making

9
Data Processing Cycle

Data processing is the re-structuring or re-ordering of data


by people or machines to increase their usefulness and add
values for a particular purpose.
Data processing consists of the following basic steps - input,
processing, and output.
These three steps constitute the data processing cycle.
 Input − in this step, the input data is prepared in some
convenient form for processing.
Processing − in this step, the input data is changed to
produce data in a more useful form.
 Output − at this stage, the result of the processing step is
collected.
Data types and their representation

Data type is simply an attribute of data that tells the


compiler or interpreter how the programmer intends to use
the data.
Data types from Computer programming perspective
Common data types include:
Integers(int)- is used to store whole numbers,
Booleans(bool)- is used to represent true or false
Characters(char)- is used to store a single character
Floating(float)- is used to store real numbers
Alphanumeric strings(string)- used to store a
combination of characters and numbers
Data types from Data Analytics perspective

From a data analytics point of view, it is important to


understand that there are three common types of data
types or structures:
1. Structured,
2. Semi-structured, and
3. Unstructured data types.
Structured Data
Structured data is highly-organized and is stored in a predefined
format.
Structured data are stored in rows and columns or in tabular
format.
Common examples of structured data are Excel files or SQL
db has structured rows and columns that can be sorted.
Data types from Data Analytics perspective

Semi-structured Data
It is difficult to categorize this types of data. Because
sometimes it look like structured data and sometimes
look like unstructured data. E.g. JSON and XML
Unstructured Data
Unstructured data is information that either does not
have a predefined data model or is not organized in a

pre-defined manner.
Unstructured data can not be stored in rows and
columns.
Example. audio, video files ,images or No- SQL
Structured data vs Unstructured data
Can be displayed in Cannot be displayed in
rows, columns and rows, columns and
relational database relational database
Only 20% of world data 80% of world data
Requires less storage Requires more storage
Easy to manage Difficult to manage
Metadata – Data about Data

Metadata is data about data.


Metadata is defined as the data providing information
about one or more aspects of the data.
It provides additional information about a specific
set of data.
In a set of photographs, for example, metadata could
describe when and where the photos were taken.
The metadata then provides fields for dates and locations
which, by themselves, can be considered structured data.
Because of this reason, metadata is frequently used by Big
Data solutions for initial analysis.
Data value Chain

The data value chain describes the process of data creation and use from
first identifying a need for data to its final use and possible reuse.
The Big Data Value Chain identifies five high-level activities:
Data Acquisition
It is the process of gathering, filtering, and cleaning data before it is
put in a data warehouse or any other storage solution on which data
analysis can be carried out.
Data acquisition is one of the major big data challenges in terms of
infrastructure requirement.
Data Analysis
Concerned with making the raw data acquired amenable to
use in decision-making as well as domain-specific usage.
Data analysis involves exploring, transforming and
modeling data with the goal of highlighting relevant data.
Data value Chain
Data Curation
It is the active management of data over its life cycle to
ensure it meets the necessary data quality requirements
for its effective usage.
 The professional person that ensures the data quality
called data curator.
Data curation processes can be categorized into different
activities such as content creation, selection, classification,
transformation, validation, and preservation.
Data value Chain

Data Storage
 It is the persistence and management of data in a
scalable way that satisfies the needs of applications
that require fast access to the data. Because of the
fact that SQL(relational database)lack to manage
 ACID(Atomicity , Consistency , Isolation and
Durability) Properties.

Nowadays we used NoSQL


Data value Chain

Data Usage
It covers the data-driven business activities
that need access to data, its analysis, and the
tools needed to integrate the data analysis
within the business activity.
What Is Big Data?

Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
 Big data refers to large sets of complex data ,both
structured and unstructured which traditional processing
techniques and /or algorisms are unable to operate on.
 Big data refers to data sets whose size is beyond the
ability of typical database software tools to capture, store,
manage and analyzed.
Big Data

 Big data is characterized by 4V and more:


Volume: large amounts of data Zeta bytes/
Massive datasets.
Velocity: Data is live streaming or in motion/ It is
the velocity of generating new data.
Variety: data comes in many different forms from
diverse sources
Veracity: can we trust the data? How accurate is
it? etc.
Clustered Computing and Hadoop Ecosystem

Clustered Computing
Big data clustering software combines the resources of
many smaller machines, seeking to provide a number of
benefits:
Resource pool: Combining the available storage space to
hold data is a clear benefit.
High Availability: Clusters can provide varying levels of
fault tolerance and availability guarantees to prevent
hardware or software failures from affecting access to data
and processing.
Easy Scalability: Clusters make it easy to scale
horizontally by adding additional machines to the group.
without expanding the physical resources on a machine.
Clustered Computing and Hadoop Ecosystem…

Hadoop and its Ecosystem


Hadoop is an open-source framework intended to make
interaction with big data easier.

It is a framework that allows for the distributed processing


of large datasets across clusters of computers using simple
programming models.
Clustered Computing and Hadoop Ecosystem

The four key characteristics of Hadoop are:


Economical: Its systems are highly economical.
Reliable: It is reliable as it stores copies of the data on
different machines and is resistant to hardware
failure.
Scalable: It is easily scalable both, horizontally and
vertically. A few extra nodes help in scaling up
the framework.
Flexible: It is flexible and you can store as much
structured and unstructured data as you need
to and decide to use them later.
Big Data Life Cycle with Hadoop

Ingesting data into the system


The first stage of Big Data processing is Ingest.
The data is ingested or transferred to Hadoop from various
sources such as relational databases, systems, or local files.
Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data.
Processing the data in storage
The second stage is processing.
In this stage, the data is stored and processed.
The data is stored in the distributed file system, HDFS, and
the NoSQL distributed data, HBase.
Spark and Map Reduce perform data processing.
Big Data Life Cycle with Hadoop…
Computing and analyzing data
The third stage is to Analyze.
Here, the data is analyzed by processing frameworks such
as Pig, Hive, and Impala.
Pig converts the data using a map and reduces and then
analyzes it.
Hive is also based on the map and reduces programming
and is most suitable for structured data.
Visualizing the results
The fourth stage is Access, which is performed by tools such
as Hue and Cloud era Search.
In this stage, the analyzed data can be accessed by users.
Review Questions
1. What is data science?
2. What is the difference between structured
and unstructured data?
3. What is the difference between data and
information?
4. What is data processing life cycle, list the
activity?
Quiz 5%

Part I: - True or false


1.The first industrial revolution were started in America……………………
2. Data Curation is the active management over its life cycle to ensure
it meets the necessary data quality requirements…………………….
3.The industrial revolution was a time when the manufacturing of
goods moved from small shops and homes to large factories…………….
Part II: - Choose the best answer
4. Which data type used to store integer
A. float B. int C. bool D. char
5.Human Computer Interaction consists of
A. User B. Computer itself C. The Way they work together D.
All
Part III: - Short Answer
6. What is Meta Data ?
End of
Chapter Two
Any Questions?

You might also like