0% found this document useful (0 votes)

22 views32 pages

EmTec Chapter 2

Chapter Two of the document provides an introduction to data science, outlining its definition, the difference between data and information, and the data processing cycle. It discusses various data types, the data value chain, and the characteristics of big data, including its management through Hadoop and its ecosystem. The chapter concludes with an overview of the big data life cycle, detailing the processes of data ingestion, storage, processing, analysis, and visualization.

Uploaded by

danielhabtamutegegne

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views32 pages

EmTec Chapter 2

Uploaded by

danielhabtamutegegne

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 32

.

Introduction to Emerging Technology

(EmTe 1012)

Chapter Two:
.

Data Science

1
Outline
Overview of data science,
Data vs. Information,

Data Processing Cycle

Data types,

Data value chain,

Basic concepts of big data,

Hadoop Ecosystem and

Big Data life Cycle with Hadoop

2
Overview of Data Science
Data science is a field
that uses scientific methods, processes,
algorithms, and systems
to extract knowledge and insights
from structured, semi-structured and
unstructured data.
Data science is much more than simply
analyzing data.
It offers a range of roles and requires
a range of skills.
3
Data Vs. Information
Data
It defined as a representation of facts,
concepts, or instructions in a formalized
manner, which should be suitable for
communication, interpretation, or
processing, by human or electronic
machines.
It described as unprocessed facts and
figures.
It is represented with the help of
characters like Alphabets (A-Z, a-z), Digits
(0-9) or Special characters (+, -, /, *, <,>,
4
Cont.…
Information
It is the processed data on which
decisions and actions are based.
It is data that has been processed into
a form that is meaningful and valuable
in the action or decision of recipient.
It is also define as interpreted data;
 That created from organized,
structured, and processed data in a
particular context.

5
Data Processing Cycle
Data processing is the re-

structuring or re-ordering of data

by people or machines
In order to increase their
Input
usefulness and add values for a
particular purpose. Processing
Data processing cycle consists of 3

basic steps. Output

Input

Processing
6

Data Processing Cycle Cont.…
Input:- The input data is prepared in some
suitable form for processing.
The form will depend on the processing
machine.
Processing: The input data is changed in
to a more useful form.
Example, interest can be calculated on
deposit to bank
Output: The result of the processing step
is collected.
The form will depends on the use of the
7
Data types from Computer programming
perspective
Data type is an attribute of data that
tells the compiler how the programmer
aims to use the data
Common data types include:
Integers (int)- used to store whole
numbers,
Booleans (bool)- used to represent
restricted to one of two values: true or
false
Characters (char)- used to store a single
character
8
Data types from Data Analytics
perspective
From a data analytics point of view, it is important to
understand that there are three common types of data types:
Structured,
Semi-structured,
Unstructured and
Metadata.

9
Cont.…

1. Structured Data:-
is a data that follows to a pre-defined

data model
It is straightforward to analyze.

It fit in to a tabular format (rows and

cols)
Examples: Excel files or SQL databases.

10
Cont.…
2. Semi-structured data
It is a form of structured data that does

not conform with the formal structure of

data models associated with other forms of
data tables,
it contains tags or markers to separate

elements and enforce hierarchies of records

& fields within the data.
Therefore, it is known as self-describing
11
Cont.…
3. Unstructured Data
It is data that either

 does not have a predefined data model or

 not organized in a pre-defined manner.
It is typically text-heavy but may contain data

such as dates, numbers, and facts as well.

This results in irregularities and ambiguities

that make it difficult to understand using

programs.
Examples: audio or video files .
12
Cont.…
Metadata
Metadata is data about data.

It provides additional information about a

specific set of data.

Example: In a set of photographs,

metadata could describe when and where

the photos were taken.
 The metadata then provides fields for dates and

locations which, by themselves, can be considered

structured data. 13
Data value Chain
It describe the information flow within a
big data system
It has a series of steps needed to generate
value and useful insights from data.
The Big Data Value Chain identifies the
following key high-level activities:
1. Data Acquisition
2. Data Analysis
3. Data Curation
4. Data Storage and
5. Data Usage

14
1. Data Acquisition:
It is the process of gathering, filtering, and
cleaning data before it is put in any storage
solution on which data analysis can be carried
out.
It is one of the major big data challenges in
terms of infrastructure requirements.
The infrastructure required to support D.
acquisition
 must deliver low, predictable potential in both
capturing data & in executing queries;
 be able to handle very high transaction
volumes; and
15
2. Data Analysis:
It is concerned with making the raw data
acquired amenable/agreeable to use in
decision-making.
It involves exploring, transforming, and
modeling data with the goal of
 highlighting relevant data,
 synthesizing and extracting useful hidden
information
Related areas include data mining, business
intelligence, and machine learning.

16
3. Data Curation:
It is the active management of data to
ensure it meets the necessary data
quality requirements for its effective
usage.
It can be categorized into different
activities like
 content creation, selection, classification,
transformation, validation, and
preservation.
It is performed by expert curators
They are responsible for ensuring that
data are trustworthy, discoverable,
accessible, reusable and fit their purpose.
 Data curators are also known as scientific 17
4. Data Storage:
It is the persistence and management of
data in a scalable way that require fast
access of data.
RDBMS have been the main a solution to
the storage paradigm for nearly 40 years.
 However, when data volumes and
complexity grow the ACID (Atomicity,
Consistency, Isolation, and Durability)
properties lack flexibility
 This
making them unsuitable for big data
scenarios.
NoSQL technologies present a solutions
based on alternative data models.
18
5. Data Usage:
It covers the data-driven business

activities that need access to data, its

analysis, and the tools needed to
integrate the data analysis within the
business activity.
Data usage in business decision-

making can enhance competitiveness

through
 the reduction of costs or
19
 increased added value,
Cont.…

20
Basic concepts of big data
 Big data is large amount of data that
consists structure and unstructured data.
 Large dataset means too large to process or
store with traditional tool or on a single
computer.

Big data is a collection of large and

complex data sets

that it becomes difficult to process using on-

hand DB management tools or traditional data

21
Characteristics of Big Data
Big data is characterized by 3V and more:
Volume: large amounts of data (Zeta bytes)
Velocity: Data is live streaming or in motion
Variety: data comes in different forms from
diverse sources
Veracity: How accurate is it? etc. Can we trust
the data?

22
Clustered Computing
Because of qualities of big data, individual

computers are often inadequate for

handling the data at most stages.
Computer clusters are a better fit in order

to address the high storage and

computational needs of big data
Big data clustering software combines the

resources of many smaller machines, to provide a

number of benefits: 23
Cont.…
Resource Pooling:
Combining the available storage space, CPU
and memory are extremely important
since, processing large datasets requires
large amounts of these three resources.
High Availability:
It provide availability guarantees to prevent
hardware or software failures from affecting
access to data and processing.
Easy Scalability:
Clusters make it easy to scale horizontally by
adding additional machines to the group.
24
Cont.…
Using clusters requires a solution for

managing cluster membership,

coordinating resource sharing, and

scheduling work on individual nodes.

Cluster membership and resource allocation can be

handled by software like Hadoop’s YARN

(which stands for Yet Another Resource
Negotiator).
25
Hadoop (High Availability Distributed Object Oriented
Platform)
Hadoop is a tool that is used to handle big
data.
Hadoop is an open-source framework
that planned to make interaction with
big data easier.
It is a framework that allows for the
distributed processing of large
datasets across clusters of computers
using simple programming models.
It is inspired by a technical document published by Google.
26
Characteristics Of Hadoop
Economical:
Since regular computers can be used for data
processing
Reliable:
Since it stores copies of the data on different
machines and is resistant to hardware failure.
Scalable:
Since it is easily scalable ,both horizontally
and vertically.
Flexible:
Since you can store both structured and
unstructured data. 27
Components of Hadoop’s Ecosystem
Hadoop has an ecosystem that has

evolved from its four core components:

1. Data management,

2. Access,

3. Processing and

4. Storage

It is continuously growing to meet the needs of Big

Data. 28
Hadoop is comprises the following
components
 Oozie: Job Scheduling
 Zookeeper: Managing
cluster
 PIG, HIVE: Query based
processing of data
services
 MapReduce: Programmi
ng based Data Processing
 HDFS: Hadoop
Distributed File System
 Spark: In-Memory data
processing
 Solar, Lucene: Searching
and Indexing

29
Big Data Life Cycle with Hadoop
1. Ingesting data into the system
The data is ingested or transferred to
Hadoop from various sources such as DB,
systems, local files
 Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data.

2. Processing the data in storage

The data is stored and processed

It is performed by tools such as HDFS &

HBase (to store data) and Spark and
MapReduce (to perform data processing).
 Spark and MapReduce perform data processing. 30
Cont.…
3. Analysing the data
The data is analysed by processing

frameworks such as Pig, Hive, and Impala.

 Pig converts the data using a map and reduce and then analyzes
it.
 Hive is also based on the map and reduce programming and is
most suitable for structured data.

4. Visualizing the results

The analyzed data can be accessed by users.

It is performed by using tools such as

Cloudera Search and Hue.

31
32

Data Science Overview and Concepts
No ratings yet
Data Science Overview and Concepts
20 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
Course Name: Introduction To Emerging Technologies
No ratings yet
Course Name: Introduction To Emerging Technologies
24 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Understanding Data Science Concepts
No ratings yet
Understanding Data Science Concepts
29 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Data Science and Big Data Basics
No ratings yet
Data Science and Big Data Basics
32 pages
Data Science Essentials & Big Data Concepts
No ratings yet
Data Science Essentials & Big Data Concepts
20 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
37 pages
Chapter Two2
No ratings yet
Chapter Two2
21 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Ict Ch. 2
No ratings yet
Ict Ch. 2
38 pages
Emerging Tech CH 2
No ratings yet
Emerging Tech CH 2
52 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
CH 2
No ratings yet
CH 2
20 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Chapter 2 (Data Science)
No ratings yet
Chapter 2 (Data Science)
35 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Data Science
No ratings yet
Data Science
32 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
Introduction To Emerging Technologies Chapter 2
No ratings yet
Introduction To Emerging Technologies Chapter 2
31 pages
CH 2 - Emerging
No ratings yet
CH 2 - Emerging
24 pages
Chapter 2 - EMTE - 240216 - 133452
No ratings yet
Chapter 2 - EMTE - 240216 - 133452
47 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
41 pages
Data Science Essentials for Beginners
No ratings yet
Data Science Essentials for Beginners
20 pages
Data Science: Insights & Challenges
No ratings yet
Data Science: Insights & Challenges
33 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
Data Science & Big Data Basics
No ratings yet
Data Science & Big Data Basics
29 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
30 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Islamic Answer
No ratings yet
Islamic Answer
27 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
CH 2
No ratings yet
CH 2
23 pages
Chapter 2EMR
No ratings yet
Chapter 2EMR
21 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
43 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Cha 2
No ratings yet
Cha 2
20 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Emergency Chapter Two
No ratings yet
Emergency Chapter Two
41 pages
Data Analytics Unit 1 2
No ratings yet
Data Analytics Unit 1 2
29 pages
4 - 6026279715508066088 2
No ratings yet
4 - 6026279715508066088 2
1 page
Alkan Group Assigniment 2017
No ratings yet
Alkan Group Assigniment 2017
1 page
Critical Thinking Hand Out (4) - 1
No ratings yet
Critical Thinking Hand Out (4) - 1
138 pages
Business Planon Poultry
No ratings yet
Business Planon Poultry
7 pages
Essential Components of Business Plan
No ratings yet
Essential Components of Business Plan
2 pages
CU-2020 B.Sc. (Honours) Computer Science Semester-V Paper-CC-11 QP
No ratings yet
CU-2020 B.Sc. (Honours) Computer Science Semester-V Paper-CC-11 QP
2 pages
Talend Developer Resume - Praveen Kumar
No ratings yet
Talend Developer Resume - Praveen Kumar
4 pages
Practical Lab Session: SQL-server Manual
100% (1)
Practical Lab Session: SQL-server Manual
49 pages
Shweta
No ratings yet
Shweta
4 pages
Mysql 2. Oracle 3. Microsoft SQL Server
No ratings yet
Mysql 2. Oracle 3. Microsoft SQL Server
11 pages
WEKA Data Preprocessing Guide
No ratings yet
WEKA Data Preprocessing Guide
15 pages
SAP Directories
No ratings yet
SAP Directories
3 pages
SQL Query Practice and Analysis
100% (1)
SQL Query Practice and Analysis
71 pages
Holiday Homework XII D
No ratings yet
Holiday Homework XII D
12 pages
Cloudera Hive
No ratings yet
Cloudera Hive
132 pages
3 - List and Detail Screens Exercise
No ratings yet
3 - List and Detail Screens Exercise
11 pages
Chapter 2
No ratings yet
Chapter 2
101 pages
Ccs341 DW Qa (Final)
No ratings yet
Ccs341 DW Qa (Final)
77 pages
00 Introduction To SAP HANA Chapter 1 Intro To In-Memory Slides en
No ratings yet
00 Introduction To SAP HANA Chapter 1 Intro To In-Memory Slides en
28 pages
1Z0 1119 1 Demo
No ratings yet
1Z0 1119 1 Demo
4 pages
HONDA XR250R Owner's Manual 1999
No ratings yet
HONDA XR250R Owner's Manual 1999
112 pages
PL 300ExamRequirements240206
No ratings yet
PL 300ExamRequirements240206
4 pages
Exersic
100% (1)
Exersic
32 pages
IBM Professional Certification Program: Study Guide Series
No ratings yet
IBM Professional Certification Program: Study Guide Series
30 pages
Worksheet For Database Administration Level 2: Tsedale Nega College of Technology
No ratings yet
Worksheet For Database Administration Level 2: Tsedale Nega College of Technology
2 pages
Unit 2
No ratings yet
Unit 2
6 pages
DPA 19.3 Data Collection Reference Guide
No ratings yet
DPA 19.3 Data Collection Reference Guide
460 pages
ICA Database Structure Overview
No ratings yet
ICA Database Structure Overview
19 pages
Azure Data Engineering Complete Guide
No ratings yet
Azure Data Engineering Complete Guide
130 pages
Major Components of Data Mining System
No ratings yet
Major Components of Data Mining System
9 pages
Coursedog Real-Time Sync Guide
No ratings yet
Coursedog Real-Time Sync Guide
1 page
Chapter 3 - ER
No ratings yet
Chapter 3 - ER
73 pages
Integration Framework: SAP Business One Inbound
No ratings yet
Integration Framework: SAP Business One Inbound
10 pages
Mlii-102 em 2024-25 KP
No ratings yet
Mlii-102 em 2024-25 KP
13 pages
Computer-Applications-In-Pharmacy-Practical Manual
No ratings yet
Computer-Applications-In-Pharmacy-Practical Manual
58 pages

EmTec Chapter 2

Uploaded by

EmTec Chapter 2

Uploaded by

.

Introduction to Emerging Technology

Data Processing Cycle

Data value chain,

Basic concepts of big data,

Hadoop Ecosystem and

Big Data life Cycle with Hadoop

structuring or re-ordering of data

basic steps. Output

It fit in to a tabular format (rows and

not conform with the formal structure of

elements and enforce hierarchies of records

 does not have a predefined data model or

such as dates, numbers, and facts as well.

that make it difficult to understand using

It provides additional information about a

specific set of data.

metadata could describe when and where

locations which, by themselves, can be considered

activities that need access to data, its

making can enhance competitiveness

Big data is a collection of large and

complex data sets

hand DB management tools or traditional data

computers are often inadequate for

to address the high storage and

resources of many smaller machines, to provide a

managing cluster membership,

coordinating resource sharing, and

scheduling work on individual nodes.

Cluster membership and resource allocation can be

handled by software like Hadoop’s YARN

evolved from its four core components:

It is continuously growing to meet the needs of Big

2. Processing the data in storage

It is performed by tools such as HDFS &

frameworks such as Pig, Hive, and Impala.

4. Visualizing the results

It is performed by using tools such as

Cloudera Search and Hue.

You might also like