0% found this document useful (0 votes)
22 views32 pages

EmTec Chapter 2

Chapter Two of the document provides an introduction to data science, outlining its definition, the difference between data and information, and the data processing cycle. It discusses various data types, the data value chain, and the characteristics of big data, including its management through Hadoop and its ecosystem. The chapter concludes with an overview of the big data life cycle, detailing the processes of data ingestion, storage, processing, analysis, and visualization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views32 pages

EmTec Chapter 2

Chapter Two of the document provides an introduction to data science, outlining its definition, the difference between data and information, and the data processing cycle. It discusses various data types, the data value chain, and the characteristics of big data, including its management through Hadoop and its ecosystem. The chapter concludes with an overview of the big data life cycle, detailing the processes of data ingestion, storage, processing, analysis, and visualization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

.

Introduction to Emerging Technology


(EmTe 1012)

Chapter Two:
.

Data Science

1
Outline
Overview of data science,
Data vs. Information,

Data Processing Cycle

Data types,

Data value chain,

Basic concepts of big data,

Hadoop Ecosystem and

Big Data life Cycle with Hadoop


2
Overview of Data Science
Data science is a field
that uses scientific methods, processes,
algorithms, and systems
to extract knowledge and insights
from structured, semi-structured and
unstructured data.
Data science is much more than simply
analyzing data.
It offers a range of roles and requires
a range of skills.
3
Data Vs. Information
Data
It defined as a representation of facts,
concepts, or instructions in a formalized
manner, which should be suitable for
communication, interpretation, or
processing, by human or electronic
machines.
It described as unprocessed facts and
figures.
It is represented with the help of
characters like Alphabets (A-Z, a-z), Digits
(0-9) or Special characters (+, -, /, *, <,>,
4
Cont.…
Information
It is the processed data on which
decisions and actions are based.
It is data that has been processed into
a form that is meaningful and valuable
in the action or decision of recipient.
It is also define as interpreted data;
 That created from organized,
structured, and processed data in a
particular context.

5
Data Processing Cycle
Data processing is the re-

structuring or re-ordering of data


by people or machines
In order to increase their
Input
usefulness and add values for a
particular purpose. Processing
Data processing cycle consists of 3

basic steps. Output


Input

Processing
6

Data Processing Cycle Cont.…
Input:- The input data is prepared in some
suitable form for processing.
The form will depend on the processing
machine.
Processing: The input data is changed in
to a more useful form.
Example, interest can be calculated on
deposit to bank
Output: The result of the processing step
is collected.
The form will depends on the use of the
7
Data types from Computer programming
perspective
Data type is an attribute of data that
tells the compiler how the programmer
aims to use the data
Common data types include:
Integers (int)- used to store whole
numbers,
Booleans (bool)- used to represent
restricted to one of two values: true or
false
Characters (char)- used to store a single
character
8
Data types from Data Analytics
perspective
From a data analytics point of view, it is important to
understand that there are three common types of data types:
Structured,
Semi-structured,
Unstructured and
Metadata.

9
Cont.…

1. Structured Data:-
is a data that follows to a pre-defined

data model
It is straightforward to analyze.

It fit in to a tabular format (rows and

cols)
Examples: Excel files or SQL databases.

10
Cont.…
2. Semi-structured data
It is a form of structured data that does

not conform with the formal structure of


data models associated with other forms of
data tables,
it contains tags or markers to separate

elements and enforce hierarchies of records


& fields within the data.
Therefore, it is known as self-describing
11
Cont.…
3. Unstructured Data
It is data that either

 does not have a predefined data model or


 not organized in a pre-defined manner.
It is typically text-heavy but may contain data

such as dates, numbers, and facts as well.


This results in irregularities and ambiguities

that make it difficult to understand using


programs.
Examples: audio or video files .
12
Cont.…
Metadata
Metadata is data about data.

It provides additional information about a

specific set of data.


Example: In a set of photographs,

metadata could describe when and where


the photos were taken.
 The metadata then provides fields for dates and

locations which, by themselves, can be considered


structured data. 13
Data value Chain
It describe the information flow within a
big data system
It has a series of steps needed to generate
value and useful insights from data.
The Big Data Value Chain identifies the
following key high-level activities:
1. Data Acquisition
2. Data Analysis
3. Data Curation
4. Data Storage and
5. Data Usage

14
1. Data Acquisition:
It is the process of gathering, filtering, and
cleaning data before it is put in any storage
solution on which data analysis can be carried
out.
It is one of the major big data challenges in
terms of infrastructure requirements.
The infrastructure required to support D.
acquisition
 must deliver low, predictable potential in both
capturing data & in executing queries;
 be able to handle very high transaction
volumes; and
15
2. Data Analysis:
It is concerned with making the raw data
acquired amenable/agreeable to use in
decision-making.
It involves exploring, transforming, and
modeling data with the goal of
 highlighting relevant data,
 synthesizing and extracting useful hidden
information
Related areas include data mining, business
intelligence, and machine learning.

16
3. Data Curation:
It is the active management of data to
ensure it meets the necessary data
quality requirements for its effective
usage.
It can be categorized into different
activities like
 content creation, selection, classification,
transformation, validation, and
preservation.
It is performed by expert curators
They are responsible for ensuring that
data are trustworthy, discoverable,
accessible, reusable and fit their purpose.
 Data curators are also known as scientific 17
4. Data Storage:
It is the persistence and management of
data in a scalable way that require fast
access of data.
RDBMS have been the main a solution to
the storage paradigm for nearly 40 years.
 However, when data volumes and
complexity grow the ACID (Atomicity,
Consistency, Isolation, and Durability)
properties lack flexibility
 This
making them unsuitable for big data
scenarios.
NoSQL technologies present a solutions
based on alternative data models.
18
5. Data Usage:
It covers the data-driven business

activities that need access to data, its


analysis, and the tools needed to
integrate the data analysis within the
business activity.
Data usage in business decision-

making can enhance competitiveness


through
 the reduction of costs or
19
 increased added value,
Cont.…

20
Basic concepts of big data
 Big data is large amount of data that
consists structure and unstructured data.
 Large dataset means too large to process or
store with traditional tool or on a single
computer.

Big data is a collection of large and

complex data sets


that it becomes difficult to process using on-

hand DB management tools or traditional data


21
Characteristics of Big Data
Big data is characterized by 3V and more:
Volume: large amounts of data (Zeta bytes)
Velocity: Data is live streaming or in motion
Variety: data comes in different forms from
diverse sources
Veracity: How accurate is it? etc. Can we trust
the data?

22
Clustered Computing
Because of qualities of big data, individual

computers are often inadequate for


handling the data at most stages.
Computer clusters are a better fit in order

to address the high storage and


computational needs of big data
Big data clustering software combines the

resources of many smaller machines, to provide a


number of benefits: 23
Cont.…
Resource Pooling:
Combining the available storage space, CPU
and memory are extremely important
since, processing large datasets requires
large amounts of these three resources.
High Availability:
It provide availability guarantees to prevent
hardware or software failures from affecting
access to data and processing.
Easy Scalability:
Clusters make it easy to scale horizontally by
adding additional machines to the group.
24
Cont.…
Using clusters requires a solution for

managing cluster membership,

coordinating resource sharing, and

scheduling work on individual nodes.

Cluster membership and resource allocation can be

handled by software like Hadoop’s YARN


(which stands for Yet Another Resource
Negotiator).
25
Hadoop (High Availability Distributed Object Oriented
Platform)
Hadoop is a tool that is used to handle big
data.
Hadoop is an open-source framework
that planned to make interaction with
big data easier.
It is a framework that allows for the
distributed processing of large
datasets across clusters of computers
using simple programming models.
It is inspired by a technical document published by Google.
26
Characteristics Of Hadoop
Economical:
Since regular computers can be used for data
processing
Reliable:
Since it stores copies of the data on different
machines and is resistant to hardware failure.
Scalable:
Since it is easily scalable ,both horizontally
and vertically.
Flexible:
Since you can store both structured and
unstructured data. 27
Components of Hadoop’s Ecosystem
Hadoop has an ecosystem that has

evolved from its four core components:


1. Data management,

2. Access,

3. Processing and

4. Storage

It is continuously growing to meet the needs of Big

Data. 28
Hadoop is comprises the following
components
 Oozie: Job Scheduling
 Zookeeper: Managing
cluster
 PIG, HIVE: Query based
processing of data
services
 MapReduce: Programmi
ng based Data Processing
 HDFS: Hadoop
Distributed File System
 Spark: In-Memory data
processing
 Solar, Lucene: Searching
and Indexing

29
Big Data Life Cycle with Hadoop
1. Ingesting data into the system
The data is ingested or transferred to
Hadoop from various sources such as DB,
systems, local files
 Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data.

2. Processing the data in storage


The data is stored and processed

It is performed by tools such as HDFS &


HBase (to store data) and Spark and
MapReduce (to perform data processing).
 Spark and MapReduce perform data processing. 30
Cont.…
3. Analysing the data
The data is analysed by processing

frameworks such as Pig, Hive, and Impala.


 Pig converts the data using a map and reduce and then analyzes
it.
 Hive is also based on the map and reduce programming and is
most suitable for structured data.

4. Visualizing the results


The analyzed data can be accessed by users.

It is performed by using tools such as

Cloudera Search and Hue.


31
32

You might also like