.
Introduction to Emerging Technology
(EmTe 1012)
Chapter Two:
.
Data Science
1
Outline
Overview of data science,
Data vs. Information,
Data Processing Cycle
Data types,
Data value chain,
Basic concepts of big data,
Hadoop Ecosystem and
Big Data life Cycle with Hadoop
2
Overview of Data Science
Data science is a field
that uses scientific methods, processes,
algorithms, and systems
to extract knowledge and insights
from structured, semi-structured and
unstructured data.
Data science is much more than simply
analyzing data.
It offers a range of roles and requires
a range of skills.
3
Data Vs. Information
Data
It defined as a representation of facts,
concepts, or instructions in a formalized
manner, which should be suitable for
communication, interpretation, or
processing, by human or electronic
machines.
It described as unprocessed facts and
figures.
It is represented with the help of
characters like Alphabets (A-Z, a-z), Digits
(0-9) or Special characters (+, -, /, *, <,>,
4
Cont.…
Information
It is the processed data on which
decisions and actions are based.
It is data that has been processed into
a form that is meaningful and valuable
in the action or decision of recipient.
It is also define as interpreted data;
That created from organized,
structured, and processed data in a
particular context.
5
Data Processing Cycle
Data processing is the re-
structuring or re-ordering of data
by people or machines
In order to increase their
Input
usefulness and add values for a
particular purpose. Processing
Data processing cycle consists of 3
basic steps. Output
Input
Processing
6
Data Processing Cycle Cont.…
Input:- The input data is prepared in some
suitable form for processing.
The form will depend on the processing
machine.
Processing: The input data is changed in
to a more useful form.
Example, interest can be calculated on
deposit to bank
Output: The result of the processing step
is collected.
The form will depends on the use of the
7
Data types from Computer programming
perspective
Data type is an attribute of data that
tells the compiler how the programmer
aims to use the data
Common data types include:
Integers (int)- used to store whole
numbers,
Booleans (bool)- used to represent
restricted to one of two values: true or
false
Characters (char)- used to store a single
character
8
Data types from Data Analytics
perspective
From a data analytics point of view, it is important to
understand that there are three common types of data types:
Structured,
Semi-structured,
Unstructured and
Metadata.
9
Cont.…
1. Structured Data:-
is a data that follows to a pre-defined
data model
It is straightforward to analyze.
It fit in to a tabular format (rows and
cols)
Examples: Excel files or SQL databases.
10
Cont.…
2. Semi-structured data
It is a form of structured data that does
not conform with the formal structure of
data models associated with other forms of
data tables,
it contains tags or markers to separate
elements and enforce hierarchies of records
& fields within the data.
Therefore, it is known as self-describing
11
Cont.…
3. Unstructured Data
It is data that either
does not have a predefined data model or
not organized in a pre-defined manner.
It is typically text-heavy but may contain data
such as dates, numbers, and facts as well.
This results in irregularities and ambiguities
that make it difficult to understand using
programs.
Examples: audio or video files .
12
Cont.…
Metadata
Metadata is data about data.
It provides additional information about a
specific set of data.
Example: In a set of photographs,
metadata could describe when and where
the photos were taken.
The metadata then provides fields for dates and
locations which, by themselves, can be considered
structured data. 13
Data value Chain
It describe the information flow within a
big data system
It has a series of steps needed to generate
value and useful insights from data.
The Big Data Value Chain identifies the
following key high-level activities:
1. Data Acquisition
2. Data Analysis
3. Data Curation
4. Data Storage and
5. Data Usage
14
1. Data Acquisition:
It is the process of gathering, filtering, and
cleaning data before it is put in any storage
solution on which data analysis can be carried
out.
It is one of the major big data challenges in
terms of infrastructure requirements.
The infrastructure required to support D.
acquisition
must deliver low, predictable potential in both
capturing data & in executing queries;
be able to handle very high transaction
volumes; and
15
2. Data Analysis:
It is concerned with making the raw data
acquired amenable/agreeable to use in
decision-making.
It involves exploring, transforming, and
modeling data with the goal of
highlighting relevant data,
synthesizing and extracting useful hidden
information
Related areas include data mining, business
intelligence, and machine learning.
16
3. Data Curation:
It is the active management of data to
ensure it meets the necessary data
quality requirements for its effective
usage.
It can be categorized into different
activities like
content creation, selection, classification,
transformation, validation, and
preservation.
It is performed by expert curators
They are responsible for ensuring that
data are trustworthy, discoverable,
accessible, reusable and fit their purpose.
Data curators are also known as scientific 17
4. Data Storage:
It is the persistence and management of
data in a scalable way that require fast
access of data.
RDBMS have been the main a solution to
the storage paradigm for nearly 40 years.
However, when data volumes and
complexity grow the ACID (Atomicity,
Consistency, Isolation, and Durability)
properties lack flexibility
This
making them unsuitable for big data
scenarios.
NoSQL technologies present a solutions
based on alternative data models.
18
5. Data Usage:
It covers the data-driven business
activities that need access to data, its
analysis, and the tools needed to
integrate the data analysis within the
business activity.
Data usage in business decision-
making can enhance competitiveness
through
the reduction of costs or
19
increased added value,
Cont.…
20
Basic concepts of big data
Big data is large amount of data that
consists structure and unstructured data.
Large dataset means too large to process or
store with traditional tool or on a single
computer.
Big data is a collection of large and
complex data sets
that it becomes difficult to process using on-
hand DB management tools or traditional data
21
Characteristics of Big Data
Big data is characterized by 3V and more:
Volume: large amounts of data (Zeta bytes)
Velocity: Data is live streaming or in motion
Variety: data comes in different forms from
diverse sources
Veracity: How accurate is it? etc. Can we trust
the data?
22
Clustered Computing
Because of qualities of big data, individual
computers are often inadequate for
handling the data at most stages.
Computer clusters are a better fit in order
to address the high storage and
computational needs of big data
Big data clustering software combines the
resources of many smaller machines, to provide a
number of benefits: 23
Cont.…
Resource Pooling:
Combining the available storage space, CPU
and memory are extremely important
since, processing large datasets requires
large amounts of these three resources.
High Availability:
It provide availability guarantees to prevent
hardware or software failures from affecting
access to data and processing.
Easy Scalability:
Clusters make it easy to scale horizontally by
adding additional machines to the group.
24
Cont.…
Using clusters requires a solution for
managing cluster membership,
coordinating resource sharing, and
scheduling work on individual nodes.
Cluster membership and resource allocation can be
handled by software like Hadoop’s YARN
(which stands for Yet Another Resource
Negotiator).
25
Hadoop (High Availability Distributed Object Oriented
Platform)
Hadoop is a tool that is used to handle big
data.
Hadoop is an open-source framework
that planned to make interaction with
big data easier.
It is a framework that allows for the
distributed processing of large
datasets across clusters of computers
using simple programming models.
It is inspired by a technical document published by Google.
26
Characteristics Of Hadoop
Economical:
Since regular computers can be used for data
processing
Reliable:
Since it stores copies of the data on different
machines and is resistant to hardware failure.
Scalable:
Since it is easily scalable ,both horizontally
and vertically.
Flexible:
Since you can store both structured and
unstructured data. 27
Components of Hadoop’s Ecosystem
Hadoop has an ecosystem that has
evolved from its four core components:
1. Data management,
2. Access,
3. Processing and
4. Storage
It is continuously growing to meet the needs of Big
Data. 28
Hadoop is comprises the following
components
Oozie: Job Scheduling
Zookeeper: Managing
cluster
PIG, HIVE: Query based
processing of data
services
MapReduce: Programmi
ng based Data Processing
HDFS: Hadoop
Distributed File System
Spark: In-Memory data
processing
Solar, Lucene: Searching
and Indexing
29
Big Data Life Cycle with Hadoop
1. Ingesting data into the system
The data is ingested or transferred to
Hadoop from various sources such as DB,
systems, local files
Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data.
2. Processing the data in storage
The data is stored and processed
It is performed by tools such as HDFS &
HBase (to store data) and Spark and
MapReduce (to perform data processing).
Spark and MapReduce perform data processing. 30
Cont.…
3. Analysing the data
The data is analysed by processing
frameworks such as Pig, Hive, and Impala.
Pig converts the data using a map and reduce and then analyzes
it.
Hive is also based on the map and reduce programming and is
most suitable for structured data.
4. Visualizing the results
The analyzed data can be accessed by users.
It is performed by using tools such as
Cloudera Search and Hue.
31
32