Chapter Two
Data Science
2.1. Overview of Data Science
Data science is a multi-disciplinary field that uses scientific
methods, processes, algorithms, and systems to extract knowledge
and insights from structured, semi-structured and unstructured
data.
Data Science is the area of study which involves extracting insights
from vast amounts of data by the use of various scientific methods,
algorithms, and processes. It helps you to discover hidden patterns
from the raw data.
Overview of Data Science
3
(I)
Data Science is an interdisciplinary field that allows you to extract
knowledge from structured or unstructured data.
Data science enables you to translate a business problem into a
research project and then translate it back into a practical solution.
Significant advantages of using Data Science
Data is the oil for today's world. With the right tools, technologies,
algorithms, we can use data and convert it into a distinctive
business advantage.
Data Science can help you to detect fraud using advanced machine
learning algorithms.
It helps you to prevent any significant monetary losses.
Significant advantages of using Data
5
Science (II)
Allows to build intelligence ability in machines
You can perform sentiment analysis to gauge customer
brand loyalty
It enables you to take better and faster decisions
Helps you to recommend the right product to the right
customer to enhance your business.
Challenges of Data science
6
High variety of information & data is required for accurate analysis
Not adequate data science talent pool available
Management does not provide financial support for a data science
team
Unavailability of/difficult access to data
Challenges of Data science
7
(I)
Data Science results not effectively used by business decision
makers
Explaining data science to others is difficult
Privacy issues
Lack of significant domain expert
If an organization is very small, they can't have a Data Science team
What are data and information?
8
Data can be defined as a representation of facts, concepts, or
instructions in a formalized manner, which should be suitable for
communication, interpretation, or processing, by human or
electronic machines.
It can be described as unprocessed facts and figures.
It is represented with the help of characters such as alphabets (A-Z,
a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.).
What are data and information?
(I)
9
Information is the processed data on which decisions and actions
are based.
Information is data that has been processed into a form that is
meaningful to the recipient and is of real or perceived value in the
current or the prospective action or decision of recipient.
Furtherer more, information is interpreted data; created from
organized, structured, and processed data in a particular context.
Data Processing Cycle
10
Data processing is the re-structuring or re-ordering of data by
people or machines to increase their usefulness and add values for
a particular purpose.
Data processing consists of the following basic steps: Input,
Processing and Output. These three steps constitute the data
processing cycle.
Fig. 1.Data processing Cycle
Data Processing Cycle (I)
11
Input :- in this step, the input data is prepared in some convenient form for
processing.
The form will depend on the processing machine.
For example, when electronic computers are used, the input data can be recorded on
any one of the several types of storage medium, such as hard disk, CD, flash disk and
so on.
Processing:- in this step, the input data is changed to produce data in a more
useful form.
For example, interest can be calculated on deposit to a bank, or a summary of
sales for the month can be calculated from the sales orders.
Data Processing Cycle
(II)
12
Output-at this stage, the result of the proceeding processing step is
collected.
The particular form of the output data depends on the use of the
data.
For example, output data may be payroll for employees.
Data types and their representation
13
Data types can be described from diverse perspectives.
In computer science and computer programming, for instance, a
data type is simply an attribute of data that tells the compiler or
interpreter how the programmer intends to use the data.
Data types from Computer programming perspective
14
Almost all programming languages explicitly include the notion of
data type, though different languages may use different
terminology. Common data types include:
Integers(int):- is used to represent whole numbers, mathematically
known as integers
Booleans(bool):- is used to represent restricted to one of two
values: true or false
Characters(char):- is used to represent a single character
Floating-point numbers(float)- is used to represent real numbers
Alphanumeric strings(string):- used to represent a combination of
characters and numbers
Data types from Data Analytics perspective
15
From a data analytics point of view, it is important to
understand that there are three common types of
data types or structures:
Structured
Semi-structured and
Unstructured data types.
Data types from Data Analytics perspective
16
Structured Data
17
Structured data is data that adheres to a pre-defined data
model and is therefore straightforward to analyze.
Structured data conforms to a tabular format with a
relationship between the different rows and columns.
Common examples of structured data are Excel files or SQL
databases.
Each of these has structured rows and columns that can be
sorted.
Semi-structured Data
18
Semi-structured data is a form of structured data that does not
conform with the formal structure of data models associated with
relational databases or other forms of data tables, but nonetheless,
contains tags or other markers to separate semantic elements and
enforce hierarchies of records and fields within the data.
Therefore, it is also known as a self-describing structure.
Examples of semi-structured data include JSON and XML are forms
of semi-structured data.
Unstructured Data
19
Unstructured data is information that either does not have a
predefined data model or is not organized in a pre-defined manner.
Unstructured information is typically text-heavy but may contain
data such as dates, numbers, and facts as well.
This results in irregularities and ambiguities that make it difficult to
understand using traditional programs as compared to data stored in
structured databases.
Common examples of unstructured data include audio, video files or
NoSQL.
Metadata – Data about Data
20
The last category of data type is metadata.
From a technical point of view, this is not a separate data structure,
but it is one of the most important elements for Big Data analysis
and big data solutions.
Metadata is data about data.
It provides additional information about a specific set of data.
In a set of photographs, for example, metadata could describe
when and where the photos were taken.
Data value Chain
21
The Data Value Chain is introduced to describe the information
flow within a big data system as a series of steps needed to
generate value and useful insights from data. The Big Data Value
Chain identifies the following key high-level activities:
Fig2.Data Value Chain
1. Data Acquisition
22
It is the process of gathering, filtering, and cleaning data before it is put in a data
warehouse or any other storage solution on which data analysis can be carried
out.
Data acquisition is one of the major big data challenges in terms of infrastructure
requirements.
The infrastructure required to support the acquisition of big data must deliver
low, predictable latency in both capturing data and in executing queries; be able
to handle very high transaction volumes, often in a distributed environment and
support flexible and dynamic data structures.
2. Data Analysis
23
It is concerned with making the raw data acquired amenable to use
in decision-making as well as domain-specific usage.
Data analysis involves exploring, transforming, and modeling data
with the goal of highlighting relevant data, synthesizing and
extracting useful hidden information with high potential from a
business point of view.
Related areas include data mining, business intelligence, and
machine learning.
3. Data Curation
24
It is the active management of data over its life cycle to ensure it meets
the necessary data quality requirements for its effective usage.
Data curation processes can be categorized into different activities
such as content creation, selection, classification, transformation,
validation, and preservation.
Data curation is performed by expert curators that are responsible for
improving the accessibility and quality of data.
Data curators (also known as scientific curators or data annotators)
hold the responsibility of ensuring that data are trustworthy,
discoverable, accessible, reusable and fit their purpose.
A key trend for the duration of big data utilizes community and crowd
sourcing approaches.
4. Data Storage
25
It is the persistence and management of data in a scalable way that
satisfies the needs of applications that require fast access to the data.
Relational Database Management Systems (RDBMS) have been the
main, and almost unique, a solution to the storage paradigm for nearly
40 years.
However, the ACID (Atomicity,Consistency,Isolation,and Durability)
properties that guarantee database transactions lack flexibility with
regard to schema changes and the performance and fault tolerance
when data volumes and complexity grow, making them unsuitable for
big data scenarios.
NoSQL technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on alternative data
models.
5. Data Usage
26
It covers the data-driven business activities that need
access to data, its analysis, and the tools needed to
integrate the data analysis within the business activity.
Data usage in business decision making can enhance
competitiveness through the reduction of costs, increased
added value, or any other parameter that can be
measured against existing performance criteria.
Basic concepts of big data
27
What Is Big Data?
Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
In this context, a “large dataset” means a dataset too large
to reasonably process or store with traditional tooling or on a
single computer.
This means that the common scale of big datasets is
constantly shifting and may vary significantly from
organization to organization.
Big data is characterized by 4V and more:
28
Volume: large amounts of data Zeta bytes/Massive datasets
Velocity: Data is live streaming or in motion
Variety: data comes in many different forms from diverse sources
Veracity: can we trust the data? How accurate is it? etc.
Fig 3. Characteristics of Big data
Source of Big data
29
Mobile devices
(Tracking all objects all the time)
Areas of Applications of Big Data
30
Health and Well being
Policy making and public opinions
Smart cities and more efficient society
New online educational models: MOOC and
Student-Teacher modeling
Robotics and human-robot interaction
Areas of Applications of Big Data
31
Smarter Multi-
Healthcare channel
sales
Telecom
Homeland
Security
Trading
Analytics
TrafficControl
Search
Quality
Manufacturing
Big Data vs Data
Science
32
Factors Big Data Data Science
Concept Handling large Data Analyzing data
Responsibility Processing huge volume of Understand pattern
data and generate insights within and make
decisions
Industry E-commerce ,security Sales, image
services, telecommunication recognition,
advertisement ,risk
analytics
tools Hadoop Python ,R
33
THANK YOU
?