DATA SCIENCE
Lesson 2
Instructor: Ellen M. Guiñares
TOPICS COVERED
01 Overview for Data Science
Definition of Data and Information
Data types and representation
02 Data Value Chain
Data Acquisition
Data Analysis
Data Curating
Data Storage
Data Usage
03 Basic concepts of Big Data
What is Data
Science?
Data Science
It is a multi-disciplinary field that uses scientific
methods, processes, algorithms and systems to
extract knowledge and insights and knowledge
from data to drive decision-making and solve
complex problems.
KEY STEPS OF DATA SCIENCE
Data Collection
Gather relevant data
Data Preparation
Suitable format for
analysis Data Visualization
Implement the findings.
Data Analysis
Identify patterns,
relationships and
insights Data Visualization
Communicating the
findings
KEY STEPS OF DATA SCIENCE
WHAT IS EXPECTED OF A DATA SCIENTIST?
• Data scientists must master the full spectrum of
the data science life cycle and possess a level of
flexibility and understanding to maximize returns
at each phase of the process.
• Data scientists need to be curious and result-
oriented.
• Data science need a strong quantitative
background in statistics and linear algebra as
well as programming knowledge.
DATA SCIENCE LIFE CYCLE
Data and Information
Data Information
a representation of facts, Information is organized or
concepts, or instructions. classified data, which has
some meaningful values for
the receiver.
Data
● Data can be defined as a representation of facts, concepts,
or instructions in a formalized manner, which should be
suitable for communication, interpretation, or processing
by human or electronic machine.
● Data is represented with the help of characters such as
alphabets (A-Z, a-z), digits (0-9) or special characters (+,-
,/,*,<,>,= etc.)
Information
● Information is organized or classified data, which has some
meaningful values for the receiver. Information is the
processed data on which decisions and actions are based.
● Information is a data that has been processed into a form
that is meaningful to recipient and is of real or perceived
value in the current or the prospective action or decision of
recipient.
Information
● For the decision to be meaningful, the processed data must
qualify for the following characteristics:
✓ Timely − Information should be available when required.
✓ Accuracy − Information should be accurate.
✓ Completeness − Information should be complete.
Data Processing Cycle
● Data processing is the re-structuring or re-ordering of data
by people or machine to increase their usefulness and add
values for a particular purpose.
● Data processing consists of the following basic steps - input,
processing, and output. These three steps constitute the
data processing cycle.
Data Processing Cycle
● Data processing is the re-structuring or re-ordering of data
by people or machine to increase their usefulness and add
values for a particular purpose.
● Data processing consists of the following basic steps - input,
processing, and output. These three steps constitute the
data processing cycle.
Data Processing Cycle
Input Step Processing Step Output Step
the input data is the input data is the result of the
prepared in some changed to produce proceeding processing
convenient form for data in a more useful step is collected.
processing. form.
The form depends on
the processing
machine.
Data Types and its representation
● Data type or simply type is an attribute of data which tells
the compiler or interpreter how the programmer intends to
use the data.
● Almost all programming languages explicitly include the
notion of data type. Common data types include:
✓ Integers
✓ Booleans
✓ Characters
✓ floating-point numbers
✓ alphanumeric strings
01
Data types / structure
Based on analysis of data
Data Types / structure
● Based on analysis of data:
✓ Structured
✓ Unstructured
✓ Semi-structured
✓ Metadata
● is data that adheres to a pre-defined data
Data Types / structure model and is therefore straightforward to
analyze.
● conforms to a tabular format with
relationship between the different rows and
columns. Common examples are Excel files
or SQL databases.
What is a structured
data? ● Structured data is considered the most
‘traditional’ form of data storage, since the
earliest versions of database management
systems (DBMS) were able to store, process
and access structured data.
● Unstructured data is information that either
Data Types / structure does not have a predefined data model or is
not organized in a pre-defined manner.
● It is without proper formatting and
alignment
What is ● Unstructured information is typically text-
heavy, but may contain data such as dates,
unstructured data? numbers, and facts as well.
● The ability to extract value from
unstructured data is one of main drivers
behind the quick growth of Big Data.
Data Types / structure
● Semi-structured data is a form of structured
data that does not conform with the formal
structure of data models associated with
relational databases or other forms of data
tables.
What is a semi-
structured data? ● Fore example: JSON and XML are forms of
semi-structured data.
● The reason that this third category exists
(between structured and unstructured data)
is because semi-structured data is
considerably easier to analyze than
unstructured data.
Data Types / structure
● A last category of data type is metadata.
From a technical point of view, this is not a
separate data structure, but it is one of the
most important elements for Big Data
analysis and big data solutions.
What is Metadata?
● Metadata is data about data.
● It provides additional information about a
specific set of data.
02
Data Value Chain
Information flow within a big data systems
Data Value Chain
● refers to the entire process of transforming raw data into
valuable insights, information, and knowledge that can be
used for decision-making, innovation, and business growth.
● It involves a series of interconnected activities that add
value to data at each stage of the process, from data
collection and processing to analysis and dissemination.
Data Value Chain
● The Data Value Chain is introduced to describe the
information flow within a big data system as a series of
steps needed to generate value and useful insights from
data.
● The Big Data Value Chain identifies the following key high-
level activities:
Data Value Chain Stages
● The Data Value Chain typically includes the following stages:
✓ Data Acquisition
✓ Data Analysis
✓ Data Curation
✓ Data Storage
✓ Data Usage
Data Value Chain Stages
1. Data Acquisition/Collection
✓ It is the process of gathering, filtering, and cleaning data
before it is put in a data warehouse or any other storage
solution on which data analysis can be carried out.
✓ Data acquisition is one of the major big data challenges in
terms of infrastructure requirements.
Data Value Chain Stages
2. Data Analysis
✓ It is concerned with making the raw data acquired amenable
to use in decision-making as well as domain-specific usage.
✓ Data analysis involves exploring, transforming, and modelling
data with the goal of highlighting relevant data, synthesizing
and extracting useful hidden information with high potential
from a business point of view.
Data Value Chain Stages
3. Data Curation
✓ It is the active management of data over its life cycle to
ensure it meets the necessary data quality requirements for
its effective usage.
✓ Data curation processes can be categorized into different
activities such as content creation, selection, classification,
transformation, validation, and preservation.
Data Value Chain Stages
4. Data Storage
✓ It is the persistence and management of data in a scalable
way that satisfies the needs of applications that require fast
access to the data.
✓ Relational Database Management Systems (RDBMS) have
been the main, and almost unique, solution to the storage
paradigm for nearly 40 years.
Data Value Chain Stages
5. Data Usage
✓ It covers the data-driven business activities that need access
to data, its analysis, and the tools needed to integrate the
data analysis within the business activity.
✓ Data usage in business decision-making can enhance
competitiveness through reduction of costs, increased added
value, or any other parameter that can be measured against
existing performance criteria
03
Basic Concepts of Big Data
Information flow within a big data systems
Basic Concepts of Big Data
• Big data is a blanket term for the non-traditional strategies
and technologies needed to gather, organize, process, and
gather insights from large datasets.
Basic Concepts of Big Data
• An exact definition of “big data” is difficult to nail down
because projects, vendors, practitioners, and business
professionals use it quite differently. With that in mind,
generally speaking, big data is:
✓ large datasets
✓ the category of computing strategies and technologies
that are used to handle large datasets
Basic Concepts of Big Data
• refers to the vast and diverse sets of data that are generated
at a high velocity, volume, and variety from various sources.
The data may be structured, semi-structured, or unstructured
and cannot be easily processed or analyzed using traditional
data processing techniques.
Key Components of Big Data
1. Volume - Big Data refers to data that is too large to be
processed using traditional data processing tools and
techniques. The volume of data can range from terabytes to
petabytes and beyond.
2. Velocity: Big Data is generated at an unprecedented speed
and needs to be processed in real-time or near real-time to
derive meaningful insights. This velocity can be measured in
microseconds to seconds or minutes.
Big Data Characteristics
Other Characteristics of Big Data – 6V’s
1. Veracity: The variety of sources and the complexity of the
processing can lead to challenges in evaluating the quality of
the data (and consequently, the quality of the resulting
analysis).
2. Variability: Variation in the data leads to wide variation in
quality. Additional resources may be needed to identify, process,
or filter low quality data to make it more useful.
Other Characteristics of Big Data – 6V’s
3. Value: The ultimate challenge of big data is delivering value.
Sometimes, the systems and processes in place are complex
enough that using the data and extracting actual value can
become difficult.
Where does big data
come from?
Sources of Big Data
1. Social Media: Social media platforms such as Facebook,
Twitter, LinkedIn, and Instagram generate vast amounts of
data in the form of user interactions, posts, comments, likes,
and shares.
2. Internet of Things (IoT) Devices: IoT devices such as sensors,
smart appliances, and wearable technology generate huge
volumes of data in real-time.
Sources of Big Data
3. E-commerce Transactions: E-commerce platforms generate a
significant amount of data related to customer behavior,
purchase history, preferences, and trends.
4. Machine-generated Data: Machines and applications
generate a massive amount of data, including log files,
clickstream data, system-generated data, and more.
Sources of Big Data
5. Mobile Devices: Mobile devices generate a significant amount
of data, including location data, usage data, and user behavior
data.
6. Customer Feedback: Customer feedback in the form of
surveys, reviews, and support tickets generate large volumes of
data that can be analyzed to improve customer experience.
Sources of Big Data
7. Business Applications: Business applications such as CRM,
ERP, and HRM generate a vast amount of data that can be
analyzed to improve business operations.
8. Public Data Sources: Public data sources such as government
data, weather data, and census data can be combined with other
data sources to create more significant insights.