CSE2500 - Data Analytics
Module 1-Introduction to Data Analysis
Introducing Data, overview of data analysis: Data in the Real World,
Data vs. Information, Many “Vs” of Data, Structured Data and
Unstructured Data, Types of Data, Data Analysis Defined, Types of
Variables, Central Tendency of Data, Scales of Data, Sources of Data,
Data preparation: Cleaning the data, Removing variables, Data
Transformations.
R Studio: Base R-R Studio IDE-Introduction to R Projects and R
Markdown. Basic R: R as a calculator-Scripts and Comments-R
Variables. Data I/O: Working Directories-Importing Data Exporting
Data-More ways to save-Data I/O in Base R.
Introducing Data
• Facts and statistics collected together for reference or analysis
• Data has to be transformed into a form that is efficient for
movement or processing.
2
Data Analysis Definition
• Data analysis is defined as a process of cleaning,
transforming, and modeling data to discover useful
information for business decision-making.
• Purpose: Extract useful information for business decision-
making.
• Example: Day-to-day decision-making based on past
experiences or future expectations, Gathering memories (past
data) or dreams (future data) to inform decisions
• Business Application: Analysts apply data analysis
techniques for business purposes and Data analysis informs
business decision-making
3
Over view of Data
Analysis
OVERVIEW
Data Collection and Importance
• Various disciplines collect and store data digitally
• Retail, insurance, and meteorological organizations use data for
informed decisions
• Timely decisions maximize sales, improve R&D, and reduce costs
Data Analysis Challenges
• Fast-growing data production due to internet and operational
systems
• Increasing volume, complexity, and reliability concerns
5
Data Analysis Process
• Define project and problem
• Prepare data for analysis
• Select and optimize data analysis approaches
• Deploy and measure results for expected benefits.
Key Objectives:
• Focus on converting raw data to meaningful information
• Outlines major steps in data analysis projects from
defining the problem to the deployment of the results.
6
Data in the Real World
• Surveys or polls, interviews, and experiments are valuable
approaches for gathering data to answer specific questions.
Data is collected from various sources, including:
• Surveys and polls to understand opinions, preferences, and
behavior
Example: Casting a vote before election.
• Interviews to elicit information on people's opinions,
preferences, and behavior.
Example: Conducted over the phone.
7
• Experiments to measure and collect data in a highly controlled
manner
Example: Double-blind drug study (one group gets the drug,
other a placebo).
• Operational databases containing ongoing business transactions
-Sensors monitoring operational processes.
-Stored in databases such as CRM, ERP, supply chain systems.
• Data warehouses for making decisions
• Databases used for Historical polls, surveys, and experiment.
• External sources such as the web or literature
• Data is used to answer specific questions, understand opinions
and needs, and make informed decisions
8
Data vs. Information
• Data refers to the raw, unprocessed facts and figures collected from
various sources.
• Information refers to the processed and analyzed data that provides
meaning and insight.
• Data becomes information when it is:
- Collected and stored
- Processed and analyzed
- Interpreted and understood
- Used to make informed decisions
• In other words, data is the raw material
• Information is the result of processing and analyzing that data to
extract meaning and value.
9
Many v’s of Data
A. Volume
• The term Volume is meant for the Magnitude or Scale of data.
Data Sources and Repositories:-
• Websites
Data Generation and Sharing:-
• User clickstreams recorded and stored
• Social media applications (Facebook, Twitter, etc.) become prosumers
(producers and consumers) of data
• Increased data shares and larger data elements
• High-definition videos increase shared data.
Autonomous Data Streams:-
• Video, Audio, Text
• Data from: Social media sites , Websites , RFID applications
B. Velocity
• Velocity refers to the speed at which the gigantic amount of data is being
generated, collected and scrutinized.
Enhanced Data Movement:-
• Faster data movement through the Internet
• Quick transfer of E-mails, Social media, Video files
Cloud-Based Storage:-
• Instantaneous sharing-
• Easy accessibility from anywhere
Social Media and Data Sharing:-
• Instant data sharing among people-
• Mobile access for faster data generation and access
C. Variety
Types of Data:-
• Structured data (numeric, text fields)-
• Unstructured data (images, video, audio, etc.)
Sources of Data:-
• Structured data: ERPs, operational systems-
• Unstructured data: social media, web, RFID, machine data, etc.
Characteristics of Unstructured Data:-
• Varying sizes and resolutions
• Subject to different types of analysis
Examples: Video: tagging, playback (not computable)
Audio: playback (not computable)
Graphic: network distance analysis
Text (Facebook, tweets): sentiment analysis (not directly comparable)
D. Value
• Value refers to convert our investigated data into values.
• Value in Big data
- Important characteristic of Big Data
- Involves collecting and analyzing data to Boost organizational
performance and Enhance customer understanding
• With the access to this useful data, one must analyze great values in
order to get amazing benefits.
E. Variability
• Variability refers to unpredictable changes in the data.
• It may happen because of multiple data types & the speed with
which data is generating and being loaded into the database.
F. Veracity
• Veracity refers to the term trustworthiness with reference to accurate data.
• If the data is accurate, only then you could think of meaningful data.
• For example, consider a dataset of thirty students on which we have to make an
analysis about the reason they got distinction.
• Being an analyzer, you can ask questions like:
• what are the methodology you adopted to get good marks in all the subjects?
• How much time you devote to individual subject?
• Do you learn some subjects with the help of daily life activities like sports etc?
• Have you ever been a scholar?
• Be getting answers like this it would be easier to determine the accuracy of
information which could easily be maintained in statistical form.
G. Validity
• Two terms of big data veracity and validity seems to be alike but
are quite different.
• validity is meant for an accurate analysis in order to get optimized
results.
H. Vulnerability
• Vulnerability is one of the major challenge in big data as the data
generated from multiple sources with such an erratic speed has high
chances of being harmed by any intruder.
• Currently, in a case of Facebook, where the Belgium court has
threatened to fine a high amount on breaking privacy recently.
I. Volatility
• Volatility refers to how long the perceived data remains to be
useful for us and how it is to be kept.
• For analyzing the same, it is necessary to develop some new rules
and techniques through which rapid access to information is
possible.
J. Visualization
• Data Visualization is one of the most complex challenge in big data.
• In this information age, data is not only going beyond the limits but also
is composed of different data types.
• So, there is a need of communicate the information by visualizing it
through some special ways with special functionalities like a web-based
approach, statistical analysis etc.
• Traditional tools of data visualization face severe challenges
like low response time, complex methods of scalability,
precision in reporting time etc.
• So, it is a challenge to work with the concept which way of
communication with data is most suitable in order to make
visualization more effective.
23
Typical human-generated unstructured
data includes
• Text files: Word processing, spreadsheets, presentations, email, logs.
• Email: Email has some internal structure thanks to its metadata, and we sometimes refer to
it as semi-structured. However, its message field is unstructured and traditional analytics
tools cannot parse it.
• Social Media: Data from Facebook, Twitter, LinkedIn.
• Website: YouTube, Instagram, photo sharing sites.
• Mobile data: Text messages, locations.
• Communications: Chat, IM, phone recordings, collaboration software.
• Media: MP3, digital photos, audio and video files.
• Business applications: MS Office documents, productivity applications
24
Typical machine-generated unstructured
data includes:
• Satellite imagery: Weather data, land forms, military movements.
• Scientific data: Oil and gas exploration, space exploration,
seismic imagery, atmospheric data.
• Digital surveillance: Surveillance photos and video.
• Sensor data: Traffic, weather, oceanographic sensors.
25
26
27
Types of Digital Data
28
Data Analysis-Types
29
Statistical Analysis
• Statistical Analysis shows "What happen?" by
using past data in the form of dashboards.
• Statistical Analysis includes collection, Analysis,
interpretation, presentation, and modeling of
data.
• Analyses a sample of data.
• There are two categories of this type of Analysis -
Descriptive Analysis
Inferential Analysis.
30
Descriptive Analysis
• Analyses complete data or a sample of summarized
numerical data.
• It shows mean and deviation for continuous data
whereas percentage and frequency for categorical
data.
Inferential Analysis
• Analyses sample from complete data. In this type of
Analysis, you can find different conclusions from the
same data by selecting different samples.
31
Diagnostic Analysis
• Diagnostic Analysis shows "Why did it happen?" by
finding the cause from the insight found in Statistical
Analysis. This Analysis is u
• Useful to identify behavior patterns of data.
• If a new problem arrives in your business process,
then you can look into this Analysis to find similar
patterns of that problem.
• May have chances to use similar prescriptions for the
new problems.
32
Predictive Analysis
• Predictive Analysis shows "what is likely to happen" by using
previous data.
• The simplest data analysis example is like if last year I bought two
dresses based on my savings and if this year my salary is
increasing double then I can buy four dresses. But of course it's
not easy like this because you have to think about other
circumstances like chances of prices of clothes is increased this
year or maybe instead of dresses you want to buy a new bike, or
you need to buy a house!
• Predictive Analysis makes predictions about future outcomes
based on current or past data.
• Forecasting is just an estimate. Its accuracy is based on how much
detailed information you have and how much you dig in it.
33
Prescriptive Analysis
• Prescriptive Analysis combines the insight from all
previous Analysis to determine which action to take
in a current problem or decision.
• Data-driven companies are utilizing Prescriptive
Analysis because predictive and descriptive
Analysis are not enough to improve data
performance.
• Based on current situations and problems, they
analyze the data and make decisions.
34
Types of Variable
Categorized based on type of values variable has.
Discrete Variables:
• Contain a fixed number of distinct values
• Finite number of possible values
• Example: Industrial sector variable values are
telecommunication industry, retail industry with finite
number of possible values.
Continuous Variables:
• Can take any numeric value within a range
• Infinite number of possible values
• Example: Patient's weight (e.g., 153.2 lb, 98.2 lb)
35
Ratio Scale:
Intervals and ratios of values can be compared
Natural zero point
Example: Bank account balance ($5, $10, $15)
Special Types of Variables:-
Dichotomous Variable:
Only two possible values
Example: Gender (male, female)
Binary Variable: Dichotomous variable with values 0 or 1
Example: Purchase (0 = no, 1 = yes), Fuel Efficiency (0 = low, 1 =
high)
36
Scales of Data
Variables classified according to the scale on which they are measured.
Nominal Scale:
• Variable with Limited number of values
• Values cannot be ordered
• Example: Industry (financial, engineering, retail)
Ordinal Scale:
• Variable whose Values can be ordered or ranked
• Values are assigned to fixed categories
• Example: Low, Medium, High
Interval Scale:
• Intervals between values can be compared
• Values share same unit of measurement
• Example: Fahrenheit scale (5◦F, 10◦F, 15◦F)
37
Central Tendency of Data
• Definition: Value that characterizes the center of a set of
values
• Purpose: Quantify the middle or central location of a
variable such as average.
• Many observations values lie around central value.
• Approaches to Calculating Central Tendency:- Mode,
Median and Mean
38
Mode:
• The mode is the most commonly reported value for a
particular variable.
• It is illustrated using the following variable whose
values are: 3, 4, 5, 6, 7, 7, 7, 8,8,9
• The mode is 7 since there are three occurrences of 7.
• The following values, both 7 and 8 are reported three
times: 3, 4, 5, 6, 7,7, 7, 8, 8, 8, 9
• The mode may be reported as {7, 8} or 7.5.
39
Median
• The median is the middle value of a variable
• Sort values from low to high.
• For variables with an even number of values, the mean of
the two values closest to the middle is selected.
• The following set of values will be used to illustrate: 3, 4,
7, 2, 3, 7,4, 2, 4, 7, 4.
• Before identifying the median, the values must be sorted: 2,
2, 3, 3, 4, 4, 4, 4, 7,7, 7
40
Mean:
• Referred to as the average.
• Commonly used central tendency for variables
measured on the interval or ratio scales. S
• Sum of all the values divided by the number of
values.
• For example, for the following set of values: 3, 4, 5,
7, 7, 8, 9, 9, 9
• Calculate mean using the formula
41
Source of Data
• External data- may be incomplete, varying quality and accuracy
• Internal data -higher quality, from within the organization
Main Sources of Data:-
• Social Media: Web and social media activities, Email, Google
searches, Facebook posts, Tweets, YouTube videos, blogs generate
data for people.
• Organizations: Major source are Business and government data,
ERP systems, e-commerce systems, user-generated content, web
access logs
• Machines: Internet of Things (IoT) is evolving, Autonomous data
from connected machines such as RFID tags, telematics, phones,
refrigerators.
42
• Metadata: Enormous data about data itself
- Web crawlers and web-bots scans the web for new
webpages, html structure, and metadata
- Used by applications like web search engines.
Data Quality:-
• Varies depending on purpose and collection methods
• Internal data generally higher quality
• Publicly available data includes trustworthy sources
e.g. government data.
43
Data Preparation
• Preparing data is a time-consuming step in data analysis.
• Data preparation involves merging, characterizing, cleaning, and
transforming data
Required Steps
• Merge data into a table from multiple sources
• Characterize data
• Clean data by:
- Resolving ambiguities and errors
- Removing redundant and problematic data
- Eliminating irrelevant columns
- Calculate new columns of data (if necessary)
- Divide data into subsets (if appropriate)
44
Important Considerations
• Record details of data preparation steps and rationale
• Provide documentation for future reference and validation of
results
• Ensure consistency in data preparation methodology
Data Preparation Tasks
• Identify and clean up errors
• Remove certain variables or observations
• Generate consistent scales across observations
• Generate new frequency distributions
• Convert text to numbers and vice versa
• Combine variables
• Generate groups
• Prepare unstructured data
45
Cleaning the Data
• Since the data available for analysis may not have been
originally collected with this project’s goal in mind, it is
important to spend time cleaning the data.
• It is also beneficial to understand the accuracy with
which the data was collected as well as correcting any
errors.
• For variables measured on a nominal or ordinal scale
(where there are a fixed number of possible values), it is
useful to inspect all possible values to uncover mistakes
and/or inconsistencies.
• Any assumptions made concerning possible values that
the variable can take should be tested.
46
• For example, a variable Company may include a
number of different spellings for the same company
such as:
• General Electric Company
• General Elec. Co
• GE
• Gen. Electric Company
• General electric company
• G.E. Company
47
• These different terms, where they refer to the
same company, should be consolidated into one
for analysis.
• In addition, subject matter expertise may be
needed in cleaning these variables.
• For example, a company name may include one
of the divisions of the General Electric
Company and for the purpose of this specific
project it should be included as the ‘‘General
Electric Company.’’
48
Removing Variables
• On the basis of an initial categorization of the variables,
it may be possible to remove variables from
consideration at this point.
• For example, constants and variables with too many
missing data points should be considered for removal.
• Further analysis of the correlations between multiple
variables may identify variables that provide no
additional information to the analysis and hence could
be removed.
49
Data Transformation
Normalization
• Normalization is a process where numeric columns are transformed
using a mathematical function to a new range. It is important for two
reasons.
• First, analysis of the data should treat all variables equally so that one
column does not have more influence over another because the ranges
are different.
• For example, when analyzing customer credit card data, the Credit limit
value is not given more weightage in the analysis than the Customer’s
age.
• Second, certain data analysis and data mining methods require the data
to be normalized prior to analysis, such as neural networks or k-nearest
neighbors
50
Min-Max Normalization
• Linear transformation is performed on the original data.
• Minimum and maximum value from data is fetched and each
value is replaced according to the following formula.
Where A is the attribute data,
Min(A), Max(A) are the minimum and maximum absolute value of A respectively.
v’ is the new value of each entry in data.
v is the old value of each entry in data.
new_max(A), new_min(A) is the max and min value of the range(i.e boundary value
of range required) respectively.
51
Problem
52
Solution
53