Bachelor of Science (Honours) in Data Science and Artificial Intelligence
DA 102 Data Analysis Basics
Introduction
Learning Objectives
01 Know various types of data
02 Issues associated with data
03 Understand broad group of analytics
04 Understand what is descriptive analytics
05 Understand what is predictive analytics
06 Tools used to perform analytics
3
4
About Data
Data
● Definition: collection of information, facts, or values
● The data is used as basis for computation, analysis and decision making.
● Data can be in the form of numbers, text, images, audio, video.
● Data is the raw material that is processed, organized, interpreted to extract meaning
and generate insights.
● Data is a fundamental concept that plays a crucial role in various fields such as
science, business, and research.
● Data is considered as the key in this era of digital age
6
Examples of data
7
Examples of data
8
Examples of data
9
About Data
Data 01
Numbers Quantitative Data Boolean Values
Qualitative Data
Quantity: 10 {"Un-married", {1 = Poor, 2 = Fair, 3 = {True, False}
"Married", Divorced"} Good, 4 = Very Good,
Price: 347.73 5 = Excellent}
{"Student", "Faculty", {AA = 10, AB = 9, BB =
"Staff"} 8, …, DD = 4, F = 0}
11
Data 02
Temporal Data Text Data Mixed data
Date: 15-Aug-2023 Sequence of characters such as Combination of any of
words, sentences or paragraphs these data types
Time: 11:30 AM
Time Stamp:
11:30 AM, 15-Aug-2023
12
Data 03
Audio Image Data Video data
An encoded file Data captured through cameras Combination several
corresponding to a song images and an
Data captured through associated audio data
screenshots in computers
13
14
Data – stored in spread sheet
15
Data – stored in a note pad
16
17
Real valued data
Analysis - example 01
• To buy the book titled "The Linux Programming Interface" by Michale Kerrisk from
online website
• The objective is to buy the above book at a website which offers best price.
• To achieve this objective, collect prices from various ecommerce websites
When data is real valued
19
Analysis – example 01
When data is real valued
20
Analysis – example 01
When data is real valued
21
Analysis – example 01
When data is real valued
22
Analysis – example 01
When data is real valued
23
Analysis – example 01
When data is real valued
24
Analysis – example 01
Website Price Additional information
www.amazon.in 5594 In stock
www.flipkart.com 5599 In stock
books.rediff.com 5378 Out of stock
bookswagon.com 7738 Out of stock
www.snapdeal.com Not available Not available
When data is real valued
25
Analysis – example 01
Website Price Additional • Prices are real valued
information numbers
www.amazon.in 5594 In stock
• Price at snapdeal.com is not
available due to
www.flipkart.com 5599 In stock unavailability of product
books.rediff.com 5378 Out of stock • Computing lowest price
from the second column of
the table is a challenging
bookswagon.com 7738 Out of stock task as price information in
unavailable for all websites
www.snapdeal.co Not available Not available
m
When data is real valued
26
Analysis – example 01
Website Price Additional • The price offered
information by snapdeal is made 0 to
make price a real valued
www.amazon.in 5594 In stock
number.
www.flipkart.com 5599 In stock • However, this change leads
to factually incorrect
books.rediff.com 5378 Out of stock information that the lowest
price is offered snapdeal.
bookswagon.com 7738 Out of stock • Handling "not
available" values is very
www.snapdeal.co 0 Not available important in data analysis
m otherwise, decisions swing
significantly resulting in
When data is real valued errors
27
Analysis – example 01
• Lowest price is offered by
books.rediff.com
Website Price Additional
information • Additional information tells
www.amazon.in 5594 In stock us that though this is the
best price offered, the book
www.flipkart.com 5599 In stock is out of stock.
• We therefore must search
books.rediff.com 5378 Out of stock for second lowest price.
bookswagon.com 7738 Out of stock • Sort price in descending
order and pick the second
element in sorted list.
www.snapdeal.co Not available Not available
m • We understand that Amazon
When data is real valued Offers the second lowest
price. 28
Analysis – example 01
Website Price Additional • Sampling – 5 websites are
information visited of many e-commerce
websites.
www.amazon.in 5594 In stock
• Computing minimum price
www.flipkart.com 5599 In stock value lead to decision
making
books.rediff.com 5378 Out of stock
• Is this example too easy?
Well the fact is:
bookswagon.com 7738 Out of stock
• 9 out of 10 consumers price
www.snapdeal.co Not available Not available check a product on Amazon
m
• https://www.bigcommerce.c
When data is real valued om/blog/amazon-statistics/
29
30
Quantitative data
Analysis – example 02
• Visit www.amazon.com
• Search for any product and
access associated product
description page.
• Apart from book description,
we get to see ratings given
by users
• The ratings are quantitative
values such as
When data is Quantitative
https://www.vecteezy.com/ 32
Analysis – example 02
• A total of 1043 users gave
ratings
• 75% of users gave 5 stars
• 17% of users gave 4 stars
and so on
• The ordered values are
presented as histogram
visualization
• Average rating is computed
and presented as 4.6 out of 5
stars
When data is Quantitative
33
Analysis – example 02
User 1 5 stars • Analysis: Computed the
following
User 2 5 stars • Number of users who gave 5
stars
User 3 4 stars • Number of users who gave 4
User 4 1 star starts
• ...
… … • Number of users who gave 1
star
User 1043 5 stars • Average rating across all
users
When data is Quantitative
34
Analysis – example 02
• Product ratings influence in conversions from
viewing to buying
• A correlation between ratings and conversion has
been observed – a product with 3.7 rating has 15%
more click through rate (CTR)
• Analysis involving only ratings
When data is Quantitative
35
Analysis – example 02
User 1 5 stars
• From the data perspective
User 2 5 stars • The rating data may be visualized to be stored as
shown
• When users visits product page, average rating and
User 3 4 stars
histograms are generated from this kind of data.
• Analytics play a pivotal role in transforming data and
User 4 1 star influencing decisions
… …
User 1043 5 stars
When data is Quantitative
36
37
Qualitative data
Analysis – example 03
• Visit www.youtube.com
• Search for any video
• Apart from the video, we get
to see how many users
{liked, disliked} the video
• More likes of a video has
strong correlation to
increase in revenue
https://www.vecteezy.com/
When data is Qualitative
39
Analysis – example 03
• Visit www.youtube.com
• Search for any video
• Apart from the video, we get
to see how many users
{liked, disliked} the video
• More likes of a video has
strong correlation to
increase in revenue
https://www.vecteezy.com/
When data is Qualitative
40
41
Text data
Analysis – example 04
• Visit www.amazon.com
• Search for any product and
access associated product
description page.
• Apart from book description,
we get to read detailed
product review given by
customers who bought the
product
• This information is in the text
form.
43
Analysis – example 04
• Analyze the contents for
finding out the sentiment of
the customer
• In the analytics terms,
perform sentiment analysis
of customer reviews
• If the complete review is in
positive sentiment, then the
it suggests that customer is
happy with the product.
44
45
A computer program
• A computer program is data
• Collection of such programs
• Collection of executables
46
Collection of programs
• A computer program is data
• Collection of such programs
• Collection of executables
47
Collection of executables
• A computer program is data
• Collection of such programs
• Collection of executables
48
49
Issues in Data Collection
Bias and Sampling Issues
• Sampling Bias - When the sample collected does not represent the entire
population
• Selection Bias - When participants are not selected randomly, leading to skewed or
inaccurate results
• Nonresponse Bias - When a significant portion of the selected participants does not
respond, leading to a biased sample
51
Measurement Issues
• Measurement Error: Inaccuracies or inconsistencies in measurement
instruments, leading to incorrect or imprecise data.
• Subjective Measurements: When data is collected through subjective methods like
surveys, individuals might respond based on personal biases or interpretations.
• Social Desirability Bias: Participants may provide responses they believe are
socially acceptable rather than their true opinions or behaviors
52
Data Quality and Integrity Issues
• Data Integrity Issues
• Missing values
• A data record that is not complete.
• Data entry errors.
• Duplications in data.
53
Ethical and Privacy Concerns
• Informed Consent: Obtaining proper informed consent from participants, especially
in sensitive or intrusive research.
• Privacy Protection: Ensuring that personally identifiable information is not leaked
or misused.
• Anonymity: Striking a balance between collecting meaningful data and
protecting participants' identities
54
Technological Challenges
• Data Security: Protecting data from breaches, leaks, or unauthorized access.
• Data Storage: Ensuring proper and secure storage of collected data.
• Technical Glitches: Issues with data collection tools, software, or hardware can
affect data quality.
55
Cultural and Language Related Challenges
• Cultural Bias: Data collection instruments might be biased towards a particular
culture or group.
• Translation Issues: Translating surveys or questionnaires into different languages
can lead to discrepancies in meanings.
56
Resource Limitations
• Financial Constraints: Adequate data collection might require funding for tools,
personnel, and infrastructure.
• Time Constraints: Rushed data collection might lead to errors or incomplete data.
57
Types of analytics
Analytics components
• A business context in order • Pregnant women are likely to be price-
to take up data analytics. insensitive.
• Two examples to support the • Their willingness to spend more is the potential
business context. for business.
• Baby products related market share is 38 billion
dollar.
• Given a female customer predict if the customer
is pregnant or not.
• In the business context, if the prediction is
accurate, offer pregnant women the customized
products and services.
59
Analytics components
• A business context in order • In competitive markets such as
to take up data analytics. telecommunication industry, customers abruptly
leave the services of the service provide.
• Two examples to support the
business context. • Predicting which customer leaves the service
before hand help address core concerns of the
customers.
• In the business context, if the prediction is
accurate, offer discounts to retain
such customers.
60
Analytics types
• Descriptive analytics
• Predictive analytics
61
Descriptive analytics
Descriptive analytics
• In descriptive analytics the focus is on summarizing and interpreting historical data
• It gives insights into past events and trends
• Strives to provide a clear and concise representation of data in order to understand
the data and aid in decision making
• Summarizing large volumes of data into manageable and interpretable forms
• Identify patterns and trends present in the historical data
• Visualize data as visualization make it easier to communicate complex data
63
Descriptive analytics - example
• An e-commerce company want to analyze the sales data for a particular product
over the past year.
• It has data describing each sale such as: date of the sale, the product sold, the
quantity, and the revenue generated.
64
Descriptive analytics - example
• Descriptive analytics include
• Summarization: Computing average revenue for each product
• Summarization: What is the total revenue of each product
• Visualization: Plot sales of product A over timeline
• Visualization: Plot Quantity sold every month for product A
• Determine top selling products based on total revenue generated by each product
• Customer segmentation: grouping customers based on purchase frequency,
total spending by customer, identify high value and low value customers
• Peak sales times: Identify days of the week when sales are highest
65
66
Predictive analytics
Predictive analytics
• In predictive analytics the focus is on finding future trends given historical data and
current data
• The goal is to build models that can learn from historical data.
• Use the learning to make predictions about future events (data).
• The models are trained on known outcomes and patterns.
• The trained models are employed to predict new or unseen data.
68
Example - customer churn
Example - customer churn
• Percentage of customers who stopped using the product or service
• In competitive marketplace identifying potential customer who would discontinue
their services is key.
• If known beforehand the intent of the customer that he/she would leave the
services company may initiate retention plan. Which may include offering a coupon
or providing discount for the previous three months service etc.
• The main challenge is no customer will explicitly state the reason leaving the
services
• The reasons are to be understood from multiple data sources. Here is an example
70
Example - customer churn
Data from billing department Data from service department
71
Example - customer churn
• In this example a correlation exists between number of complaints raised, resolved,
unresolved and rating.
• Obtain an elaborate data for all active customers for past four months a snapshot
of the same is given below.
72
Customer churn
• When large number of customer data is presented, it is hard to establish relations
between churn and the attributes of the data visually.
• When a relationship is established, the relationship is to be validated. In the
previous example, customer ID 1235 is identified as potential to churn. This must be
validated.
• The outline of predictive analytics is
• Perform data analysis and modeling: Analyze historical data and discover
relationships, correlations, and patterns that can be used to make predictions. In
the customer churn examples, complaints raised, number of unresolved
complaints, rating on resolved complains are the information relationships.
73
74
Data analytics tools
76
Data analytics tools
• Several tools which are used for data analytics. Some of them are listed below
• Microsoft Excel: Widely used for basic data analysis, calculations, and charting.
• Tableau: Known for its powerful data visualization capabilities and interactive
dashboards.
• Power BI: Microsoft's business analytics service for interactive visualizations and
business intelligence.
• QlikView/Qlik Sense: Tools for data visualization, reporting, and business
intelligence.
• R: A programming language and software environment for statistical computing
and graphics.
• Python: A versatile programming language used for data analysis and manipulation
with libraries like pandas and NumPy.
77
Data analytics tools
• SAS (Statistical Analysis System): Offers a suite of software for advanced analytics,
business intelligence, and data management
• SPSS (Statistical Package for Social Sciences): Software for statistical analysis,
used in social science research and data mining.
• MATLAB: A programming language and environment for numerical computing,
often used in engineering and scientific research.
• Stata: A software package used for data analysis and statistical purposes
• JMP: Statistical discovery software often used for exploratory data analysis.
78
DA102 – Microsoft Excel
• This course focus on basics of data analysis and business modeling
• Microsoft Excel software tool is used.
• The choice of this tool stems from the fact that students of this online programme
are in their first semester and are yet to introduce to programming languages.
• For those students who are introduced to the programming languages this course
should enrich them with the orientation of data manipulation through software
systems.
80
DA102 – Detailed contents
• Basic spread sheet modeling
• Understanding range names
• LOOKUP functions
• INDEX function
• MATCH function
• Text manipulation functions
• Time – Dates and date functions
• Conditional statements (IF statement)
81
DA102 – Detailed contents
• Three-dimensional formulas
• Sensitivity analysis
• COUNT family
• SUM family
• OFFSET function
• INDIRECT function
• Data validation
• Filtering and removing duplicates
• Consolidating data
82
DA102 – Detailed contents
• Pivot tables
83