Course code : CSE1006
Course title : Foundations of Data Analytics
Module-1
Introduction to Data Analytics
31-12-2024 Dr. V. Srilakshmi 1
31-12-2024 Dr. V. Srilakshmi 2
31-12-2024 Dr. V. Srilakshmi 3
31-12-2024 Dr. V. Srilakshmi 4
31-12-2024 Dr. V. Srilakshmi 5
31-12-2024 Dr. V. Srilakshmi 6
Approaches of Data Analytics:
• Data analytics help a business optimize its performance, perform more
efficiently, maximize profit, or make more strategically-guided decisions.
Various approaches to data analytics include looking at
• What happened (Descriptive analytics)
• Why something happened (Diagnostic analytics)
• What is going to happen (Predictive analytics), or
• What should be done next (Prescriptive analytics).
31-12-2024 Dr. V. Srilakshmi 7
31-12-2024 Dr. V. Srilakshmi 8
Data:
What is Data?
• Data is a collection of raw, unorganised facts and details like text,
observations, figures, symbols and descriptions of things etc.
• Data does not carry any specific purpose and has no significance
by itself.
• Data is measured in terms of bits and bytes – which are basic units
of information in the context of computer storage and processing.
31-12-2024 Dr. V. Srilakshmi 9
Information:
What is Information?
• Information is processed, organised and structured data. It
provides context for data and enables decision making.
• For example, a single customer’s sale at a restaurant is data
– this becomes information when the business is able to
identify the most popular or least popular dish.
31-12-2024 Dr. V. Srilakshmi 10
Data Vs information
31-12-2024 Dr. V. Srilakshmi 11
31-12-2024 Dr. V. Srilakshmi 12
Data Vs information
• Example:
31-12-2024 Dr. V. Srilakshmi 13
31-12-2024 Dr. V. Srilakshmi 14
31-12-2024 Dr. V. Srilakshmi 15
31-12-2024 Dr. V. Srilakshmi 16
31-12-2024 Dr. V. Srilakshmi 17
31-12-2024 Dr. V. Srilakshmi 18
31-12-2024 Dr. V. Srilakshmi 19
31-12-2024 Dr. V. Srilakshmi 20
Data Munging:
• In data analysis, Data munging or Data wrangling refers to
the process of cleaning and transforming raw data into
its desired format, usually to facilitate further analysis or
visualization.
• Data munging can be done in Python or R
31-12-2024 Dr. V. Srilakshmi 21
Data Cleaning:
• This involves identifying and
correcting errors or inconsistencies
in the data, such as missing values,
noisy data, outliers, and duplicates.
Various techniques can be used for
data cleaning, such as imputation
(process of replacing missing values
with estimated values), removal,
and transformation.
31-12-2024 Dr. V. Srilakshmi 22
31-12-2024 Dr. V. Srilakshmi 23
31-12-2024 Dr. V. Srilakshmi 24
31-12-2024 Dr. V. Srilakshmi 25
Data Cleaning methods:
31-12-2024 Dr. V. Srilakshmi 26
31-12-2024 Dr. V. Srilakshmi 27
Example
31-12-2024 Dr. V. Srilakshmi 28
31-12-2024 Dr. V. Srilakshmi 29
31-12-2024 Dr. V. Srilakshmi 30
Partition into equal frequency bins
31-12-2024 Dr. V. Srilakshmi 31
Smoothing by bin means
In smoothing by bin means, each value in a bin is replaced by the mean
value of the bin. For example, the mean of the values 4, 8, and 15 in Bin 1
is 9. Therefore, each original value in this bin is replaced by the value 9.
Similarly, smoothing by bin medians can be employed, in which each bin
value is replaced by the bin median.
31-12-2024 Dr. V. Srilakshmi 32
Smoothing by bin boundaries
In smoothing by bin boundaries, the minimum and maximum values in a
given bin are identified as the bin boundaries. Each bin value is then
replaced by the closest boundary value.
31-12-2024 Dr. V. Srilakshmi 33
31-12-2024 Dr. V. Srilakshmi 34
31-12-2024 Dr. V. Srilakshmi 35
31-12-2024 Dr. V. Srilakshmi 36
• Data Scraping:
• Data scraping, also known as web scraping, is the process of importing information
from a website into a spreadsheet or local file saved on your computer.
• It’s one of the most efficient ways to get data from the web, and in some cases to
channel that data to another website.
• You might use data scraping for:
• Website upgrades
• Competitor analysis
• In-depth reporting
• Some people use the technique to harm others. For example, some people set up
scraping tools to gather email addresses or social media profiles. Then they bundle
up that data and sell it to email spammers.
• 4 Ways to Protect Your Data
• Limit requests.
• Apply CAPTCHA.
• Use images.
• Shake up your text.
31-12-2024 Dr. V. Srilakshmi 37
31-12-2024 Dr. V. Srilakshmi 38
Data Sampling:
• It is the practice of selecting an individual group from a
population to study the whole population.
• Every sampling type comes under two broad categories:
• Probability sampling - Random selection techniques are used
to select the sample.
• Non-probability sampling - Non-random selection
techniques based on certain criteria are used to select the
sample.
31-12-2024 Dr. V. Srilakshmi 39
• Probability Sampling Techniques:
➢Simple Random Sampling
➢In simple random sampling, the researcher selects the
participants randomly. There are a number of data analytics
tools like random number generators and random number
tables used that are based entirely on chance. Two types
1. Simple Random sample with replacement
2. Simple Random sample without replacement
31-12-2024 Dr. V. Srilakshmi 40
31-12-2024 Dr. V. Srilakshmi 41
31-12-2024 Dr. V. Srilakshmi 42
31-12-2024 Dr. V. Srilakshmi 43
➢Systematic Sampling
➢In systematic sampling, every population is given a number as well like
in simple random sampling. However, instead of randomly generating
numbers, the samples are chosen at regular intervals.
➢Stratified Sampling
➢In stratified sampling, the population is subdivided into subgroups,
called strata, based on some characteristics (age, gender, income,
etc.). After forming a subgroup, you can then use random or
systematic sampling to select a sample for each subgroup.
➢Cluster Sampling
➢In cluster sampling, the population is divided into subgroups, but each
subgroup has similar characteristics to the whole sample. Instead of
selecting a sample from each subgroup, you randomly select an entire
subgroup. This method is helpful when dealing with large and diverse
populations.
31-12-2024 Dr. V. Srilakshmi 44
31-12-2024 Dr. V. Srilakshmi 45
• Non-Probability Sampling Techniques:
➢Convenience Sampling
➢In this sampling method, the researcher simply selects the
individuals which are most easily accessible to them.
➢This is an easy way to gather data, but there is no way to
tell if the sample is representative of the entire population.
31-12-2024 Dr. V. Srilakshmi 46
Measures of central tendency
• Measures of central tendency measure the location of
the middle or center of a data distribution.
• i.e, given a attribute where do most of its values fall?
1. Mean
2. Median
3. Mode
31-12-2024 Dr. V. Srilakshmi 47
31-12-2024 Dr. V. Srilakshmi 48
Mean
• The "average" number; found by adding all data
points and dividing by the number of data points.
• Example: The mean of 4,1,and 7 is
(4+1+7)/3 = 12/3 =4
31-12-2024 Dr. V. Srilakshmi 49
Median
• The middle number; found by ordering all data points and
picking out the one in the middle
Ex: 4,8,2,3,6
=2,3,4,6,8 =4
• The median of 4, 1, and 7 is 4 because when the numbers
are put in order ( 1, 4, 7), the number 4 is in the middle.
31-12-2024 Dr. V. Srilakshmi 50
Mode
• The most frequent number—that is, the number that
occurs the highest number of times.
• Example: The mode of {4,2,4,3,2,2} is 2 because it
occurs three times, which is more than any other
number.
31-12-2024 Dr. V. Srilakshmi 51
31-12-2024 Dr. V. Srilakshmi 52
31-12-2024 Dr. V. Srilakshmi 53
31-12-2024 Dr. V. Srilakshmi 54
31-12-2024 Dr. V. Srilakshmi 55
31-12-2024 Dr. V. Srilakshmi 56
Calculation of Mean, Median and Mode in
Continuous Series
• In continuous series (grouped frequency distribution), the
value of a variable is grouped into several class intervals
(such as 0-5,5-10,10-15) along with the corresponding
frequencies.
31-12-2024 Dr. V. Srilakshmi 57
Calculate the mean of the following data
using Direct Method and Short-Cut Method:
31-12-2024 Dr. V. Srilakshmi 58
31-12-2024 Dr. V. Srilakshmi 59
31-12-2024 Dr. V. Srilakshmi 60
Example:
• The weekly expenditures
of 100 families are listed
in the following table.
Calculate the weekly
expenditure’s median.
31-12-2024 Dr. V. Srilakshmi 61
Solution:
Find n=sum of frequencies(fi)=100
Median(M)=Size of [n/2]th item
=Size of [100/2]th item
=Size of 50th item(find median class which is equal to or
greater than 50)
Hence, the median lies in the class 1500-3000.
l = 1500,
f = 25,
h = 3000-1500=1500,
c.f. = 30
Now apply the following formula:
Median = 2700
31-12-2024 Dr. V. Srilakshmi 63
31-12-2024 Dr. V. Srilakshmi 64
Example of Continuous Series
• If 4 students of a class score
marks between 0-10, 15
students score marks between
10-20, 24 students score marks
between 20-30,16 students
score marks between 30-40,
and 13 students score marks
between 40-50 then this
information will be shown as:
31-12-2024 Dr. V. Srilakshmi 65
• Median class=20-30
• l=20
• cf=19
• f=24
• h=10
• n=72
• Example: In a class of 30 students marks obtained by
students in mathematics out of 50 is tabulated as
below. Calculate the mode of data given.
Solution:
The maximum class frequency is 12 and the class interval corresponding to this
frequency is 20 – 30. Thus, the modal class is 20 – 30.
Lower limit of the modal class (l) = 20
Size of the class interval (h) = 10
Frequency of the modal class (f1) = 12
Frequency of the class preceding the modal class (f0) = 5
Frequency of the class succeeding the modal class (f2)= 8
Substituting these values in the formula we get;
• Find mean, median and mode for the following data
31-12-2024 Dr. V. Srilakshmi 71
31-12-2024 Dr. V. Srilakshmi 72
31-12-2024 Dr. V. Srilakshmi 73
31-12-2024 Dr. V. Srilakshmi 74
31-12-2024 Dr. V. Srilakshmi 75
31-12-2024 Dr. V. Srilakshmi 76