Unit 2: Data Collection
• Getting to know your data
• Types of Data
• Data collection strategies
• Data Pre-engineering with IoT data
• Exploratory Data Analytics
• Descriptive Statistics, Mean, Deviation, Skewness and Kurtosis
Getting to know your data
• The first step in any analytics, especially IoT analytics, is getting to know your
data.
• You need to know everything you can about it.
• Know its faults and its strong points.
• Know which attributes can be annoying and make sure you are able to live with
them.
• Get to know its future earning potential. And find all this out before it is too late
and you are bound to it.
Getting to know your data
• Data completeness: Due to the nature of low-power sensors and unreliable
connectivity, IoT data often has missing and incomplete values.
• Data validity: After finishing your completeness checks, it is important to check
the validity of the data in the records that you do have.
• Assessing Information Lag: Data from IoT devices does not always arrive into
your dataset in the same intervals of time. Some records maybe included quickly
as the devices are located closer with better connectivity.
• Assessing the variation in lag times, which we will call Information Lag, is
important to understand before applying more advanced analytics to the data.
Getting to know your data
Getting to know your data
• Representativeness: Do the observed data values represent the real world?
• IoT devices should record values that are not only within the range of the sensors
but also line up with what they are measuring in the real world.
• For example, a healthcare IoT device that measures pulse rate should report a
distribution similar to what is observed at physician's offices for the same patient
population.
Types of Data
What is data?
Data is information, facts, or statistics that you can analyze and interpret. You can
use data across various fields and industries to guide decision-making, optimize
processes, and uncover valuable insights.
Types Of Data
1. Quantitative data
Also called numerical data, this data can be measured or counted
Examples include age, temperature, and number of purchases
Quantitative data can be used to perform calculations like averages and
correlation.
Types of Data (Cont…)
2. Qualitative data
This data represents information and concepts that are not represented by
numbers
Examples include responses from focus groups, interviews, and observations
Qualitative data can provide deeper insight into a topic
3. Discrete data
This data involves values that are distinct and separate
Examples include the number of pets someone has or the number of wins
someone's favorite team gets
Types of Data (Cont…)
4. Continuous data
This data can take any value within a given range
Examples include height, weight, temperature, and length
Continuous data can change over time
5. Ordinal data
Ordinal data represents information with a clear order or ranking.
When you handle ordinal data, you might see examples like customer
satisfaction ratings, educational levels, or survey responses.
Analyzing ordinal data typically involves calculating measures of
central tendency, such as the median, and using graphs like bar charts
or pie charts to display the data distribution.
Types of Data (Cont…)
6. Primary Data
• Primary data refers to information collected directly from first-hand
sources .
• This type of data is gathered through various methods, including surveys,
interviews, experiments, observations, and focus groups.
• One of the main advantages of primary data is that it provides current, relevant,
and specific information.
Types of Data (Cont…)
7. Secondary Data
• Secondary data refers to information that has already been collected and
published by others.
• This type of data can be sourced from existing research papers, government
reports, books, and company records.
• The advantage of secondary data is that it is readily available and often free or
less expensive to obtain compared to primary data.
• It saves time and resources since the data collection phase has already been
completed.
Types of Data (Cont…)
Types of data in data science for the IoT
In data science for the Internet of Things (IoT), the main types of data include:
• Status data (basic device information): Simple, raw data that indicates the
current state of a device or system, like whether a machine is on or off.
• Automation data (generated by automated systems): Data created by
automated systems like smart thermostats or lighting.
• Location data (geographical position): Geographical coordinates of an IoT
device, crucial for tracking assets or monitoring movements.
Types of Data (Cont…)
Types of data in data science for the IoT (Cont…)
• Sensor data (physical measurements like temperature, pressure): Continuous
streams of physical measurements captured by sensors like temperature,
humidity, pressure, vibration, etc.
• Event data (occurrences like alarms or changes in state): Discrete occurrences
like alarms, errors, or status changes that trigger specific actions.
Data Collection Strategies in Data Science
• Data Collection is the process of collecting information from relevant sources
to find a solution.
• Collection of Data is the first and foremost step in a statistical investigation.
• Data collection is a critical step in any research or data-driven decision-making
process, ensuring the accuracy and reliability of the results obtained.
Data Collection Strategies in Data Science (Cont…)
Different methods of collecting data include
• Interviews
• Questionnaires
• Observations
• Experiments
• Published Sources and Unpublished Sources
Methods of Collecting Primary Data
• There are a number of methods of collecting primary data, Some of the common
methods are as follows:
• 1. Interviews: Collect data through direct, one-on-one conversations with
individuals. The investigator asks questions either directly from the source or from its
indirect links.
Direct Personal Investigation: The method of direct personal investigation involves
collecting data personally from the source of origin. In simple words, the investigator
makes direct contact with the person from whom he wants to obtain information.
Methods of Collecting Primary Data (Cont…)
Indirect Oral Investigation: In the indirect oral investigation method of collecting
primary data, the investigator does not make direct contact with the person from
whom he/she needs information, instead they collect the data orally from some
other person who has the necessary required information. For
example, collecting data of employees from their managers.
Advantage: Provides real-time, natural data.
Disadvantage: Observer bias, may influence subjects’ behavior.
Suitable Use Case: Behavioral studies, user experience research.
Methods of Collecting Primary Data (Cont…)
2. Questionnaires: Collect data by asking people a set of questions, either online, on
paper, or face-to-face. In this method the investigator prepares a questionnaire to
collect information through questionnaires and schedules, while keeping in mind the
motive of the study.
(A) Mailing Method: This method involves mailing the questionnaires to the informants
for the collection of data. The investigator attaches a letter with the questionnaire in
the mail to define the purpose of the study or research.
Advantage: Can reach a large audience quickly and cost-effectively.
Disadvantage: Responses may be inaccurate.
Suitable Use Case: Customer satisfaction surveys, market research.
3. Observations: The observation method involves collecting data by watching
and recording behaviors, events, or conditions as they naturally occur. The
observer systematically watches and notes specific aspects of a subject’s behavior
or the environment, either secretly or openly.
Advantage: Provides real-time, authentic data without dependance on self-
reported information.
Disadvantage: Observer bias can influence the results, and the presence of an
observer might alter subjects’ behavior.
Suitable Use Case: Assessing classroom dynamics.
4. Experiments: The experiment method involves manipulating one or more variables
to determine their effect on another variable, within a controlled environment.
• Advantage: Allows for the establishment of cause-and-effect relationships with high
precision.
Disadvantage: Experiments can be artificial, limiting the ability to generalize
findings to real-world settings.
Suitable Use Case: Assessing the impact of a new teaching method, or evaluating
the effect of a marketing campaign.
Methods of Collecting Secondary Data
• Secondary data can be collected through different published and unpublished
sources. Some of them are as follows:
• 1. Published Sources
Government Publications: Government publishes different documents which
consists of different varieties of information or data published by the Ministries,
Central and State Governments.
As the government publishes these Statistics, they are fairly reliable to the
investigator.
Examples of Government publications on Statistics are the Annual Survey of
Industries etc.
Journals and Papers: Different newspapers and magazines provide a variety of
data in their writings, which are used by different investigators for their studies.
2. Unpublished Sources
• Unpublished sources are another source of collecting secondary data.
• The data in unpublished sources is collected by different government
organizations and other organizations.
• These organizations usually collect data for their self-use and are not
published anywhere.
• For example, research work done by professors, professionals, teachers and
records maintained by business.
Key Steps in the Data Collection Process?
1. Decide What Data You Want to Gather
2. Establish a Deadline for Data Collection
3. Select a Data Collection Approach: Data Collection Technique
4. Gather Information
5. Examine the Information and Apply Your Findings
Data Pre-engineering with IoT data
• Data pre-engineering is a crucial step in processing IoT data before applying
analytics, machine learning, or decision-making algorithms.
• Since IoT devices generate vast amounts of real-time, heterogeneous, and
sometimes noisy data, effective pre-processing is essential for ensuring data
quality and efficiency.
• Definition: Data pre-engineering in the context of IoT data refers to the “process
of cleaning, transforming, and preparing raw sensor data collected from IoT
devices before it can be used for analysis, machine learning, or other
applications, ensuring the data is accurate, consistent, and suitable for
further processing by addressing issues like missing values, outliers, and
incompatible formats”.
Key aspects of IoT data pre-engineering
1. Data Collection
• IoT devices collect raw data from sensors (temperature, humidity, motion, etc.).
2. Data Cleaning:
o Handling missing values: Identifying missing data points with appropriate
techniques like mean/median replacement.
o Outlier detection and handling: Identifying and addressing abnormal data
points that differ significantly from expected values, potentially by removing
them or applying outlier correction methods.
o Data type conversion: Ensuring data is in the correct format (e.g.,
converting string values to numeric).
o Data validation: Checking for inconsistencies based on domain knowledge.
Key aspects of IoT data pre-engineering (Cont…)
3. Data Transformation:
o Normalization and scaling: Adjusting the range of numerical features to a
common scale (e.g., scaling to values between 0 and 1) to improve the
performance of machine learning algorithms.
o Feature engineering: Creating new features by combining existing data points
or applying mathematical operations to extract meaningful patterns.
4. Data Integration:
o Data merging: Combining data from multiple IoT devices or sensors to
create a comprehensive view.
o Data alignment: Ensuring identifiers are consistent across different data
sources.
o Schema mapping: Dealing with inconsistencies in data structures and
formats from different devices.
Exploratory Data Analytics
• What is Exploratory Data Analysis?
• Exploratory Data Analysis (EDA) is an important first step in data science
projects.
• It involves looking at and visualizing data to understand its main features,
find patterns, and discover how different parts of the data are connected.
• EDA helps to spot any unusual data or outliers and is usually done before starting
more detailed analysis or building models.
Exploratory Data Analytics (Cont…)
Why Exploratory Data Analysis is Important?
Here are some of the key reasons why EDA is a critical step in the data
analysis process:
• Helps to understand the dataset, showing how many features there are,
the type of data in each feature, and how the data is spread out, which
helps in choosing the right methods for analysis.
• EDA helps to identify hidden patterns and relationships between different
data points, which help us in model building.
• Allows to spot errors or unusual data points that could affect your results.
• Insights that you obtain from EDA help you decide which features are
most important for building models and how to prepare them to improve
performance.
• By understanding the data, EDA helps us in choosing the best modeling
techniques and adjusting them for better results.
Exploratory Data Analytics (Cont…)
Types of Exploratory Data Analysis: we can divide EDA into three types:
• Univariate
• Bivariate
• Multivariate
1. Univariate Analysis
• Univariate analysis focuses on studying one variable to understand its
characteristics.
• It helps describe the data and find patterns within a single feature.
• Common methods include histograms to show data distribution, box plots to
detect outliers, and bar charts for categorical data.
• Summary statistics like mean, median, mode, variance, and standard
deviation help describe the central tendency and spread of the data.
Exploratory Data Analytics (Cont…)
Types of Exploratory Data Analysis (Cont…)
2. Bivariate Analysis
• Bivariate analysis focuses on exploring the relationship between two
variables to find connections, correlations, and dependencies.
• It’s an important part of exploratory data analysis that helps understand how
two variables interact.
• Some key techniques used in bivariate analysis include
scatter plots, which visualize the relationship between two continuous
variables;
scatter plot
correlation coefficient, which measures how strongly two variables are related (A
correlation coefficient is a number that measures the strength and direction of a
linear relationship between two variables.)
Line graphs, are useful for comparing two variables over time, especially in time
series data, to identify trends or patterns.
Line graph
Exploratory Data Analytics (Cont…)
Types of Exploratory Data Analysis (Cont…)
3. Multivariate Analysis
• Multivariate analysis examines the relationships between two or more variables in
the dataset.
• It aims to understand how variables interact with one another, which is crucial for
most statistical modeling techniques.
• It include techniques like pair plots, which show the relationships between
multiple variables at once, helping to see how they interact.
Steps for Performing Exploratory Data Analysis
Steps for Performing Exploratory Data Analysis (Cont…)
Step 1: Understand the Problem and the Data
• The first step in any data analysis project is to clearly understand the problem
you’re trying to solve and the data you have. This involves asking key questions
such as:
• What is the business goal or research question?
• What are the variables in the data and what do they represent?
• What types of data (numerical, categorical, text, etc.) do you have?
• Are there any known data quality issues or limitations?
• Are there any domain-specific concerns or restrictions?
• By thoroughly understanding the problem and the data, you can better plan your
analysis, avoid wrong assumptions, and ensure accurate conclusions
Steps for Performing Exploratory Data Analysis (Cont…)
Step 2: Import and Inspect the Data
• After clearly understanding the problem and the data, the next step is to import the data into
your analysis environment (like Python). At this stage, it’s crucial to examine the data to get
an initial understanding of its structure, variable types, and potential issues.
• Here’s what you can do:
Load the data into your environment carefully to avoid errors.
Examine the size of the data to understand its complexity.
Check for missing values, since missing data can impact the quality of your analysis.
Identify data types for each variable (like numerical, unconditional, etc.), which will help in data
manipulation and analysis.
Look for errors, such as invalid values, mismatched units, or outliers, which could signal deeper
issues with the data.
By completing these tasks, you’ll be prepared to clean and analyze the data more effectively.
Steps for Performing Exploratory Data Analysis (Cont…)
Step 3: Handle Missing Data
• Missing data is common in many datasets and can significantly affect the quality of your
analysis.
• During Exploratory Data Analysis (EDA), it’s important to identify and handle missing
data properly to avoid misleading results.
Here’s how to handle it:
• Understand the patterns and possible reasons for missing data. Is it missing completely at
random (MCAR) and missing not at random (MNAR)? Knowing this helps decide how
to handle the missing data.
• Use appropriate imputation methods like mean/median imputation or machine learning
techniques like decision trees based on the data’s characteristics. Imputation methods
are techniques for replacing missing data with estimated values. The goal is to create a
complete data set that can be analyzed using standard statistical methods.
Steps for Performing Exploratory Data Analysis (Cont…)
Step 4: Explore Data Characteristics
• After addressing missing data, the next step in EDA is to explore the characteristics of your
data by examining the distribution, central tendency, and variability of your variables, as
well as identifying any outliers.
• This helps in selecting appropriate analysis methods and spotting potential data issues.
• You should calculate summary statistics like mean, median, mode, standard
deviation, skewness, and kurtosis for numerical variables.
• These provide an overview of the data’s distribution and help identify any irregular patterns
or issues.
Steps for Performing Exploratory Data Analysis (Cont…)
Step 5: Perform Data Transformation
Data transformation is an essential step in EDA because it prepares your data for accurate
analysis and modeling. Depending on your data’s characteristics and analysis needs, you may
need to transform it to ensure it’s in the right format.
Common transformation techniques include:
Scaling or normalizing numerical variables (e.g., min-max scaling or standardization).
Applying mathematical transformations (e.g., logarithmic or square root) to correct non-
linearity.
Creating new variables from existing ones (e.g., calculating ratios or combining variables).
Aggregating or grouping data based on specific variables or conditions
Steps for Performing Exploratory Data Analysis (Cont…)
Step 6: Visualize Data Relationship
• Data visualization is the graphical representation of information and data. It uses visual
elements like charts, graphs, and maps to make complex data more understandable,
accessible, and actionable.
• Why is Data Visualization Important?
• Helps identify patterns, trends, and outliers in data.
• Makes large datasets easier to analyze.
• Enhances decision-making by presenting data in an natural format.
• Common Types of Data Visualization
1. Bar Charts – Compare values across different categories.
2. Line Graphs – Show trends over time.
3. Pie Charts – Display proportions of a whole.
Steps for Performing Exploratory Data Analysis (Cont…)
Step 7: Handling Outliers
• Outliers are data points that significantly differ from the rest of the data, often caused
by errors in measurement or data entry.
• Detecting and handling outliers is important because they can skew your analysis and affect
model performance.
• You can identify outliers using methods like Z-scores, or domain-specific rules.
“Z-Score in statistics is a measurement of how many standard deviations away a data point is
from the mean of a distribution.”
Steps for Performing Exploratory Data Analysis (Cont…)
Step 8: Communicate Findings and Insights
• The final step in EDA is to communicate your findings clearly. This involves summarizing
your analysis, pointing out key discoveries, and presenting your results in a clear way.
• Clearly state the goals and scope of your analysis.
• Provide background to help others understand your approach.
• Use visualizations to support your findings and make them easier to understand.
• Highlight key insights, patterns, or anomalies discovered.
• Mention any limitations or challenges faced during the analysis.
• Suggest next steps or areas that need further investigation.
Descriptive Statistics
• Descriptive statistics is a subfield of statistics that deals with characterizing the
features of known data.
• Descriptive statistics give summaries of either population or sample data.
• Descriptive statistics are broken down into measures of central tendency and
measures of variability (spread).
• Measures of central tendency include the mean, median, and mode.
• Measures of variability include standard deviation, variance, kurtosis,
and skewness .
Measures of Central Tendency
• The central tendency is defined as a statistical measure that may be used to
describe a complete distribution or dataset with a single value, known as
a measures of Central Tendency.
• Measures of central tendency describe the center of the data set (mean, median,
mode).
• The mean, median and mode are the three commonly used measures of central
tendency.
Measures of Central Tendency (Cont…)
Mean
• Mean is nothing but the average of the given set of values. It denotes the
equal distribution of values for a given data set.
• Mean is the sum of all the components in a group or collection divided by the
number of items in that group or collection.
• Mean of a data collection is typically represented as x̄ (pronounced "x bar").
Measures of Central Tendency (Cont…)
• Mean for Ungrouped Data
• The formula for calculating the mean of ungrouped data is given as follows:
For a series of observations:
x̄ = Σx / n
where,
x̄ = Mean Value of Provided Dataset
Σx = Sum of All Terms
n = Number of Terms
Example: Weights of 7 girls in kg are 54, 32, 45, 61, 20, 66 and 50.
Measures of Central Tendency (Cont…)
Example: Weights of 7 girls in kg are 54, 32, 45, 61, 20, 66 and 50. Determine
the mean weight for the provided collection of data.
• Mean = Σx/n
= (54 + 32 + 45 + 61 + 20 + 66 + 50)/7
= 328 / 7
= 46.85
Thus, the group's mean weight is 46.85 kg.
Measures of Central Tendency (Cont…)
Example: In a class there are 20 students and they have secured a percentage of 88, 82, 88, 85,
84, 80, 81, 82, 83, 85, 84, 74, 75, 76, 89, 90, 89, 80, 82, and 83.
Find the mean percentage obtained by the class.
Solution:
Mean = Total of percentage obtained by 20 students in class/Total number of students
= [88 + 82 + 88 + 85 + 84 + 80 + 81 + 82 + 83 + 85 + 84 + 74 + 75 + 76 + 89 + 90 + 89 + 80 +
82 + 83]/20
= 1660/20
= 83
Hence, the mean percentage of each student in the class is 83%.
Measures of Central Tendency (Cont…)
• Mean for grouped Data
• The mean of grouped data is the average of all the data
points in a group. It's calculated by adding the products
of each class interval's midpoint and its frequency, and
then dividing that sum by the total number of values.
• If x1, x2, x3, …xn are the number of observations with
respective frequencies f1, f2, f3, … fn, then
Measures of Central Tendency (Cont…)
Example: The marks scored by 30 students of class 10 of a certain school in the
Maths paper consisting of 100 marks is given below in the tabular form. Find the
mean of the marks obtained by the class 10 students.
Measures of Central Tendency (Cont…)
• Solution: To find the mean of the marks obtained by the students in the Mathematics paper, we
need to find the product of each xi and their corresponding frequency fi.
Marks Obtained Number of students fixi
(xi) (fi)
10 1 10
20 1 20 Thus, by using the formula,
36 3 108
40 4 160
50 3 150
56 2 112
60 4 240
we get
70 4 280 x̄ = 1779/30
72 1 72
80 1 80 x̄ = 59.3
88 2 176
92 3 276
95 1 95
Total Σfi = 30 Σfixi = 1779
Measures of Central Tendency (Cont…)
What is a Continuous Series: In continuous series, the value of a variable is grouped into
several class intervals (such as 0-5,5-10,10-15) along with the corresponding frequencies.
Measures of Central Tendency (Cont…)
Solution: The first step is to create the table with the midpoint or marks and the
product of the frequency and midpoint. To calculate the midpoint, we find the
average between the class interval by using the formula mentioned above.
(0+10)/2
(10+20)/2
Measures of Central Tendency (Cont…)
• Median: Median of a data set is the value of the middle-most observation
obtained after organizing the data in ascending order.
• Median formula may be used to compute the median for many types of data, such
as grouped and ungrouped data.
If the total number of observations given is odd,
then
cumulative frequency
Calculation of Median in Continuous Series
• Step 1: Arrange the given data in either descending or ascending
order.
• Step 2: Determine the cumulative frequency; i.e., cf.
Median Class
• Step 3: Calculate the median item using the following formula:
• Where, N = Total of Frequency
• Step 4: Now inspect the cumulative frequencies and find out the cf which is either equal to or just greater than the value determined in the previous
step.
• Step 5: Now, find the class corresponding to the cumulative frequency equal to or just greater than the value determined in the third step. This class
is known as the median class.
• Step 6: Now, apply the following formula for the median:
• Where,
• l1 = lower limit of the median class
• c.f. = cumulative frequency of the class preceding the median class
• f = simple frequency of the median class
• i = class size of the median group or class
• Note: While calculating the median of a given distribution, we have to assume that every class of the distribution is uniformly distributed in the
class interval.
Measures of Central Tendency (Cont…)
Mode: Mode is defined as the value that appears the most frequently in the
provided data, i.e. the observation with the highest frequency is known as the mode
of data.
Standard Deviation
• Standard deviation is a statistical measures the amount of variation or
dispersion in a set of values around the mean.
• It tells us how the value of the data points varies from the mean value of the data
point and it tells us about the variation of the data point in the sample of the data.
• Low, or small, standard deviation indicates data are clustered tightly around the
mean, and high, or large, standard deviation indicates data are more spread out.
• Standard Deviation of a given sample of data set is also defined as the square root
of the variance of the data set.
• Standard deviation is used in business to measure the variability of financial data
such as sales, revenue, and profits. It helps to identify trends, patterns, and
outliers, as well as determine critical values, which in turn enable businesses to
make informed decisions.
Sample and Population in Statistics
Population:
• Represents the entire group of interest, including all possible
individuals or items that could be studied.
Sample:
• A smaller group selected from the population to collect data.
• Example:
• Population: All students enrolled at a university.
• Sample: A group of 100 students randomly selected from that
university to participate in a survey.
Final Rule of Thumb
If the data represents the whole group → Use population standard deviation
If the data represents only a part of the whole group → Use sample standard
deviation
Look for Keywords in the Problem Statement
•If the problem mentions "entire population," "all," "whole," "complete dataset," use
population standard deviation.
•If the problem mentions "sample," "subset," "selected data," or states that data was
taken from a larger group, use sample standard deviation.
Example Questions and Identification
1."A researcher collected the height of all students in a school and calculated the
standard deviation."
1. Since the data includes all students in the school, this is a population standard
deviation problem.
2."A scientist recorded the weight of 50 randomly selected dogs from a city to estimate
the standard deviation of all dogs' weights."
1. Since only a sample (50 dogs) was taken, this is a sample standard deviation
problem.
3."The average income and standard deviation of all employees in a company were
calculated."
1. Since it includes all employees, it's a population standard deviation problem.
4."A medical researcher took blood pressure readings from 200 patients in a hospital to
estimate the variation in blood pressure among all patients."
1. Since only a subset (sample) of patients is used, it's a sample standard deviation
problem.
Example: Calculating Population Standard Deviation
Suppose we have a small population of exam scores: 50,60,70,80,90
Example: Calculating Sample Standard Deviation
Suppose we take a sample of 5 students' test scores: 50,60,70,80,90
Standard Deviation
Population Standard Deviation Formula: when Standard Deviation Formula Sample: when
you have data for the entire population you only have data for a subset of the
population
Initially the denominator in the sample standard deviation formula has “n” in its
denominator but the result from this formula was not appropriate. So, a correction was made
and the n is replaced with n-1 this correction is called Bessel’s correction which in turn
produced the most appropriate results.
Example: Find the SD of data set: X = 56, 65, 74, 75,
76, 77, 80, 81, 91
Standard Deviation Formula Based on Discrete Frequency Distribution
• For a given data set if it has n values (x1, x2, x3, …, xn) and the frequency
corresponding to them is (f1, f2, f3, …, fn) then its standard deviation is calculated
using the formula-
Example: Calculate the Solution:
standard deviation for
the given data n = ∑(fi) = 1+3+5+1 = 10
xi fi Mean (x̄ ) = ∑(fi xi)/∑(fi) = (10×1 + 4×3 + 6×5 + 8×1)/(1+3+5+1) =
10 1 60/10 = 6
xi fi fixi (xi – x̄ ) (xi – x̄ )2 fi(xi – x̄ )2
4 3
10 1 10 10 - 6 = 4 16 16
6 5
4 3 12 4 - 6 -2 4 12
8 1
6 5 30 6 – 6= 0 0 0
8 1 8 8–6=2 4 8
Now,
using standard deviation formula
σ = √(∑in fi(xi – x̄ )2/n)
⇒ σ = √[(16 + 12 + 0 +8)/10]
⇒ σ = √(3.6) = 1.897
Standard Derivation(σ) = 1.897
Standard Deviation of Continuous Grouped Data
• For the continuous grouped data we can easily calculate the standard
deviation using the Discrete data formulas by replacing each class with its
midpoint (as xi) and then normally calculating the formulas.
• Midpoint of each class is calculated using formula,
For example, Calculate standard deviation of
continuous grouped data as given in table,
Class 5-15 15-25 25-35 35-45
Freque
ncy(fi) 2 4 2 2
Mean (x̄ ) = ∑(fi xi)/∑(fi)
⇒ Mean (μ) = (10×2 + 20×4 + 30×2
+ 40×2)/(2+4+2+2) = 240/10 = 24
n = ∑(fi) = 2+4+2+2 = 10
Skewness
• Skewness is a statistical measure that describes the asymmetry of the distribution of
values in a dataset.
• It indicates whether the data points are skewed to the left (negative skew) or the right
(positive skew) relative to the mean.
• It quantifies the degree to which the data deviates from a perfectly symmetrical
distribution, such as a normal (bell-shaped) distribution .
Types of Skewness:
1. Positive Skewness (Right Skew)
2. Negative Skewness (Left Skew)
3. Zero Skewness (Symmetrical Distribution)
Tests of Skewness
Here are some common tests and techniques used to assess skewness:
1. Visual Inspection: The simplest way to assess skewness is by creating
a histogram or a density plot of the given data. If the plot is skewed to the left, it is
negatively skewed, and if the plot is skewed to the right, it is positively skewed. If the
plot is symmetrical, it has no skewness.
Measurement of Skewness
1. Karl Pearson's Measure
This is a numerical measure of skewness, which determines the skewness when mean and
mode are not equal.
It is calculated as:
Skewness = Mean - Mode
Skewness of Karl Pearson's Measure
•If mean > mode, then positive skewness
•If mean < mode, negative value skewness.
•If mean = mode, the skewness will be zero.
Coefficient of Skewness as per Karl Pearson's Measure
Here,
Sk is Karl Pearson's skewness coefficient.
Xˉ is the arithmetic mean or average of the data.
M is the middle value (Median) of the data when it is arranged in ascending order.
σ is a measure of the standard deviation of the data.
Coefficient of Karl Pearson's Measure
If Sk = 0, it indicates a perfectly symmetric distribution where the data is evenly
balanced on both sides of the mean.
If Sk > 0, it suggests a positively skewed distribution where the tail on the right side is
longer or fatter, and the majority of data points are concentrated on the left side of the
mean.
If Sk < 0, it indicates a negatively skewed distribution where the tail on the left side is
longer or fatter, and the majority of data points are concentrated on the right side of
the mean.
Interpretation: The skewness coefficient (Sk)
is negative, indicating a slight negative
skewness in the distribution of exam scores.
This means that the tail of the distribution is
slightly longer on the left side, and most of the
scores are concentrated on the right side of the
mean.
2. Bowley's Measure
• Bowley's Skewness Coefficient, is a statistical measure used to assess the
skewness or asymmetry in a probability distribution.
• Bowley's Skewness Coefficient is based on quartiles. This coefficient provides a
simple way to understand the direction and magnitude of skewness in a dataset.
• Bowley's Skewness Coefficient is especially useful when dealing with data that
may not follow a normal distribution.
B (OR) Sq = (Q3 + Q1 - 2Q2) / (Q3 - Q1)
Kelly's Measure
• Kelly's measure of skewness is a way to quantify the degree of skewness in a
distribution by comparing the values of certain percentiles (typically the
10th, 50th, and 90th percentiles).
• Specifically, it involves comparing the difference between the median
(50th percentile) and the average of the 10th and 90th percentiles (or deciles) to
assess the skewness of the data.
Kurtosis
• Kurtosis is a measure of the “tailedness” of the probability
distribution of a random variable. In other words, kurtosis identifies
whether the tails of a given distribution contain extreme values.
Kurtosis
Types of Kurtosis
• There are three types of kurtosis:
Mesokurtic: Mesokurtic distributions have a kurtosis value similar to
that of the normal distribution
Leptokurtic : Leptokurtic distributions have positive kurtosis
Platykurtic: platykurtic distributions have negative kurtosis.
How to Calculate Kurtosis?
It can be calculated as the ratio of the fourth moment to the square of the variance.
To calculate kurtosis in statistics, you can follow these steps:
2. Compute the Mean (μ): Calculate the arithmetic mean of the dataset.
3. Compute the Variance (σ2): Calculate the variance of the dataset, which is the average of the squared differences
from the mean.
4. Compute the Standard Deviation (σ): Take the square root of the variance to find the standard deviation.
5. Compute the Fourth Moment (μ4): Calculate the fourth moment of the dataset, which is the average of the
fourth power of the differences from the mean.
6. Compute Kurtosis: The formula for calculating kurtosis is:
Kurtosis = μ4/σ4
Sometimes, you might also see a version of kurtosis that subtracts 3 from this calculation. This is called excess
kurtosis, and it subtracts 3 because the kurtosis of a normal distribution is 3.
So the formula becomes:
Excess Kurtosis = (μ4/σ4)− 3
7. This version is often used because it allows for easier comparison to the normal distribution, where excess
Here's a more detailed explanation:
•First Moment (around the mean):
•This is always zero because it represents the
average deviation from the mean, which is
inherently zero.
•Second Moment (around the mean):
•This is the variance, a measure of how spread out
the data is around the mean.
•Third Moment (around the mean):
•This is used to calculate skewness, which
describes the asymmetry of a distribution. A
positive skewness indicates a longer tail to the
right, while a negative skewness indicates a longer
tail to the left.
•Forth Moment (around the mean):
•The fourth moment around the mean, often
denoted as μ₄, is a measure of kurtosis, which
describes the "peakedness" or "tailedness" of a