0% found this document useful (0 votes)
19 views87 pages

Unit 2 Descriptive Analytics

The document provides an overview of data types, including constant and variable data, and further classifies variable data into qualitative and quantitative categories. It explains data analysis methods, including univariate, bivariate, and multivariate analysis, as well as the scales of measurement: nominal, ordinal, interval, and ratio. Additionally, it covers descriptive statistics, including measures of central tendency and dispersion, to summarize and describe datasets.

Uploaded by

theiconicps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views87 pages

Unit 2 Descriptive Analytics

The document provides an overview of data types, including constant and variable data, and further classifies variable data into qualitative and quantitative categories. It explains data analysis methods, including univariate, bivariate, and multivariate analysis, as well as the scales of measurement: nominal, ordinal, interval, and ratio. Additionally, it covers descriptive statistics, including measures of central tendency and dispersion, to summarize and describe datasets.

Uploaded by

theiconicps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 87

Collection, Presentation of

Data, and Descriptive


Statistics
UNIT 2
Data

Data is a collection of facts such as values,


observations or measurements.
It can be numbers, words, measurements,
observations, or even just descriptions of
things.
Basically, data are of two types: constant and
variable.
Constant data

Constant is a situation/value that does not


change.
Examples: Days of week, months of the year,
days in a months, multiplication of 2
numbers, total alphabets, hours in a days,
atomic number of elements in periodic table,
spelling of numbers or digits, multiplication
table, chemical formula of water, equation to
find average of n numbers, sin series, cos
series.
Variable data

A variable is a characteristic or an
attribute that can assume different values
in different situations.
Example: height, family size,
temperature, result, Profit/loss of an
entity, job satisfaction, reading habits,
attendance in classroom, GDP, petrol and
gold price.
Based on the values that variables
assume, variables can be classified as
qualitative and quantitative.
Qualitative and Quantitative Variable
Data

Qualitative variables are those variables that do


not assume numeric values. For example, gender,
season, degree, facial expression is qualitative
variable. Also known as categorical data.
Quantitative variables are, on the other hand, are
those variables which assume numeric values.
Height, temperature, blood pressure, family size
are examples of quantitative variables. Also
known as numeric data.
Quantitative variables are again classified into:
discrete and continuous variables.
Discrete Variable Data

It assumes whole number values and


consist of distinct and recognizable
individual elements that can be counted.
For example, family size, total admission
at college, number of cars at the traffic
signal, events organized/attended by
faculties, total likes for a tweet, followers
on instagram, mails in mailbox.
 Their values are obtained by counting .
The values of these variables are obtained
by counting (0, 1, 2, ).
Continuous Variable Data

It takes any value including decimals.


These variables can theoretically assume an
infinite number of possible values.
Their values are obtained by measuring.
Examples of continuous variables are height,
weight, time, temperature, distance between
objects.
Exercise

Classify each of the following variables as


qualitative and quantitative and if it is
quantitative, classify as discrete and continuous.
Color of automobiles in a dealer's show room.
Number of seats in a movie theater.
Classification of patients based on nursing care
needed (complete, partial or safers).
Number of tomatoes on each plant on a field.
Weight of newly born babies.
Distance between two tree leaves.
Data Analytics vs Data Analysis

Data analysis is a process involving the


collection, manipulation, and examination of
data for getting a deep insight.
Data analytics is taking the analyzed data and
working on it in a meaningful and useful way
to make well-versed business decisions.
Analytics is defined as a process of
transforming data into actions through
analysis and statistical tool.
Data Analysis

It is special form of data analytics.


Its an approach to analyze or explore the
data set to maximize insight into dataset,
know the hidden pattern, uncover underlying
structure, distribution of data, relationship
between features, selecting appropriate
model, detect outliers or anomalies, for
further analysis.
Various applications are retail, health care,
education, electronic media, election poll,
entertainment, economy and stocks.
Data Presentation

Two ways to represent the results of


exploratory data analysis:
Statistics (non-graphical)
Visualization (graphical)
Data analysis -Univariate,bivariate and
Multivariate
1. Univariate data –
 This type of data consists of only one variable. The
analysis of univariate data is thus the simplest form of
analysis since the information deals with only one
quantity that changes. It does not deal with causes or
relationships and the main purpose of the analysis is to
describe the data and find patterns that exist within it.
The example of a univariate data can be height.
Data analysis -Univariate,bivariate and
Multivariate

 The description of patterns found in this type of data can


be made by drawing conclusions using central tendency
measures (mean, median and mode), dispersion or
spread of data (range, minimum, maximum, quartiles,
variance and standard deviation) and by using frequency
distribution tables, histograms, pie charts

 2. Bivariate data
 This type of data involves two different variables. The
analysis of this type of data deals with causes and
relationships and the analysis is done to find out the
relationship among the two variables. Example of bivariate
data can be temperature and ice cream sales in summer
season.
Data analysis -Univariate,bivariate and
Multivariate
Suppose the temperature and ice cream sales are the two variables of a
bivariate data(figure 2). Here, the relationship is visible from the table
that temperature and sales are directly proportional to each other and
thus related because as the temperature increases, the sales also
increase.
Data analysis -Univariate,bivariate and
Multivariate
3. Multivariate data
 When the data involves three or more variables, it is
categorized under multivariate. Example of this type of
data is suppose an advertiser wants to compare the
popularity of four advertisements on a website, then
their click rates could be measured for both men and
women and relationships between variables can then be
examined. It is similar to bivariate but contains more
than one dependent variable.
Data analysis -Univariate,bivariate and
Multivariate

Univariate Bivariate Multivariate

It only summarize single It only summarize two It only summarize more


variable at a time. variables than 2 variables.

It does deal with causes It does not deal with


It does not deal with
and relationships and causes and relationships
causes and relationships.
analysis is done. and analysis is done.

It is similar to bivariate
It does not contain any It does contain only one
but it contains more than
dependent variable. dependent variable.
2 variables.

The main purpose is to


The main purpose is to The main purpose is to
study the relationship
describe. explain.
among them.
Data Measurement Scales

A scale of measurement shows the


information contained in the value of a
variable,
and what mathematical operations and
statistical analysis are permissible to be done
on the values of the variable.
There are four levels of measurement.
These levels, from the weakest to the
strongest, in order are:
nominal scale, ordinal scale, interval scale
and ratio scale.
Example : Scales of Measurement

Let us take four different situations for a class of 30


students:
1. Assigning them roll nos. from 1 to 30 in no
particular, manner or on random basis so long as no
student has more than one number;.
2. Ask students to stand in a queue as per their heights
and assign them position numbers in the queue from
1 to 30;
3. Conduct a test of 50 marks, for all students and
award marks from 0 to 50, as per their performance;
4. Measure the height and weight of students and
make student-wise record.
Situation 1 : Nominal variables

Any student could be assigned No. 1 while


any one could be assigned No. 30.
No two students can be compared on the
basis of allotment of numbers, in any respect.
 The students have been labeled from 1 to 30
in order to give each an identity.
Nominal variables

Variables which show category of


individuals.
Number can be given to object for
identification.
They reflect classification into mutually
exclusive (non-overlapping) and exhaustive
categories (name of groups) without any
associated ranking.
This scale is the weakest form of
measurement.
The only mathematical operation permissible
on these variables is counting.
Nominal variables

It is another name for a categorical variable.


have two or more categories without having any kind of
natural order.
they are variables with no numeric value.
Numbers may be assigned to the variables simply for
coding purposes. It is not possible to compare individuals
based on the numbers assigned to categories.
Cannot be assigned any order or cannot be quantified.
In other words, you can’t perform arithmetic operations
on them, like addition or subtraction, or logical
operations like “equal to” or “greater than” on them.
Examples: Nominal variables

Qualitative:
Gender (Male, Female, Transgender).
Eye color (Blue, Green, Brown, Hazel).
Type of house (Bungalow, Duplex, Ranch).
Type of pet (Dog, Cat, Rodent, Fish, Bird).
Religious preference: Buddhist, Mormon, Muslim, Jewish,
Christian, Other.
Political party: BJP, Congress, AAP
Quantitative:
person’s phone number, National Identification Number
postal code, aadhar card number etc. are being collected.
1=male, 2=female
Situation 2 : Ordinal variables

Students have been assigned their position numbers


in a queue from 1 to 30.
Here the numbering is not arbitrary. The numbers
have been assigned according to the height of the
students.
So the students are comparable on the basis of their
heights, as there is a sequence in this regard. Every
subsequent child is taller than the previous one, and
so on.
Here the object or event has got its identity. as well
as order.
As the difference in height of any two students is not
known, so the property of addition of numbers is not
applicable to the ordinal scale.
Ordinal variables

Assigns number to objects, but numbers also


have some meaningful order.
These are also those variables whose values
can be ordered and ranked.
Ranking and counting are the mathematical
operations to be done on the values of the
variables.
However, these ranks only indicate as to which
category greater or better but there is no
precise difference between the values
(categories) of the variable
Examples: Ordinal variables

grade scores (A, B, C, D, F),


academic qualifications (BE, ME, Ph.D.),
strength (very weak, week, strong, very
strong),
health status (very sick, sick, cured).
Satisfaction level(very satisfied, satisfied,
dissatisfied)
Situation 3 :Interval variables
 Students have been awarded marks from 0 to 50,
on the basis of their performance in the test
administered on them. Consider the marks
obtained by 3 students, which are 30, 20 and 40
respectively.
 Here, it may be interpreted that the difference
between the performance of the 1st and 2nd
student is the same, as between the performance
of the 1st and 3rd student.
 A student getting 0 marks cannot be described
as having zero achievement level.
 Similarly, the 2nd student cannot be said to have
half the intelligence of the 3rd student, simply
because the 2nd has 20 and the 3rd has 40.
Interval variables

These are those quantitative variables and identifies not


only as to which category is greater or better but also by
how much.
Numbers have order as well as equal interval between
adjacent categories.
It is the stronger form of measurement but there is no
true zero.
Zero indicates low than empty.
Examples: temperature, 0 C does not mean there is no
temperature but, rather, it is too cold.
Similarly, if a student scores 0 in a certain course, it does
not mean that the student has no knowledge in the
course at all.
Situation 4 : Ratio variables

The exact physical values pertaining to the


heights and weights of all students have been
obtained.
Here the values are comparable in all
respect.
If two students have heights of 120 cm and
140 cm, then the difference in their heights is
20 cm and the heights are in the ratio 6:7.
 This scale refers to ratio scale.
Ratio variables

These scales are the highest form of


measurements.
Ratio variables are those quantitative
variables but, unlike the interval variables,
zero shows absence of the characteristics.
All mathematical operations are allowed to be
operated on the values of these variables.
Examples: height, weight, income, amount of
yield,
expenditure, consumption.
Exercise

Jersey number assigned to cricket player


Aadhar card number
Rank order of runners in a race
GTU top 10 students
Top 10 highest tax payers
Temperature in centigrade
Elevation from sea level
height
Age
Scale of Measurement
Population and Sample

Population is the set of all possible


observations for a given context of problem.
The size of the population is very large.
For Example, eligible voters in election.
Population size of voters may be in millions.
During every election, media and other
organizations collect data to predict likely
winner of election through opinion polls
Population and Sample

It is very difficult and also practically


impossible to collect data from millions of
eligible voters about their choice of candidate,
so the opinion polls are based on opinion
expressed by a subset of voters called sample.
Population (also known as universal set) is the
set of all possible data for a given context.
Sample is the subset taken from a population.
An incorrect sample may result in bias and
incorrect inference about the population.
Descriptive Statistics

Descriptive statistics refers to a set of


methods used to summarize and describe the
main features of a dataset, such as its central
tendency, variability, and distribution. These
methods provide an overview of the data and
help identify patterns and relationships.
Descriptive Statistics

 Descriptive statistics include the following details about the data


 Central Tendency
 Mean – also known as the average
 Median – the center most value of the given dataset
 Mode – The value which appears most frequently in the given dataset
 Depending on what exactly you’re trying to describe, you will use a different
measure of central tendency. Mean and median can only be used for numerical
data. The mode can be used with numerical and nominal data both.
 Statistical Dispersion
 Range – Range gives us the understanding of how spread out the given data is
 Variance – It gives us the understanding of how the far the measurements are from the
mean.
 Standard deviation – Square root of the variance is standard deviation, also the
measurement of how far the data deviate from the mean
 Measure of Shape and Symmetry
 The Bell Curve – It is a graph of a normal distribution of a variable, it is called a
bell curve because of its shape.
 Skewness – It is the measure of the asymmetry of a distribution of a variable about its mean
 Kurtosis – It is the measure of the “tailedness” of a distribution of a variable. It gives us the
understanding of how closely the data is spread out.
Measures of Central Tendency

Measures of central tendency are the measures that


are used for describing the data using a single value.
Measures of central tendency help users to
summarize and comprehend the data.
Computed averages:
Mean and mode are the measures of central
tendency and are frequently used to compare
different data sets.
Positional averages:
Median and Quantiles(Quartiles, Deciles and
Percentiles)
Mean : Arithmetic Mean

Simple arithmetic mean:


The arithmetic mean is the simplest but most
useful
measure of central tendency.
It is nothing but the 'average‘. It is defined as
the sum of all observations divided by the total
number of observations.
The sample mean is denoted by x̅ (read as X bar)
While the population mean is represented by the
Greek letter ,μ.
Individual observations

For a sample of n raw (individual)


observations, X1, X2,…,Xn:

Example : Find the arithmetic mean of 2, 4


and 8.
Mode

Mode is another measure of central tendency.


It is a value of a particular type of items
which occur most frequently.
For instance if shoe size 7 has the maximum
demand, size No. 7 is the modal value of shoe
sizes. Mode is denoted by ^X .
A data set may have one mode (uni-modal),
two modes (bi-modal), more than two modes
(multi-modal) or no mode at all (i.e. when all
observations are equally frequent).
Mode

In ungrouped (individual series) cases, one


can find mode by inspection. After arranging
the data in ascending or descending order,
the value appearing most frequently (the
most frequent value) is taken as the modal
value.
Mode

Find the mode of the following data sets.


a. 110, 113, 116, 116, 118, 118, 118, 121 and 123.
Since 118 occurs more than other values, the mode is
118.
2, 3, 5, 7 and 8.
Each value occurs once (equally frequent), the data
has no mode.
15, 18, 18, 18, 20, 22, 24, 24, 24, 26 and 26
18 and 24 occur three times, hence the modal values
are 18 and 24 (bi-modal).
5, 6, 6, 7, 9, 9, 10, 12 and 12.
Tri-modal (multi-modal): 6, 9 and 12.
Median

Median is the half way point in a data set.


It divides a data set into two equal parts
such that half of the numbers have a value
less than the median and
half will have values greater than the median.
Graphically, median is located at the
intersection point of the less than and more
than cumulative frequency curves.
Median

The median (denoted by ˜X ) of a set of n


observations X1, X2,…,Xn, arranged in
ascending or descending order of magnitude
is the middle value if n is odd or the
arithmetic mean of the two middle values if n
is even.
That is:
Median

Find the median of the following two data


sets:
a. 180, 201, 220, 191, 219, 209 and 220.
b. 62, 63, 64, 65, 66, 66, 68 and 78
Using the formula for raw data:
a. 4th value=209
b. (4th value + 5th value)/2=(65+66)/2=65.5

180,191,201,209,219,220,220
a. 4th value=209
Other Measures of Location :
Quantiles

Median divides a given data set into two


equal parts.
There are also other positional measures that
divide a given data set into more than two
equal parts.
These measures are collectively known as
quantiles.
Quantiles include quartiles, deciles and
percentiles.
For all of these measures, first, the data
should be arranged in ascending order.
Quartiles

Quartiles are values that divide a data set


into four equal parts.
These values are denoted by Q1, Q2 and Q3
such that 25% of the data fall below Q1, 50%
below Q2 and 75% below Q3.
Let Qi be the ith quartile (i = 1; 2; 3), then
Quartiles
Quartiles
Quartiles

Given the data: 420, 430, 435, 438, 441, 449,


490, 500, 510 and 515. Find all the quartiles.
Deciles

Deciles are values that divide the data into


ten equal parts.
These values are denoted by D1, D2,…, D9
such that 10% of the data fall below D1, 20%
below D2, , 90% below D9.
Let Di be the ith decile (i = 1, 2,…, 9), then
Deciles
Deciles

Given the data: 420, 430, 435, 438, 441, 449,


490, 500, 510 and 515. Find the 1st and 7th
deciles.
Percentiles

Percentiles are values that divide a data set


into 100 equal parts. These values are
denoted by P1, P2,…, P99.
Let Pi be the ith percentile (i = 1, 2,…, 99),
then
Percentiles

Given the data: 420, 430, 435, 438, 441, 449,


490, 500, 510 and 515. Find the 40th and
75th percentiles.
Relationship between Median, Quartiles, Deciles and
Percentiles
Measures of Variation

In the previous topic , we concentrated on


central values (measures of central
tendency), which gives an idea of the whole
mass, that is, a complete set of values
However, the information so obtained is
neither exhaustive nor comprehensive,
these measures do not reveal how the values
are spread (dispersed or scattered) on each
side of the central value.
as the mean does not lead us to know
whether the observations are close to each
other or far apart.
Median is a positional average and has
nothing to do with the variability of the
observations in a data set.
 This leads as to conclude that a measure of
central tendency is not enough to have a
clear idea about the data unless all
observations are the same.
Moreover, two or more data sets may have
the same mean and/or median but they may
measures of central tendency

The following table displays the price of a


certain commodity in four cities. Find the
mean and median prices of the four cities and
interpret it.
measures of central tendency

All the four data sets have mean 30 and


median is also 30.
But by inspection it is apparent that the four
data sets differ remarkably from one another.
So measures of central tendency alone do not
provide enough information about the nature
of the data.
Thus, to have a clear picture of the data, one
needs to have a measure of dispersion or
variability among observations in the data
set.
measure of dispersion

Variation or dispersion may be defined as the


extent of scatteredness of value around the
measures of central tendency.
Thus, a measure of dispersion tells us the
extent to which the values of a variable vary
about the measure of central tendency.
Dispersion is the amount of variation, scatter
or spread in data. A measure of dispersion is
a value which indicates the degree of
variability of data.
Measures of Variation or Deviation

 Range: The range of a set of data is the difference


between the largest and lowest observed values in
a data set or the interval between these values.
 It is a measure of the spread of the data.
 For example, the range of 2, 3, 3, 5, 7, and 10 is 8,
i.e 10-2.
 Variance: It is a measure of variability based on
the squared deviation of the observed values in the
data set from its mean.
 Since variance is the average of the squared
deviation from the mean, it is also called the Mean
Square Average.
Variance and Standard Deviation

Variance and standard deviation are the most


superior and widely used measures of
dispersions
and both measure the average dispersion of
the observations around the mean.
The variance of a data set is the sum of the
squares of the deviation of each observation
taken from the mean divided by total number
of observations in the data set.
The positive square root of variance is called
standard deviation.
Variance and Standard Deviation

For a population containing N elements, the


population standard deviation is denoted byσ
the Greek letter (sigma) and hence the
population variance is denoted by σ2.
Variance and Standard Deviation

For a sample of n elements, the sample


variance and standard deviation denoted by S
and S2, respectively, are calculated as using
the formulae:
Variance and Standard Deviation

Find the variance and standard deviation of:


20, 28, 40, 12, 30, 15 and 50.
a. Take the data as a population.
b. Consider it as a sample.
Variance and Standard Deviation
Variance and Standard Deviation
MEASURES OF SYMMETRY AND SHAPE

The measure of central tendency and


dispersion can describe the distribution but
they are not sufficient to describe the nature
of the distribution.
For this purpose, we use other concepts
known as Skewness and Kurtosis.
Skewness

It is third moment of ditribution.


It is a measure of asymmetry of a
distribution.
It represents the shape of the distribution.
Skewness

Skewness: means lack of symmetry.


A distribution is said to be symmetrical or
normal when values are uniformly distributed
around mean.
In a symmetrical distribution the mean,
median and mode coincide, that is, mean =
median = mode.
Several measures are used to express the
direction and extent of skewness of a
dispersion.
The first one is the Coefficient of Skewness:
Normal Skewness

For a symmetric distribution Sk = 0.


The range for Sk is from -3 to 3.
The mode, the mean and the median have the
same value in a normal curve.
Frequency Curves :
Symmetric/Normal curve

A symmetric curve is a frequency curve when it


looks the same to the left and right of the central
point.
The distribution spread around a central tendency
value in a symmetrical and bell shaped pattern.
The lengths of both tails are the same.
 The mean, median and modal values are
approximately equal.
 The corresponding pairs of quartiles, deciles and
percentiles are equi-distance from the median.
For example, first quartile and third quartile have
the same distance from the median.
Symmetric/Normal curve
Positive Skewnwss

When the value of the skewness is positive, the


tail of the distribution is longer towards the right
hand side of the curve.
Mass of distribution is concentrated on the right.
It is a right-leaning curve.
if it is positively skewed then Sk is positive.
The median and the mean lie to the right of the
mode in the same direction of the skewness in the
positively skewed curve.
In other words the mean and the median are
greater than the mode.
Frequency Curves :
Positively skewed curve

 If some observations are extremely large, the mean


of the distribution becomes greater than the median
or mode.
 In such case, the distribution is said to be positively
skewed. In positively skewed distribution:
 The right tail of the frequency curve is more
elongated, longest tail to the right of the central
point.
 More values are on the left of the mean.
 The extreme variation is towards large values (to the
right). Smaller values are more frequent.
 Mean>Median>Mode
Positively skewed curve
Negative skewness

When the value of the skewness is negative, the


tail of the distribution is longer towards the left
hand side of the curve.
Mass of distribution is concentrated on the left.
It is a left-leaning curve.
If the distribution is negatively skewed then Sk is
negative and
In the negatively skewed curve the median and
the mean lie to the left of the mode.
In other words the mean and the median are less
than the mode.
Frequency Curves :
Negatively skewed curve

If some extremely small observations are present,


the mean is the smallest of the the other two
averages, and the distribution is said to be
negatively skewed.
 The left tail is more elongated.
 More observations are concentrated on the right
of the mean.
 The extreme variation is towards lower values (to
the left).
 Larger values are more frequent than small values
 Mean<Median<Mode
Negatively skewed curve
Kurtosis

 It is a measure of the peakness or convexity of a


curve is known as Kurtosis.
 It is derived from Greek word Kyrtos means curved.
 It is the degree of flatness or peakness in the
region around the mode of the frequency curve.
 A positive value tells you that you have heavy tails
(a lot of data in your tails) and a negative value
means that light-tails (little data in your tails).
Kurtosis

 (i) Distribution in which value of observations clusters


heavily in the centre is peaked or leptokurtic. Lepto
means thin. It has thinner tails.
 A distribution with kurtosis >3 (excess kurtosis >0).
 (ii) Flat distribution, with values of observations more
evenly distributed and tails flatter than the normal
distribution is called platykurtic. Platy means flat.
 A distribution with kurtosis <3 (excess kurtosis <0).
Platy mean flat.
 (iii) A distribution that is almost normal, neither too
peaked not too flat, is called mesokurtic.
 A normal distribution has kurtosis exactly 3 (excess
kurtosis exactly 0).
Kurtosis

The quantitative indices of kurtosis of a


distribution can be calculated using the semi-
inter quartile range and the nintieth and
tenth percentiles.
This index is symbolized by the Greek letter
K(Kappa) and is given by
leptokurtic
platykurtic
mesokurtic
Different Measurement of data

Scale of Measurement of data:


(nominal, ordinal, Interval, ratio)
Measurement of central tendency(computational):
Mean(arithmetic, geometric, progressive, harmonic)
Measurement of central tendency(positional) or
Measure of center of location:
Quantiles(median, quartiles, deciles, percentiles),
mode
Measure of deviation or variation :
Range, variance and standard deviation
Measure of symmetry and shape :
Skewness and Kurtosis
Moments [of Statistical Distribution]

The shape of any distribution can be described by its


various ‘moments’. The first four are:
1) The mean, which indicates the central tendency of
a distribution.
2) The second moment is the variance, which
indicates the width or deviation.
3) The third moment is the coefficient of skewness,
which indicates any asymmetric ‘leaning’ to either
left or right.
4) The fourth moment is the coefficient of Kurtosis,
which indicates the degree of central ‘peakedness’
or, equivalently, the ‘fatness’ of the outer tails.

You might also like