0% found this document useful (0 votes)
19 views80 pages

Lecture Notes

The document consists of lecture notes for a Statistics 1 course, covering key topics in descriptive statistics, including vocabulary, data collection methods, data presentation, and numerical descriptive measures. It aims to provide foundational knowledge for interpreting, summarizing, and communicating data insights effectively. The course emphasizes the importance of mastering descriptive statistics to make informed decisions across various fields.

Uploaded by

khoulabensaadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views80 pages

Lecture Notes

The document consists of lecture notes for a Statistics 1 course, covering key topics in descriptive statistics, including vocabulary, data collection methods, data presentation, and numerical descriptive measures. It aims to provide foundational knowledge for interpreting, summarizing, and communicating data insights effectively. The course emphasizes the importance of mastering descriptive statistics to make informed decisions across various fields.

Uploaded by

khoulabensaadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Statistics 1 USTHB Page 1

Statistics 1: Lecture Notes

Course staff
M. A. MEZIANI — Lecturer
N. BOUGUERRA — TD + TP
Contents

1 Introduction 4

2 Vocabulary of descriptive statistics 5


2.1 Statistics Vocabulary . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Collecting Data . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Organisation of Data . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Objectives of Classification of Data . . . . . . . . . . 12
2.4.2 Characteristics of a Good Classification . . . . . . . . 13
2.4.3 Basis of Classification . . . . . . . . . . . . . . . . . . 14
2.5 Univariate statistical series . . . . . . . . . . . . . . . . . . . 16
2.5.1 Types of Frequency Distribution . . . . . . . . . . . . 17
2.5.2 Statistical Series . . . . . . . . . . . . . . . . . . . . 21
2.6 Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6.1 Uses of Index Numbers . . . . . . . . . . . . . . . . . 22
2.6.2 Limitations of Index numbers . . . . . . . . . . . . . 24
2.6.3 Growth rate . . . . . . . . . . . . . . . . . . . . . . . 25

3 Presentation of Data 26
3.1 Tabular Presentation of Data . . . . . . . . . . . . . . . . . 26
3.1.1 Objectives of Tabulation . . . . . . . . . . . . . . . . 26
3.2 Classification and Tabulation . . . . . . . . . . . . . . . . . 27
3.2.1 Discrete Data tabular representation . . . . . . . . . 27
3.2.2 Continuous Data Tabular representation . . . . . . . 29
3.3 Graphical Presentation of data . . . . . . . . . . . . . . . . . 29
3.3.1 Qualitative Data . . . . . . . . . . . . . . . . . . . . 30
3.3.2 Quantitative Data . . . . . . . . . . . . . . . . . . . 33

2
Statistics 1 USTHB Page 3

4 Numerical Descriptive Measures 48


4.1 Measures of Ungrouped data . . . . . . . . . . . . . . . . . . 49
4.1.1 Measures of Central Tendency . . . . . . . . . . . . . 49
4.1.2 Measures of position . . . . . . . . . . . . . . . . . . 52
4.1.3 Numerical measures of variability . . . . . . . . . . . 54
4.2 Measures of grouped data . . . . . . . . . . . . . . . . . . . 57
4.2.1 Form measures . . . . . . . . . . . . . . . . . . . . . 63
4.3 Change of variable . . . . . . . . . . . . . . . . . . . . . . . 65
4.4 Measures of inequality (concentration measures) . . . . . . . 67
4.4.1 The Median – Medial difference . . . . . . . . . . . . 68
4.4.2 The concentration ratio for groups . . . . . . . . . . 71
4.4.3 The Lorenz curve (Concentration curve) . . . . . . . 72
4.4.4 GINI coefficient . . . . . . . . . . . . . . . . . . . . . 75
4.4.5 Herfindahl–Hirschman index . . . . . . . . . . . . . . 78

5 Conclusion 80
1

Introduction

Descriptive statistics serve as the cornerstone of statistical analysis, offering


vital tools and techniques to comprehend and present datasets effectively.
This comprehensive exploration into descriptive statistics aims to equip you
with the foundational knowledge necessary to interpret, summarize, and
communicate data insights. The journey begins with establishing a robust
vocabulary specific to descriptive statistics, ensuring clarity and precision in
discussing various statistical concepts. Subsequently, we delve into the art
of presenting data through tables and graphical representations, elucidat-
ing the significance of visual representation in conveying complex informa-
tion. Moreover, this course navigates through numerical measures, spanning
central tendency, position, variability, and concentration measures. Each
measure provides unique perspectives on dataset characteristics, uncover-
ing essential insights into the distribution, dispersion, and concentration of
values. Ultimately, mastering descriptive statistics empowers us to discern
patterns, derive meaningful conclusions, and make informed decisions across
diverse fields. Join us on this enlightening expedition into descriptive statis-
tics, unlocking the key to unraveling the intricacies of data analysis and
interpretation.

4
2

Vocabulary of descriptive statistics

2.1 Statistics Vocabulary

Definition 2.1.1. The Random House College Dictionary defines


statistics as “the science that deals with the collection, classification,
analysis, and interpretation of information or data.”

Often, we encounter random phenomena, i.e., phenomena whose out-


comes cannot be known or predicted with certainty. This is generally due
to the complexity of the studied phenomenon and/or a lack of information.
That is why Statistics and Probability theory are used in various fields.

Why learning Statistics:


ˆ Statistics is a fundamental research tool for many scientific disciplines.
... but even in research, statistics are very often misused.

ˆ An article published in 2005 named ”Why most published Research


findings are False?” has been the most downloaded paper in the PLoS
Medicine journal.

ˆ Over half phsychology studies fail reproducibility test.


ˆ Machine learning, data mining, computer networks, information visu-
alization, human-computer interaction, etc.

5
Statistics 1 USTHB Page 6

Statistics areas
The applications of statistics can be divided into two broad areas: descrip-
tive statistics and inferential statistics:

Definition 2.1.2 (Descriptive statistics). utilizes numerical and graph-


ical methods to look for patterns in a data set, to summarize the
information revealed in a data set, and to present that information in
a convenient form.

Definition 2.1.3 (Inferential statistics). utilizes sample data to make


estimates, decisions, predictions, or other generalizations about a larger
set of data.

In this course, we only interested in Descriptive Statistics.

Definition 2.1.4. An experimental (or observational) unit is an


object (e.g., person, thing, transaction, or event) about which we collect
data.

Definition 2.1.5. A population is a set of all units (usually people,


objects, transactions, or events) that we are interested in studying.

Definition 2.1.6. A variable is a characteristic or property of an in-


dividual experimental (or observational) unit in the population.

Definition 2.1.7 (Modalities). Each of the characters studied can


present two or more modalities. The modalities are the different sit-
uations in which individuals (units) can find themselves with regard
to character. Each individual in the population presents one and only
one of the modalities of the character envisaged.
Statistics 1 USTHB Page 7

Definition 2.1.8. A sample is a subset of the units of a population.

Remark 2.1.1

ˆ A sample, being a subset of the whole population, won’t necessarily


resemble it.

ˆ Thus, the information the sample provides about the population is un-
certain.

ˆ The role of statistics is to find ways to deal with this uncertainty.


Types of data
All data (and hence the variables we measure) can be classified as one of
two general types: quantitative data and qualitative data.

Definition 2.1.9 (Quantitative variable). Quantitative data are mea-


surements that are recorded on a naturally occurring numerical scale.
It could be discrete (finite or countable number of values: age, etc.) or
continuous (sizes, etc.).

Remark 2.1.2
Quantitative data include interval and ratio levels of measurement. In-
terval and ratio levels of measurement refer to data obtained from numerical
variables, and meaning is given to the difference between measurements. An
interval scale indicates rank and distance from an arbitrary zero measured
in unit intervals. Ratio data indicate both rank and distance from a natural
zero, with ratios of two measures having meaning.

Definition 2.1.10 (Qualitative data). Qualitative (or categorical)


data are measurements that cannot be measured on a natural numerical
scale; they can only be classified into one of a group of categories. It
can be nominal (eye color, political affiliation, etc.) or ordinal when
all categories have a total order (very resistant, fairly resistant, not
very resistant). The different levels of a qualitative variable are called
modalities (or categories).
Statistics 1 USTHB Page 8

Figure 2.1: Methods of Collecting Data

Remark 2.1.3

ˆ Quantitative data can be subclassified as either interval data or ratio


data. For ratio data, the origin (i.e., the value 0) is a meaningful num-
ber. But the origin has no meaning with interval data. Consequently,
we can add and subtract interval data, but we can’t multiply and divide
them.

ˆ Often, we assign arbitrary numerical values to qualitative data for ease


of computer entry and analysis. But these assigned numerical values
are simply codes: They cannot be meaningfully added, subtracted, mul-
tiplied, or divided.

2.2 Collecting Data

Definition 2.2.1. Data Collection is the process of collecting infor-


mation from relevant sources in order to find a solution to the given
statistical enquiry. Collection of Data is the first and foremost step in a
statistical investigation.

Generally, you can obtain data in three different ways:


Statistics 1 USTHB Page 9

1. From a published source

2. From a designed experiment

3. From an observational study (e.g., a survey)

Definition 2.2.2. A designed experiment is a data collection method


where the researcher exerts full control over the characteristics of the ex-
perimental units sampled. These experiments typically involve a group
of experimental units that are assigned the treatment and an untreated
(or control) group.

Definition 2.2.3. An observational study is a data collection method


where the experimental units sampled are observed in their natural set-
ting. No attempt is made to control the characteristics of the experi-
mental units sampled. (Examples include opinion polls and surveys.)

Where to find data


1. Published resource (websites, book, journal,..)

2. The second method of collecting data involves conducting a designed


experiment, in which the researcher exerts strict control over the
units (people, objects, or things) in the study. For example, an often-
cited medical study investigated the potential of aspirin in preventing
heart attacks.

3. In an observational study, the researcher observes the experimental


units in their natural setting and records the variable(s) of interest.
For example, a child psychologist might observe and record the level
of aggressive behaviour of a sample of fifth graders playing on a school
playground.

Regardless of which data collection method is employed, it is likely that


the data will be a sample from some population. And if we wish to apply
inferential statistics, we must obtain a representative sample.
Statistics 1 USTHB Page 10

Definition 2.2.4. A representative sample exhibits characteristics


typical of those possessed by the target population.

The most common way to satisfy the representative sample requirement


is to select a random sample. A simple random sample ensures that every
subset of fixed size in the population has the same chance of being included
in the sample.

Definition 2.2.5. A simple random sample of n experimental units


is a sample selected from the population in such a way that every dif-
ferent sample of size n has an equal chance of selection.

Definition 2.2.6. Selection bias results when a subset of experimental


units in the population has little or no chance of being selected for the
sample.

Definition 2.2.7. Nonresponse bias is a type of selection bias that


results when data on all experimental units in a sample are not obtained.

Definition 2.2.8. Measurement error refers to inaccuracies in the


values of the data collected. In surveys, the error may be due to ambigu-
ous or leading questions and the interviewer’s effect on the respondent.

Ethics in Statistics
Intentionally selecting a biased sample in order to produce misleading statis-
tics is considered unethical statistical practice.
Most of the problems with the surveys result from the use of non-random
samples. These samples are subject to errors such as selection bias, non-
response bias, and measurement error. Researchers who are aware of these
problems yet continue to use the sample data to make inferences are prac-
tising unethical statistics.
Statistics 1 USTHB Page 11

2.3 Operators
Definition 2.3.1. In mathematics, an operator is generally a mapping
or function that acts on elements of a space to produce elements of another
space (possibly and sometimes required to be the same space).
Definition 2.3.2. In mathematics, summation is the addition of a se-
quence of any kind of numbers, called addends or summands; the result is
their sum or total. Beside numbers, other types of values can be summed
as well: functions, vectors, matrices, polynomials and, in general, elements
of any type of mathematical objects on which an operation denoted ”+” is
defined.
Summations of infinite sequences are called series.
Definition 2.3.3. The sum Σ can be defined as:
n
X
ai = am + am+1 + ... + an−1 + an
i=m

where i is the index of summation; ai is an indexed variable representing


each term of the sum; m is the lower bound of summation, and n is the
upper bound of summation.
Example:
6
X
i2 = 32 + 42 + 52 + 62 = 86
i=3

Properties of sums
ˆPn
i=1 (ai + bi ) =
Pn
i=1 ai +
Pn
i=1 bi

ˆPn
i=1 [Link] = c.
Pn
i=1 ai

ˆPn
i=1 c = n.c

Definition 2.3.4. A product is the result of multiplication, or an ex-


pression that identifies objects (numbers or variables) to be multiplied, called
factors. The product operator for the product of a sequence is denoted by
the capital Greek letter pi Π and can be defined as:
n
Y
ai = am .am+1 .....an−1 .an
i=m
Statistics 1 USTHB Page 12

Properties of products
ˆQn Qn Qn
i=1 ai .bi = ( i=1 ai )( i=1 bi )

ˆQn k Qn
i=1 ai = ( i=1 ai )
k

ˆQn
i=1 c = c .
n

2.4 Organisation of Data


The data collected by an investigator is in raw form and cannot offer any
meaningful conclusion; hence, it needs to be organized properly. Therefore,
the process of systematically arranging the collected data or raw data so
that it can be easy to understand the data is known as organization of data.
Definition 2.4.1. According to Conner, “Classification is the process of
arranging things (either actually or notionally) in groups or classes according
to their resemblances and affinities, and gives expression to the unity of
attributes that may exist amongst a diversity of individuals.”
Based on the definition of classification of data by Conner, the two basic
features of this process are:
ˆ The raw data is divided into different groups. For example, on the
basis of marital status, people can be classified as married, unmarried,
divorced and engaged.
ˆ The raw data is classified based on class similarities. All similar units
of the raw data are put together in one class. For example, every
educated person can be put together in one class and uneducated in
another.
Each group or division of the raw data classified on the basis of their
similarities is known as Class.

2.4.1 Objectives of Classification of Data


The major objectives of the classification of data are as follows:
ˆ Brief and Simple: The main objective of the classification of data
is presentation of the raw data in a systematic, brief and simple form.
It will help the investigator in understanding the data easily and effi-
ciently, as they can draw out meaningful conclusions through them.
Statistics 1 USTHB Page 13

Figure 2.2: Methods of Collecting Data

ˆ Distinctiveness: Through classification of data, one can render ob-


vious differences from the collected raw data more distinctly.
ˆ Utility: Classification of data brings out the similarities within the
raw diverse data of the study that enhances its utility.
ˆ Comparability: With the classification of data, one can easily com-
pare data and can also estimate it for various purposes.
ˆ Scientific: Arrangement: The process of classification of data facili-
tates proper arrangement of raw data in a scientific manner. In this
way, one can increase the reliability of the collected data.

2.4.2 Characteristics of a Good Classification


ˆ Clarity: Classification of the raw data is beneficial for an investigator
only when it provides a clear and simple form of information. Clarity
here means that there should not be any kind of confusion regarding
any element or part of a class.
ˆ Comprehensiveness: There should be comprehensiveness in the classi-
fication of the raw data so that each of its items gets a place in some
class. In other words, a classification is good if no item is left out of
the classes.
Statistics 1 USTHB Page 14

Figure 2.3: Methods of Collecting Data

ˆ Homogeneity: Each and every item of a class must be similar to each


other. Homogeneity in the different items of a class ensures the best
results and further investigations.

ˆ Stability: Stability in the same set of classification of data for a specific


kind of investigation is essential, as it does not confuse the investigator.
Therefore, the base of classification of data should not change with
every investigation.

ˆ Suitability: The classes in the data classification process must suit the
motive of enquiry. For example, classifying children of a city based on
their weight, age, and sex for the investigation of literacy rate makes
no sense. The data for literacy rate investigation must be done into
classes, like educated and uneducated.

ˆ Elastic: Data classification can provide better results only if it is elastic


and hence, has scope for change if there is any change in the scope or
objective of the investigation.

2.4.3 Basis of Classification


Statistical information can be classified into four different categories de-
scribed below:
Statistics 1 USTHB Page 15

1. Geographical or Spatial Classification


Under this category, the data is classified on the basis of location or geo-
graphical differences in the data. In other words, geographical classification
involves classifying data according to the geographical region. For exam-
ple, to perform a study on the production of cotton in Algeria, we can take
the major four central regions and classify data based on this geographical
classification as:

Table 2.1: Production of Cotton in Different Regions (in kg)


Region Production of Cotton
North Algeria 2893
South Algeria 898
East Algeria 2198
West Algeria 1570

2. Chronological Classification
Under this category, the data is classified on the basis of time of existence,
like months, weeks, days, years, quarters, etc. In chronological data classi-
fication, the given data is arranged either in descending order or ascending
order with reference to the time as years, months, days, weeks, quarters, etc.
Another name for chronological classification is temporal classification. For
example, profits of a company in three years 2010, 2011 and 2012.

Table 2.2: Profits in Different Years (in millions)


Year Profits (millions)
2010 20
2011 50
2012 90

3. Qualitative Classification
Under this category, the given data is classified based on its attributes or
qualities. The attributes or qualities of data include hair colour, gender, in-
telligence, religion, honesty, etc. In the qualitative classification of data, one
cannot measure the attributes of the study; instead, one can only discover
whether the attribute is present or not.
Statistics 1 USTHB Page 16

4. Quantitative or Numerical Classification


As the name suggests, under the quantitative classification of data, the col-
lected data is classified on the basis of numerical values. The variables of
quantities under the quantitative classification of data can be either oper-
ated on or estimated for further analysis. These measurable characteristics
include age, income, weight, height, etc. For example, classification of 50
students in a class based on their weight.

Table 2.3: Distribution of Students by Weight (in kg)


Weight (in kg) Number of Students
30-40 10
40-50 22
50-60 8
60-70 7
70-80 3

2.5 Univariate statistical series


Definition 2.5.1. We call a univariate statistical series the application de-
noted X, defined from a finite set Ω called population, to a set θ called set
of character values.

X : Ω → θ = X(Ω)
ω → X(ω)

For each individual ω, we associate the value taken by the measured


character X(ω).

Example 2.5.1. Ω: Students admitted to the baccalaureate.


X: The appreciation of the baccalaureate.
X(Ω): {Fair, Fairly Good, Good, Very Good, Excellent}.

Ω: University teachers.
X: Family situation.
X(Ω): {Single, Married, Divorced, Widowed}.
Statistics 1 USTHB Page 17

ˆ Discret case:
X(Ω) = {x1 , x2 , ..., xk }

ˆ Continuous case: X(Ω) = [a, b] such that this interval is subdivided


into k classes:
[a0 , a1 [, [a1 , a2 [, ..., [ak−1 , ak [
With
a0 = a and ak = b

Definition 2.5.2. A frequency distribution is a table used to orga-


nize data. The left column (called classes or groups) includes all possible
responses on a variable being studied. The right column is a list of the
frequencies, or number of observations, for each class. A relative fre-
quency distribution is obtained by dividing each frequency by the
number of observations.

Definition 2.5.3. A cumulative frequency distribution contains


the total number of observations whose values are less than the upper
limit for each class. We construct a cumulative frequency distribution
by adding the frequencies of all frequency distribution classes up to
and including the present class. In a relative cumulative frequency
distribution, cumulative frequencies can be expressed as cumulative pro-
portions or percents.

Definition 2.5.4. The class percentage is the class relative frequency


multiplied by 100; that is,
class percentage = (class relative frequency) * 100

2.5.1 Types of Frequency Distribution


It is not always possible for an investigator to easily measure the items of a
series or set of data. To make the data simple and easy to read and analyse,
the items of the series are placed within a range of values or limits. In other
words, the given raw set of data is categorized into different classes with a
Statistics 1 USTHB Page 18

range, known as Class Intervals. Every item of the given series is put against
a class interval with the help of tally bars. The number of items occurring
in the specific range or class interval is shown under Frequency against that
particular class range to which the item belongs.

1. Exclusive Series
The series with class intervals, in which all the items having the range from
the lower limit to the value just below its upper limit are included, is known
as the Exclusive Series. This kind of frequency distribution is known as
exclusive series because the frequencies corresponding to the specific class
interval do not include the value of its upper limit. For example, if a class
interval is 0-10, and the values of the given series are 4, 10, 2, 15, 8, and 9,
then only 4, 2, 8, and 9 will be included in the 0-10 class interval. 10 and 15
will be included in the next class interval, i.e., 10-20. Also, the upper limit
of a class interval is the lower limit of the next class interval.

Table 2.4: Age Distribution


Age Class Frequency
0-10 25
11-20 42
21-30 38
31-40 17
41-50 12
51-60 8

2. Inclusive Series
The series with class intervals, in which all the items having the range from
the lower limit up to upper limit are included, is known as Inclusive Series.
Like exclusive series, the upper limit of one class interval does not repeat
itself as the lower limit of the next class interval. Therefore, there is gap
(between 0.1 to 1) between the upper class limit of one class interval and
the lower limit of the next class interval. For example, class intervals of an
inclusive series can be, 0-9, 10-19, 20-29, 30-39, and so on. In this case, the
gap between the upper limit of one class interval and the lower limit of the
next class interval is 1, and the class intervals do not overlap with each other
like in exclusive series.
Statistics 1 USTHB Page 19

Sometimes it gets difficult to perform statistical analysis with inclusive


series. In those cases, the inclusive series is converted into exclusive series.

Table 2.5: Student Marks Distribution


Marks Interval Frequency
0-9 25
10-19 42
20-29 38
30-39 17
40-49 12
50-60 8

Conversion of Inclusive Series into Exclusive Series


For statistical calculation, sometimes it becomes necessary to convert the
inclusive series into exclusive series. Suppose, in the above example some
students have obtained marks such as 9.5, 19,5, etc. In this case, this series
will be converted into exclusive series,
The steps for converting an inclusive series into exclusive series are:
ˆ In this first step, calculate the difference between the upper class limit
of one class interval and the lower limit of the next class interval.
ˆ The next step is to divide the difference by two and then add the
resulting value to the upper limit of every class interval and subtract
it from the lower limit of every class interval.

3. Open End Series


Sometimes the lower limit of the first class interval and the upper class limit
of a series is not available; instead, Less than or Below is mentioned in the
former case (in place of the lower limit of the first class interval), and More
than or Above is mentioned in the latter case (in place of the upper limit of
the last class interval). These types of series are known as Open End Series.

4. Equal and Unequal Class Interval Series


Equal Class Interval Series:
When the classes of a series are of the same interval, it is known as Equal
Class Interval Series.
Statistics 1 USTHB Page 20

Table 2.6: Age Distribution


Age Class Frequency
Bellow-10 25
11-20 42
21-30 38
31-40 17
41-50 12
51 and Above 8

Unequal Class Interval Series:


When the classes of a series are of unequal interval, it is known as Equal
Class Interval Series.

ˆ Discret case: Let X be a univariate statistic, suppose that:


X(Ω) = {x1 , x2 , ..., xk }

ˆ Frequency of the value x i : is the number of times this value is


repeated in the population, this number is noted ni .

ˆ Cumulative frequency of x : it is the sum of all the previous num-


i
bers, this number is noted

ñi = n1 + n2 + ... + ni

ˆ The Relative Frequency of x i can be defined as:


ni
fi =
n

ˆ n being the sample size k


X
n= ni
i=1

ˆ The cumulative relative frequency in x : i

i
X ñi
f˜i = fj =
j=1
n
Statistics 1 USTHB Page 21

ˆ Let X be a univariate statistic, suppose that X(Ω) = [a, b] such that


this interval is subdivided into k classes:
[a0 , a1 [, [a1 , a2 [, ..., [ak−1 , ak [

ˆ With
a0 = a and ak = b
ˆ Frequency of the ith class [a i−1 , ai [ : is the number of character values
that are in this class.
ˆ Cumulative frequency in a i : the quantity
i
X
ñi = nj
j=1

ˆ Relative frequency of the ith class [a i−1 , ai [ : the quantity


ni
fi =
n
ˆ cumulative Relative frequency in a : i
i
X ñi
f˜i = fj =
j=1
n

2.5.2 Statistical Series


The pair (xi , ni )i=1,..,p is called a discrete statistical series.
The pair ([ai−1 , ai [, ni )i=1,..,p is called a continuous statistical series.

2.6 Indexes
We are a part of a fast-paced economy. Numerous changes in the size of
the population, output, money supply, income, and price of commodities
are taking place continuously in an ever-changing environment. Economic
changes have their effects on the volume of economic activity, income and
employment, general price level, and other factors. Most people are inter-
ested in learning how much their values have changed over time and how they
have changed, and most values evolve. The index number helps in studying
changes in consumption, production, imports, exports, cost of living, crimes,
accidents, national income, business failures, and other phenomena.
Statistics 1 USTHB Page 22

Definition 2.6.1. An index shows the development of a number over


time.
An illustration can help explain the concept of index numbers. Take a
look at how prices are expected to rise in 2023. In this situation there arises
three basic questions.
ˆ First, in comparison to which year did prices rise in 2023?
ˆ Second, how to respond when certain products’ prices increase more
rapidly than others?
ˆ Third, is there any standard unit or not in which the prices of differ-
ent goods and services can be expressed? The price of milk can be
expressed in terms of a DA per liter, of cloth in terms of a DA per
meter, and sweets in terms of a DA per kilogram.
All questions can be answered by studying index numbers.
ˆ First, the rise in the prices in 2023 can be studied using the previous
year, like 2020 or 2015 as the reference/base.
ˆ Second, the index number suggests considering an average change. For
example, if the price of potato rise from 100 DA to 200 DA and the
price of onions rise from 100 DA to 300 DA, then the average price is
considered 250 DA. i.e. 200+300
2
= 500
2
.
ˆ Third, the index number suggests considering only the percentage
change. Thus, the unit of commodity losses its relevance.
An index number is used to display changes in a variable or group of
related variables through time, space, or other factors. Comparisons can be
made between elements like people, hospitals, schools, etc. It tracks changes
in the value of variables such as the cost of living, the volume of production
in various industries, the prices of a list of defined commodities, and so on.
Also, as the index numbers are used to feel the pulse of the economy, they
are also known as the Barometer of Economic Activity.

2.6.1 Uses of Index Numbers


Practically, in all areas of economic activity, changes are measured using in-
dex numbers. It helps in recording changes in output, income, employment,
business activities, productivity, etc.
Statistics 1 USTHB Page 23

1. Measurement and Comparison of Changes in the Price Level:


It is not possible to measure changes in the price level of two variables in
absolute terms; therefore, index numbers provide a relative measure to the
changes in the magnitude of a group of variables. It can be used to know the
influence of changes in the value of money on different sections of society. It
is possible to solve the problem of inflation or deflation in the system.

2. Helps in Policy Formulation:


An index number is an important tool for government or non-government
organizations in the following ways:

ˆ In policy formulation, there is a need for a base or [Link] the


index numbers, the trends of different phenomena can be studied.

ˆ It can also be utilized in the formulation and planning of government


and business policies.

3. Acts as an Economic Barometer:


A barometer is an instrument that is used to measure atmospheric pressure.
It indicates fluctuations in the general conditions of a country and measures
the pulse of the economy.

4. Helps in Studying Trade:


Index numbers are useful in studying the trend of a series over a period of
time. It helps in forecasting future trends which is crucial for any business or
production activity’s future operations. Besides, it also aids in determining
patterns in exports, imports, prices, and several other occurrences. For
example, If someone is planning on opening up a new business, then by
studying the trend of price, wages, income, etc., in different industries will
help the businessman in planning the future course of action for the business.

5. Measure Purchasing Power:


Money’s worth is determined by purchasing power, and purchasing power
is determined by commodity prices. Index numbers help find the intrinsic
value of money as contrasted with its nominal value. It helps in establishing
the nation’s wage policy. Besides, a change in the price of the commodity
Statistics 1 USTHB Page 24

adversely affects the value of money. When the price level of a commodity
rises, the purchasing power or value of money falls.

6. Helps in Deflating Various Values:

With price index numbers, it becomes easy to adjust the monetary figures
of the different periods for changes in the price. For example, a country’s
national income is calculated on the basis of the prices of the year mentioned
in the question. However, the real change in the production level of goods
and services cannot be revealed through the national income computed at
the current year’s price. To know the real change, first, we have to adjust
the figures for the price changes in different years. Making adjustments is
possible only by using price index numbers. Besides, in case of rise in prices,
the process of adjustment is known as deflating.

2.6.2 Limitations of Index numbers


The limitations of Index Numbers are as follows:

1. Provides Relative Changes only:

Index Numbers estimate relative changes only and cannot speak the truth
as they are only approximate indicators. Besides, they represent the gen-
eralized truth based on the overall average of the items. Therefore, it does
not apply to specific units.

2. Lack of Perfect Accuracy:

Index numbers do not consider every item and are based on the sample
items. Thus, in case of an inadequate sample or a sample selected through
a faulty process, there will be inaccurate results.

3. Ignores Qualitative Changes:

Index numbers do not pay any attention to the qualitative changes in the
product while constructing the price or production. An increase in the price
possibly results from the improvement in the quality of the product. But it
is neglected in the index numbers.
Statistics 1 USTHB Page 25

4. Possibility of Manipulations:
Index numbers can be created in a way that allows for manipulation. This
manipulation can be made in a selection of a particular base year, a partic-
ular group of commodities, a specific set of prices, etc.

5. Difference between purpose and method of Construction:


When the index numbers are created for a specific purpose using a specific
methodology, they are not suitable for all situations and uses. If they are
employed for other reasons, they are bound to produce incorrect conclusions.

2.6.3 Growth rate

Definition 2.6.2. Growth rates refer to the percentage change of a


specific variable within a specific time period. Growth rates can be
positive or negative, depending on whether the size of the variable is
increasing or decreasing over time. Growth rates were first used by
biologists studying population sizes, but they have since been brought
into use in studying economic activity, corporate management, or
investment returns.

Growth rates can be calculated in several ways, depending on what the


figure is intended to convey. A simple growth rate simply divides the dif-
ference between the ending and starting value by the beginning value, or
(EV-BV)/BV. The economic growth rate for a country’s GDP can thus
be computed as:
GDP2 − GDP1
Economic Growth =
GDP1
where: GDP = Gross domestic product of nation.
This approach, however, may be overly simplistic.
3

Presentation of Data

3.1 Tabular Presentation of Data


The systematic presentation of numerical data in rows and columns is known
as Tabulation. It is designed to make presentation simpler and analysis
easier. This type of presentation facilitates comparison by putting relevant
information close to one another, and it helps in further statistical analysis
and interpretation. One of the most important devices for presenting the
data in a condensed and readily comprehensible form is tabulation. It aims
to provide as much information as possible in the minimum possible space
while maintaining the quality and usefulness of the data.

3.1.1 Objectives of Tabulation


The aim of tabulation is to summarise a large amount of numerical infor-
mation into the simplest form. The following are the main objectives of
tabulation:

ˆ To make complex data simpler: The main aim of tabulation is to


present the classified data in a systematic way. The purpose is to
condense the bulk of information (data) under investigation into a
simple and meaningful form.

ˆ To save space: Tabulation tries to save space by condensing data in


a meaningful form while maintaining the quality and quantity of the
data.

26
Statistics 1 USTHB Page 27

ˆ To facilitate comparison: It also aims to facilitate quick comparison of


various observations by providing the data in a tabular form.

ˆ To facilitate statistical analysis: Tabulation aims to facilitate statisti-


cal analysis because it is the stage between data classification and data
presentation. Various statistical measures, including averages, disper-
sion, correlation, and others, are easily calculated from data that has
been systematically tabulated.

ˆ To provide a reference: Since data may be easily identifiable and used


when organised in tables with titles and table numbers, tabulation
aims to provide a reference for future studies.

3.2 Classification and Tabulation


For performing statistical analysis, various kinds of data are gathered by
the investigator or analyst. The information gathered is usually in raw form
which is difficult to analyse. To make the analysis meaningful and easy, the
raw data is converted or classified into different categories based on their
characteristics. This grouping of data into different categories or classes
with similar or homogeneous characteristics is known as the Classification
of Data. Each division or class of the gathered data is known as a Class.

3.2.1 Discrete Data tabular representation

Table 3.1: Frequency Distribution


Class Frequency Cumulative Frequency Relative Frequency C. R. Frequency
x1 n1 n1 f1 f1
x2 n2 n1 + n2 f2 f1 + f2
.. .. .. .. ..
. . . . .
xk nk n fk 1

Example: We noted the number of children X from 20 families. The


results are shown in the table below.
Solution. Before constructing the statistical table, we must first sort the
values in ascending order, in order to facilitate the distribution of the data.
We represent the data in a statistical table:
Statistics 1 USTHB Page 28

Figure 3.1: Number of children from 20 families

Figure 3.2: Number of children from 20 families ordered.

Figure 3.3: Tabular presentation of example’s data.


Statistics 1 USTHB Page 29

Figure 3.4: Frequency distribution for continuous data.

3.2.2 Continuous Data Tabular representation

Table 3.2: Frequency Distribution


Class Frequency Cumulative Frequency Relative Frequency C. R. Frequency
[a0 , a1 [ n1 n1 f1 f1
[a1 , a2 [ n2 n1 + n2 f2 f1 + f2
.. .. .. .. ..
. . . . .
[ak−1 , ak [ nk n fk 1

Example Consider the following data collected on a sample of 100 Math-


ematics students. Complete the following statistical table:

3.3 Graphical Presentation of data


The technique of presenting statistical data in the form of diagrams such
as bar diagrams, cartograms, pie diagrams, and pictograms is known as the
Diagrammatic or Graphical Presentation of Data.
Statistics serves a crucial role in simplifying intricate datasets to en-
hance comprehension. Classification and tabulation are effective techniques
for presenting data in an intelligible manner. However, as data volumes
Statistics 1 USTHB Page 30

grow, comprehending them solely through classification and tabulation be-


comes progressively challenging. Consequently, data is often depicted using
diagrams and graphs, facilitating quick comparisons and a more intuitive
grasp of data patterns across various scenarios.

3.3.1 Qualitative Data


Two of the most widely used graphical methods for describing qualitative
data: bar graphs and pie charts.

Definition 3.3.1. Bar Graph: The categories (classes) of the quali-


tative variable are represented by bars, where the height of each bar is
either the class frequency, class relative frequency, or class percentage.

Definition 3.3.2. Pie Chart: is referred to as a circle containing all


the categories or modalities of the studied characteristic, in which the
i-th modality is represented by the angle θi = 2πfi (where fi is the
relative frequency of the i-th modality).

Definition 3.3.3. Pareto Diagram: A bar graph with the categories


(classes) of the qualitative variable (i.e., the bars) arranged by height in
descending order from left to right.

One goal of a Pareto diagram (named for the Italian economist Vilfredo
Pareto) is to make it easy to locate the “most important” categories—those
with the largest frequencies.
Example: The following table provides the distribution of blood groups
for 1000 randomly selected individuals.
Statistics 1 USTHB Page 31

Figure 3.5: Pie chart for blood type groups

Figure 3.6: Bar graph for blood type groups


Statistics 1 USTHB Page 32

Figure 3.7: Pareto Diagram for causes of Customers’ complaints


Statistics 1 USTHB Page 33

Table 3.3: Blood Group Distribution


Blood Group Frequency Relative Frequency θi = 2πfi
A 350 0.35 126
B 50 0.05 18
AB 100 0.10 36
O 500 0.50 180
Total 1000 1.00 360

Figure 3.8: Frequency distribution table

3.3.2 Quantitative Data


Discrete data

Definition 3.3.4 (Bar Chart). Let (xi , ni )i=1,..,p be a discrete statistical


series. A bar chart, or bar graph, is a figure obtained by placing the
values xi on the x-axis and the frequencies or relative frequencies on the
y-axis in a Cartesian coordinate system, and drawing segments parallel
to the y-axis.
The polygon connecting the tops of the bars is called the polygon
of frequencies (or polygon of relative frequencies).

Example: A survey conducted in a village focuses on the number of


dependent children per family. The following are the results:
The bar chart below represents the distribution of the frequency:
Statistics 1 USTHB Page 34

Figure 3.9: Bar Chart example

Continuous data
Histogram: Let ([ai−1 , ai [, ni )i=1,..,p ([ai−1 , ai [, fi )i=1,..,p be a continuous sta-
tistical series.
The histogram of this series is obtained by placing the classes on the
x-axis and the numbers , ni (or fi ) on the y-axis, resulting in rectangles.
The polygonal line connecting the midpoints of the upper edges of the
rectangles is called the polygon of frequencies (or polygon of relative
frequencies) of the series.
Example: We have observations on the diameter of a mechanical piece
in millimeters.
The distribution of this variable is shown in the graph (3.11).
Comment. The use of class boundaries in histograms assures us that the
bars of the histogram touch and that no data fall on the boundaries. Both
of these features are important.
Statistics 1 USTHB Page 35

Figure 3.10: Frequency distribution table for continuous data.

Figure 3.11: Example of a Histogram and Polygon.


Statistics 1 USTHB Page 36

Figure 3.12: Grades obtained from 30 students.

Characterisation parameters
- Number of Classes
Given the pair ([ai−1 , ai [, ni )i=1,..,p , we determine the number of classes, k,
using the formula: √
k = ⌊ n⌋

Where ⌊ x⌋ = max{n ∈ Z : n ≤ x}
- Range
The range of the series represents the difference between the largest and
smallest values in a series (Largest and smallest observations), and is denoted
as e:

e = xmax − xmin
- Amplitude (Class width)
The range of the series is then divided by the desired number of classes to
obtain an approximation of the amplitude that each class should have:
e
a=
k

Example 3.3.1. The scores (X) obtained from 30 students has been
recorded. We have the results shown in Table (3.12):

ˆ The studied population: All students.


ˆ The studied characteristic/variable: the obtained score.
ˆ Its nature: continuous quantitative.
ˆ Number of classes: k = ⌊√30⌋ = 5
Statistics 1 USTHB Page 37

Figure 3.13: Frequency distribution table.

ˆ Range of the series: e = 17 − 2 = 15


ˆ Amplitude: a = = 315
5

ˆ The Table (3.13) contains the various classes, frequencies and cumula-
tive frequencies, relative frequencies, and cumulative relative frequen-
cies.

These results can be represented by the frequency histogram and polygon


shown in Figure (3.14).
Statistics 1 USTHB Page 38

Figure 3.14: Histogram and Polygon for the Grades’ example.

Distribution Shapes

Histograms are valuable and useful tools. If the raw data came from a
random sample of population values, the histogram constructed from the
sample values should have a distribution shape that is reasonably similar to
that of the population.
We can visually determine whether data are evenly spread from its mid-
dle or center. Sometimes the center of the data divides a graph of the
distribution into two “mirror images,” so that the portion on one side of the
middle is nearly identical to the portion on the other side. Graphs that have
this shape are symmetric; those without this shape are asymmetric, or
skewed.

Definition 3.3.5 (Symmetry). The shape of a distribution is said to be


symmetric if the observations are balanced, or approximately evenly
distributed, about its center.
Statistics 1 USTHB Page 39

Figure 3.15: Distribution Shapes

Definition 3.3.6 (Skewness). A distribution is skewed, or asymmet-


ric, if the observations are not symmetrically distributed on either side
of the center. A skewed-right distribution (sometimes called positively
skewed) has a tail that extends farther to the right. A skewed-left dis-
tribution (sometimes called negatively skewed) has a tail that extends
farther to the left.

Several terms are commonly used to describe histograms and their asso-
ciated population distributions.

1. Mound-shaped symmetrical (Bell-shaped or Normal): This


term refers to a histogram in which both sides are (more or less) the
same when the graph is folded vertically down the middle. Figure
3.15(a) shows a typical mound-shaped symmetrical histogram.

2. Uniform or rectangular: These terms refer to a histogram in which


every class has equal frequency. From one point of view, a uniform
distribution is symmetrical with the added property that the bars are
of the same height. Figure 3.15(b) illustrates a typical histogram with
a uniform shape.

3. Skewed left or skewed right: These terms refer to a histogram in


which one tail is stretched out longer than the other. The direction of
Statistics 1 USTHB Page 40

skewness is on the side of the longer tail. So, if the longer tail is on the
left, we say the histogram is skewed to the left. Figure 3.15(c) shows a
typical histogram skewed to the left and another skewed to the right.

4. Bimodal: This term refers to a histogram in which the two classes


with the largest frequencies are separated by at least one class. The top
two frequencies of these classes may have slightly different values. This
type of situation sometimes indicates that we are sampling from two
different populations. Figure 3.15(d) illustrates a typical histogram
with a bimodal shape.

CRITICAL THINKING:
A bimodal distribution shape might indicate that the data are from two
different populations. For instance, a histogram showing the heights of a
random sample of adults is likely to be bimodal because two populations,
male and female, were combined.

Remark 3.3.1
If there are gaps in the histogram between bars at either end of the graph,
the data set might include outliers

Definition 3.3.7. Outliers in a data set are data values that are very
different from other measurements in the data set.

Misleading Histograms
We know that the width of all intervals should be the same. Suppose a
data set contains many observations that fall into a relatively narrow part
of the range, whereas others are widely dispersed. We might be tempted to
construct a frequency distribution with narrow intervals where the bulk of
the observations are and broader ones elsewhere. Even if we remember that
it is the areas, rather than the heights, of the rectangles of the histogram
that must be proportional to the frequencies, it is still never a desirable
option to construct such a histogram with different widths because it may
easily deceive or distort the findings. We include this section simply to point
out potential errors that we might find in histograms.
Statistics 1 USTHB Page 41

Figure 3.16: Misleading Histogram of Grocery Receipts

Cumulative distribution function


Discrete variable:

Definition 3.3.8. The function referred to as the cumulative distri-


bution function is denoted by F and defined as:

F : R → [0, 1]
X
x → F (x) = fi
x≤xi

Here, F is a step function that is continuous within each interval, and


xi represents the points of discontinuity.

Example 3.3.3. Let’s revisit the table (3.1) (X: The number of depen-
dent children per family). For the cumulative distribution function F(x),
Statistics 1 USTHB Page 42

Figure 3.17: Cumulative distribution function for the example number of


children.

we find the following values:




 0 x<0

0.09 0≤x<1





0.25 1≤x<2




0.58 2≤x<3
F (x) =


 0.78 3≤x<4
4≤x<5



 0.94

0.99 5≤x<6





1 x≥6

This is represented graphically in Figure (3.17).

Continuous variable:

Definition 3.3.9. The cumulative distribution function is called F and


is defined as follows:
F : R → [0, 1]
Statistics 1 USTHB Page 43

i−1
X x − ai−1
x → F (x) = fi + fi , ∀x ∈ [ai−1 , ai ]
j=1
ai − ai−1

Example 3.3.5. Let’s revisit the table (Figure 3.13). For the cumulative
distribution function F(x), we find the following values:



 0 x<2
x−2
x ∈ [2, 5]



 0 + 5−2 .0, 13

x−5
0, 13 + 8−5 .0, 17 x ∈ [5, 8]



x−8
F (x) = 0, 30 + 11−8 .0, 37 x ∈ [8, 11]
 x−11
0, 67 + 14−11 .0, 27 x ∈ [11, 14]




x−14




 0, 94 + 17−14 .0, 06 x ∈ [14, 17]

1 x ≥ 17

The graph representing the cumulative distribution function, often referred


to as the cumulative curve, can be directly constructed from the cumulative
frequency histogram by connecting the midpoints of adjacent rectangles.
This function exhibits continuity across the entire real number domain R.

Time Series data

Definition 3.3.10. A time series is a set of measurements, ordered


over time, on a particular quantity of interest. In a time series the
sequence of the observations is important. A line chart, also called
a time-series plot, is a series of data plotted at various time intervals.
Measuring time along the horizontal axis and the numerical quantity
of interest along the vertical axis yields a point on the graph for each
observation. Joining points adjacent in time by straight lines produces
a time-series plot.

There are two types of time series graphs: (I) Graphs with one variable,
and (II) Graphs with two variables or more than two variables.
Statistics 1 USTHB Page 44

Figure 3.18:
Statistics 1 USTHB Page 45

Yearly Exports (in crores)

25

20
Exports (in crores)

15

10

2017 2018 2019 2020 2021


Year

Figure 3.19: One variable Time series plot

Example 3.3.6. Create a time series graph with the annual export data
given in Table (3.4).

Table 3.4: Years


Year Exports (×106 )
2017 10
2018 14
2019 8
2020 20
2021 25

Example 3.3.7. The data in Table (3.5) shows the exports and imports
in different years. Use this information to draw a time series graph.
Statistics 1 USTHB Page 46

Table 3.5: Yearly Exports and Imports (in crores)


Year Exports (in crores) Imports (in crores)
2017 5 9
2018 7 12
2019 10 15
2020 6 9
2021 13 16

Yearly Exports and Imports (in crores)

16

12
Amount (in crores)

colour
Exports
Imports

2017 2018 2019 2020 2021


Year

Figure 3.20: Two variables Time series plot


Statistics 1 USTHB Page 47

Figure 3.21: SAT Math Scores: First-Year Students (Time-Series Plot)

Figure 3.22: SAT Math Scores: First-Year Students (Revised Time-Series


Plot)

Time series graphs are important tools in various applications of statis-


tics. When recording values of the same variable over an extended period
of time, sometimes it is difficult to discern any trend or pattern. However,
once the same data points are displayed graphically, some features jump out.
Time series graphs make trends easy to spot.
More details about Time series will be discusses in Statistics 2.

Misleading Time-Series Plots


By selecting a particular scale of measurement, we can, in a time-series plot,
create an impression either of relative stability or of substantial fluctuation
over time.
4

Numerical Descriptive Measures

In Chapter 2, we discussed how to summarize data using Tables and to dis-


play data using graphs. Graphs are one important component of statistics;
however, it is also important to numerically describe the main characteris-
tics of a data set. The numerical summary measures, such as the ones that
identify the center and spread of a distribution, identify many important
features of a distribution.
As you’ll see, a large number of numerical methods are available to de-
scribe quantitative data sets. Most of these methods measure one of three
data characteristics:

1. The central tendency of the set of measurements—that is, the ten-


dency of the data to cluster, or center, about certain numerical values.

2. Position of the set of measurements.

3. The variability of the set of measurements—that is, the spread of the


data.

4. The concentration of data (or inequality).

48
Statistics 1 USTHB Page 49

4.1 Measures of Ungrouped data


Numerical measures for ungrouped data involve calculations and statistical
techniques applied directly to individual, raw data points. These mea-
sures offer insights into the central tendency, dispersion, and shape of the
dataset without categorizing the data into intervals or groups.

4.1.1 Measures of Central Tendency


We often represent a data set by numerical summary measures, usually
called the typical values. A measure of central tendency gives the center of
a histogram or a frequency distribution curve.
The most popular and best understood measure of central tendency for a
quantitative data set is the arithmetic mean (or simply the mean or average)
of the data set.

Definition 4.1.1. Mean The arithmetic mean (or simply mean) of


a set of data is the sum of the data values divided by the number of
observations. If the data set is the entire population of data, then the
population mean, µ, is a parameter given by
PN
xi x1 + x2 + ... + xN
µ = i=1 = (4.1)
N N
where N = population size.
If the data set is from a sample, then the sample mean, x̄, is a
statistic given by Pn
xi
x̄ = i=1 (4.2)
n
where n = sample size. The mean is appropriate for numerical data.

Another important measure of central tendency is the median.

Definition 4.1.2. The median is the middle observation of a set of


observations that are arranged in increasing (or decreasing) order.

ˆ If the sample size, n, is an odd number, the median is the middle


observation.
Statistics 1 USTHB Page 50

ˆ If the sample size, n, is an even number, the median is the average


of the two middle observations. The median will be the number
located in the
1
(n + 1) th ordered position
2

Example 4.1.1. The following data give the prices (in thousands of
dollars) of seven houses selected from all houses sold last month in a
city.
312 257 421 289 526 374 497

ˆ First, we rank the given data in increasing order as follows:


257 289 312 374 421 497 526

ˆ Since there are seven homes in this data set and the middle term
is the fourth term, the median is given by the value of the fourth
term in the ranked data.
257 289 312 374 421 497 526

Definition 4.1.3. The mode is the value that occurs with the highest
frequency in a data set. A distribution with one mode is called unimodal;
with two modes, it is called bimodal; and with more than two modes,
the distribution is said to be multimodal. The mode is most commonly
used with categorical data.

Shape of a Distribution
We described earlier graphically the shape of a distribution as symmetric or
skewed by examining a histogram. For continuous numerical unimodal data,
the mean is usually less than the median in a skewed-left distribution and the
mean is usually greater than the median in a skewed-right distribution. In
a symmetric distribution the mean and median are equal. This relationship
between the mean and the median may not be true for discrete numerical
variables or for some continuous numerical variables
Statistics 1 USTHB Page 51

Figure 4.1: Detecting Skewness by Comparing the Mean and the Median

Definition 4.1.4. The geometric mean, x̄g , is the nth root of the
product of n numbers:
p
x̄g = n (x1 x2 ...xn ) = (x1 x2 ...xn )1/n (4.3)

Business analysts and economists who are interested in growth over a number
of time periods use the geometric mean.
Statistics 1 USTHB Page 52

Figure 4.2: Quartiles

4.1.2 Measures of position


A measure of position determines the position of a single value in relation
to other values in a sample or a population data set.

Definition 4.1.5. Quartiles are three summary measures that divide a


ranked data set into four equal parts. The second quartile is the same as
the median of a data set. The first quartile is the value of the middle
term among the observations that are less than the median, and the
third quartile is the value of the middle term among the observations
that are greater than the median.

ˆQ 1 = the value in the 0.25(n + 1)th ordered position

ˆQ 2 = the value in the 0.5(n + 1)th ordered position

ˆQ 3 = the value in the 0.75(n + 1)th ordered position

Approximately 25% of the values in a ranked data set are less than Q1
and about 75% are greater than Q1. The second quartile, Q2, divides a
ranked data set into two equal parts; hence, the second quartile and the
median are the same. Approximately 75% of the data values are less than
Q3 and about 25% are greater than Q3.
In describing numerical data, we often refer to the five-number summary.
Statistics 1 USTHB Page 53

Figure 4.3: Percentiles

Definition 4.1.6. The five-number summary refers to the five de-


scriptive measures: minimum, first quartile, median, third quartile, and
maximum.

minimum < Q1 < median < Q3 < maximum

Definition 4.1.7. Percentiles are the summary measures that divide


a ranked data set into 100 equal parts. Each (ranked) data set has 99
percentiles that divide it into 100 equal parts.
The (approximate) value of the kth percentile, denoted by Pk , is

kn
Pk = Value of the ( )th term in a ranked data set
100
where k denotes the number of the percentile and n represents the
sample size.
Statistics 1 USTHB Page 54

Figure 4.4: Box plot

4.1.3 Numerical measures of variability


Measures of central tendency provide only a partial description of a quan-
titative data set. The description is incomplete without a measure of the
variability, or spread, of the data set. Knowledge of the data set’s variability,
along with knowledge of its center, can help us visualize the shape of the
data set as well as its extreme values.

Definition 4.1.8. The difference between the third and the first quar-
tiles gives the interquartile range; that is,

IQR = Interquartile range = Q3 − Q1

Definition 4.1.9. A box-and-whisker plot gives a graphic presenta-


tion of data using five measures: the median, the first quartile, the third
quartile, and the smallest and the largest values in the data set between
the lower and the upper inner fences. A box-and-whisker plot can help
us visualize the center, the spread, and the skewness of a data set. It
also helps detect outliers.
Statistics 1 USTHB Page 55

Definition 4.1.10. With respect to variance, the population variance,


σ 2 , is the sum of the squared differences between each observation and
the population mean divided by the population size, N:
PN
2 (xi − µ)2
σ = i=1 (4.4)
N
The sample variance, s2 , is the sum of the squared differences between
each observation and the sample mean divided by the sample size, n,
minus 1: Pn
2 (xi − x̄)2
s = i=1 (4.5)
n−1

To compute the variance requires squaring the distances, which then


changes the unit of measurement to square units. The standard deviation,
which is the square root of variance, restores the data to their original mea-
surement unit. If the original measurements were in feet, the variance would
be in feet squared, but the standard deviation would be in feet. The stan-
dard deviation measures the average spread around the mean.

Definition 4.1.11. With respect to standard deviation, the popula-


tion standard deviation, σ, is the (positive) square root of the population
variance and is defined as follows:
s

PN 2
2 i=1 (xi − µ)
σ= σ = (4.6)
N

The sample standard deviation, s, is as follows:


sP
√ n 2
2 i=1 (xi − x̄)
s= s = (4.7)
n−1

Definition 4.1.12. The coefficient of variation, CV, is a measure of


relative dispersion that expresses the standard deviation as a percentage
of the mean (provided the mean is positive). The population coefficient
Statistics 1 USTHB Page 56

Figure 4.5: Graphical representation for small and large Standard Deviation

of variation is
σ
CV = × 100% if µ > 0
µ
The sample coefficient of variation is
s
CV = × 100% if x̄ > 0

If the standard deviations in sales for large and small stores selling sim-
ilar goods are compared, the standard deviation for large stores will almost
always be greater. A simple explanation is that a large store could be mod-
eled as a number of small stores. Comparing variation using the standard
deviation would be misleading. The coefficient of variation overcomes this
problem by adjusting for the scale of units in the population.
The coefficient of variation provides a very good idea of the degree of
homogeneity of a distribution. The lower the coefficient of variation, the
more homogeneous the series is. A coefficient of variation below 15% appears
to be, in many cases, an indication of good homogeneity in the distribution
of the data.
Statistics 1 USTHB Page 57

4.2 Measures of grouped data


Numerical measures for grouped data are employed when the dataset is
organized into intervals or classes. Grouped data involves summarizing
raw data into intervals, allowing for a more concise representation, especially
with large datasets.

The mean
Suppose that data are grouped into k classes, with frequencies n1 , n2 , ..., nk ,
x1 , x2 , .., xk the modalities. Then, the sample mean can be given by
k
1X
x̄ = n i xi (4.8)
n i=1

with ni is the frequency of the xi modality. For the continuous case,


k
1X
x̄ = n i ci (4.9)
n i=1

with ci is the center of ith the class.

Mode in the continuous case

Definition 4.2.1. The modal class is called the class that corresponds
to the highest frequency. If the modal class is unique, then the mode is
calculated as follows:
d1
M0 = LM0 + am
d1 + d2
where,
LM0 : the lower bound of the modal class.
d1 : the frequency of the modal class - the frequency of the previous
class.
d2 : the frequency of the modal class - the frequency of the next class.
am : the amplitude of the modal class.

Remark 4.2.1
Mode by Graphical Method
Statistics 1 USTHB Page 58

The mode for a given set of series can be located through a histogram.
The steps to locate mode through Graph are as follows:

ˆ Step 1
Draw a histogram of the given set of series.

ˆ Step 2
The rectangle of the histogram with the greatest height will be the modal
class of the series.

ˆ Step 3
Draw a line joining the top left corner/point of the rectangle of the
modal class to the top left corner/point of the rectangle of the succeed-
ing class.

ˆ Step 4
Similarly, draw a line joining the top right corner/point of the rectangle
of the modal class to the top right corner/point of the rectangle of the
preceding class.

ˆ Step 5
Draw a line perpendicular to the X-axis from the intersection point of
the two lines drawn in the previous two steps.

ˆ Step 6
The point at which the perpendicular line cuts the X-axis is the modal
value of the given series.
An example is given in Figure (4.6)

Example: We revisit the table of students’ scores.

The Median
- Discrete Case
Let x1 ≤ x2 ≤ ... ≤ xn be a discrete sample.
The median of this series, denoted as Med , is defined as the value such that
Statistics 1 USTHB Page 59

Figure 4.6: Calculating the mode graphically

Figure 4.7: Mode calculated in the continuous case


Statistics 1 USTHB Page 60

Figure 4.8:

half of the values are less than or equal to it and half of the values are greater
than or equal to it.
( x n +x n +1
2
2
2
if n is even
Med =
x n+1 if n is odd
2

Exemple We revisit the table of students’ scores.


- Continuous Case
The median class is defined as the first class for which the cumulative fre-
quency is greater than or equal to n/2. The median is calculated using the
following formula:
n
2
− ncummed
Med = LMed + · amed
nmed
where,
LMed : the lower bound of the median class.
ncummed : the cumulative frequency of the previous class.
nmed : the frequency of the median class.
amed : the amplitude of the median class.
Exemple We revisit the table of students’ scores.

Quantiles
- Discrete Case
We use the formula:
(
xnα +xnα+1
2
if nα is an integer
Qα =
xENT(nα)+1 if nα is not an integer
Statistics 1 USTHB Page 61

Figure 4.9:

- Continuous Case
First, we determine the interval [l1 , l2 [, which corresponds to the first class
for which the cumulative frequency is greater than or equal to nα. - Then,
we apply the following formula:

nα − ncuml1
Qα = l1 + · am
n[l1 ,l2 [

where,
ncuml1 : the cumulative frequency of the previous class.
n[l1 ,l2 [ : the frequency of the interval [l1 , l2 [.
am : the amplitude of the interval [l1 , l2 [.

The variance
The sample variance is for discrete data

Pk
2 i=1 ni (xi − x̄)2
s = (4.10)
N

for continuous data,


Pk
2 i=1 ni (ci − x̄)2
s = (4.11)
N
Statistics 1 USTHB Page 62

Moments
Moments
The moment of order r is defined as the number
k
1X
mr = ni xri (discrete case)
n i=1

k
1X r
mr = ni ci (continuous case)
n i=1
Central Moments
The central moment of order r is defined as the number
k
1X
µr = ni (xi − X̄)r (discrete case)
n i=1

k
1X
µr = ni (ci − X̄)r (continuous case)
n i=1

Remark 4.2.2

m1 = X̄
µ1 = 0
µ2 = σ 2
Statistics 1 USTHB Page 63

Figure 4.10:

4.2.1 Form measures

ˆ
Definition 4.2.2. Skewness Coefficient It is a descriptive measure
that characterizes the degree of symmetry or asymmetry, and it is cal-
culated using the following formula:
µ3
δ=
σ3
ˆ If:
ˆ δ > 0 - Right-skewed.
ˆ δ < 0 - Left-skewed.
ˆ δ = 0 - Symmetrical.
Example: We return to the table on the number of dependent children,
with
X̄ = 2, 36 et σ(X) = 1, 3454.
We find:
k
1X
µ3 = ni (xi − X̄)3 = 0, 4689
n i=1
µ3
δ= = 0, 1925 > 0 (skewed to the right)
σ3
Statistics 1 USTHB Page 64

Figure 4.11: Skewness

ˆ
Definition 4.2.3. Kurtosis Coefficient A distribution is more or less
peaked depending on whether the frequencies of values near the central
ˆ
values differ little or much from each other. The kurtosis coefficient is
calculated using the following formula:
µ4
β=
σ4
ˆ If:
β > 3: the curve is leptokurtic (sharply peaked).

ˆ β < 3: the curve is platykurtic (flattened).


ˆ β = 3: the curve is mesokurtic (normal).
Statistics 1 USTHB Page 65

Figure 4.12: Kurtosis

Figure 4.13:

4.3 Change of variable


ˆ if we let:
Zi = aXi + b
ˆ Then:
Z̄ = aX̄ + b
V (Z) = a2 V (X)
Example: Between the years 1974 and 1983, wheat production (X in
quintals) was recorded, and it is presented in the table below.
ˆ ˆ
Divide the variable X into 3 classes with the same amplitude. Provide
ˆ
the statistical table. By performing the following variable transformation:
Y = X−217
22
, calculate the average annual wheat production and the variance
of X.
ˆ We have:
√ √
k = EN T ( n) = EN T ( 10) = 3
Statistics 1 USTHB Page 66

Figure 4.14:

e = xmax − xmin = 66
e
am = = 22
k
ˆ Calculation of the arithmetic mean and variance:
k
1X
X̄ = ni ci = 0, 1
n i=1

k
1X 2
V (X) = ni ci − X̄ 2 = 0, 69
n i=1
ˆ Variable change:
X − 217 1
Y = = X − 9, 86
22 22
1
Ȳ = X̄ − 9, 86 = 219, 2
22
1 2
V (Y ) = ( ) V (X) = 333, 96 =⇒ σ(Y ) = 18, 3
22
Statistics 1 USTHB Page 67

4.4 Measures of inequality (concentration mea-


sures)
Studying concentration helps to define various economic measures (such as
salaries, income, taxes, revenue, land area, etc.). Concentration is observed
whenever there exists inequality in the distribution among individuals within
a specific population. It’s a statistical concept relevant to specific datasets
where the characteristic can be aggregated. Hence, we refer to income con-
centration, land concentration, capital concentration, and similar measures.
We have already looked at a number of measures of inequality, but let’s
be explicit about them. All of the following have claims to be ways of
measuring inequality:

ˆ The difference (or ratio) between mean and median income


ˆ The ratio (or difference) between any two percentiles of income, say
“P50/P10” (median to 10th percentile) or “P90/P10” (90th to 10th
percentile).

In these contexts, ratios are more often used than differences. This is
simply because high income values just are multiples of lower ones. In 2019
$ Ö
in the US, the difference between P90 and P10 was 1.3 105 , but the ratio
was 18, and people usually find it easier to grasp and compare such ratios
than the absolute differences.
In addition, anything which we use as a measure of dispersion can also
be used, in this context, as a measure of inequality. Thus the variance or
standard deviation of incomes is, in itself, a measure of inequality. The
inter-quartile range, the difference between the 75th and 25th percentile, is
sometimes used as a measure of dispersion (it’s “robust”, in the sense of not
being much affected by a few outliers), and so could be used as a measure
of inequality. For the reason just explained, however, we’d be more likely to
employ the ratio P75/P25.
Statistics 1 USTHB Page 68

4.4.1 The Median – Medial difference


Mediale (Médiale en français)
The median should not be confused with the medial.
The medial of a statistical variable is the value that divides the mass of
values into two equal parts. Therefore, the medial value of the characteristic
divides
P the sum of ni · xi into two equal parts, where 50% of this mass
( ni · xi ) is less than this value and 50% is greater than or equal to this
value.
The medial class is defined as the first class for which the cumulative
frequency is greater than or equal to ci ∗ ni /2. The medial is calculated
using the following formula:
ci ∗ni
2
− ncumml
ML = LML + · aml (4.12)
nml
where,
LML : the lower bound of the medial class.
ncumml : the cumulative frequency of the previous class.
nml : the frequency of the medial class.
aml : the amplitude of the medial class.
Example:
The employees of a company are distributed according to their monthly
salary in 10000 Algerian Dinars

Table 4.1: Years


Salaries ci ni si = ni · ci s̃i gi = s̃i /S
[2 − 4[ 3 60 180 180 0.375
[4 − 6[ 5 20 100 280 0.58
[6 − 10[ 8 10 80 360 0.75
[10 − 14] 12 10 120 480 1
TOTAL 100 480

where S is the overall mass of the characteristic.


In order to calculate the medial ML , we know that

0.375 < 0.5 < 0.58

So, ML ∈ [4 − 6[.
Therefore, with linear interpolation, we can write
Statistics 1 USTHB Page 69

0.375 < 0.5 < 0.58 and 4 < ML < 6

ML − 4 0.5 − 0.375
=
6−4 0.58 − 0.375
So, ML = 5.2 or 52,000 DA.
It is to be noted that we can also use the medial formula (4.12).
In this company, where the medial salary is equal to 52,000 DA, between
60 and 80 out of 100 employees share the first half of the salary mass, with
only less than 40 out of 100 employees sharing the second half.

The Median – Medial difference: ∆M


Comparing the values of the medial and the median constitutes a measure
of concentration. The measure of the difference between the medial and the
median informs about the degree of concentration. The larger the difference,
the stronger the concentration, and it is zero in the case of perfect equality.

∆M = ML − Med
To assess concentration, we relate the difference to the range of the series:
∆M
× 100 (4.13)
e
ˆ If ∆M is large relative to the range: the concentration is strong.
ˆ If ∆M is small relative to the range: the concentration is weak.
ˆ If ∆M is zero: the concentration is zero (M = M  perfect equal-
L ed
ity).

This concentration ratio, in economics, can indicates the size of firms in


relation to their industry as a whole. Low concentration ratio in an indus-
try would indicate greater competition among the firms in that industry,
compared to one with a ratio nearing 100%, which would be evident in an
industry characterized by a true monopoly.
Example:
Going back to the previous example:

ˆ The median after calculation is equal to 3666 DA.


Statistics 1 USTHB Page 70

ˆ Therefore, ∆M = M − M = 5200 − 3660 = 1540.


le e

ˆ (∆M/E) × 100 = (1540/12000) × 100 = 12.83%.


ˆ The concentration is relatively low as it represents 12.83% of the range.
Statistics 1 USTHB Page 71

4.4.2 The concentration ratio for groups


When we talk about the concentration of income, we mean the way the rich
receive a disproportionate share ofPthe total income.
1 n
The mean income of is x = n i=1 xi , where n is the number of individ-
uals in the population, so the total income of the population is nx̄.
When we pick out any part of the population, say C, we can ask what
share of the total number goes to members of the group C:
P
j∈C xj
s(C) =
nx̄
Notice that if we write x̄C for the mean income of members of the group
C, and nC for the number of individuals in that group, we get
nC x̄C
s(C) = ·
n x̄
So a group C will tend to get a big share if it is a big part of the total
population, or if its average income is large compared to the population
average.
We can look at income shares by education, race, etc.,
Statistics 1 USTHB Page 72

4.4.3 The Lorenz curve (Concentration curve)


Generally, Lorenz Curves are used to measure the variability of the distri-
bution of income and wealth. Hence, Lorenz Curve is the measure of the
deviation of the actual distribution of a statistical series from the line of
equal distribution. The extent of this deviation is known as Lorenz Coef-
ficient. If the distance between the Lorenz Curve from the line of equal
distribution is more, it means that there is more inequality or variability
in the series, and vice-versa.

Construction of Lorenz Curve


The steps involved in the construction of a Lorenz Curve are as follows:
Step 1: The first step is to convert the given series into a cumulative
frequency series f˜i . Then it is assumed that the cumulative sum of the items
(or mid-values of class intervals) is 100, and the different items of the series
are converted into percentages of the cumulative sum g̃i . Similarly, it is
assumed that the cumulative sum of the frequencies is 100, and different
frequencies are converted into percentages of the sum of the frequencies.
Step 2: In the second step, the cumulative frequencies are plotted on
the X-axis of a graph, and the cumulative items are plotted on the Y-axis.
To plot the values, we have to start from 0 to 100 on both axes.
Step 3: Now, draw a diagonal line joining the origin (0, 0) with the
cumulative frequencies (100, 100). This diagonal line shows the equal dis-
tribution because of which it is known as the Equality Line or Line of Equal
Distribution.
Step 4: Now, plot the actual data on the graph and obtain a curve
joining the plotted points. This curve shows the actual distribution of the
given statistical series.
Remark 4.4.1
The actual distribution curve is known as Lorenz Curve. If there is closeness
in the Lorenz Curve to the Equal Distribution Line, it means that there is
lesser variation in the distribution. However, if there is larger gap between
the Lorenz Curve and the Equal Distribution Line, it means that there is
greater variation in the distribution.
Besides, if two Lorenz Curves are drawn on the same graph paper, then
the one which is further away from the equal distribution line shows greater
variation.
Example:
Statistics 1 USTHB Page 73

Draw a Lorenz Curve of the data given below:

Income ni f˜i ni xi g̃i


100 80 32% 8000 10 %
200 70 60% 14000 29 %
400 50 80% 20000 56%
500 30 92% 15000 76 %
800 20 100% 16000 100 %
Total 250 73000

100
Cumulative Share of Income Lorenz Curve
Equality Line
80

60

40

20

Cumulative Share of Population


20 40 60 80 100

Interpretation of the Lorenz curve


Interpreting the Lorenz curve involves understanding its shape concerning
the diagonal line of perfect equality. If the Lorenz curve lies below the
diagonal line, it indicates income inequality, implying that a smaller portion
of the population holds a larger share of the total income. Conversely, if the
curve lies closer to the diagonal line, it suggests a more equal distribution
of income.
The further the Lorenz curve diverges from the diagonal line towards
the top-left corner, the more pronounced the income inequality. The point
where the Lorenz curve intersects the diagonal line represents perfect income
equality, indicating that every segment of the population earns an equal
share of the total income (example Figure (4.15)).
Statistics 1 USTHB Page 74

Figure 4.15: Lorenz curve for comparison


Statistics 1 USTHB Page 75

Figure 4.16: Income distribution & Cencentration

A normally distributed histogram, with values spread out relatively evenly


around the mean, implies that there isn’t a significant concentration of data
points at any particular extreme. This characteristic aligns with weaker con-
centration or inequality, as opposed to distributions that exhibit significant
skewness or heavy-tailed shapes which often indicate stronger concentration
or inequality (Figure 4.16).

4.4.4 GINI coefficient


The Gini coefficient is a statistical measure used to quantify the degree of
economic inequality within a population. It represents the extent of in-
equality among values within a distribution, commonly used to assess in-
come inequality but applicable to various other distributions like wealth,
consumption, or social status.
The Gini coefficient is expressed as a value between 0 and 1, where:

ˆ 0 represents perfect equality, implying that everyone in the population


has an identical income or wealth.
Statistics 1 USTHB Page 76

Figure 4.17: Lorenz curve and the GINI coefficient

ˆ 1 represents maximum inequality, indicating that a single individual


or group possesses all the income or wealth, while the rest have none.

Mathematically, the Gini coefficient is calculated based on the Lorenz


curve, a graphical representation of income or wealth distribution. It mea-
sures the area between the Lorenz curve and the line of perfect equality (the
diagonal line) relative to the total area under the line of perfect equality
(Figure (4.17)).

Area of concentration
IG =
Area under the line of perfect equality
Area of concentration
=
Area of the triangle
Statistics 1 USTHB Page 77

Calculation of GINI index


The area under the curve is decomposed
P into triangles and trapezoids, the
surface of which is calculated ( Si ).
This results in a triangle and trapezoids.
Area of triangle OAB = 1×1 2
= 0.5. P P
Concentration area = Area of triangle OAB - Si = 0.5 − Si .
Where
(Small base + large base) × height
Si =
2
P P
0.5 − Si Si
IG = =1−
0.5 0.5
Xk
IG = 1 − (fi (g̃i−1 + g̃i ))
i=1

If IG < 0.5 the concentration is weak.

Remark 4.4.2 ˆ
ˆ Gini coefficients can vary widely across different countries, regions, or
demographics. Hence, use the Gini index as a comparative tool.

ˆ Compare Gini coefficients over time or between different populations


to analyze changes in income or wealth inequality.

ˆ Higher Gini coefficients often correlate with social issues like poverty,
reduced social mobility, and economic instability.

ˆ A lower Gini coefficient suggests a fairer and more equitable distribu-


tion of income or wealth.
Statistics 1 USTHB Page 78

4.4.5 Herfindahl–Hirschman index


The Herfindahl index (also known as Herfindahl–Hirschman Index,
HHI, or sometimes HHI-score) is a measure of the size of firms in relation
to the industry they are in and is an indicator of the amount of competition
among them. Named after economists Orris C. Herfindahl and Albert O.
Hirschman, it is an economic concept widely applied in competition law,
antitrust regulation, and technology management.
HHI has continued to be used by antitrust authorities, primarily to eval-
uate and understand how mergers will affect their associated markets.
HHI is calculated by squaring the market share of each competing firm
in the industry and then summing the resulting numbers (sometimes limited
to the 50 largest firms).
If we consider each share as:
xi
Pi = Pn
i=1 xi

The HHI is defined by


n
X
HHI = Pi2 (4.14)
i=1

Increases in the HHI generally indicate a decrease in competition and an


increase of market power, whereas decreases indicate the opposite.
The major benefit of the Herfindahl index in relation to measures such
as the concentration ratio is that the HHI gives more weight to larger firms.
Other advantages of the HHI include its simple calculation method and the
small amount of often easily obtainable data required for the calculation.

Interpretation of the HH index


The closer a market is to a monopoly, the higher the market’s concentration
(and the lower its competition). If, for example, there were only one firm in
an industry, that firm would have 100% market share, and the HHI would
equal 10,000, indicating a monopoly. If there were thousands of firms com-
peting, each would have roughly 0% market share, and the HHI would be
close to 0, indicating nearly perfect competition.
If HHI < 1000, the European competition authority considers the sector
to be weakly concentrated. Potential company mergers will be permitted as
they do not pose a risk related to potential abuse of market power.
Statistics 1 USTHB Page 79

If 1000 < HHI < 2000, the sector poses a risk of market power. Compe-
tition authorities will verify that the post-merger HHI does not increase by
more than 250 (approximately equivalent to a 16% market share increase).
Otherwise, company mergers will not be permitted.
If HHI > 2000, the sector is highly concentrated, and competition author-
ities will verify that the post-merger HHI does not increase by more than
150 (approximately 12.2% market share increase). In this case, company
mergers will not be permitted.

Examples
The banking sector in France
Here are the market shares of the major French banks below:

Crédit Agricole + Banque Populaire + Crédit Mutuel + Société Générale+


BNP Paribas + Banque Postale + HSBC + Others
= 292 + 212 + 132 + 132 + 102 + 82 + 22 + 42 = 1804 (2011)

The HHI (Herfindahl-Hirschman Index) in the banking sector is 1000 <


1804 < 2000. There is a risk of market power and anti-competitive strate-
gies. It is a sector to monitor.
The global market for sports equipment

Nike + Adidas + Puma + New Balance + Under Armour


= 33.52 + 27.32 + 18.92 + 13.32 + 72 = 2451 (2015)

The HHI (Herfindahl-Hirschman Index) in the global sports equipment


market is 2451 > 2000. It’s a concentrated sector with companies that may
have anti-competitive strategies, ultimately resulting in consumers paying
higher prices for their sports equipment!
5

Conclusion

In conclusion, descriptive statistics form the foundation of understanding


and summarizing datasets. Through this lecture, we’ve delved into the fun-
damental vocabulary essential for effective statistical analysis, emphasizing
the significance of clear data presentation using tables and graphical repre-
sentations. Moreover, we’ve explored various numerical measures, ranging
from measures of central tendency, which highlight the typical or central
values within a dataset, to measures of variability, showcasing the spread
or dispersion of data points. Additionally, concentration measures provide
insights into inequality or concentration within distributions. Mastering
descriptive statistics equips us with the tools to describe, analyze, and com-
prehend data, serving as a crucial step in further statistical exploration and
decision-making processes.

80

You might also like