Basic Statistics Overview for Students
Basic Statistics Overview for Students
com 2024
HARAMAYA UNIVERSITY
COLLEGE OF COMPUTING AND INFORMATICS
DEPARTMENT OF STATISTICS
_____________________________________________
Basic Statistics- Stat 2131
For Department of Accounting and Finance
Set by:
Kindu Kebede Gebre(Assistant Professor )
©November, 2024
Basic Statistics Email:[email protected] 2024
CHAPTER ONE
1. Introduction to Statistics
1.1. Definition of Statistics
Before getting involved in the subject matter in detail, let us define of the terms used extensively
in the field of statistics.
Data: are figures or facts from which conclusion can be made. Data are the numerical results of
any scientific measurement. Any value that is expressed in numbers is called data.
Population: the totality of all elements under study.
Sample: is a portion or part of the population taken so that some generalization about the
population can be made. It is the subset of the population which is assumed to be the
representative of the population.
Statistics can be defined in two senses: plural (as Statistical Data) and singular (as Statistical
Methods).
Plural sense: Statistics are collection of facts (figures). This meaning of the word is widely used
when reference is made to facts and figures on sales, employment or unemployment, accident,
weather, death, education, etc. E.g: Sales Statistics, Labor Statistics, Employment Statistics, etc.
In this sense the word Statistics serves simply as data. But not all data are statistics.
In order for the numerical data to be identified as statistics, it must be possessing a certain
identifiable characteristics as follows:-
1. Statistics are aggregate of facts:- single or isolate fact or figure are not statistics.
Example1 I earn birr 30000 per year? Not statistical statement.
Example 2 the average salary of professor at our university is 30000 per year? Yes it is
statistical statement. Because average is computed from many related figure of yearly
salary of many professor.
2. Statistics are numerical expression: All statistics are stated in numerical figure only.
Example 2: compare CGPA of statistics and probability course in Accounting and Finance
students with that of Statistics students. (It is statistical statement)
1
Basic Statistics Email:[email protected] 2024
3. Statistics must be placed in relation to other. Comparison must relate to the same subject
implies oranges cannot compare with apple.
Singular sense: Statistics is the science that deals with the methods of data collection,
organization, presentation, analysis and interpretation of data. It refers the subject area that is
concerned with extracting relevant information from available data with the aim to make sound
decisions. According to this meaning, statistics is concerned with the development and
application of methods and techniques for collecting, organizing, presenting, analyzing and
interpreting statistical data.
Based on the scope of the decision, statistics can be classified into two; Descriptive and
Inferential Statistics.
Descriptive Statistics refers to the procedures used to organize and summarize masses of data.
It is concerned with describing or summarizing the most important features of the data. It deals
only the characteristics of the collected data without going beyond it. That is, this part deals with
only describing the data collected without going any further: that is without attempting to
infer(conclude) anything that goes beyond the data themselves.
Inferential Statistics includes the methods used to find out something about a population, based
on the sample. It is concerned with drawing statistically valid conclusions about the
characteristics of the population based on information obtained from sample. In this form of
statistical analysis, inferential statistics is linked with probability theory in order to generalize the
results of the sample to the population. Performing hypothesis testing, determining relationships
between variables and making predictions are also inferential statistics.
Example: Classify the following statements as Descriptive or Inferential Statistics
2
Basic Statistics Email:[email protected] 2024
a. The average income of Staff in commercial bank of America in this year is 5000$ years.
b. There is a strong association between income and expenditure level.
c. Of the students enrolled in Haramaya University in this year 74% are male and 26% are
female.
d. The price of wheat will be increased by 5% in the coming year.
e. The chance of winning the Ethiopian National Lottery in any day is 1 out of 167000.
Uses of Statistics
To reduce and summarize masses of data and to present facts in numerical and
definite form. Statistics condenses and summarizes a large mass of data and presents
facts into a few presentable, understandable and precise numerical figures. The raw data,
as is usually available, is voluminous and haphazard. It is generally not possible to draw
any conclusions from the raw data as collected. Hence it is necessary and desirable to
express these data in a few numerical values.
To facilitate comparison. Statistical devices such as averages, percentages, ratios, etc
are used for this purpose.
For determining functional relationships between two or more phenomenon.
Statistical techniques such as correlation analysis assist in establishing the degree of
association between two or more variables.
For formulating and testing hypotheses. For instance, hypothesis like whether a new
medicine is effective in curing a disease, whether there is an association between
variables can be tested using statistical tools.
For forecasting. Statistical methods help in studying past data and predicting future
trends.
1.3. Types of Variables and Measurement Scales
1.3.1. Variable
Variable is a characteristics or an attribute that can assume different values.
For example: income, Family size, Gender, etc.
Based on the values that variables assume, variables can be classified as
1. Qualitative variables are those variables that do not assume numeric values.
For example: Gender, marital status, religion, etc.
3
Basic Statistics Email:[email protected] 2024
2. Quantitative variables are variables assume numeric values. These variables are numeric in
nature.
For example: Expenditure, Family size, etc
Quantitative variables are again classified in to two; discrete and continuous variables.
Discrete variable takes whole number values and consists of distinct recognizable
individual elements that can be counted. It is a variable that assumes a finite or countable
number of possible values. These values are obtained by counting (0, 1, 2. . .).
For example: Family size, Number of children in a family, number of cars at the traffic
light.
Continuous variable takes any value including decimals. Such a variable can
theoretically assume an infinite number of possible values. These values are obtained by
measuring.
Example: Height, Weight, Net- income, and Age
Generally the values of a variable can be obtained either by counting for discrete variables, by
measuring for continuous variables or by making categories for qualitative variables.
Ex: Classify each of the following as Qualitative or Quantitative and if it is quantitative classify
as Discrete and Continuous.
a. Sales of automobiles in a dealer‟s show room.
b. The number of customers who come in each day.
c. Classification of wealth index based on income status (very poor, poor, rich, very rich)
d. Weight of newly born babies.
4
Basic Statistics Email:[email protected] 2024
Based on the number on the shirts it is not possible to judge, whether Mr B plays better. But by
using the test score, it is possible to judge that Mr B did better in the exam. Also it not possible
to find the average shirt numbers (or the average shirt number is nothing) because the numbers
on the shirts are simply codes but it is possible to obtain the average test score.
2. Ordinal variables: are also those qualitative variables whose values can be ordered and ranked.
Ranking and counting are the only mathematical operations to be done on the values of the
variables. But there is no precise difference between the values (categories) of the variable.
Examples: Academic qualifications (B.Sc., M.Sc., Ph.D.), Grade Scores (A, B, C, D, F), Wealth
index (very poor, poor, rich, very rich), Wealth Index (very poor, poor, rich, very rich)
3. Interval variables: are those quantitative variables when the value of the variables is zero it
does not show absence of the characteristics i.e. there is no true zero. Zero indicates low than
empty. There is a precise difference between the units of measurement (levels)
Examples: temperature, 00c does not mean there is no temperature but to say it is too cold.
4. Ratio variables: are those quantitative variables when the values of the variables are zero it
shows absence of the characteristics. Zero indicates absence of the characteristics.
Examples: Income, Amount of yield, Expenditure, Consumption.
All mathematical operations are allowed to be operated on the values of the variables.
5
Basic Statistics Email:[email protected] 2024
Descriptive statistics are used to describe datasets. Businesses in almost every field
use descriptive statistics to gain a better understanding of how their consumers
behave. For example, a grocery store might calculate the following descriptive
statistics:
On the other hand, a bank might calculate the following descriptive statistics:
Using these metrics, the bank can get an idea of how their customers behave and
how they handle their money. Not all businesses build statistical models or perform
complex calculations, but just about every business uses descriptive statistics to
gain a better understanding of their customers.
6
Basic Statistics Email:[email protected] 2024
Using this simple chart, the business can quickly see that both their sales and
number of new clients tends to increase the most in the final quarter of the year.
This can allow the business to be prepared with more staff, later hours, more
inventory, etc. during this time of year.
Another way that statistics is used in business settings is in the form of linear
regression models. These are models that allow a business to understand the
relationship between one or more predictor variables and a response variable. For
example, a grocery store might track their total amount spent on print advertising,
their total amount spent on online advertising, and their total revenue. They might
then build the following multiple linear regression model:
7
Basic Statistics Email:[email protected] 2024
For each additional dollar spent on online advertising, the total revenue
increases by $4.87 (assuming TV advertising is held constant).
Using this model, the grocery store can quickly see that their money is better spent
on online advertising as opposed to TV advertising.
Note: In this example, we only used two predictor variables (TV advertising and
online advertising), but in practice businesses often build regression models with
far more predictor variables.
Another way that statistics is used in business settings is in the form of cluster
analysis. This is a machine learning technique that allows a business to group
together similar people based on different attributes. Retail companies often use
clustering to identify groups of households that are similar to each other.
Household income
Household size
Head of household Occupation
Distance from nearest urban area
They can then feed these variables into a clustering algorithm to perhaps identify
the following clusters:
The company can then send personalized advertisements or sales letters to each
household based on how likely they are to respond to specific types of
advertisements.
8
Basic Statistics Email:[email protected] 2024
CHAPTER TWO
In order to describe situations, draw conclusions or make inferences about the population even to
describe the sample, the collected data must organize into some meaningful way. The most
convenient way of organizing data is to construct a frequency distribution. Frequency
distribution is the organization of raw data in table form, using classes and frequencies.
Definition of some terms
Class: is a description of a group of similar numbers in a data set.
Frequency: is the number of times a variable value is repeated.
Class frequency: the number of observations belonging to a certain class.
There are three types of frequency distributions; categorical, ungrouped (discrete or frequency
array) and grouped (continuous) frequency distributions.
Categorical FD:-a FD in which the data is qualitative i.e. either nominal or ordinal. Each
category of the variable represents a single class and the number of times each category repeats
represents the frequency of that class (category).
Example:-The blood type of 25 students is given below
A B B AB O A
O O B AB B A B
B B O A O AB
A O O O AB O
9
Basic Statistics Email:[email protected] 2024
10
Basic Statistics Email:[email protected] 2024
Class Limits:-The lowest and highest values that can be included in a class are called Class
Limits. The lowest values are called Lower Class Limits and the highest values are called Upper
Class Limits.
Class limit for the first class 1-25
Lower class limit 1 and Upper class limit 25
Class Boundaries:-are class limits when there is no gap between the UCL of the first class and
the LCL of the second class. The lowest values are called Lower Class Boundaries and the
highest values are called Upper Class Boundaries.
Class Width (Class Size):-the difference between UCB and LCB of a class. It is also the
difference between the lower limits of two consecutive classes or it is the difference between
upper limits of two consecutive classes.
Class Mark (Class Midpoint):-is the half way between the class limits or the class boundaries.
11
Basic Statistics Email:[email protected] 2024
Relative frequency: - is the ratio of class frequency to the total frequency (total number of
observations).
Percentage frequency: - Relative frequency ×100
Cumulative frequency: is the sum of frequencies (total number of observations) below or above
a certain value.
Less than Cumulative Frequency: is the total number of values of a variable below a certain
UCB.
More than Cumulative Frequency: - is the total number of values of a variable above certain
LCB.
12
Basic Statistics Email:[email protected] 2024
13
Basic Statistics Email:[email protected] 2024
16 21 26 24 11 17 25 26 13 27 24 26 3 27 23 24 15 22 22 12 22 29 18 22 28 25 7
17 22 28 19 23 23 22 3 19 13 31 23 28 24 9 20 33 30 23 20 8 21 24
Solution:
Exercise: In a survey the age of 44 women at marriage was reported as follows. Construct the
appropriate FD for this data.
24 25 27 26 22 23 24 25 24 23 26 28 24 25 23 24 25 25 25 22 27 28
14
Basic Statistics Email:[email protected] 2024
27 24 25 24 25 28 26 25 24 28 24 25 25 24 25 24 26 27 27 25 28 26
15
Basic Statistics Email:[email protected] 2024
Because the selection of the class width and the lower class limit of the first class are to a
certain extent arbitrary, different frequency distributions may be constructed for the same
data and hence may give contradictory impressions.
1. Histogram: A graph in which the classes are marked on the X axis (horizontal axis) and the
frequencies are marked along the Y axis (vertical axis).
The height of each bar represents the class frequencies and the width of the bar represents
the class width.
The bars are drawn adjacent to each other.
16
Basic Statistics Email:[email protected] 2024
A stem and leaf display is a graphical method of displaying data. It is particularly useful when
your data are not too numerous. In this section, we will explain how to construct and interpret
this kind of graph. As usual, an example will get us started. Consider Table 1 that shows the
number of touchdown passes (TD passes) thrown by each of the 31teams in the National
Football League in the 2000season.
A stem and leaf display of the data is shown in Figure 1. The left portion of Figure 1
contains the stems. They are the numbers 3, 2, 1, and 0, arranged as a column to the left
of the bars. Think of these numbers as 10′s digits. A stem of 3, for example, can be used
17
Basic Statistics Email:[email protected] 2024
to represent the 10′s digit in any of the numbers from 30 to 39. The numbers to the right
of the bar are leaves, and they represent the 1′s digits. Every leaf in the graph therefore
stands for the result of adding the leaf to 10 times its stem.
A dot plot, also known as a strip plot or dot chart, is a simple form of data visualization
that consists of data points plotted as dots on a graph with an x- and y-axis. These types
of charts are used to graphically depict certain data trends or groupings. A dot plot is
similar to a histogram in that it displays the number of data points that fall into each
category or value on the axis, thus showing the distribution of a set of data.
A dot plot is used to represent any data in the form of dots or small circles. It is similar to a
simplified histogram or a bar graph as the height of the bar formed with dots represents the
numerical value of each variable. Dot plots are used to represent small amounts of data. For
example, a dot plot can be used to collect the vaccination report of newborns in an area, which is
represented in the following table.
18
Basic Statistics Email:[email protected] 2024
Now let's see the number of newborn babies who got a vaccine in each colony. Colony A has a
total of 7 dots, which means that seven babies have been vaccinated. Similarly, colony B has
three babies, colony C has five babies, and colony D has one baby who has been vaccinated. The
other way to represent it through a dot plot is given below:-
3. Frequency Polygon: A graph that consists of line segments connecting the intersection
of the class marks and the frequencies.
Can be constructed from Histogram by joining the mid-points of each bar.
19
Basic Statistics Email:[email protected] 2024
Example: Construct frequency polygon for the following Grouped frequency Distribution.
4. Cumulative Frequency (Ogive) curves: is a smooth free hand curve of frequency polygon.
Example: Construct Ogive curve for the following Grouped frequency Distribution.
Class boundaries Frequency
99.5–104.5 2
104.5–109.5 8
109.5–114.5 18
114.5–119.5 13
119.5–124.5 7
124.5–129.5 1
129.5–134.5 1
20
Basic Statistics Email:[email protected] 2024
A line graph also known as a line plot or a line chart is a graph that uses lines to connect
individual data points. A line graph displays quantitative values over a specified time interval.
In finance, line graphs are commonly used to depict the historical price action of an asset or
security.
Line graphs use data point "markers," which are connected by straight lines. These
data points, connected by straight lines, aid in visualization. While line graphs are
used across many different fields for different purposes, they are especially helpful
when it is necessary to create a graphical depiction of changes in values over time.
21
Basic Statistics Email:[email protected] 2024
changes of that variable over time. This graph cannot be used to compare the variable to another
variable because only variable is charted.
In the example below, the x-axis is time and the y-axis is the year-over-year change in price for
all consumer goods in the United States. This graph of the Consumer Price Index shows the
annual rate of inflation and, since it is analyzing just one set of data (all items), there is only one
line.
In a multiple line graph, more than one dependent variable is charted on the graph and
compared over a single independent variable (often time). Different dependent variables are
often given different colored lines to distinguish between each data set. Each line relates to only
the points in its given data set; lines do not cross between dependent variables.
For example, the line graph below shows the Consumer Price Index again. However, this graph
shows the change in price for three different categories: medical care
(red), commodities (green), and shelter (blue). In this graph, we can see the growth in price for
commodities was higher than the other two categories in July 2022. However, shelter or medical
expenses were typically the groups that experienced higher inflation over the past decade.
22
Basic Statistics Email:[email protected] 2024
b. Pictograms
A pictogram is one of the simplest and most popular forms of data visualization out there.
Besides making your data look nice, pictograms can make your data more memorable. Visually
stacking icons to represent simple data can improve a reader‟s recall of that data and even their
level of engagement with that data. Pictograms can also be a fun addition to any info-graphic.
Pictograms are types of charts and graphs that use icons and images to represent data.
Also known as “pictographs”, “icon charts”, “picture charts”, and “pictorial unit charts”,
pictograms use a series of repeated icons to visualize simple data. The icons are arranged in a
single line or a grid, with each icon representing a certain number of units (usually 1, 10, or 100).
A feature of many great info-graphics, they‟re often used to make otherwise boring facts or data
points more compelling, as seen in the statistical info-graphic below.
When to use a pictogram
Pictograms can come in handy quite often when visualizing data in info-
graphics, reports, presentations, and even resumes!
You can use a pictogram whenever you want to make simple data more visually interesting,
more memorable, or more engaging.
Whether you want to show the magnitude of an important stat or visualize a fraction or
percentage, you can use pictograms to add visual impact to simple data.
Use a pictogram to show ratings or changes
We know that pictograms are great for showing simple proportions or percentages.
23
Basic Statistics Email:[email protected] 2024
A Scatter Diagram is also called a Scatter Plot or an x-y graph. This type of chart is designed
to express the relationship between two data points or variables. You have to plot two data
points along the x and y-axes. The y-axis displays the dependent variable of your data, while
mark the data as a dot. Still, you can show your independent variables on the x-coordinate. When
you carefully examine this Scatter Diagram type, you will see that the dots follow a linear
pattern. All you have to do is to join them using a straight line. Below is an example of a Scatter
24
Basic Statistics Email:[email protected] 2024
The dots‟ straight-line alignment shows a strong relationship between your data points. Experts
Experts term this Scatter Plot type with a low degree of correlation. The data points are somehow
non-linear, and it can be challenging to use a straight line. Your data points appear as dots and
are usually close to each other. A Scatter Diagram with moderate correlation will appear as
shown below.
This Scatter Diagram type has no degree of alignment or correlation. In most instances, your data
points scatter all over the diagram, which can prove difficult to draw a straight line? It becomes
impossible for you to establish a relationship between your variables. A Scatter Diagram with no
25
Basic Statistics Email:[email protected] 2024
A contingency table displays frequencies for combinations of two categorical variables. Analysts
also refer to contingency tables as cross-tabulation and two-way tables. Contingency tables
classify outcomes for one variable in rows and the other in columns. The values at the row and
column intersections are frequencies for each unique combination of the two variables. Use
contingency tables to understand the relationship between categorical variables. For example, is
there a relationship between gender (male/female) and type of computer (Mac/PC)?
The contingency table example below displays computer sales at our fictional store. Specifically,
it describes sales frequencies by the customer‟s gender and the type of computer purchased. It is
a two-way table (2 X 2).
In this contingency table, columns represent computer types and rows represent genders. Cell
values are frequencies for each combination of gender and computer type. Totals are in the
margins. Notice the grand total in the bottom-right margin. At a glance, it‟s easy to see how two-
26
Basic Statistics Email:[email protected] 2024
way tables both organize your data and paint a picture of the results. You can easily see the
frequencies for all possible subset combinations along with totals for males, females, PCs, and
Macs. For example, 66 males bought PCs while females bought 87 Macs. Furthermore, there are
117 females, 106 males, 96 PC sales, 127 Mac sales, and a grand total of 223 observations in the
study.
Bar Diagram:-It is the simplest and most commonly used diagrammatic representation of a
frequency distribution. It is appropriate to present Qualitative Data (nominal\ordinal).
It uses a serious of separated and equally spaced bars in which the width of the bars is constant
and height of bars corresponds to the frequency of the category. The bars are separated by
constant distance.
a. Simple Bar Diagram: is a diagram in which categories of a variable are marked on the X
axis and the frequencies of the categories are marked on the Y axis. It is applicable for
discrete variables, that is, for data given according to some period, places and timings.
These periods and timings are represented on the base line (X-axis) at regular interval and
the corresponding frequencies are represented on the Y-axis.
The width of the rectangle represents nothing (it is meaningless), but it should be equal for
all rectangles.
Each rectangle is separated by an equal space.
It can also represent some magnitude (on the Y axis) over time, space, groups, etc.(on the X
axis).
Example1:
27
Basic Statistics Email:[email protected] 2024
100
80
60
Frequ en cy
40
20
0
Single Married Divorced
b. Component Bar Diagram: is used when there is a desire to show a total or aggregate is
divided into its component parts. The bars represent total value of a variable with each total
broken into its component parts and different colors are used for identification. In such type
of diagrams, a bar is subdivided in to parts in proportion to the size of the sub division.
These subdivided rectangles are shaded differently by lines, dots and colors so that they will
be very easy to compare the components. Sometimes the volumes of different attributes may
be greatly different. For making meaningful comparisons, the components of the attributes
are reduced to percentages. In that case each attribute will have 100 as its maximum volume.
This sort of component bar diagram is known as percentage bar-diagram. Each rectangle
represents total value of a variable and is broken into its component parts.
Example:
28
Basic Statistics Email:[email protected] 2024
c. Multiple Bars Diagram: used to display data on more than one variable. In the multiple
bars diagram two or more sets of inter-related data are interpreted.
Example:
d. Deviation Bar Diagram: When the data contains both positive and negative values such as
data on net profit, net expense, percent change,etc
29
Basic Statistics Email:[email protected] 2024
Example:
8. Pie chart: - Pie chart is popularly used in practice to show percentage break down of data. A
pie chart is a circle representing a set of data by dividing the circle into sectors proportional to
the number of items in the categories or a pie chart is a circle representing the total, cut into
slices in proportional to the size of the parts that make up the total. It gives the proportional
sizes of different data groups as slice of a pie or a circle.
Example:
30
Basic Statistics Email:[email protected] 2024
31
Basic Statistics Email:[email protected] 2024
CHAPTER THREE
Usually the collected data is not suitable to draw conclusions about the mass from which it has
been taken. Even though the data will be ,somewhat summarized after it is depicted using
frequency distributions and presented by using graphs and diagrams, still we cannot make any
inferences about the data since we have many groups. Hence, organizing a data into a FD is not
sufficient, there is a need for further condensation, particularly when we want to compare two or
more distributions we may reduce the entire distribution into one number that represents the
distribution we need. A single value which can be considered as a typical or representative of a
set of observations and around which the observations can be considered as centered is called an
„Average‟ (or average value or center of location). Since, such typical values tend to lie centrally
within asset of observations when arranged according to magnitudes; averages are called
Measures of Central Tendency.
1. To condense a mass of data in to one single value. That is to get a single value which is best
representative of the data (that describes the characteristics of the entire data). Measures of
central tendency, by condensing masses of in to one single value enable us to get an idea of
the entire data. Thus one value can represent thousands of data even more.
2. To facilitate comparison. Statistical devices like averages, percentages and ratios used for this
purpose. Measures of central tendency, by condensing masses of in to one single value,
facilitates comparison. For example, to compare two classes A and B, instead of comparing
each student result, which is infeasible, we can compare the average mark of the two classes.
32
Basic Statistics Email:[email protected] 2024
There are many types of measures of central tendency, each possessing particular properties and
each being typical in some unique way. The most frequently encountered ones are :-
Computed averages: Mean (Arithmetic Mean. Geometric Mean and Harmonic Mean)
Positional averages: Median and Quantiles (Quartiles, Deciles, Percentiles)
Mode
Summation Notation
n
The sum X1+X2+…+Xn is denoted by the Greek letter ∑ (sigma) as X
i 1
i = X1+X2+…+Xn and
n
X Y
i 1
i i X 1Y1 X 2Y2 ... X nYn
33
Basic Statistics Email:[email protected] 2024
n n
(X
i 1
i c) X i nc
i 1
n n
CX
i 1
i =C X i , where C is a constant.
i 1
n
a =n a where a is a constant.
i 1
n
From now onwards we will use ∑X in place of X
i 1
i just for simplicity.
3.1.1. Mean
1. Arithmetic Mean
Simple Arithmetic Mean:-is the sum of all observations divided by total number of observations.
For a sample of n observations X1X2,…,Xn the sample mean is denoted by X (X-bar) and
calculated as follows.
X=
X = X 1 X 2 .... X n
n n
Example1: The high temperatures for a 7-day week during December in Haramaya University
were 29 , 31 , 28 , 32 , 29 , 27 , and 55 . find the mean high temperature for the
week.
Solution: X = = =33 .
Example2: The amounts of drops of water in drip irrigation were registered from 43 sample drip
holes in one day and the data are as follows:
34
Basic Statistics Email:[email protected] 2024
The algebraic sum of the deviations of each value from the arithmetic mean is zero. That is
∑(X- X ) =0.
The sum of the squares of the deviations from the mean is less than the sum of the squares of
the deviations about the other score in the distribution.
That is ∑(X- X ) 2≤∑(X-A) 2, A≠ X
If a constant C is added or subtracted from each value in a distribution, then the new mean
will be X new= X old C respectively.
If each value of a distribution is multiplied by a constant C, the new mean will be the original
mean multiplied by C.
35
Basic Statistics Email:[email protected] 2024
Combined Mean: If there are p different groups (having the same unit of measurement) with
mean X 1 , X 2 ,…, X p and number of observations n1,n2,…np respectively, then the mean of all
XC =
nX =
n1 X 1 n2 X 2 .... n p X p
n n1 n2 ... n p
While calculating the simple arithmetic mean we had given equal importance to all values. But
there are cases where the relative importance is not the same for all items. When this is case, it is
necessary to assign them weights (i.e. relative importance) and then calculate a weighted
arithmetic mean. Let X1X2,…,Xn be the values and W1,W2,…,Wn be the corresponding weights
Example: If a final examination in a course is weighted three times as much as a quiz and a
student has a final examination grade of 85 and quiz grades of 70 & 90, find the mean grade of a
student.
Solution: let X1=1st quiz=70, X2=2ndquiz=90 and X3=final=85 with the corresponding weights‟
XW =
WX = = =83, so the average grade of a student is 83.
W
Arithmetic mean fulfills almost all characteristics of good measures of central tendency with the
exception that it is highly affected by extreme values. And it cannot be calculated for a FD with
open-ended classes (a FD with no lower class boundary of the first class or with no upper class
boundary of the last class or with both).
36
Basic Statistics Email:[email protected] 2024
GM= n X = n X 1 X 2 ... X n
But this formula is used if n is small. If it is large, it is difficult to calculate the nth root. Thus to
facilitate the computation, we make use of logarithms.
1
GM=Antilog( ∑logX)
n
1
For ungrouped FD, GM=Antilog ( ∑flogX)
f
For grouped FD, X represents class mark.
If the variable values are measures as ratios, proportions or percentage and some values are
larger in magnitude and others are small, then the geometric mean is a better representative of
the data than the simple average. In a “geometric series”, the most meaning full average is the
geometric mean. The arithmetic mean is very biased toward the large numbers in the series.
The geometric mean is important in determining the average rate of growth, percentages, ratios
and portions.
The disadvantage of GM is that it cannot be calculated if one or more observations are zero or
negative. It is also affected by extreme values but not to the extent of AM.
Exercise:
1. Find the geometric mean of A) 1, 2, 3, 4, 5. B) 1, 2, 3, 4, 100. Is there a great difference
between the GM of A and that of B?
2. The price of a commodity increased by 5% from 1989 to 1990, 8% from 1990 to 1991 and by
77% from 1991 to 1992. Find the average price increase.
3. A machine depreciated by 10% each in the first two years and by 40% in the third year. Find
out the average rate of depreciation.
4. Decadal percentage growth of population in country A is given below. Find the average rate
growth.
37
Basic Statistics Email:[email protected] 2024
Harmonic Mean is another specialized average which is useful in averaging variables expressed
as rate per unit of time, such as speed, number of units produced per day. It is the reciprocal of
the arithmetic mean of the numbers.
n n
HM= =
1 1 1 1
X
X1 X 2
...
Xn
For n observations AM ≥ GM ≥ HM
For two positive observations GM = AM * HM
n
Solution: X HM = = = =3.43
1
X
38
Basic Statistics Email:[email protected] 2024
Example 2: In a small company two typists are employed, typist A types one page in 10 minutes
and typist B types one page in 20 minutes.
a) Both are asked to types 10 pages. What is the average time taken for typing one page?
b) Both are asked to types for one hour. What is the average time taken by them by them for
typing for one page?
( ) ( )
Solution: a) X HM= =15 minute
Exercise:
1. Find the harmonic mean of A) 1, 2, 3, 4, 5. B) 1, 2, 3, 4, 100. Is there a great difference
between the HM of A and that of B?
2. A driver traveled 400 km per day for three days at a speed of 60, 50 and 40 kilometers per
hour. Find the average speed of the driver.
3. A student reads the first 100 pages of a book at a rate of 5 pages per hour, the next 100 pages
at a rate of 8 pages per hour. What is the student‟s average reading speed?
4. Suppose a train moves 100 km with a speed of 40 km per hour, then 150 km with a speed of
50 km per hour and the next 135 km with a speed of 45 km per hour. Calculate the average
speed of the train.
5. In a factory a mechanic takes 15 days to fabricate a machine, the second mechanic takes 18
days, the third takes 30 days and the fourth takes 90 days. Find the average number of days
taken by the workers to fabricate the machine.
6. Suppose a train moves 5 hours at a speed of 40 km per hour, then 3 hours at a speed of 50 km
per hour and the next 5 hours with a speed of 45 km per hour. Calculate the average speed of
the train.
3.1.2. Median
Median is the half-way point in a data set. It divides a data set into two equal parts such that half
of the numbers have a value less than the median and have will have values greater than the
median. Graphically median is the intersection of the less than and more than cumulative
frequency curves.
39
Basic Statistics Email:[email protected] 2024
FX~ 1 is the less than cumulative frequency just before the median class.
First obtain the less than cumulative frequencies. From the cumulative frequencies select the
n
minimum one which contains the value . Then the median class is the class corresponding to
2
n
this minimum cumulative frequency which contains the value .
2
Median is not influenced by extreme values. It can be calculated for FD with open-ended classes,
even it can be located if the data is incomplete.
Examples:
Find the median of the following data sets.
180, 201, 220, 191, 219, 209 and 220.
Solution: 4th value=209
62, 63, 64, 65, 66, 66, 68 and 78.
Solution: (4th value+5th value)/2= (65+66)/2=65.5
40
Basic Statistics Email:[email protected] 2024
Find the median weight of the 40 males college students at state university and Interpretation the
result.
118-126 3 3
127-135 5 8
136-144 9 17
145-153 12 29
154-162 5 34
163-171 4 38
172-180 2 40
Total 40
Solution: The median class is the class having the less than cumulative frequency containing the
value n/2=40/2=20. This implies, 145-153 is the median class.
L X~ =144.5, n=40, FX~ 1 =17, f X~ =12 and w=9
n
FX~ 1
~ 2
X = L X~ ( ) w =144.5+ (20-17)* =146.8.
f X~
3.1.3. Mode
The mode denoted by X̂ , is the most frequently occurring value in a set of observations or it is
the value with the highest frequency. A data set may have one mode (uni-modal), two modes (bi-
modal), more than two modes (multi-modal) or no mode at all (i.e. when all observations are
equally frequent).
Ungrouped (individual series): Arrange the data in ascending order and take the value
appearing most frequently (the most frequent value).
41
Basic Statistics Email:[email protected] 2024
Grouped (continuous) series: In a frequency distribution, the mode is located in the class with
highest frequency and that class is the modal class.
f Xˆ f Xˆ 1
Then the formula for mode is X̂ = L Xˆ ( )w
( f Xˆ f Xˆ 1 ) ( f Xˆ f Xˆ 1 )
Mode is not affected by extreme values and can be calculated for open-ended classes. But it
often does not exist and is value may not be unique.
Example 1: The study of the relationship between age and varies function (such as acuity and
depth perception) reported the following observation on area of sclera lamina (mm2) from human
optic nerve heads (experimental eye research 1988): 2.75, 2.62, 2.74, 3.85, 2.34, 2.74, 3.93, 4.21,
3.88, 4.33, 3.46, 4.52, 2.43, 3.65, 2.78, 3.56, 3.01. Find mean, median, mode,Q1, D5, P75.
Solution: Check the answer (mean=3.341, median=3.46, mode=2.71, Q1=2.74, D5=3.46 &
P75=3.93)
Example 2: Find the mode & interpret the result of 40 male college students.
Solution: the most frequency appears at class interval 145-153, so
L X~ =144.5, n=40, FX~ 1 =9, FX~ 1 =5 f X~ =12 and w=9
f Xˆ f Xˆ 1
X̂ = L Xˆ ( ) w =144.5+ =144.5+2.7=147.2
( f Xˆ f Xˆ 1 ) ( f Xˆ f Xˆ 1 )
In the third chapter, we concentrated on a central value (measures of central tendency), which
gives an idea of the whole mass that is a complete set of values. However the information so
obtained is neither exhaustive nor comprehensive, as the mean does not lead us to know whether
the observations are close to each other or far apart. Median is a positional average and has
nothing to do with the variability of the observations in a data set. Mode is the largest occurring
value independent of the other values in the set. This leads us to conclude that a measure of
central tendency is not enough to have a clear idea about the data unless all observations are the
same. Moreover two or more data sets may have the same mean and/or median but they may be
quite different. So MCT alone do not provide enough information about the nature of the data.
42
Basic Statistics Email:[email protected] 2024
To illustrate this let us consider the following three data sets: the price of a certain commodity in
four cities in five different months.
Month
A 30 30 30 30 30
City
B 28 29 31 30 32
C 15 5 55 45 30
D 3 5 37 30 75
Now if we calculate the mean and median for each of the city, we will come up with the value
30. This value implies that, the price of the commodity in the four cities A, B, C and D, on
average, is the same. That is the average price of the commodity in the four cities is the same.
But by inspection, it is apparent that the price of the commodity in the cities differs remarkably
from one another. For city A, it is right, for city B more or less it is ok, but for city C and D it is
not realistic to say the price of the commodity is 30. This means, just only by looking at the
average we cannot talk about the data set confidently. So, along with the average values
(measures of central tendency), we have to study the dispersion of the data.
Dispersion or variation may be defined as the extent of dispersion value around the measures of
central tendency. Thus measure of dispersion tells us the extent to which the values of a variable
vary about the measure of central tendency.
1. To have an idea about the reliability of the measure of central tendency. If the degree
of dispersion is large, an average is less reliable. If the value of the dispersion is small, it
indicates that a central value is a good representative of all the values in the data set.
43
Basic Statistics Email:[email protected] 2024
2. To compare two or more sets of data with regard to their variability. Two or more
data sets can be compared by calculating the same measure of dispersion having the same
unit of measurement. A set with smaller value possess less variability or is more uniform
(or more consistent).
3. To provide information about the structure the data. A value of a measure of
dispersion gives an idea about the spread of the observations. Further, one can surmise
about the limits of the expansion of the values in the data set.
4. To pave way to the use of other statistical measures. Measures of dispersion,
especially variance and standard deviation, lead to many statistical techniques like
correlation, regression, analysis of variance.
1. Range
It is the simplest and crudest measure of dispersion. Range is defined as the difference between
the largest and the smallest values in the data.
Range hardly satisfies any property of good measure of dispersion as it is based on two
extreme values only, ignoring the others. It is not liable to further algebraic treatment.
2. Quartile Deviation
3. Mean Deviation
It is the arithmetic mean of the absolute values of the deviation from some measures of central
tendency usually the mean and the median of a distribution. Hence we have mean deviation
~
about the mean MD( X ) and mean deviation about the median MD( X ).
~
Ungrouped Data: MD( X )=
|XX| ~ | X X |
MD( X )=
n n
~
Grouped Data: MD( X ) =
f |XX| ~
MD( X ) =
f |X X |
f f
45
Basic Statistics Email:[email protected] 2024
The Variance and Standard Deviation are the most superior and widely used measures of
dispersions and both measure the average dispersion of the observations around the mean.
For a population containing N elements, the population variance ( 2 ) is calculated by using the
formula 2
=
(X X ) 2
(X X ) 2
46
Basic Statistics Email:[email protected] 2024
If the standard deviation of the data is small the values are concentrated near the mean and if it
large the values are scattered away from the mean.
Interpretation of the Standard Deviation
If the data are a sample and the distribution is normal or bell-shaped (or close to it!) or
approximately normally distributed, then the following conclusions can be reached:
approximately 68% of the scores in the sample fall within one standard deviation of the mean i.e.
X S will include approximately 68% of the data
approximately 95% of the scores in the sample fall within two standard deviations of the mean
i.e. X S will include approximately 95% of the data
Approximately 99% of the scores in the sample fall within three standard deviations of the mean
i.e. X S will include approximately 99.73% of the data.
Even if standard deviation is better than variance, there is however on difficulty with it. If there
are two or more distributions of different variables (having different units of measurement), there
variability cannot be compared by comparing the values of the standard deviation.
Examples:
1) Compute the variance (S2) and standard deviation(S) for the following11, 12, 13, 14, 15, 16,
17, 18, 19, 20 and 21.
n n
x i ( x i ) 2 / n
2
i 1 i 1 2926 (176) 2 / 11
S2 11
n 1 10
So, S S 2 11 3.316
2) Computing the variance & standard deviation for the data given below.
Observation(Xi) 32 36 40 44 48 Total
Frequency(fi) 2 5 8 4 1 20
47
Basic Statistics Email:[email protected] 2024
fx ( f i xi ) 2 / f i
2
31376 (788) 2 / 20
17.31
2 i i
S
f i 1 19
1-3 1 2 2 4
3-5 9 4 36 144
13-15 3 14 42 588
fm ( f i mi ) 2 / f i
2
7016 (800) 2 / 100
6.22
2 i i
S
f i 1 99
2
=6.22. So, S=√ =2.49
Properties of Variance and Standard Deviation
2. If every value is multiplied by a constant C the new variance is S2new=C2S2old and standard
deviation is Snew=CSold
3. When a constant C is added (subtracted) to or (from) each and every value, the standard
deviation and variance remains the same.
48
Basic Statistics Email:[email protected] 2024
5. Coefficient of Variation
All absolute measures of dispersion have units. If two or more distributions differ in their units
of measurement, there variability cannot be compared by any of the absolute measure given
before. Also, the size of these measures of dispersion depends up on the size of the values. That
is if the size of the values is larger, the value of the absolute measures will also be larger. Hence,
in situations where either the two or more data sets have different units of measurement, or their
means differ sufficiently in size, absolute measures fails to be appropriate.
It is a relative measure of standard deviation. The coefficient of variation is the ratio of the
standard deviation to the mean and it is expressed as percent.
CV= ×100%, for population
S
CV= ×100%, for sample
X
It is used for comparing the variability of two or more distributions. The distribution having less
CV is said to be less variable or more consistent or more uniform.
Since absolute measures depend on the units of measurement of the data, they fail to be
appropriate for comparing two or more groups if
1. The groups have different units of measurement.
2. The size of the data between the groups is not the same.
When either of these two conditions happens we have to use relative measures of variation. CV
is a unit less measure of variation and also takes into account the size of the means of the
distributions.
EX: Given Data Set A: 2 Meters, 4 Meters, 6 Meters
Data Set B: 1000 Liters, 800 Liters, 900Liters
Compare the variability of the two data sets using standard deviation and coefficient of variation.
6. Standard Score(Z-score)
It used to determine how many standard deviations a given value is above or below the
mean which is depend on whether the z-score is negative or positive.
for Population
for Sample
49
Basic Statistics Email:[email protected] 2024
Example: Suppose Ablakat scored 90 on a basic statistics test in which the mean and standard
deviation of the class were 70 and 10 respectively. In the second test, Meklit scored 60 on which
the mean and standard deviation of the class were 56 and 4 respectively. Who is better of relative
to her class?
Solution:
Ablakat ==2.0 Meklit ==1.0
The score of Ablakat (90) in her class is 2 standard deviation above the mean whereas the score of
Meklit (60) in her class is 1 standard deviation above the mean. This implies that the Ablakat‟s score
is the better relative score when considered in the context of Meklit‟s score.
Although the terms correlation and association are often used interchangeably, correlation in a
stricter sense refers to linear correlation, and association refers to any relationship between
variables.) The method used to determine the strength of an association depends on
the characteristics of the data for each variable. Data may be measured on an interval/ratio scale, an
ordinal/rank scale, or a nominal/categorical scale. These three characteristics can be thought of as
continuous, integer, and qualitative categories, respectively.
In this lesson we will deal with a bi-variate data i.e. data involving two variables.
50
Basic Statistics Email:[email protected] 2024
Regression may be defined as the estimation of the unknown value of one variable from the known
values of one or more variables. The variable whose values are to be estimated is known as
dependent or explained variable while the variable which are used in determining the value of the
dependent variable are called independent or predictor variables.
The regression study that involves only two variables is called simple regression and the regression
analysis that studies more than two variables is called multiple regression. If the relationship
between the two variables can be described by a straight line then the regression is known as linear
regression otherwise it is called non-linear.
The regression analysis involving only two variables and having a linear relationship is called
Simple Linear Regression. This linear relationship between the two variables is represented by a
straight line.
Regression Line (Line of Regression): is the line that gives the best estimate of one variable for
any given value of another variable. The regression line which is used to estimate the values of Y for
any given value of X is called regression line of Y on X.
Regression Equation: is a mathematical equation that defines the relationship between two
variables.
Regression of Y on X
Model: Y= α + βX + Є
α is the intercept
β is the slope
α is the value of the dependent variable when the value of the independent variable is zero.
β is the increment in the value of the dependent variable when the value of the independent
variable increased by 1 unit. There is a direct linear relationship between the two variables
ifβ is positive, there is an indirect linear relationship between the two variables if β is
negative, and there is no linear relationship between the two variables if β is zero.
51
Basic Statistics Email:[email protected] 2024
a) Method of Estimation
The objective in the above model is to estimate the regression parameters (α and β) using the sample
data. The most common and widely used method of estimation is called Ordinary Least Squares
(OLS) which minimizes error sum of the squares.
^ ^
Yˆ X
^
is the estimated intercept.
^
is the estimated slope.
^ n XY X Y
n X 2 ( X ) 2
^ ^
, and Y X
2. Correlation
Most of the variables in economics and business area show relationship. For example, price and
supply, income and expenditure, advertising expenditure and sales. Thus in order to know the degree
or direction of such a relationship between variables, correlation analysis is important. Correlation is
a statistical tool desired towards measuring the degree of the relationship (degree of association)
between the variables. If the changes in one variable affect the change in the other variable, then the
variables are correlated. Correlation that involves only two variables is called simple correlation.
Covariance: is a measure of the joint variation between two variables, i.e. it measures the way in
which the values of the two variables vary together. If the covariance is zero, there is no linear
relationship between the two variables.
If it is negative, there is an indirect linear relationship between them. If the covariance is positive,
there is a direct linear relationship between the variables. The sample covariance between two
variables is defined as:
52
Basic Statistics Email:[email protected] 2024
1 X Y
S xy
n 1
XY n
The coefficient of correlation is a measure of the degree or strength of the linear association between
two variables. It is defined as a ratio of the covariance between the two variables and the product of
the standard deviations of the two variables. The sample correlation coefficient is denoted by r and
the population correlation coefficient is denoted by ρ.
S xy n XY X Y
r
SxSy n X 2 ( X ) 2 n Y 2 ( Y ) 2
Interpretation of r: The value of the correlation coefficient can be positive, zero or negative,
depending on the sign of the covariance between the two variables. But, it lies the limits -1 and +1;
that is, -1≤r≤1.
If the value of r is -1 or +1, there is a perfect negative or perfect positive linear relationship
between the variables, respectively.
If the value of r is approximately -1 or +1, there is a strong negative or strong positive linear
relationship between the variables, respectively.
If r is -0.5 (or approximately -0.5) or 0.5 (or approximately 0.5), there is moderate negative
or moderate positive linear relationship between the variables, respectively.
If the value of r is near zero, there is no linear relationship between the two variables.
So far, we were concerned with the problem of estimating the parameters of the regression model
and the correlation coefficient between two variables. We now consider the goodness of fit of the
estimated model to a set of data; that is, we shall find out how “well” the estimated model fits the
data.
The coefficient of determination tells how well the estimated model fits the data. For simple linear
regression (two variables case), it is defined as the square of the sample correlation coefficient, and
denoted by r2. Hence r2 measures the proportion or percentage of the variation in the dependent
variable explained by the independent variable. Generally, r2 is a nonnegative quantity which lies in
the limits 0 and 1, i.e., 0≤r2≤1. If it approaches to 1, it means a good fit and if it approaches 0, no
relationship between the variables.
53
Basic Statistics Email:[email protected] 2024
Examples:
a. Given the following data on supply (X) and sales (Y) of a certain commodity
Supply (X) 60 62 65 70 73 75 71
Sales (Y) 10 11 13 15 16 19 14
a) Estimate the regression equation sales on supply and interpret the coefficients.
b) Calculate the correlation coefficient between supply and sales, and interpret it.
c) Find the coefficient of determination and interpret it.
d) Predict the amount of sales of the commodity if the supply amount is 80.
b. The following summary results are obtained from price and demand of a
commodity
2
S2 S
c. Given n = 25, X = 3.95, Y = 2.03, S x = 85.35, y =98.75, xy = 90
Solution: 1
n=7, , X Y XY 6764
X 476 Y 98 2
32564 2
1428
, , and
^ ^
a)Yˆ X
54
Basic Statistics Email:[email protected] 2024
^ n XY X Y
n X 2 ( X ) 2
^ ^
^ ^
Yˆ X 20.68 0.51X
n XY X Y
b) r
n X 2 ( X ) 2 n Y 2 ( Y ) 2
=0.9545
^ ^
d )Yˆ X 20.68 0.51 80 20.12
3. Logistic Regression
Logistic regression analysis studies the association between a categorical dependent variable and a
set of independent (explanatory) variables. The name logistic regression is used when the dependent
variable has only two values, such as 0 and 1 or Yes and No. The name multinomial logistic
regression is usually reserved for the case when the dependent variable has three or more unique
values, such as Married, Single, Divorced, or Widowed. Although the type of data used for the
dependent variable is different from that of multiple regressions, the practical use of the procedure is
similar.
When we want to look at a relationship between categorical dependent variable and a set of
explanatory variables (one or more), we can use the logistic regression framework. Multiple linear
regressions may be used to investigate the relationship between a continuous dependent variable,
such as income, blood pressure or examination score. However, socio-economic variables are very
often categorical, rather than interval scale. In many cases research focuses on models where the
dependent variable is categorical. For example, the dependent variable might be „unemployed‟ or
„not‟, and we could be interested in how this variable is related to sex, age, ethnic group, etc. In this
case we could not carry out a multiple linear regression as many of the assumptions of this technique
will not be met, as will be explained theoretically below. Instead we would carry out a logistic
regression.
If there is a categorical explanatory variable with two categories, then it is appropriate to include it in
the model as if it was binary logistic regression. However, if there is a categorical explanatory
variable with more than two categories, then it is appropriate to include it in the model as if it was
multinomial Logistic regression. For example, that one of the explanatory variable is marital status
with three categories: "Single", "Married", "Separated".
55
Basic Statistics Email:[email protected] 2024
The chi-square distribution can only take positive values and is highly skewed. We use the chi-
square distribution when we analyse categorical data. The chi-square test can also be used to test the
association of two variables, and for goodness of fit test.
Test of association
Example: A researcher wishes to determine whether there is a relationship between the gender of an
individual and the amount of alcohol consumed. A random sample of 68 people was selected and the
following data were obtained.
56
Basic Statistics Email:[email protected] 2024
57
Basic Statistics Email:[email protected] 2024
CHAPTER FOUR
Introduction
As a general concept, probability is the measure of a chance that something will occur. It is a
numerical measure with a value between 0 (0%) and 1 (100%) where the probability of 0 indicates
that the given event cannot occur and a probability of 1(100%) assures certainty of such an
occurrence.
Introduction to Set
1. Experiment: it is an activity or a trial that leads to well-defined results called outcomes, but it is
uncertain to which result will occur.
2. Outcome is particular result of an experiment.
58
Basic Statistics Email:[email protected] 2024
3. Sample space: It is the set of all possible outcomes for the experiment. Each possible outcome
is called sample point. It is denoted by S.
Examples: Define the sample space for the following probability experiments.
Tossing a coin: S={H, T}
Tossing two coins: S={HH, HT, TH, TT}
Rolling a die: S={1, 2, 3, 4, 5, 6}
4. Event: An event is a subset of the sample space in other words; an event is a set containing
sample points of a certain sample space under consideration.
Example: If we roll a fair die, then the experiment is rolling the die.
The sample space S for this experiment is
S= {1, 2, 3, 4, 5, 6}
If we are interested to the outcomes of even numbers, then the event or out interest is E= {2, 4, 6}.
Elementary or simple event: An event having only one- simple point is an elementary or simple
event.
Mutually exclusive events: Two events E1 and E2 are said to be mutually exclusive events if there is
no sample point which is common to both events E1 and E2. That means, E1 n E2=. Mutually
exclusive events are events, which cannot happen at the same time. Example: consider the
experiment of tossing two coins. Let E1 be an event with not heads shown, E2 be an event with one
head shown and E3 be an event with two heads shown. Are E1, E2 and E3 mutually exclusive?
Solution
S= {HH, HT, TH, TT}
E1= {TT}
E2= {HT, TH}
E3= {HH}
E1 n E2=E2 n E3=E1 n E3=
Thus, E1 and E2, E2 and E3, E1 and E3 are mutually exclusive events.
Independent events: Two events E1 and E2 are said to be independent if the occurrence of E1 has no
effect on the occurrence of E2. That means the knowledge of event E1 has occurred given no
information about the occurrence of the event E2. If two events are not independent, they are said to
be dependent.
59
Basic Statistics Email:[email protected] 2024
Equally likely outcome: In a certain experiment if each outcome in the sample space has the same
chance to be occurred, then we say that the outcome is equally likely outcomes. Example: in
throwing a fair die all possible outcomes are equally likely comes/occurred. That means the elements
of the sample space have the same chance to occur.
Random Variable is a variable whose values are determined by chance or with some probability. It
is denoted by capital letter. The set consisting of all possible values of a random variable is called
range space (Rx).
Discrete random variable: If the number of possible values of a random variable X (that is, R x) is
finite or countable infinite.
Continuous random variable: If the random variable assumes an uncountable infinite number of
possible values.
Probability Distribution is a listing of all possible values of a random variable together with their
corresponding probabilities. Based on the type of a random variable, a probability distribution can be
discrete or continuous.
probability of x i is associated. The number p ( xi ) , i 1,2,... must satisfy the following conditions.
0 p ( xi ) 1
∑P(X=xi) =1
This function p defined above is called probability mass function (pmf) of the random variable X.
the collection of pairs ( xi , p( xi )), i 1,2,... is called the probability distribution of X.
Examples:
60
Basic Statistics Email:[email protected] 2024
1. Construct a probability distribution for the number of heads observed in tossing a coin two
times.
2. Construct a probability distribution for the number of heads observed in tossing a coin three
times.
3. Construct a probability distribution for the number of girls if a family plans to have four
children.
Solutions:
Let X be the number of heads observed in tossing a coin two times. Rx={0, 1, 2}
x 0 1 2 Total
P x 14 2/ 4 ¼ 1
Let X be the number of heads observed in tossing a coin three times. Rx={0, 1, 2, 3}
x 0 1 2 3 Total
P x 18 38 38 18 1
A continuous probability distribution is represented by the probability density function (pdf), having
the following characteristics: suppose X is continuous on an interval [a, b].
i. f(x)≥0, for all x Є(a,b)
b
ii. f ( x)dx 1
a
b
iii. P(a X b) f ( x)dx
a
Examples:
1. Show that each of the following functionis pdf.
1,0 x 1
a. f ( x)
0, otherwise
61
Basic Statistics Email:[email protected] 2024
e x , x 0
f ( x)
b.
0, otherwise
2. Find the value of b for the following function to be a pdf.
bx 2 ,0 x 1
f ( x)
0, otherwise
The mean of a random variable X is known as the expected value of X, denoted by E(X). It is
defined as:
The variance of the random variable X is the expected value of the square of the deviation of X from
its mean.
( x ) P( x) , if X is a discrete r.v.
2
E( X )
2 2
( x ) 2 f ( x)dx , if X is a continousr.v.
2 E ( X ) 2 E ( X E ( X )) 2 E ( X 2 ) ( E ( X )) 2
Examples:
1. Find the mean number of heads observed in tossing a coin three times.
2. Find the average number of girls if a family plans to have four children.
3. Find the mean of the following probability distributions.
1,0 x 1
a. f ( x)
0, otherwise
Solution:
Let X be the number of heads observed in tossing a coin three times. Rx= {0, 1, 2, 3}
x 0 1 2 3 Total
P x 18 38 38 18 1
62
Basic Statistics Email:[email protected] 2024
E ( X ) xp( x)
0 1 / 8 1 3 / 8 2 3 / 8 3 1 / 8
1.5
Binomial distribution is one of the simplest and most frequently used discrete probability
distribution and is very useful in many practical situations involving either /or types of events.
Let X be the number of successes. Then X follows a binomial distribution with parameters n,
number of experiments performed and p, probability of success, and write as X~Bin(n,p).Then, the
n
probability of getting exactly x successes in n trials is given by: P( X x) p x q n x , x 0,1,2,...n .
x
Where p is the probability of success
q=1-p is the probability of failure
n is number of trials
x is number of successes.
This is called the Binomial Distribution. The mean of a binomial distribution is E(X)=np and
variance is V(X)=npq.
Examples:
1. Suppose a coin is tossed 10 times. What is the probability of getting
a) Exactly 3 heads
63
Basic Statistics Email:[email protected] 2024
b) No head
c) At most 3 heads
d) At least 3 heads
e) More than 3 heads
Find the average and variance of the number of heads.
2. The probability of a man kicking into the goal is 2/3. If a person kicks 5 times, what is the
probability of scoring
a) At least one goal.
b) At most 3 goals.
Find the average, variance and standard deviation of the number of goals.
Solution:
Let X be the number of heads observed in tossing a fair coin 10 times, Rx= {0, 1, 2,…, 10}
n
P( X x) p x q n x , x 0,1,2,...,10
x
10
0.5 x 0.510 x
x
10
0.510
x
10 1
10
a) P( X 3)
3 2
10 1
10
b) P( X 3)
0 2
c) P( X 3) P( X 0) P( X 1) P( X 2) P( X 3)
d) P( X 3) P( X 3) P( X 4) ... P( X 10) 1 P( X 3)
e) P( X 3) P( X 4) P( X 5) ... P( X 10) 1 P( X 3)
4.5.1.1. Application of Binomial Distribution
64
Basic Statistics Email:[email protected] 2024
Evaluating the binomial distribution of events can be essential in many practical applications. For
instance, statistical analysis in computer programming, data science and business analytics may all
use the binomial distribution of occurrence to evaluate various outcomes. Because binomial
distribution measures two distinct outcomes, this probability is also useful in financial analysis and
forecasting. Consider several more instances when it's useful to apply the binomial distribution
probability:
The Poisson distribution is discrete probability distribution. It differs from binomial distribution in
the sense that it is not possible to count the number of failures even though the number of successes
is known.
Properties of Poisson distribution:
1. The probability of success, p, is very small.
2. The experiment is performed indefinitely (n is very large).
3. The average number of events per unit of time ( ) is known.
Thus, the random variable X (number of successes) has a Poisson distribution with parameter ,
e x
X~Poisson ( ) and the probability of getting x successes is given by P( X x) , x 0,1,2,.... .
x!
where is the average number of events per unit of time.
If X is a Poisson random variable, then E(X) = and V(X)= .
Examples:
1. On average a typist commits 3 errors per page. Find the probability that she will make
a) No mistake.
b) More than one mistake.
65
Basic Statistics Email:[email protected] 2024
2. Customer arrive at a photocopying machine at an average rate of two every 10 minutes. What
is the probability that there will be
a) No arrivals during any period of ten minutes.
b) Exactly one arrival during these time period.
c) More than two arrivals during this time period.
Solution:
3 x e 3
X poisson3 p X x
x!
30 e 3
a) P X 3 P X 0
0!
b) P X 1 P X 2 P( X 3) ... 1 P( X 1)
4.5.2.1. Application of Poisson distribution
The Poisson distribution can be practically applied to several business operations that are common
for companies to engage in. As noted above, analyzing operations with the Poisson distribution can
provide company management with insights into levels of operational efficiency and suggest ways to
increase efficiency and improve operations. Here are some of the ways that a company might utilize
analysis with the Poisson distribution.
Check for adequate customer service staffing. Calculate the average number of customer
service calls per hour that requires more than 10 minutes handling. Then, calculate the
Poisson distribution to find the probable maximum number of calls per hour that might come
in requiring more than ten minutes handling. Assuming that the maximum number of 10+
minute‟s calls occurs, evaluate whether customer service staffing is adequate to handle all the
calls without making customers wait on hold.
Use the Poisson formula to evaluate whether it is financially viable to keep a store open
24 hours a day. Calculate the average number of sales made by the store during the
overnight shift – the period from midnight to 8 A.M. using the distribution formula then;
calculate the probable lowest number of sales that might be made during the overnight shift.
Finally, determine whether that lowest probable sales figure represents sufficient revenue to cover all
the costs (wages and salaries, electricity, etc.) of keeping the store open during that time period,
while also providing a reasonable profit.
Review and evaluate business insurance coverage. Determine the average number of
losses or claims that occur each year and that are covered by the company‟s business
66
Basic Statistics Email:[email protected] 2024
Review the cost of your insurance and the coverage it provides. Consider whether perhaps you‟re
overpaying – that is, paying for a coverage level that you probably don‟t need, given the probable
maximum number of claims. Alternatively, you may find that you‟re underinsured – that if what the
Poisson distribution shows as the probable highest number of claims actually occurred one year,
your insurance coverage would be inadequate to cover the losses.
Hyper-geometric distribution is a distinct probability distribution that defines the “m” successes
probability (some random draws for the object drawn that has some specified feature) in “n” no of
draws, without any replacement, from a given population size “N” that includes accurately “m”
objects having that feature, where the draw may succeed or may fail. The hyper-geometric
distribution arises when one samples from a finite population, thus making the trials dependent on
each other, thus making the trials dependent on each other. There are five characteristics of a hyper-
geometric experiment.
67
Basic Statistics Email:[email protected] 2024
Where:-
N: population size
M: number of objects in population with a certain feature
n: sample size
x: number of objects in sample with a certain feature
Example1
There are 4 Queens in a standard deck of 52 cards. Suppose we randomly pick a card from a deck,
then, without replacement, randomly pick another card from the deck. What is the probability that
both cards are Queens? To answer this, we can use the hyper-geometric distribution with the
following parameters.
Solution
P(X=2) = mCx (N--mCn-x) / NCn = 4C2 (52-4C2-2) / 52C2 = 6*1/ 1326 = 0.00452.
68
Basic Statistics Email:[email protected] 2024
Example 2
An urn contains 3 red balls and 5 green balls. You randomly choose 4 balls. What is the probability
that you choose exactly 2 red balls?
To answer this, we can use the hyper-geometric distribution with the following parameters:
The hyper-geometric test uses the hyper-geometric distribution to measure the statistical
significance of having drawn a sample consisting of a specific number of successes (out of total
draws) from a population of size containing successes.
The uniform distribution is a symmetric probability distribution where all outcomes have an equal
likelihood of occurring. All values in the distribution have a constant probability, making them
uniformly distributed. This distribution is also known as the rectangular distribution because of its
shape in probability distribution plots.
The uniform distribution is a probability distribution in which every value between an interval
from a to b is equally likely to occur.
The uniform distribution gets its name from the fact that the probabilities for all outcomes are the
same. Unlike a normal distribution with a hump in the middle or a chi-square distribution, a uniform
distribution has no mode. Instead, every outcome is equally likely to occur. Unlike a chi-square
distribution, there is no skewness to a uniform distribution. As a result, the mean and
median coincide. Since every outcome in a uniform distribution occurs with the same relative
frequency, the resulting shape of the distribution is that of a rectangle.
69
Basic Statistics Email:[email protected] 2024
If a random variable X follows a uniform distribution, then the probability that X takes on a value
between a and b can be found by the following formula:-
Analysts can use the uniform distribution to approximate new processes when there is insufficient
data to estimate the actual distribution of outcomes. In other cases, analysts use this distribution
because it‟s a close approximation and the formula is simple.
70
Basic Statistics Email:[email protected] 2024
The most often used continuous probability distribution is the normal distribution. This distribution
plays a very important role in statistical theory and practice, particularly in the area of statistical
inference and statistical quality control. Its importance is due to the fact that in practice, the
experimental results, very often seem to follow the normal distribution or bell shaped curve.
A random variable X is said to have a normal distribution if its probability density function is given
by
1 x 2
1
2
f ( x) e , x , , 0
2
Where E ( X ), 2 Variance ( X )
and 2 are the Parameters of the Normal Distributi on.
1. It is bell shaped and is symmetrical about its mean and it is mesokurtic. The maximum ordinate
is at x and is given by
1
f ( x)
2
2. It is asymptotic to the axis, i.e., it extends indefinitely in either direction from the mean.
3. It is a continuous distribution.
4. It is a family of curves, i.e., every unique pair of mean and standard deviation defines a different
normal distribution. Thus, the normal distribution is completely described by two parameters:
mean and standard deviation.
5. It is unimodal, i.e., values mound up only in the center of the curve.
6. Mean Median mod e
Note: To facilitate the use of normal distribution, the following distribution known as the standard
normal distribution was derived by using the transformation
71
Basic Statistics Email:[email protected] 2024
X
Z
1
1 2z 2
f ( z) e
2
Properties of the Standard Normal Distribution:
Mean is zero
Variance is one
Standard Deviation is one
The total area under the (standard) normal curve is 1. Hence, the area to the right and left of
the center value (µ=0) of the standard normal distribution is 0.5 (as it is symmetric about 0).
Examples:
1. Find the area under the standard normal distribution which lies
a) Between Z 0 and Z 0.96
Solution:
Solution:
Area P (1.45 Z 0)
P (0 Z 1.45)
0.4265
72
Basic Statistics Email:[email protected] 2024
Solution:
Area P( Z 0.35)
P(0.35 Z 0) P( Z 0)
P(0 Z 0.35) P( Z 0)
0.1368 0.50 0.6368
Solution:
Area P( Z 0.35)
1 P( Z 0.35)
1 0.6368 0.3632
Solution:
Solution:
73
Basic Statistics Email:[email protected] 2024
Solution
Solution
P ( Z z ) 0.9868
P ( Z 0) P (0 Z z )
0.50 P (0 Z z )
P (0 Z z ) 0.9868 0.50 0.4868
and from table
P (0 Z 2.2) 0.4868
z 2.2
3. A random variable X has a normal distribution with mean 80 and standard deviation 4.8. What is
the probability that it will take a value
74
Basic Statistics Email:[email protected] 2024
Solution
X 87.2
a) P( X 87.2) P( )
87.2 80
P( Z )
4.8
P( Z 1.5)
P( Z 0) P(0 Z 1.5)
0.50 0.4332 0.9332
X 76.4
b) P( X 76.4) P( )
76.4 80
P( Z )
4.8
P( Z 0.75)
P( Z 0) P(0 Z 0.75)
0.50 0.2734 0.7734
81.2 X 86.0
c) P(81.2 X 86.0) P( )
81.2 80 86.0 80
P( Z )
4.8 4.8
P(0.25 Z 1.25)
P(0 Z 1.25) P(0 Z 1.25)
0.3934 0.0987 0.2957
Companies use different statistical methodologies and calculations to help them make strategic
decisions to optimize operations and return on investment. One method of analysis employs normal
distribution charts or graphs to determine where different values in a given dataset relate to the data's
average. If you're considering a career in accounting, finance, business or analysis, understanding
how it works is an essential skill. In this article, we discuss what normal distribution is, which
industries and positions use it and review how it can help improve a business's decision making.
75
Basic Statistics Email:[email protected] 2024
This type of distribution can help finance professionals, such as market researchers and stock market
traders, determine whether the price of the assets is fair. A price above the curve indicates an
overvaluation of an asset in comparison with similar commodities or resources. When a price falls
below the average, the asset has been under-priced. Determining if a company has an asset they have
overvalued, underpriced or priced fairly can help other companies and traders make effective
decisions.
Many industries and companies incorporate this type of distribution analysis into their business
decision-making processes. It can provide valuable insights into customer behaviours, market trends
and purchasing patterns. Among the industries to use this type of distribution analysis are:
The exponential distribution is a probability distribution that is used to model the time we must
wait until a certain event occurs.
How long does a shop owner need to wait until a customer enters his shop?
How long will a laptop continue to work before it breaks down?
How long will a car battery continue to work before it dies?
How long do we need to wait until the next volcanic eruption in a certain region?
In each scenario, we‟re interested in calculating how long we‟ll have to wait until a certain event
occurs. Thus, each scenario could be modeled using an exponential distribution.
76
Basic Statistics Email:[email protected] 2024
Mean: 1 / λ
Variance: 1 / λ2
Example1
Suppose the mean number of minutes between eruptions for a certain geyser is 40 minutes. We
would calculate the rate as λ = 1/μ = 1/40 = .025.
Example2
A new customer enters a shop every two minutes, on average. After a customer arrives, find the
probability that a new customer arrives in less than one minute.
Solution 1: The average time between customers is two minutes. Thus, the rate can be calculated as:
λ = 1/μ
λ = 1/2
λ = 0.5
77
Basic Statistics Email:[email protected] 2024
P(X ≤ x) = 1 – e-λx
P(X ≤ 1) = 1 – e-0.5(1)
P(X ≤ 1) = 0.3935
The probability that we‟ll have to wait less than one minute for the next customer to arrive is 0.3935.
To predict the amount of waiting time until the next event (i.e., success, failure, arrival, etc.).
For example, we want to predict the following:
The amount of time until the customer finishes browsing and actually purchases something in
your store (success).
The amount of time until the hardware on AWS EC2 fails (failure).
The amount of time you need to wait until the bus arrives (arrival).
Exponential distributions are commonly used in calculations of product reliability, or the length of
time a product lasts.
There are many applications of exponential functions in business and economics. Below are
examples where an exponential function is used to model and predict cost and revenue:-
If a populations growth is proportional to the number in the population, then we say that the
population grows exponentially.
If the decay of a substance is inversely proportional to the amount of substance then the
substance will follow an exponential decay model.
Compound Interest Formula will follow an exponential.
78