Normality, standard
deviation, symmetry
The emergence
As in many areas of statistics, the emergence of
normal distribution is associated with controlling
chance. It begins with a question in the 1660’s that a
professional gambler named Antonie Gombaud asked
Blaise Pascal, his close friend and one of the most
important mathematicians of his time and science
history.
Scenario
It will be tossed 5 times in a row. If the tails are
much, the first player will win the game, and if the
heads are much, the second player will win the
game. Suppose both players bet 500 Liras each.
When the situation is 2 tails and 1 head (ie the first
player is ahead), imagine that one of the players
received a phone and must stop the game
immediately.
There is a total bet of 1,000 Lira and the two players
have different opinions for this bet. They are;
The first player claims he deserves the full bet as he
is ahead when the quitting situation occurs.
The second player, on the other hand, wants the
game to be terminated by returning the initial funds
to the players, instead of a player taking the entire
bet because the game has not yet been completed.
The question asked by gambler Antonie Gombaud to
mathematician Blaise Pascal is; What would you
decide if you were a referee?
At the time when the question was asked, as
mentioned in the first, under the title "Pre-Statistical
Period", the approaches to what might happen in
the future were left to fortune-tellers, clergy,
astrologers and augurs. Mathematics did not yet
have a say in matters of the future. Blaise Pascal
turns to the help of his friend, the French
mathematician Pierre de Fermat, and develops a
solution proposal as follows;
Fair Solution
According to the probability of realization of the remaining 2
shots, determining the completion scenarios of the game, how
many possible scenarios there are and distributing the bet
according to the number of winners in these scenarios is the
fairest solution.
Possibiliti
Current
es of the Result Winner
state
last two
shots
TT 4T 1H 1st Player
HT 3T 2H 1st Player
2T 1H
TH 3T 2H 1st Player
HH 2T 3H 2nd Player
As can be seen, there are 4 possible situations
result, and in 3 of these cases the first player wins
and in 1 the second player wins. Pascal decides that
the TL 750, which is 3/4 of the bet, should be paid
to the first player and that the TL 250, which is 1/4
the bet, should be paid to the second player.
Solving his gambling friend's problem, Pascal said,
"What would happen if there were more shots to be
completed?" focuses on the question. He sees that
the possible situations follow the coefficients in the
binomial triangle found by the Turkish-Muslim
mathematician Ömer Hayyam in the 11th century.
He noticed that the histograms of possible
situations follow a pattern;
Six more shots
20 more shots
40 more shots
Approximately 70 years after the death of Blaise
Pascal, French mathematician Abraham De Moivre,
based on Pascal's work, predicted that the
decreases in the histograms of possible situations to
the right and left of the peak can be defined by a
mathematical parameter and named it as
"modulus".
This interpretation of Abraham De Moivre was taken
into consideration by the most influential
mathematician of history, Johann Carl Friedrich
Gauss, approximately 80 years later, in 1810, and
Gauss defined a differential equation that could
parameterize Pascal's histograms;
=
Normal Distribution Function
Normal distribution forms the basis of today's modern
statistics. Almost all analyzes on continuous data are
conducted on the basis of normal distribution. In later
chapters, why normality is essential will be discussed in
more depth.
The normal distribution assumes the median of the
data (i.e. the midpoint in the sequential data), its mode
(i.e. the most observed value) and its mean (i.e. the
sum of the data divided by the number of
observations) to be the same and works on this
assumption.
The integral of the function is equal to 1. This
means that the sum of the area under the normal
distribution curve is 1, which indicates the sum of
all the possibilities for the analyzed situation. The
function is continuous and infinite. The continuity
feature enables it to generate probability values for
observations that are expected to be at any
distance from the mean. The infinity property
admits the existence of a probability, however
trace, no matter how far you get from the mean. For
example, the average height of adult American
males is 178 centimeters and is considered to be
normally distributed. The normal distribution
function calculates the probability of even adult
American males who are 240 centimeters or 110
centimeters tall, whether they exist or not.
The most crucial feature of the normal distribution is
symmetry. As can be clearly seen in Figure, equal heights
(equal probabilities) are observed when the graph is
progressed at an equal distance from the vertex on the y axis
to the right and left. Continuing with the example above; The
average height of American adult males is 178 cm, which
means that the number of men with a height of 177 cm and a
height of 179 cm is equal. Not only that, but the same number
of males 175 cm tall and 181 cm tall;
170 cm to 186 cm
160 cm to 196 cm
146 cm to 210 cm
It means the adult American male numbers are the same. As
you will appreciate, symmetry in statistics is a very difficult
concept to establish.
Standard deviation
It is the most commonly used distribution (scatter)
measure in statistics. It shows at what level the data
is distributed around the mean. It takes place as a
very important property in the differential equation
(normal distribution function) put forward by Gauss.
The second derivative of the function, the points
where the slope of the curve changes direction,
indicates the standard deviation.
The probability distribution function of a normally
distributed data series with a mean of 80 standard
deviations of 20 is given. In the region from the
middle (80) to one standard deviation (± 20) span
(range of [60-100]), the function curve is in the form
that curves into the function curve, one outside the
standard deviation range (values less than 60 and
greater than 100) It is in the form that curves out of
the function curve.
That is, data near the mean (within ± 1 standard
deviation range) tend to be “aggregated” while data
far from the mean (outside the ± 1 standard deviation
range) tend to “scatter”. The standard deviation
indicates the point where the upset-scatter separation
begins.
The calculation was proposed by Karl Pearson in
1895. Particular attention should be paid to this; The
second derivative of the normal distribution curve of
Gauss and the approach proposed by Pearson have
no mathematical connection. To understand why this
calculus is appropriate, one must know the empirical
rule.
Empirical Rule
If a population fully conforms to the normal
distribution, the following 3 rules are observable;
68% of the data in ± 1 standard deviation range from the
mean
95% of the data in ± 2 standard deviation range from the
mean
99% of the data at ± 3 standard deviations from the mean
Should be found
Absolute Deviation from Mean
Apart from the standard deviation proposed by Karl
Pearson's, another measure of distribution that is
not as widely used as it is absolute deviation from
the mean. The formula is given below;
The reason for using the standard deviation instead
of the absolute deviation from the mean as the
parameter that determines the distribution is that
the result produced by the standard deviation shows
a better approach to the empirical rule.
* Mean Absolute
Deviation
Skewness
All cases where symmetry cannot be observed in
the histogram is called skewness. If the median,
mode and average of the data differ from each
other, this indicates skewness in the data. When the
skewness in the data set exceeds a certain level,
that data set cannot be evaluated by the normal
distribution function. Now let's talk about the
approaches used to detect symmetry-skewness.
Coefficient of Variation
It is obtained by dividing the standard deviation of a
data set by its arithmetic mean and multiplying it by
100.
A coefficient of change greater than 30 indicates
that there is no symmetry and skewness.
Normality Tests
In statistical programs such as SPSS, Stata, Minitab,
R, Kolmogorov-Smirnov, Anderson-Darling and
Shapiro-Wilk tests can be used to determine
normality, and an idea about whether the
distribution is normal or not.
Skewness-Kurtosis
Again in the same programs, a decision about
normality can be made by looking at the skewness-
kurtosis situations. The fact that the calculated
values are in the range of [-2,2] is an indicator of
normality.
Median, Mean and Range
Another method that gives a simple but satisfying
result on the normality of distributions is formulated
below.
The fact that this inequality cannot be achieved
indicates that there is not normality but skewness.
Outliers
Within the data array, values that are too small or
too large are called outliers. If these very small or
very large values are not equidistant from the
middle (to the mean or median for the normal
distribution function), the extreme value that is far
away tends to pull the mean towards itself, causing
distortion. Since it is assumed that normality can be
established when such data is excluded (excluded)
from the data set, these values that are expected to
be discarded and expected to be ignored are called
“excluded”. For example, suppose the height of 7
girls in a class is as follows;
163; 168; 166; 170; 158; 164; 194
The average height of the female students in this class
is 169 cm and their standard deviation is 11.3.
However, as you draw your attention, there is a very
tall girl in this class, the last student. He's probably a
basketball or volleyball athlete. This student "breaks"
the expected symmetry in female students' heights.
Because the average of the heights (169 cm) moved
away from the middle of the heights, that is, the
median (164 cm). As we mentioned in the median-
mean-span, the distribution of heights of girls in this
class is not symmetrical;
|164-169|<(194-162)/20
5≮1,6
In order to approximate the median to the mean
and to simulate the distribution of female students'
heights to the normal distribution, the 194 cm
student is ignored, excluded;
Now the remaining 6 students have an average
height of 164.83 cm and a standard deviation of
2.71. Very close to the median (164 cm). We can
now assume that the heights of the girls in this class
are normally distributed.
The purpose of exclusion is to make the arithmetic
mean and standard deviation usable by establishing
normality.
Exclusions are described in the statistical literature by
John Wilder Tukey in his 1977 book Exploratory Data
Analysis. According to this book, exclusion is only
possible in two cases;
Measurement Error: In the measurements made, there may
be error of the meter or the meter. Or the measured value
may have been recorded incorrectly.
Being from Another Group: the observation being measured
may not be a member of the measured group. For example,
in the previous example, while the class in which the
measurement was made consists of the second year
students of a medical faculty, the 194 cm tall girl may be a
second year student of the sports sciences faculty.
Except for these two cases, it is not appropriate and
correct to exclude data.
Detecting outliers
Excludes are found by following the steps below;
Inter-Quartile Range IQR is calculated.
1.) 1.5 times the IQR is added to the 3rd quarter (Q3).
2.) From the 1st quarter (Q1) 1.5 times the IQR is subtracted.
Values greater than (1) or less than (2) are
determined as excluded.
Determined exclusions can be removed from the
data set after certain controls.
Excluding Outliers
If the determined outcasts are one-sided (that is,
only if there is data in (1) or (2)) and if the total
number of outcasts is less than 5% of the number of
observations, they can be discarded.
Identify potential outliers in the data set below and
decide whether they can be discarded.
49 60 63 64 67 69 74
56 60 63 64 67 69 76
57 61 63 64 67 71 77
58 61 63 64 67 72 77
58 62 63 65 68 72 77
58 62 64 65 68 72 77
59 63 64 67 69 73 79
Primarily the median, first quartile, and third
quartile are found.
Median is the middle value in the sorted data array.
This is 64 in the data set.
The first quartile (Q1) is the value in the middle of
the media with the smallest value in the sequenced
data series. This is 62 in the data set.
The third quartile (Q3) is the value in the middle of
the media with the largest value in the sorted data
series. This is equal to 69 in the data set.
Next, the interquartile range (IQR) is calculated.
Q3-Q1 = 7
(1) and (2) are calculated.
(1) Q3 + 1.5 * IQR = 69 + 10.5 = 79
(2) Q1 - 1.5 * IQR = 62 - 10.5 = 51.5
Values greater than (1) and less than (2) are determined.
No greater than (1), less than (2) 49
It is checked whether the determined outcasts are
unidirectional and whether they are less than 5% of the
number of observations.
"Transcendents are the only way out because in (1)" there is
no marginalized. And the number of outcasts is "1", less than
5% of the number of observations, 49 * 0.05 = 2.45.
As a result, 49 is external and can be omitted from the data
set.
Box – Whisker Plot
These 5 values are calculated and created;
Median
1st Quarter
3rd Quarter
Q1 - 1,5 IQR
Q3 + 1,5 IQR
The 1st quarter and 3rd quarter values define the boundaries of the
“box”. Median is symbolized in the middle of the box. «whiskers" are
determined by Q1 - 1.5 IQR and Q3 + 1.5 IQR values.
Outliers (exclusions) are specified
specifically.
Itcontains visually significant outputs about
whether the data set is normally distributed.
In a normally distributed data array;
Median centers the box
The lengths of the whiskers are equal.
There are no outliers.
Z Scores
The z-score is a standardized version of a
raw score (x) and gives information about its
relative position within the distribution. Raw
score to z-score Conversion formula;
As you can see, it gives information about where the
element of interest is located in the distribution. The
Z score has 2 important components. The first is the
sign of the score. It shows which side of the mean
the element of interest is on. The second is the
value of the score. It allows us to see how far the
element we are interested in is far from the
average.
Z distribution; It can be defined as the population
standard deviation 𝜎= 1. So the general
distribution of Z scores with mean µ = 0 and
characteristic of the Z distribution is the same as
the normal distribution. For this reason, Z
distribution is also named as "standard normal
distribution".
Normal distribution is a population distribution
defined according to mean and standard deviation
values. Therefore, it is possible to determine an
infinite number of universe distributions according
to mean and standard deviation. The z distribution
is a standard distribution and the mean and
standard deviation values are fixed. This makes it
easier to use the Z distribution in probabilistic
estimates.
Another advantage of the Z distribution is the ease
of calculating the intervals between these exact
patterns, as well as the standard deviation ranges
given as integers. For example, in a normal
distribution curve with a mean of 50 and a standard
deviation of 10, the possibility of taking values in
integer standard deviation ranges such as between
50 and 60, over 70, and below 40 can be easily
determined.
Since the curve of the Z distribution is standard and
'good', the probability of non-integer intervals as
well as integer ranges in terms of standard
deviation can be determined more easily. Moreover,
there are "standard Z values" tables prepared by
making these calculations. Those who do not know
the calculation method can calculate the
probabilities of certain intervals on the Z distribution
by using these tables. Z table can be accessed from
the link below;
https://www.math.arizona.edu/~rsims/ma464/stand
ardnormaltable.pdf
Let a student get 90 in math, 75 in science, 60 in
history and 81 in Turkish. If a question was asked to
determine the course in which the student is most
successful, the answer that can be given with this
insufficient information, we would have to determine
the mathematics course in which the student received
the highest grade. However, we must determine the
success of the student according to his / her class.
Now, let's examine the table below, which includes the
grades the student received and the class status;
Class Grade Averag Standar
e d
Deviati
on
Maths 90 90 5
Science 75 80 10
History 60 50 5
Turkish 81 76 10
We must evaluate the student's success according
to his position in the classroom. For this, we have to
calculate the Z-scores of the grades the student
received;
Class Grade Averag Standar Z
e d Score
Deviati
on
Maths 90 90 5 0
Science 75 80 10 -0,5
History 60 50 5 2
Turkish 81 76 10 0,5
We now have the opportunity to comment more
consistently and satisfactorily about the student's
course performance. In the mathematics course in
which the student got the highest grade, he is
actually not very successful, because his success is at
the same level as the class. In the history course
where the student received the lowest grade, he is
very successful. Because it got a very high grade
relative to the class. Science is the subject in which
the student fails the most because he received a
grade below his grade level.
Now let's examine a little more deeply how well the
student succeeded in the course of history;
While the class average was 50, the student got 60 from
the course and the standard deviation of the class's date
grades is 5. The student is more successful than the
students in the blue field.
If we can calculate the size of the blue area, we can
calculate how much the student is more successful in the
class.
To make this calculation, access the Z table by clicking
the link we gave at the introduction of the Z scores topic.
Table values give the area to the left of the Z score.
The table value for the Z score 2 is given as 0.97725.
This is the area of the blue field, and it means that the
student does better than 97.72% of the students who
took the history course.
The gestation period of the human species is
normally distributed with an average of 266 days
and 16 standard deviations. How long are the
gestational periods between 240 days and 270 days
(roughly 8 to 9 months)?
The desired probability is pregnancy periods of
more than 240 days and less than 270 days. For
this, let's calculate the Z scores of 240 and 270;
Since the table values give the area to the left of
the Z score, from the table value of Z270;
Table value of Z270 = 0.25
0.5987
The table value of Z240;
Table value of Z240 = -1.625≈-1.63
0.0516
It is found by subtracting;
Z270 - Z240 = 0,5987 – 0,0516 = 0,5471
The area of the blue-dyed region is 54.71%. In other words, 54.71% of
the pregnancies of the human species last between 8 months and 9
months.
In the example above, let's define the shortest 10%
of all pregnancies as preterm birth, and the longest
10% of all pregnancies as late birth. Determine the
limits of normal pregnancy based on the duration of
the pregnancy.
Now the goal is to find the Z scores limiting the "normal delivery"
zone with the middle 80% area.
For this, we need to read the Z table from a different angle. It is
necessary to find the 10% and 90% values in the Z table (that the
difference is 90% - 10% = 80% and is in the middle). In the Z table,
10% value corresponds to -1.28, 90% value corresponds to 1.28. That
is, the points bounding the blue region;
Now we need to calculate the point that is -1.65
standard deviations away from the mean and the
point 1.65 standard deviations away from the mean;
266 - 1.28 * 16 = 246 days
266 + 1.28 * 16 = 286 days
We conclude that 80% of human pregnancies that
are considered normal are distributed between 246
and 286 days.
A large group of test scores is normally distributed
with mean 78.2 and standard deviation 4.3. What
percent of the students scored 85 or better (nearest
whole percent)