0% found this document useful (0 votes)

25 views18 pages

ch2 (Descriptive Statistics)

ntu

Uploaded by

yujiaaoro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views18 pages

ch2 (Descriptive Statistics)

ntu

Uploaded by

yujiaaoro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

MH8131 Probability and Statistics

Chapter 2 Descriptive statistics

In this chapter we describe several techniques for visualizing data, as well as

for computing quantities that summarize it eﬀectively. Such quantities are known
as descriptive statistics. As we will see in the following chapters, these statistics
can often be interpreted within a probabilistic framework, but they are also useful
when probabilistic assumptions are not warranted. Because of this, we present
them from a deterministic point of view.

1 Histogram
When we speak of the shape of a data set, we are usually referring to the shape
exhibited by an associated graphical display, such as a histogram. The shape can
tell us a lot about any underlying structure to the data, and can help us decide
which statistical procedure we should use to analyze them.

DEFINITION 1 A histogram is constructed by first deciding on a set of classes, or

bins, which partition the real line into a set of boxes into which the data values
fall. Then vertical bars are drawn over the bins with height proportional to the
number of observations that fell into the bin.

In other words, it is a plot of the frequency table with the bins on the x-axis
and the count (or proportion) on the y-axis. The width of the bins is a parameter
that can be adjusted to yield higher or lower resolution. If we interpret the data
as corresponding to samples from a random variable, then the histogram would
be a piecewise constant approximation to their pmf or pdf.
Figure 1 shows two histograms computed from temperature data gathered
at a weather station in Oxford over 150 years.1 Each data point represents the
maximum temperature recorded in January or August of a particular year. Figure
2 shows a histogram of the GDP per capita of all countries in the world in 2014
according to the United Nations.
There are many ways to plot histograms in R, and one of the easiest is with
the hist function. The following R codes for producing the above two histograms.

1
Figure 1: Histograms of temperature data taken in a weather station in Oxford
over 150 years. Each data point equals the maximum temperature recorded in a
certain month in a particular year.

2
Figure 2: Histogram of the GDP per capita of all countries in the world in 2014.

3
>temp=read.table( "/Users/lym/Desktop/plott/temp.txt" , fill = TRUE , header = TRUE )
>t1=temp[temp$mon==1,] # select the first month
>t8=temp[temp$mon==8,] # select the eight month

>tt1=t1[!is.na(t1$degC), ] # remove the missing value

>tt8=t8[!is.na(t8$degC), ] # remove the missing value
>ss=c(tt1$degC,tt8$degC) # put the two months data together
>h=hist(ss,breaks = 20) # plot the histgram
>cuts <- cut(h$breaks, c(-Inf,15,Inf)) # break the x axis into two parts
>plot(h, col=cuts,xlab ="Degrees",main = "") # add the color
>legend(’topright’, legend=c("Jan", "Aug"), # add the mark
fill=c("black", "red"))

>install.packages("readxl")
>library("readxl")
>gdp=read_xlsx("/Users/lym/Desktop/plott/GDPPCconstant-USD-countries.xlsx",col_names = T
>extract=as.matrix(gdp[,which(gdp[2,]==2014)])[-c(1:2),]/1000
>h=hist(extract,breaks = 30,col = "red",xlim=c(0,200),
main = "GDP per captia of all contries",xlab = "Thousand of dollars")

Watch what happens when we change the bins slightly (with the breaks argu-
ment to hist). To produce a plot is a relative frequency histogram by adding (freq
= FALSE).

2 Measures of Center

DEFINITION 2 The sample mean is denoted by x̄ (read "x-bar") and is simply the
arithmetic average of the observations:
n
1∑
x̄ = xi .
n
i=1

It is appropriate for use with data sets that are not highly skewed without extreme
observations.

◃ Good: natural, easy to compute, has nice mathematical properties

◃ Bad: sensitive to extreme values

4
You can calculate the sample mean of a data vector x with the R command
mean(x).

DEFINITION 3 The sample median is another popular measure of center and is

denoted med(x) or x̃. To calculate its value, first sort the data into an increasing
sequence of numbers. If the data set has an odd number of observations then
med(x) is the value of the middle observation, which lies in position (n + 1)/2;
otherwise, there are two middle observations and med(x) is the average of those
middle values.

One desirable property of the sample median is that it is resistant to extreme

observations, in the sense that the value of med(x) depends only the values of
the middle observations, and is quite robust estimate of location since it is not
influenced by outliers (extreme cases) that could skew the results. The same
cannot be said for the sample mean.

◃ Good: resistant to extreme values, easy to describe

◃ Bad: not as mathematically tractable, need to sort the data to calculate

You can calculate the sample median of x with the R command median(x).

3 Measures of Spread

DEFINITION 4 The sample variance of x1 , x2 , · · · , xn is denoted s2 and is calcu-

lated with the formula
n
1 ∑
s2 = (xi − x̄)2 .
n−1
i=1
√
The sample standard deviation is s = s2 .

Intuitively, the sample variance is approximately the average squared distance of

the observations from the sample mean. The sample standard deviation is used to
scale the estimate back to the measurement units of the original data. Empirical
Rule: If data follow a bell-shaped curve, then approximately 68%, 95%, and 99.7%
of the data are within 1, 2, and 3 standard deviations of the mean, respectively.
You might be wondering why the normalizing constant is 1/(n − 1) instead of
1/n. The reason is that this ensures that the expectation of the sample variance

5
equals the true variance when the data are iid. In practice there is not much
diﬀerence between the two normalizations.
From the console we may compute the sample range with range(x) and the
sample variance with var(x), where x is a numeric vector. The sample standard
deviation is sqrt(var(x)) or just sd(x).

4 Outlier

DEFINITION 5 An outlier is any value that is very distant from the other values
in a data set. If a value falls beyond three standard deviations away from the
mean, that data point is identified as an outlier.

Being an outlier in itself does not make a data value invalid or erroneous. Still,
outliers are often the result of data errors such as mixing data of diﬀerent units
(kilometers versus meters) or bad readings from a sensor. When outliers are the
result of bad data, the mean will result in a poor estimate of location, while the
median will be still be valid.
What do we do about outliers? They merit further investigation. The primary
goal is to determine why the observation is outlying, if possible. If the observation
is a typographical error, then it should be corrected before continuing. If the
observation is from a subject that does not belong to the population of interest,
then perhaps the datum should be removed. Otherwise, perhaps the value is
hinting at some hidden structure to the data.

DEFINITION 6 (ANOMALY DETECTION) In contrast to typical data analysis,

where outliers are sometimes informative and sometimes a nuisance, in
anomaly detection the points of interest are the outliers, and the greater mass
of data serves primarily to define the "normal" against which anomalies are
measured.

5 Order statistics
In some cases, a data set is well described by its mean and standard deviation.
For example, in January the temperature in Oxford is around 6.73o C give or take
2o C. This is a pretty accurate account of the temperature data from the previous
section.

6
However, imagine that someone describes the GDP data set in Figure 8.2 as:
Countries typically have a GDP per capita of about $16 500 give or take $25
300. This description is pretty terrible. The problem is that most countries have
very small GDPs per capita, whereas a few have really large ones and the sample
mean and standard deviation don’t really convey this information. Order statistics
provide an alternative description, which is usually more informative when there
are extreme values in the data.

DEFINITION 7 (Quantiles and percentiles). Let x(1) ≤ x(2) ≤ x(n) denote the or-
dered elements of a set of data {x1 , x2 , xn }. The q quantile of the data for 0 < q
< 1 is x([q(n+1)] ), where [q(n + 1)] is the result of rounding q(n + 1) to the closest
integer. The 100 p quantile is known as the p percentile.
The 0.25 and 0.75 quantiles are known as the first and third quartiles, whereas
the 0.5 quantile is known as the sample median. A quarter of the data are smaller
than the 0.25 quantile, half are smaller (or larger) than the median and three
quarters are smaller than the 0.75 quartile. If n is even, the sample median is
usually set to
x(n/2) + x((n+1)/2)
. (5.1)
2
The diﬀerence between the third and the first quartile is known as the interquar-
tile range (IQR).

For the GDP data set, the median is $6,350. This means that half of the
countries have a GDP of less than $6,350. In contrast, 71% of the countries
have a GDP per capita lower than the sample mean! The IQR of these data is
$18,200. To provide a more complete description of the data set, we can list a
five-number summary of order statistics: the minimum x(1) , the first quartile, the
sample median, the third quartile and the maximum x(n) . For the GDP data set
these are $130, $1,960, $6,350, $20,100, and $188,000 respectively.
You can calculate the sample quantiles of any order p where 0 < p < 1 for
a data set stored in a data vector x with the quantile function, for instance, the
command quantile(x, probs = c(0, 0.25, 0.37)) will return the smallest observation,
the first quartile, q̃0.25 , and the 37th sample quantile, q̃0.37 . For q̃p simply change
the values in the probs argument to the value p.

6 Boxplot
We can visualize the main order statistics of a data set by using a box plot.

7
DEFINITION 8 Boxplots, introduced by Tukey [Tukey-1977], are based on per-
centiles and give a quick way to visualize the distribution of data.

The median is shown by the horizontal line in the box. The dashed lines,
referred to as whiskers, extend from the top and bottom to indicate the range
for the bulk of the data. The bottom and top of the box are the first and third
quartiles. The lower whisker is a line extending from the bottom of the box to
the smallest value within 1.5 IQR of the first quartile. The higher whisker extends
from the top of the box to the highest value within 1.5 IQR of the third quartile.
Values beyond the whiskers are considered outliers and are plotted separately.
Figure 3 applies box plots to visualize the temperature data set used in Figure 1.
Each box plot corresponds to the maximum temperature in a particular month
(January, April, August and November) over the last 150 years. The box plots
allow us to quickly compare the spread of temperatures in the diﬀerent months.
Figure 4 shows a box plot of the GDP data from Figure 2. From the box plot
it is immediately apparent that most countries have very small GDPs per capita,
that the spread between countries increases for larger GDPs per capita and that
a small number of countries have very large GDPs per capita

Figure 3: Box plots of the Oxford temperature data set used in Figure 1.

The R format is boxplot(x), where x is the data.

t1=temp[temp$mon==1,] # select the first month

t4=temp[temp$mon==4,]
t8=temp[temp$mon==8,]
t11=temp[temp$mon==11,]

tt1=t1[!is.na(t1$degC), ]$degC # Remove the missing value and take the highest temperat
tt4=t4[!is.na(t4$degC), ]$degC
tt8=t8[!is.na(t8$degC), ]$degC

8
Figure 4: Box plot of the GDP per capita of all countries in the world in 2014.
Not all of the outliers are shown.

9
tt11=t11[!is.na(t11$degC), ]$degC
ma=list(tt1,tt4,tt8,tt11) # put the four months temperature together
boxplot(ma,names = c("Jan","Apr","Aug","Nov"),ylab="Degrees (Celsius)") # plot the boxpl
and
install.packages("readxl")
library("readxl")
gdp=read_xlsx("/Users/lym/Desktop/plott/GDPPCconstant-USD-countries.xlsx",col_names = TR
gdp1=gdp[,which(gdp[2,]==2014)][-c(1:2),]/1000
boxplot(gdp1,ylim=c(0,60),ylab="Thousands of dollars")
You need to change the name of the folder where the GDP data is stored.

7 Sample Covariance
In the previous sections we mostly considered data sets consisting of one-
dimensional data. In machine learning lingo, there was only one feature per
data point. We now study a multidimensional scenario, where there are several
features associated to each data point.
The sample covariance quantifies whether the two features of a two-dimensional
data set tend to vary in a similar way on average, just as the covariance quantifies
the expected joint variation of two random variables.

Figure 5: Scatterplot of the temperature in January and in August (left) and of the
maximum and minimum monthly temperature (right) in Oxford over the last 150
years.

10
DEFINITION 9 (Sample covariance). Let (x1 , y1 ), (x2 , y2 ), (xn , yn ) be a data set
where each example consists of a measurement of two diﬀerent features. The
sample covariance is defined as
n
1 ∑
Sxy = (xi − x̄)(yi − ȳ).
n−1
i=1

In order to take into account that each individual feature may vary on a dif-
ferent scale, a common preprocessing step is to normalize each feature, dividing
it by its sample standard deviation. If we normalize before computing the covari-
ance, we obtain the sample correlation coeﬃcient of the two features. One of
the advantages of the correlation coeﬃcient is that we don’t need to worry about
the units in which the features are measured. In contrast, measuring a feature
representing distance in inches or miles can severely distort the covariance, if
we don’t scale the other feature accordingly

DEFINITION 10 (Sample correlation coeﬃcient)). Let (x1 , y1 ), (x2 , y2 ), (xn , yn ) be

a data set where each example consists of a measurement of two diﬀerent
features. The sample correlation coeﬃcient is defined as
Sxy
ρxy = ,
sx sy

where sx means the standard deviation of x variables and sy is similarly defined.

By the Cauchy-Schwarz inequality the magnitude of the sample correlation

coeﬃcient is bounded by one. If it is equal to 1 or -1, then the two centered data
sets are collinear.
Figure 5 is annotated with the sample correlation coeﬃcients corresponding
to the two plots. Maximum and minimum temperatures within the same month
are highly correlated, whereas the maximum temperature in January and August
within the same year are only somewhat correlated.
Table below, called a correlation matrix, shows the correlation between the
daily returns for telecommunication stocks from July 2012 through June 2015.
From the table, you can see that Verizon (VZ) and ATT (T) have the highest
correlation. Level Three (LVLT), which is an infrastructure company, has the
lowest correlation. Note the diagonal of 1s (the correlation of a stock with itself
is 1), and the redundancy of the information above and below the diagonal.

11
Correlation between telecommunication stock returns

T CTL FTR VZ LVLT

T 1.000 0.475 0.328 0.678 0.279
CTL 0.475 1.000 0.420 0.417 0.287
FTR 0.328 0.420 1.000 0.287 0.260
VZ 0.678 0.417 0.287 1.000 0.242
LVLT 0.279 0.287 0.260 0.242 1.000
You may evaluate correlation and produce a scatter plot matrix in r as follows:

etfs <- sp500_px[row.names(sp500_px)>"2012-07-01",

sp500_sym[sp500_sym$sector=="etf", ’symbol’]]
<-cor(etfs)
<-plot(efts)

You could use the data set "mtcars" available in the R environment to illustrate.

8 Sample Covariance matrix and PCA

We now consider sets of multidimensional data. In particular, we are interested
in analyzing the variation in the data. The sample covariance matrix of a data set
contains the pairwise sample covariance between every pair of features.

DEFINITION 11 Let x1 , · · · , xn be a set of p dimensional real-valued data vectors.

The sample covariance matrix of these data is the p × p matrix
n
1 ∑
Sn = (xi − x̄)(xi − x̄)T ,
n−1
i=1

n
∑
1
where x̄ = n xi .
i=1

The (i,j) entry of the covariance matrix is given by

{
the variance of the ith entry ofx1 , · · · , xn , wheni = j
(Sn )ij =
the covariance of the ith entry and the jth entry of x1 , · · · , xn , wheni ̸= j

In order to characterize the variation of a multidimensional data set around

its center, we consider its variation in diﬀerent directions. The average variation
of the data in a certain direction is quantified by the sample variance of the

12
projections of the data onto that direction. Let v be a unit-norm vector aligned
with a direction of interest. The sample variance of the data set in the direction
of v is given by
n n
1 ∑ T 1 ∑ T
(v xi − vx)2 = (v (xi − x̄))2
n−1 n−1
i=1 i=1

[ 1 ∑
n ]
T
=v (xi − x̄)(xi − x̄)T v = vT Sn v (8.1)
n−1
i=1

where
n
1∑ T
vx = v xi .
n
i=1

Consider the eigen-decomposition of the covariance matrix Sn

 
λ1 0 ··· 0
0 λ2 · · · 0 
Sn = Un 

 UTn

···
0 0 · · · λn

where the eigenvectors matrix is Un = (u1 , · · · , un ) and the ordered eigenvalues

λ1 ≥ λ2 ≥ · · · ≥ λn .

THEOREM 1 For the above the eigen-decomposition of the covariance matrix

Sn , we have
λ1 = max vT Sn v, u1 = arg max vT Sn v
∥v∥=1 ∥v∥=1

λk = max vT Sn v, uk = arg max v T Sn v

∥v∥=1,v⊥u1 ,,··· ,uk−1 ∥v∥=1,v⊥u1 ,,··· ,uk−1

This means that u1 is the direction of maximum variation. The eigenvector u2

corresponding to the second largest eigenvalue λ2 is the direction of maximum
variation that is orthogonal to u1 .
In data analysis, the eigenvectors of the sample covariance matrix are usually
called principal directions. Computing these eigenvectors to quantify the variation
of a data set in diﬀerent directions is called principal component analysis (PCA).
Figure 6 shows the principal directions for several 2D examples
Figure 7 illustrates the importance of centering before applying PCA. If the
data are concentrated around a point that is far from the origin, the first principal
direction tends be aligned with that point. This makes sense as projecting onto

13
Figure 6: PCA of a set consisting of n = 200 two-dimensional data points with
diﬀerent configurations.

that direction captures more energy. As a result, the principal directions do not
reflect the directions of maximum variation within the cloud of data. Centering
the data set before applying PCA solves the issue.

EXAMPLE 1 You can compute principal components in R using the princomp

function. The following performs a PCA on the stock price returns for Chevron
(CVX) and ExxonMobil (XOM):

oil_px <- sp500_px[, c(’CVX’, ’XOM’)]

pca <- princomp(oil_px)
pca$loadings
Loadings:
Comp.1 Comp.2
CVX -0.747 0.665
XOM -0.665 -0.747

summary(pca)
Importance of components:
Comp.1 Comp.2
Standard deviation 0.8820924 0.3007512
Proportion of Variance 0.8958580 0.1041420
Cumulative Proportion 0.8958580 1.0000000

The weights for CVX and XOM for the first principal component are -0.747 and -
0.665 and for the second principal component they are 0.665 and -0.747. How to

14
Figure 7: PCA on the left for the data without centering. As a result the dominant
principal direction u1 lies in the direction of the mean of the data and PCA does
not reflect the actual structure. u1 becomes aligned with the direction of maximal
variation for the centered data.

15
interpret this? The first principal component is essentially an average of CVX and
XOM, reflecting the correlation between the two energy companies. The second
principal component measures when the stock prices of CVX and XOM diverge.

Using the sample covariance matrix we can express the variation in every direc-
tion.

9 QQ plot

DEFINITION 12 QQ plot is a plot to visualize how close a sample distribution is

to a normal.

To compare data to a standard normal distribution, you subtract the mean then di-
vide by the standard deviation; this is also called normalization or standardization
or Z-Scores. Write
x − x̄
z= .
s
These are commonly referred to as z-scores.
A QQ-Plot is used to visually determine how close a sample is to the normal
distribution. The QQ-Plot orders the z-scores from low to high, and plots each
value’s z-score on the y-axis; the x-axis is the corresponding quantile of a normal
distribution for that valueŠs rank. Since the data is normalized, the units corre-
spond to the number of standard deviations away of the data from the mean. If
the points roughly fall on the diagonal line, then the sample distribution can be
considered close to normal. Figure 5 shows a QQ-Plot for a sample of 100 values
randomly generated from a normal distribution; as expected, the points closely
follow the line. This figure can be produced in R with the qqnorm function:

norm_samp <- rnorm(100)

qqnorm(norm_samp)
abline(a=0, b=1, col=’grey’)

In contrast, in practice, one tail of a distribution is often longer than the other,
particularly in finance. A good example to illustrate the long-tailed nature of data
is stock returns. Figure 6 shows the QQ-Plot for the daily stock returns for
Netflix

16
Figure 8: QQ-Plot of a sample of 100 values drawn from a normal distribution.

17
Figure 9: QQ-Plot of the returns for NFLX.

Statistical Data Analysis
No ratings yet
Statistical Data Analysis
23 pages
First Week
No ratings yet
First Week
8 pages
Introduction To The Practice of Basic Statistics (Textbook Outline)
100% (14)
Introduction To The Practice of Basic Statistics (Textbook Outline)
65 pages
02 Data
No ratings yet
02 Data
36 pages
Understanding Statistics: Types & Methods
No ratings yet
Understanding Statistics: Types & Methods
7 pages
Statistics I Chapter 2: Univariate Data Analysis
No ratings yet
Statistics I Chapter 2: Univariate Data Analysis
27 pages
Lecture-1 Descriptive Statistics
No ratings yet
Lecture-1 Descriptive Statistics
50 pages
Chapter Five
No ratings yet
Chapter Five
48 pages
H1.1 Definitions, Measures, Plots, CLT
No ratings yet
H1.1 Definitions, Measures, Plots, CLT
83 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
89 pages
Making Sense of Data Mooc Notes PDF
No ratings yet
Making Sense of Data Mooc Notes PDF
32 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Understanding Data: Statistics Basics
No ratings yet
Understanding Data: Statistics Basics
39 pages
Notes: Section 1: Exploratory Data Analysis
No ratings yet
Notes: Section 1: Exploratory Data Analysis
6 pages
Chapter 1
No ratings yet
Chapter 1
51 pages
Descriptive Statistics and EDA Overview
No ratings yet
Descriptive Statistics and EDA Overview
36 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
80 pages
Note 02
No ratings yet
Note 02
31 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
Parta PDF
No ratings yet
Parta PDF
153 pages
C1S1 Statistics Packet
No ratings yet
C1S1 Statistics Packet
24 pages
CH 2 Lecture Notes
No ratings yet
CH 2 Lecture Notes
12 pages
Boxplots
No ratings yet
Boxplots
9 pages
Statistics Midterm Review
No ratings yet
Statistics Midterm Review
21 pages
Ap Stat Exam Rev ch1-13
No ratings yet
Ap Stat Exam Rev ch1-13
120 pages
Statistics and Probability Formulas Guide
No ratings yet
Statistics and Probability Formulas Guide
47 pages
STAT241 - Business Statistics (Day 3)
No ratings yet
STAT241 - Business Statistics (Day 3)
32 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
Annotated 3 Ch3 Data Description F2014
No ratings yet
Annotated 3 Ch3 Data Description F2014
16 pages
Chapter 4: Summarizing & Exploring Data (Descriptive Statistics) Graphics! Graphics! Graphics! (And Some Numbers)
No ratings yet
Chapter 4: Summarizing & Exploring Data (Descriptive Statistics) Graphics! Graphics! Graphics! (And Some Numbers)
85 pages
CS361 FA23 Lec2 Post
No ratings yet
CS361 FA23 Lec2 Post
67 pages
Descriptive Statistics Overview for STAT 614
No ratings yet
Descriptive Statistics Overview for STAT 614
56 pages
Chapter 2 - Representing Sample Data: Graphical Displays
No ratings yet
Chapter 2 - Representing Sample Data: Graphical Displays
16 pages
DSILYTC Session 5 - Descriptive Statistics
No ratings yet
DSILYTC Session 5 - Descriptive Statistics
99 pages
Lec 2
No ratings yet
Lec 2
26 pages
Lecture 06-Describing Data Visual Information
No ratings yet
Lecture 06-Describing Data Visual Information
49 pages
Mean, Median, and Mode Explained
No ratings yet
Mean, Median, and Mode Explained
4 pages
Tutoring Session 2023 - Statistics For Business
No ratings yet
Tutoring Session 2023 - Statistics For Business
65 pages
Manm526 W1
No ratings yet
Manm526 W1
38 pages
AP Statistics Study Guide
No ratings yet
AP Statistics Study Guide
35 pages
Questions
No ratings yet
Questions
22 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Intro to Statistics Basics
No ratings yet
Intro to Statistics Basics
53 pages
MATM111 Midterms REVIEWER
No ratings yet
MATM111 Midterms REVIEWER
3 pages
One-Variable Data Analysis Guide
No ratings yet
One-Variable Data Analysis Guide
4 pages
Introduction To Statistics 2024-2025
No ratings yet
Introduction To Statistics 2024-2025
40 pages
Measures of Central Tendency & Variation
No ratings yet
Measures of Central Tendency & Variation
86 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Chapter 2 Final of Final
No ratings yet
Chapter 2 Final of Final
158 pages
Probability and Statistics For Computer Scientists Second Edition, By: Michael Baron
No ratings yet
Probability and Statistics For Computer Scientists Second Edition, By: Michael Baron
63 pages
Topic 2 - Descriptive - Statistics
No ratings yet
Topic 2 - Descriptive - Statistics
36 pages
3 Data Visualization
No ratings yet
3 Data Visualization
75 pages
02data Part2
No ratings yet
02data Part2
34 pages
I Am Sharing 'DOC-20250811-WA0005.' With You
No ratings yet
I Am Sharing 'DOC-20250811-WA0005.' With You
16 pages
SCSA1606 - Predictive and Advanced Analytics - Unit II
No ratings yet
SCSA1606 - Predictive and Advanced Analytics - Unit II
50 pages
MMW Reviewer
No ratings yet
MMW Reviewer
9 pages
2023 MH6201-Finals - Solutions
No ratings yet
2023 MH6201-Finals - Solutions
11 pages
Chap 5
No ratings yet
Chap 5
13 pages
Chap 4
No ratings yet
Chap 4
14 pages
钢琴基础教程 1
No ratings yet
钢琴基础教程 1
128 pages
R Quiz Question2025
No ratings yet
R Quiz Question2025
2 pages
Lecture05 Sensitivity Analysis - BLANK
No ratings yet
Lecture05 Sensitivity Analysis - BLANK
33 pages
Lecture01 Formulating + Solving LP - BLANK
No ratings yet
Lecture01 Formulating + Solving LP - BLANK
37 pages
Lecture 1 - R Introduction1
No ratings yet
Lecture 1 - R Introduction1
77 pages
Saxena - Machine Learning in Visible Light Communication System A
No ratings yet
Saxena - Machine Learning in Visible Light Communication System A
12 pages
Lecture Slides 5
No ratings yet
Lecture Slides 5
15 pages
Coal India Exam Paper 2020
No ratings yet
Coal India Exam Paper 2020
52 pages
Ktea3 Parent Report
No ratings yet
Ktea3 Parent Report
6 pages
De&v Two Marks Questions With Answers
No ratings yet
De&v Two Marks Questions With Answers
19 pages
EPRG 2014 Guidelines for Pipeline Welds
No ratings yet
EPRG 2014 Guidelines for Pipeline Welds
20 pages
Engineering Mechanics Exam
No ratings yet
Engineering Mechanics Exam
13 pages
Computer Class VII - Ch.1
No ratings yet
Computer Class VII - Ch.1
6 pages
Giulio C Argan - The Architecture of Brunelleschi and The Origins of Perspective Theory in The Fifteenth Century
No ratings yet
Giulio C Argan - The Architecture of Brunelleschi and The Origins of Perspective Theory in The Fifteenth Century
33 pages
Image Processing in Agriculture
No ratings yet
Image Processing in Agriculture
8 pages
Lesson 4 Physics
No ratings yet
Lesson 4 Physics
43 pages
Forum 15fforum Halliwell Complex
No ratings yet
Forum 15fforum Halliwell Complex
66 pages
Complex Numbers Quiz
No ratings yet
Complex Numbers Quiz
1 page
Multiple Regression and Model Building: Dr. Subhradev Sen Alliance School of Business
No ratings yet
Multiple Regression and Model Building: Dr. Subhradev Sen Alliance School of Business
150 pages
Deterioration Prediction of Building Components
No ratings yet
Deterioration Prediction of Building Components
9 pages
ANSYS Mechanical FEA Overview
No ratings yet
ANSYS Mechanical FEA Overview
31 pages
Advanced Free DPP - 01-2
No ratings yet
Advanced Free DPP - 01-2
6 pages
Practice Physics 1. Second Semester. Newton'S Laws: in This IB Lab You Will Be Assessed On The Following Criteria
No ratings yet
Practice Physics 1. Second Semester. Newton'S Laws: in This IB Lab You Will Be Assessed On The Following Criteria
4 pages
RSHS TOS Grade 7
No ratings yet
RSHS TOS Grade 7
1 page
American National Standard For Roadway Lighting Equipmentñ Fiber-Reinforced Plastic (FRP) Lighting Poles
No ratings yet
American National Standard For Roadway Lighting Equipmentñ Fiber-Reinforced Plastic (FRP) Lighting Poles
15 pages
CH 13
No ratings yet
CH 13
70 pages
1 s2.0 S2215098621000070 Main
No ratings yet
1 s2.0 S2215098621000070 Main
12 pages
Robotics
No ratings yet
Robotics
15 pages
Aops Community 1963 German National Olympiad
No ratings yet
Aops Community 1963 German National Olympiad
1 page
ML Roadmap Day by Day
No ratings yet
ML Roadmap Day by Day
2 pages
CST446 Data Compression Exam Key
No ratings yet
CST446 Data Compression Exam Key
10 pages
De Broglie Wave-Particle Duality
No ratings yet
De Broglie Wave-Particle Duality
22 pages
Gr.6 3rd Sum. Math 6
No ratings yet
Gr.6 3rd Sum. Math 6
2 pages
Year 10 Baseline Test Maths Foundation Calculator - Mark Scheme
No ratings yet
Year 10 Baseline Test Maths Foundation Calculator - Mark Scheme
4 pages
Solid Propellant Motor Performance Program
No ratings yet
Solid Propellant Motor Performance Program
135 pages

ch2 (Descriptive Statistics)

Uploaded by

ch2 (Descriptive Statistics)

Uploaded by

MH8131 Probability and Statistics

Chapter 2 Descriptive statistics

In this chapter we describe several techniques for visualizing data, as well as

DEFINITION 1 A histogram is constructed by first deciding on a set of classes, or

>tt1=t1[!is.na(t1$degC), ] # remove the missing value

◃ Good: natural, easy to compute, has nice mathematical properties

◃ Bad: sensitive to extreme values

DEFINITION 3 The sample median is another popular measure of center and is

One desirable property of the sample median is that it is resistant to extreme

◃ Good: resistant to extreme values, easy to describe

◃ Bad: not as mathematically tractable, need to sort the data to calculate

DEFINITION 4 The sample variance of x1 , x2 , · · · , xn is denoted s2 and is calcu-

Intuitively, the sample variance is approximately the average squared distance of

DEFINITION 6 (ANOMALY DETECTION) In contrast to typical data analysis,

The R format is boxplot(x), where x is the data.

t1=temp[temp$mon==1,] # select the first month

DEFINITION 10 (Sample correlation coeﬃcient)). Let (x1 , y1 ), (x2 , y2 ), (xn , yn ) be

where sx means the standard deviation of x variables and sy is similarly defined.

By the Cauchy-Schwarz inequality the magnitude of the sample correlation

T CTL FTR VZ LVLT

etfs <- sp500_px[row.names(sp500_px)>"2012-07-01",

8 Sample Covariance matrix and PCA

DEFINITION 11 Let x1 , · · · , xn be a set of p dimensional real-valued data vectors.

The (i,j) entry of the covariance matrix is given by

In order to characterize the variation of a multidimensional data set around

Consider the eigen-decomposition of the covariance matrix Sn

where the eigenvectors matrix is Un = (u1 , · · · , un ) and the ordered eigenvalues

THEOREM 1 For the above the eigen-decomposition of the covariance matrix

λk = max vT Sn v, uk = arg max v T Sn v

This means that u1 is the direction of maximum variation. The eigenvector u2

EXAMPLE 1 You can compute principal components in R using the princomp

oil_px <- sp500_px[, c(’CVX’, ’XOM’)]

DEFINITION 12 QQ plot is a plot to visualize how close a sample distribution is

norm_samp <- rnorm(100)

You might also like