Problem Set 1 Solutions
Statistics 104
Due February 6, 2020 at 11:59 pm
Problem set policies. Please provide concise, clear answers for each question. Note that only writing the result of
a calculation (e.g., "SD = 3.3") without explanation is not sufficient. For problems involving R, be sure to include
the code in your solution.
Please submit your problem set via Canvas as a PDF, along with the R Markdown source file.
We encourage you to discuss problems with other students (and, of course, with the course head and the TFs), but
you must write your final answer in your own words. Solutions prepared "in committee" are not acceptable. If
you do collaborate with classmates on a problem, please list your collaborators on your solution.
Problem 1.
For each of the following scenarios, discuss (in at most five sentences) the main issue(s) with
respect to sampling or reporting bias.
a) A particular city has 14 architects who own their own firm. To select a survey sample, each
architect was contacted via telephone by order of appearance in the telephone directory,
then the first 8 that agreed to be interviewed formed the sample.
This is an example of sampling bias. Since architects are contacted by order of appearance
in the directory and the sampling ends once 8 have agreed to be in the sample, those with
last names higher in the alphabet are more likely to be sampled than those with last names
lower in the alphabet. In a well-chosen sample, each individual in the population has an
equal chance of being selected for the sample.
b) The September 1992 issue of Prevention magazine included a women’s health survey; ap-
proximately 16,500 women responded to the survey. The May 1993 issue reported on the
survey results, claiming that “92% of our readers rated their health as excellent, very good,
or good”.
This is an example of inaccurate reporting; 92% of the women who responded to the survey
rated their health as excellent, very good, or good. This is not equivalent to claiming that
92% of women who read the magazine gave such ratings, or that 92% of readers did so.
The respondents are almost certainly not a random sample of the women who read the
magazine. For example, perhaps those in excellent health were more likely to respond to
the survey than not; if this were true, 92% overestimates the percentage of female readers
who rate their health as excellent, very good, or good.
1
c) Many scholars and policymakers are interested in estimating the prevalence of mental ill-
ness among the homeless population. In one study, the authors sampled homeless persons
who received medical attention from a clinic that was part of the Health Care for the Home-
less project, resulting in an estimated prevalence of 33%.1 The authors maintain that se-
lection bias is not a serious problem because the clinics are easily accessible to homeless
people.
Even if the clinics are easily accessible, there may still be selection bias. It is not reasonable
to assume that each person is equally likely to seek medical attention from a clinic. In
particular, individuals with a mental illness may be less likely to seek medical attention
than someone in need of general health services, due to factors such as social stigma.
Problem 2.
A recently published analysis examined 10 studies that measured optimism and pessimism by
asking participants about their level of agreement with statements like “In uncertain times, I
usually expect the best,” or “I rarely expect good things to happen to me”. Optimistic people
tend to expect that they will encounter favorable outcomes, whereas less optimistic people tend
to expect that they will encounter unfavorable outcomes.2
These studies also measured other variables on participants, including factors related to heart
disease. The analysis found that compared with pessimists, people with the most optimistic out-
look had a 35% lower risk for cardiovascular events (e.g., heart attacks). The studies, on average,
observed people over a 14-year period and compared the rate of cardiovascular events between
those classified as optimists versus pessimists.
a) A popular newspaper reports on the analysis with the headline “Thinking Positively Im-
proves Cardiovascular Health”. Write a short response to the editor explaining clearly why
the headline is potentially misleading. Be sure to use language accessible to a general audi-
ence without a statistics background. Limit your answer to at most five sentences.
The headline is potentially misleading because it interprets evidence for an association be-
tween optimism and lower risk for cardiovascular events as a causal relationship. It is not
prudent to make causal claims from observational studies since there could be unmeasured
variables (i.e., confounding variables) obscuring the true causal relationship. For example,
perhaps optimists tend to lead lifestyles that promote cardiovascular health (e.g., exercising
more, eating healthier diets), and it is these lifestyle factors that actually have a causal effect
on cardiovascular health. The analysis only demonstrates evidence that optimistic people
tend to have lower risk for cardiovascular health; it does *not* demonstrate evidence that
optimism improves cardiovascular health.
b) Briefly describe a plausible study design that has the potential to demonstrate the effect of
thinking positively on cardiovascular health.
Randomly select a study sample from the population of interest. Randomize half the par-
ticipants to the control group and half the participants to the treatment group. The control
group will receive typical advice about behaviors that promote cardiovascular health. The
treatment group will receive typical advice in addition to attending mindfulness sessions
1 This project is a federally funded program that brings general health and mental health services to homeless people.
2 Alan Rozanski, MD, et al. Association of optimism with cardiovascular events and all-cause mortality. JAMA
Network Open 2019; 2(9):e1912200.
2
that teach strategies for thinking positively. Observe participants over a set period of time,
then compare rate of cardiovascular events between the groups.
c) Suppose someone who is very optimistic reads about the analysis and concludes that the
findings suggest he has a 35% lower risk for cardiovascular events than his friend who is
extremely pessimistic. Explain why this is not necessarily the case.
This is not necessarily the case because "the individual is not the average". Each person’s
individual risk for cardiovascular events is influenced by a myriad of factors specific to that
person, such as diet and family history. The 35% risk figure is an overall average calculated
using data from many individuals; it would be flawed to assume that all individuals have
the same risk as the average risk. In mathematical terms, for example, if x = 5, this does not
imply that x1 = x2 = ... = xn = 5.
Problem 3.
The following graphs are based on data from the National Center for Health Certificates.
a) Describe what you see in the two graphs, with particular focus on the differences between
the two distributions.
The 1980 graph shows a unimodal distribution with slight right skewing; the mode is
around 20 years. The 2016 graph shows a bimodal distribution, with a peak around 20
years and another peak around 29 years.
b) Economists are interested in the possible causes driving the shape of the age distribution in
2016.
i. Discuss a possible reason behind the discrepancy between the 1980 distribution and
the 2016 distribution; i.e., what is a potential factor driving the difference in the distri-
butions?
Societal shifts that occurred during this time period led to women having more oppor-
tunities to join the workforce and pursue higher education. As a result, women may
choose to delay having children until completion of a professional degree or reaching a
stable point in a career.
3
ii. Discuss a possible reason behind the shape of the age distribution in 2016.
The bimodal shape can be due to factors such as geography. In certain areas of the US
(the coasts, predominantly), it is more common for women to delay having children un-
til their late 20s; in more conservative regions such as the South, many women choose
to have children in their early 20s.
Problem 4.
FiveThirtyEight is a data journalism site devoted to applying statistical analysis to a variety of
current topics in politics, sports, science, economics, and culture. They recently published a series
of articles on gun deaths in America, based on data collected from the Centers for Disease Control
and Prevention (and other governmental agencies) on all gun deaths in the United States from
2012 - 2014.
The main dataset used for the analysis is available as the gun_deaths data. Each case represents a
single gun death.
– month: Month of death, coded as a numerical variable taking values 1 - 12.
– intent: The underlying cause of death, either Accidental, Homicide, Suicide, or
Undetermined. Deaths from legal intervention are coded as homicides.
– police: Coded Yes if the death is a result of legal intervention by police and No otherwise.
– sex: The sex of the individual who died, either F or M.
– age: Age of the individual who died, in years.
– race: Race of the individual who died, either Asian/Pacific Islander, Black, Hispanic,
Native American/Native Alaskan, or White.
– place: Place of injury, either Home, Residential institution, School/instiution[sic],
Sports, Street, Trade/service area, Industrial/construction, Farm, Other specified, or
Other unspecified.
– education: Educational status of the individual who died, either Less than HS, HS/GED, Some
college, or BA+.
a) Using numerical and graphical summaries, describe and compare the distribution of
age (of death) between males and females.
The median age at death for females in the dataset is 44 years, which is slightly higher
than the median age at death for males at 41 years. The age distribution for females
is roughly symmetric around the center. The age distribution for males, in contrast,
shows a clear mode at just above 20 years of age.
#load the data
load("datasets/gun_deaths.Rdata")
#numerical summaries
summary(gun_deaths$age[gun_deaths$sex == "F"])
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 29.0 44.0 43.7 56.0 101.0 3
4
summary(gun_deaths$age[gun_deaths$sex == "M"])
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 26.00 41.00 43.88 58.00 107.00 15
#graphical summaries
boxplot(gun_deaths$age ~ gun_deaths$sex, horizontal = T)
gun_deaths$sex
M
F
0 20 40 60 80 100
gun_deaths$age
par(mfrow = c(1, 2))
hist(gun_deaths$age[gun_deaths$sex == "F"],
main = "Age Distribution (Females)", xlab = "Age (yrs)")
hist(gun_deaths$age[gun_deaths$sex == "M"],
main = "Age Distribution (Males)", xlab = "Age (yrs)")
Age Distribution (Females) Age Distribution (Males)
1500
8000
1000
Frequency
Frequency
4000
500
0
0 20 40 60 80 100 0 20 40 60 80 100
Age (yrs) Age (yrs)
b) Which underlying cause of death contributes the most toward gun deaths?
5
At 63,715 deaths, suicide is the most common underlying cause of gun deaths.
#summarize intent
table(gun_deaths$intent)
##
## Accidental Homicide Suicide Undetermined
## 1639 35176 63175 807
c) Identify the proportion of deaths classified as homicides that involved legal interven-
tion by police.
The proportion of deaths classified as homicides that involved legal intervention by
police is 1402/35176 = 0.03995.
#two-way table of intent and police
addmargins(table(gun_deaths$intent, gun_deaths$police))
##
## No Yes Sum
## Accidental 1639 0 1639
## Homicide 33774 1402 35176
## Suicide 63175 0 63175
## Undetermined 807 0 807
## Sum 99395 1402 100797
#use r as a calculator
1402/35176
## [1] 0.03985672
d) During which season did gun deaths most often occur? Assume that spring is March -
May, summer is June - August, fall is September - November, and winter is December -
February.
Gun deaths occurred most often in the summer (over 26,000 deaths).
#subset by season
spring = gun_deaths[(gun_deaths$month == 3 |
gun_deaths$month == 4 |
gun_deaths$month == 5), ]
summer = gun_deaths[(gun_deaths$month == 6 |
gun_deaths$month == 7 |
gun_deaths$month == 8), ]
fall = gun_deaths[(gun_deaths$month == 9 |
gun_deaths$month == 10 |
gun_deaths$month == 11), ]
winter = gun_deaths[(gun_deaths$month == 12 |
gun_deaths$month == 1 |
gun_deaths$month == 2), ]
6
#tabulate by season
spring.deaths = nrow(spring); spring.deaths
## [1] 25413
summer.deaths = nrow(summer); summer.deaths
## [1] 26449
fall.deaths = nrow(fall); fall.deaths
## [1] 25157
winter.deaths = nrow(winter); winter.deaths
## [1] 23779
e) Of the gun deaths in 2012 among individuals with at least a high school education,
what proportion of those individuals were white males? Be sure to account for any
missing values in the calculation.
When accounting for missing values: of the gun deaths in 2012 among individuals with
at least a high school education, 15199/25548 = 59.5% occurred among white males.
This can be done either by using is.na() or simply subsetting out specific values and
looking at a two-way table.
Without counting missing values: of the gun deaths in 2012 among individuals with at
least a high school education, 15484/26040 = 59.5% occurred among white males.
# CONSIDERING NAs #
#check for NAs
summary(gun_deaths$year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2012 2012 2013 2013 2014 2014
summary(gun_deaths$education) # education shows 1422 NAs
## Less than HS HS/GED Some college BA+ NA's
## 21823 42927 21680 12946 1422
summary(gun_deaths$sex)
## F M
## 14449 86349
summary(gun_deaths$race)
## Asian/Pacific Islander Black
## 1326 23296
## Hispanic Native American/Native Alaskan
## 9022 917
## White
## 66237
7
gun_deaths_clean = gun_deaths[is.na(gun_deaths$education) == FALSE,]
nrow(gun_deaths) - nrow(gun_deaths_clean) #check that rows with NAs have been removed
## [1] 1422
#redo counts
white.male.hs.clean = gun_deaths_clean[gun_deaths_clean$race == "White" &
gun_deaths_clean$sex == "M" &
gun_deaths_clean$education != "Less than HS" &
gun_deaths_clean$year == 2012, ]
all.clean = gun_deaths_clean[gun_deaths_clean$education != "Less than HS" &
gun_deaths_clean$year == 2012, ]
nrow(white.male.hs.clean)/nrow(all.clean)
## [1] 0.5949194
#another method...
hs.2012 = gun_deaths[gun_deaths$education!= "Less than HS" & gun_deaths$year == 2012, ]
addmargins(table(hs.2012$race, hs.2012$sex))
##
## F M Sum
## Asian/Pacific Islander 69 303 372
## Black 593 4614 5207
## Hispanic 212 1439 1651
## Native American/Native Alaskan 25 172 197
## White 2922 15199 18121
## Sum 3821 21727 25548
15199/25548
## [1] 0.5949194
# WITHOUT CONSIDERING NAs #
#count number of white males with at least HS education in 2012
white.male.hs = gun_deaths[gun_deaths$race == "White" &
gun_deaths$sex == "M" &
gun_deaths$education != "Less than HS" &
gun_deaths$year == 2012, ]
nrow(white.male.hs)
## [1] 15484
#count all with at least HS education in 2012
all = gun_deaths[gun_deaths$education != "Less than HS" &
gun_deaths$year == 2012, ]
nrow(all)
## [1] 26040
8
#calculate proportion
nrow(white.male.hs)/nrow(all)
## [1] 0.5946237
f) Compare the proportion of gun deaths due to suicide versus homicide between
race/ethnicity groups. In a few sentences, summarize the main findings.
For two race/ethnicity groups, Black and Hispanic, homicide is the most prevalent
cause of gun death: 83.75% for Black and 62.4% for Hispanic. This is counter to the
overall trend observed in part b), where we observed that overall, the most common
underlying cause of gun death is suicide.
#summary table for white individuals
whites = gun_deaths[gun_deaths$race == "White", ]
prop.table(table(whites$intent))
##
## Accidental Homicide Suicide Undetermined
## 0.017090404 0.138097107 0.835980434 0.008832055
#summary table for black individuals
blacks = gun_deaths[gun_deaths$race == "Black", ]
prop.table(table(blacks$intent))
##
## Accidental Homicide Suicide Undetermined
## 0.014079670 0.837482830 0.143028846 0.005408654
#summary table for hispanic individuals
hispanics = gun_deaths[gun_deaths$race == "Hispanic", ]
prop.table(table(hispanics$intent))
##
## Accidental Homicide Suicide Undetermined
## 0.016071824 0.624473509 0.351474174 0.007980492
#summary table for asian individuals
asians = gun_deaths[gun_deaths$race == "Asian/Pacific Islander", ]
prop.table(table(asians$intent))
##
## Accidental Homicide Suicide Undetermined
## 0.009049774 0.421568627 0.561840121 0.007541478
#summary table for native americans
native.americans = gun_deaths[gun_deaths$race == "Native American/Native Alaskan", ]
prop.table(table(native.americans$intent))
##
## Accidental Homicide Suicide Undetermined
## 0.02399128 0.35550709 0.60523446 0.01526718
9
Problem 5.
Employment statistics represent an important source of metrics for policymakers to use in gaug-
ing the overall health of the economy. In the United States, the government measures unemploy-
ment using the Current Population Survey (CPS). This survey collects demographic and employ-
ment information each month from about 60,000 occupied households.
To be eligible to participate in the survey, individuals must be 15 years of age or older and not in
the Armed Forces. One person generally responds for all members of the household; this person
is called the “reference person” and usually is the person who owns or rents the housing unit.
The CPSData.csv file contains data from the September 2016 survey. Descriptions of the relevant
variables are as follows:
– PeopleInHousehold: number of people in the household, including the respondent
– Region: the census region the respondent lives in, either Midwest, Northeast, South, or West
– State: the state the respondent lives in, either one of the 50 states or the District of Columbia
– Age: age in years of the respondent, where 80 represents individuals ages 80-84, and 85
represents individuals ages 85 and higher
– Married: marital status of the respondent, either Divorced, Married, Never Married,
Separated, or Widowed
– Sex: sex of the respondent, either Female or Male
– Education: highest educational level of the respondent, either Associate degree,
Bachelor’s degree, Doctorate degree, High school, Master’s degree, No high school
diploma, Professional degree, or Some college, no degree
– Race: race of the respondent, either American Indian, Asian, Black, Multiracial, Pacific
Islander, White
– Hispanic: coded 1 if the respondent is of Hispanic ethnicity, and 0 otherwise
– Citizenship: citizenship status of the respondent, either Citizen, Native, Citizen,
Naturalized, or Non-Citizen
– EmploymentStatus: employment status of the respondent, either Disabled, Employed, Not in
Labor Force, Retired, or Unemployed
– Industry: industry of employment, available only if the respondent is employed
a) Explore the variables Age, Sex, and Race.
i. Based on these three variables, write a short paragraph describing the basic demo-
graphics of the survey respondents. Reference numerical and graphical summaries as
appropriate.
Age is roughly evenly distributed until about age 60; there are fewer individuals of ages
older than 60 years. The median age is 39 years, and the middle 50% of individuals are
between 19 and 57 years old. The sex distribution is roughly equal, with 51% females
and 49% males. The majority of individuals are white (81%); 11% of individuals are
black, and 5% of individals are Asian. American Indians, Pacific Islanders, and those
identifying as multiracial make up a very small proportion of the sample.
#load the data
cps = read.csv("datasets/CPSData.csv", header = TRUE)
#explore age
10
par(mfrow = c(1, 2))
hist(cps$Age, main = "Histogram of Age", xlab = "Age (yrs)")
boxplot(cps$Age, ylab = "Age (yrs)")
Histogram of Age
80
8000
60
Frequency
Age (yrs)
4000
40
20
0
0
0 20 40 60 80
Age (yrs)
summary(cps$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 19.00 39.00 38.83 57.00 85.00
#explore sex
prop.table(table(cps$Sex))
##
## Female Male
## 0.5139373 0.4860627
#explore race
prop.table(table(cps$Race))
##
## American Indian Asian Black Multiracial
## 0.010913771 0.049656517 0.105961828 0.022063640
## Pacific Islander White
## 0.004706707 0.806697537
ii. Do you notice anything odd about the distribution of Age? Point out what you think is
unusual.
According to the description of the data, individuals must be 15 years of age or older
to participate in the survey. Yet the distribution shows that there are many individuals
represented in the data who "responded" to the survey.
b) Describe the distribution of the number of people in a household, referencing appropriate
numerical and graphical summaries.
The distribution of the number of people in a household is heavily right skewed; there are
many more people with a small number of people in a household than a large number. The
11
median number of people in a household is 3, and the middle 50% is between 2 and 4.
#numerical summaries
summary(cps$PeopleInHousehold)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.284 4.000 15.000
#graphical summary
hist(cps$PeopleInHousehold,
main = "Histogram of Number of People in Household",
xlab = "Number of People in Household")
Histogram of Number of People in Household
50000
Frequency
20000
0
2 4 6 8 10 12 14
Number of People in Household
c) Describe the distribution of citizenship status.
A majority of people in the data are native citizens (88.8%), and a roughly equal proportion
are naturalized citizens versus non-citizens (about 5% each).
table(cps$Citizenship)
##
## Citizen, Native Citizen, Naturalized Non-Citizen
## 116639 7073 7590
prop.table(table(cps$Citizenship))
##
## Citizen, Native Citizen, Naturalized Non-Citizen
## 0.88832615 0.05386818 0.05780567
d) The CPS differentiates between race and ethnicity. For which races do 15% or more of re-
spondents identify as ethnically Hispanic?
12
For the race categories American Indian, Multiracial, and White, more than 15% of respon-
dents identify as ethnically Hispanic.
prop.table(table(cps$Race, cps$Hispanic), 1)
##
## 0 1
## American Indian 0.78785764 0.21214236
## Asian 0.98266871 0.01733129
## Black 0.95536549 0.04463451
## Multiracial 0.84535727 0.15464273
## Pacific Islander 0.87540453 0.12459547
## White 0.84204265 0.15795735
e) Create a graphical summary that shows the association between age and marital status. De-
scribe what you see and comment on whether it is what you might expect intuitively.
Yes, the association generally coheres with what one can expect intuitively, based on how
individuals generally get married in their 20s to 30s, may experience separation or divorce
in the decades following, and then are most likely to experience the loss of a partner. The
median age in the Never Married group is the lowest, around 30 years, and the median age in
the Widowed group is the highest, around 80 years (although there are high and low outliers
for both categories, respectively). The median age in the Divorced, Married, and Separated
categories is similar, around 50 years.
boxplot(cps$Age ~ cps$Married)
80
cps$Age
60
40
20
Divorced Married Never Married Separated Widowed
cps$Married
13
Problem 6.
This problem uses stock data from Apple Corporation (AAPL) and Microsoft Corporation (MSFT).
The following code uses commands from the quantmod package to fetch daily return data. The
adjusted closing price can be thought of as the most accurate reflection of a stock’s value at closing;
the closing price factors in events that might affect the stock price after the market closes. The
daily volume of a stock refers to how many shares were traded that day. The daily return quantifies
how much value was gained/lost in a day.
#load quantmod package
library(quantmod)
#load AAPL and MSFT data
getSymbols("AAPL", from = "2018-01-01", to = "2019-07-22")
## [1] "AAPL"
getSymbols("MSFT", from = "2018-01-01", to = "2019-07-22")
## [1] "MSFT"
#obtain adjusted closing prices
aapl.closing = Ad(AAPL)
msft.closing = Ad(MSFT)
#obtain daily volume
aapl.volume = Vo(AAPL)
msft.volume = Vo(MSFT)
#obtain daily returns
aapl.return = as.numeric(dailyReturn(Ad(AAPL)))
msft.return = as.numeric(dailyReturn(Ad(MSFT)))
a) Run the code in the template to plot Apple’s and Microsoft’s stock prices between 01 January
2018 and 22 July 2019. Describe what you see.
Over this time period, both stock prices gradually rise (not considering small dips at-
tributable to daily variation) from the starting price. However, while Microsoft’s stock is a
steady gradual rise over time, Apple’s stock underwent a large dip below its starting price
in January 2019 before recovering.
#plot prices
plot(aapl.closing, col = "blue", ylim = c(50, 300))
14
aapl.closing 2018−01−02 / 2019−07−19
250 250
200 200
150 150
100 100
Jan 02 2018 Jun 01 2018 Nov 01 2018 Apr 01 2019
lines(msft.closing, col = "red")
aapl.closing 2018−01−02 / 2019−07−19
250 250
200 200
150 150
100 100
Jan 02 2018 Jun 01 2018 Nov 01 2018 Apr 01 2019
b) Compare the return from holding one share of AAPL for this time period versus one share of
MSFT. The return over a period of time is the change in value divided by the original price.
The return for holding one share of AAPL over this time period is 20.3%, which is lower
than the return of 63.0% for holding one share of MSFT over this time period.
#calculate return for aapl
aapl.closing[1, ]; aapl.closing[nrow(aapl.closing), ]
15
## AAPL.Adjusted
## 2018-01-02 166.804
## AAPL.Adjusted
## 2019-07-19 200.7426
(as.numeric(aapl.closing[nrow(aapl.closing), ]) - as.numeric(aapl.closing[1, ])) /
as.numeric(aapl.closing[1, ])
## [1] 0.2034641
#calculate return for msft
msft.closing[1, ]; msft.closing[nrow(msft.closing), ]
## MSFT.Adjusted
## 2018-01-02 83.25638
## MSFT.Adjusted
## 2019-07-19 135.7048
(as.numeric(msft.closing[nrow(msft.closing), ]) - as.numeric(msft.closing[1, ])) /
as.numeric(msft.closing[1, ])
## [1] 0.6299628
c) Compute the standard deviation of daily returns for AAPL and MSFT over this time period.
Based on standard deviation, which stock is less volatile? Volatility refers to the degree of
fluctuation in a stock period over time.
The standard deviation of daily returns for AAPL is 0.0180, while the standard deviation of
daily returns for MSFT is 0.0163. Based on standard deviation, MSFT is less volatile.
sd(aapl.return)
## [1] 0.01800519
sd(msft.return)
## [1] 0.01632335
d) Identify the highest and lowest adjusted closing prices of AAPL during this time period, as
well as the dates on which they occurred. Do the same for MSFT.
For AAPL, the highest closing price, $227.30, occurred on October 03 2018, and the lowest
closing price, $139.75, occurred on January 03 2019. For MSFT, the highest closing price,
$137.97, occurred on July 12 2019, and the lowest closing price, $82.35, occurred on Febu-
rary 08 2018.
aapl.closing[which.max(aapl.closing)]
## AAPL.Adjusted
## 2018-10-03 227.3003
aapl.closing[which.min(aapl.closing)]
## AAPL.Adjusted
16
## 2019-01-03 139.7535
msft.closing[which.max(msft.closing)]
## MSFT.Adjusted
## 2019-07-12 137.9695
msft.closing[which.min(msft.closing)]
## MSFT.Adjusted
## 2018-02-08 82.34585
e) Both MSFT and AAPL are in the S&P 500 Index, a stock index that measures the stock
performance of 500 large publicly traded companies on the US market. It is a weighted
index, such that larger companies contribute more to the index than smaller companies.
The two largest components of the index are AAPL and MSFT. Thus, we might expect that
the returns of these two companies are highly related.
i. From a graphical summary, describe your impression of the relationship between daily
return for AAPL and daily return for MSFT.
There appears to be a linearly positive relationship between daily return for AAPL
and daily return for MSFT; higher values of AAPL are associated with higher values of
MSFT.
ii. Calculate an appropriate numerical summary for the relationship observed in part i.
The correlation between the daily return for AAPL and the daily return for MSFT is
0.678.
iii. Interpret the value from part ii. in language accessible to an audience who has not
taken a statistics course.
The correlation of 0.678 indicates a moderately strong linear relationship between the
daily return for AAPL and daily return for MSFT. In other words, an increase in the
daily return of AAPL generally occurs with an increase in the daily return of MSFT.
However, the relationship is not strong enough to conclude that an increase in the daily
return of AAPL always occurs with an increase in the daily return of MSFT.
#graphical summary (part i.)
plot(aapl.return, msft.return,
cex = 0.8,
xlab = "Daily Return for AAPL", ylab = "Daily Return for MSFT")
17
0.08
Daily Return for MSFT
0.04
0.00
−0.04
−0.10 −0.05 0.00 0.05
Daily Return for AAPL
#numerical summary (part ii.)
cor(aapl.return, msft.return)
## [1] 0.6779739
18