R Functions for Loss Distributions Analysis
R Functions for Loss Distributions Analysis
in
+91-9711150002
INDEX
Question 2. Define a function in R to simulate the values from 𝑃𝑎𝑟𝑒𝑡𝑜 (𝛼, 𝜆) distribution.
Name this function rpareto.
Question 4. (i) Using the function defined in Q2, simulate 500 values from a
𝑃𝑎𝑟𝑒𝑡𝑜 (3,1000) distribution, using a seed value of 80.
Question 5. Repeat Q1 to Q3 for 𝑊𝑒𝑖𝑏𝑢𝑙𝑙(𝑐, 𝛾) distribution. You can refer the Actuarial
Tables for Formulas.
Question 6. Repeat Q1 to Q3 for 𝐵𝑢𝑟𝑟(𝛼, 𝜆, 𝛾) distribution. You can refer the Actuarial
Tables for Formulas.
Question 7. Adapting the code of dpareto from Q1, create a corresponding function for the
three-parameter Pareto distribution. You can refer the Actuarial Tables for
Formulas.
Solutions
Answer 1. dpareto <- function(x,a,l){
a*(l^a)/((l+x)^(a+1))
}
Answer 4. (i)
set.seed(80)
rpareto <- function(n,a,l){
rp <- l*((1-runif(n))^(-1/a)-1)
rp
}
x<-rpareto(500,3,10000)
(ii)
coeff.of.skew<-skewness/var(x)^(3/2)
coeff.of.skew
[1] 6.903348
Question 2. Calculate f (1000) and F(1000) for the random variable Y where Y is lognormally
distributed with parameters 𝜇 = 6 𝑎𝑛𝑑 𝜎 ! = 4.
Question 5. Calculate the 99.5th percentile for the random variable Y, where Y is lognormally
distributed with parameters 𝜇 = 6 𝑎𝑛𝑑 𝜎 ! = 4.
Question 6. Calculate the IQR for the random variable X where X~Gamma(6,0.1).
Question 9. You believe that claim amounts follow a Gamma(6,𝜆) distribution, where
𝜆~𝐸𝑥𝑝(10). Generate a random sample of five claims, using a seed value equal
to 1.
"
Question 10. The random variable Z is defined as Z = # where:
Question 11. Claim sizes X for a particular class of insurance business are believed to follow a
three‐parameter Pareto distribution such that X ~ Pa(2,200,7) . Summarise the
code required to calculate P(X > 10,000) , and display the result.
Question 12. The claim frequency on a portfolio of life insurance policies follows a beta
distribution with parameters 𝛼 = 0.02 and 𝛽 = 0.4. Calculate the probability that
the claim frequency exceeds 0.75.
Question 13. The random variable X follows a three‐parameter Pareto distribution with
parameters 𝛼 = 3, 𝜆 = 5 and k = 4. Calculate the median of X.
Hint: write a function to calculate |P(X <M) - 0.5| and minimise this with respect to M.
Question 14. The random variable X follows a LogN(0,0.25) distribution (i.e. with parameters
𝜇 = 0 and 𝜎 ! = 0.25 ).
(i) Calculate the mode of X, to 6 significant figures.
(ii) Check your answer against a plot of the lognormal distribution.
Question 16. This question uses the following file of claim sizes for a portfolio of general
insurance policies: ClaimSize.csv
(i) An analyst believes that claims are independent and follow a
Gamma(𝛼, 𝛽) distribution. Determine the maximum likelihood estimates
of the parameters 𝛼 and 𝛽, choosing from the following possibilities:
• 𝛼 = 1.7,1.8,1.9 or 2
• 𝛽 = 0.0015, 0.0016, 0.0017, …, 0.002
(ii) Plot the fitted density function against the data, and use it to comment on
your results.
Question 17. Insurance claims are believed to follow an exponential distribution with parameter
𝜆. The file ‘exp.txt’ gives the claim size for 1,000 of these claims.
(i) Find the maximum likelihood estimate of 𝜆. (You should use a starting
value of 𝜆 = 0.001 for the iteration process.)
(ii) Plot the empirical density function against your fitted distribution and
hence comment on the goodness of fit.
Question 18. A fellow student has collected a set of motor insurance claims data in the
following file: MotorClaims.txt
The student believes that this data follows an exponential distribution, and has
calculated the maximum likelihood estimate for 𝜆H = 0.003.
Plot the Q‐Q plot of the data against the student’s fitted distribution, and comment
on the result.
set.seed(1066)
x<-rlnorm(1000,5,2)
(i) Use the data stored in vector x to find the MLE of parameters of Log-
Normal distribution.
(ii) Use the method of moments on the simulated data above to estimate the
lognormal parameters.
(iii) Verify that your estimates can be used to give sensible maximum
likelihood estimates.
Solutions
Answer 1. dgamma(50,10,0.2)
[1] 0.02502201
pgamma(50,10,0.2)
[1] 0.5420703
Answer 2. dlnorm(1000,6,2)
[1] 0.0001799479
plnorm(1000,6,2)
[1] 0.6750416
Answer 3. c<-0.00002
g<-4
pweibull(12,4,c^(-1/g))-pweibull(8,4,c^(-1/g))
[1] 0.2608205
Answer 4. c<-0.00002
g<-4
sum(dweibull(y,4,c^(-1/g),log=TRUE))
[1] -56.0779
Answer 5. qlnorm(0.995,5,2)
[1] 25633.58
Answer 6. qgamma(0.75,6,0.1)-qgamma(0.25,6,0.1)
[1] 32.03492
Answer 7. c<-0.00002
g<-4
qweibull(0.7,g,c^(-1/g))
[1] 15.66378
Answer 8. set.seed(50)
rlnorm(10,5,2)
[1] 445.563506 27.571772 158.538247 423.392403 4.686996
[6] 85.137785 305.410535 45.521062 1044.382530 8.235879
rgamma(5,6,rexp(1,10))
[1] 63.05133 49.19141 86.27594 32.85776 88.75371
Note, since your answer will depend on the order in which you generate numbers
from the two distributions, it is possible that your solution will be different to
ours, but still equally valid!
For example, you may have first generated a vector of random variates from the
exponential distribution, and then used these to simulate values from the
gamma distribution. In this case, your answer would be:
set.seed(1)
lambda <- rexp(5,10)
rgamma(5,6,lambda)
[1] 32.85776 56.72204 505.65996 495.95443 110.23761
f<-function(x) {
gamma(a+k)*(l^a)*x^(k-1)/(gamma(a)*gamma(k)*(l+x)^(a+k))
}
F<-function(M) {
abs(integrate(f,0,M)$value-0.5)
}
M.start<-10
nlm(F,M.start)
$minimum
[1] 1.947338e-08
$estimate
[1] 6.865008
$gradient
[1] -3.221498e-05
$code
[1] 2
$iterations
[1] 61 So the median M = 6.865
Answer 14. m<-0
s2<-0.25
s<-s2^0.5
exp(m+0.5*s^2)
[1]1 1.133148
x<-seq(0.000001,2,by=0.000001)
f<-function(x) {
dlnorm(x,m,s)
}
max<-max(f(x)); max
[1] 0.9041217
i<-which(f(x)==max); i
[1] 778801
x[i]
[1] 0.778801
(ii)
plot(x,f(x),type="l",main="Density function for
LogN(0,0.25)")
The highest point on the curve is at x ≈0.8 , which is close to our answer for part (i).
p<-c(111,0.189)
f(p)
[1] 5433.073
$minimum
[1] 5433.067
$estimate
[1] 110.9999999 0.1890614
$gradient
[1] -0.0301862472 -0.0002226943
$code
[1] 2
$iterations
[1] 2
#In the code above, the function rbind appends each iteration
into a new row of the dataframe result.
(ii) The maximum claim size observation is 4,664 so we’ll plot a histogram with the x axis
on the range (0,5000):
We have also chosen to set breaks=20, ie twenty bars in our histogram. This is an
arbitrary choice, you may have chosen a different number altogether. In any case R only
treats this as a suggestion and will automatically set the breakpoints to ‘pretty’ values.
The option freq=FALSE tells R to plot the PDF, rather than the frequencies. Now we
plot our fitted gamma distribution on the same graph as the histogram:
lines(seq(0:5000),dgamma(seq(0:5000),shape=1.7,rate=0.002),
col="blue")
We can see that the data is more positively skew than the fitted distribution. There are
far too many claims greater than 800 and far too few claims of less than 800.
Extracting the data into a vector will make it easier to work with:
claims<-claims[,1]
p<-0.001
MLE<-nlm(f,p)
MLE
$minimum
[1] 75087.7
$estimate
[1] 0.001490436
$gradient
[1] -9535.74
$code
[1] 3
$iterations
[1] 5
Hence, 𝜆" = 0.0015 .
We want to plot the fitted PDF on a similar range, so we check the maximum claim size:
max(claims)
[1] 5612.656
Hence, we set our x coordinates as:
x<-seq(0,5613)
Now we calculate the PDF of the exp(0.0015) distribution, for each value of x:
y<-dexp(x,MLE$estimate)
set.seed(77)
y<-rexp(1000,0.003)
qqplot(x,y,xlab="Sample quantiles",ylab="Quantiles of
exp(0.003)", main="Q-Q plot")
abline(0,1,col="red")
The data doesn’t lie anywhere near the reference line for claims greater than about 500,
so the fit is particularly bad in the tail of the distribution. Nor is it a good fit for smaller
claims. Overall we would recommend the student tries to fit an alternative distribution.
we have written params[1] instead of mu, and params[2] instead of sigma. So when
we specify params (ie our vector of inputs), we need the first element of params to
correspond to the parameter mu and the second element to correspond to the
parameter sigma.
p<-c(100,100)
MLE<-nlm(f,p)
MLE
$minimum
[1] 7108.991
$estimate
[1] 4.988397 2.017083
$gradient
[1] 7.110559e-05 -5.455841e-05
$code
[1] 1
$iterations
[1] 20
MLE$estimate
[1] 4.988397 2.017083
(iii) Using the method of moments estimates as inputs in the MLE function
We enter the estimates mu and sigma from part (i) as inputs in our function f, and
minimise the negative log-likelihood:
p<-c(mu,sigma)
MLE<-nlm(f,p)
MLE
$minimum
[1] 7108.991
$estimate
[1] 4.988397 2.017082
$gradient
[1] 8.897315e-05 -5.830088e-04
$code
[1] 1
$iterations
[1] 8
This time, our maximum likelihood estimates are 𝜇̂ = 4.99 and 𝜎( = 2.02 , which are very
close to the population parameters 𝜇 = 5 and 𝜎 = 2 .
Reinsurance
Question 1. Claims from a particular portfolio of insurance policies are expected to follow a
Pareto distribution with parameters 𝛼 = 2 and 𝜆 = 800. An individual excess of
loss arrangement is in force where the insurer pays a maximum of 1,500 on any
claim.
(i) Simulate 10,000 claims from this portfolio, using a seed value of 5.
(ii) Estimate the probability that a claim involves the reinsurer.
Question 2. Claim sizes this year, X, from a particular class of insurance follow an exponential
distribution with parameter 𝜆 = 0.001.
(i) Simulate 10,000 claims from this portfolio, using a seed value of 2357.
(ii) Inflate each claim in the sample by 5%, to derive a sample of 10,000
claims from next year’s claim size distribution.
(iii) Apply the policy excess of 100 to each claim and find the mean amount
paid by the insurer after inflation.
The vector yclaims denotes the amounts paid by an insurer for a book of general
insurance business, net of reinsurance. The reinsurance is an excess of loss
arrangement with retention limit M = 1,000 so that the insurer pays a maximum of
1,000 for any individual claim. (Hence, many claims in yclaims are exactly equal
to 1,000.)
You believe that the underlying claim size random variable X (before reinsurance)
is exponentially distributed with unknown parameter 𝜆.
(i) Calculate the maximum likelihood estimate of 𝜆.
(ii) Plot a graph of your fitted distribution against the empirical density plot of
the data yclaims. Your x axis should extend from 100 to 1,000 and you
should label your graph appropriately.
(iii) Comment on the key features of your graph.
Question 4. Individual claim amounts follow a gamma distribution with parameters α = 5 and
β = 0.04. The insurer arranges excess of loss reinsurance with a retention limit of
200. Calculate the proportion of claims on which the reinsurer is involved.
increase by 4% next year. Estimate the skewness of the number of claims the
insurer can expect to pay next year.
Question 6. Before attempting this question, you should run the following line of code:
set.seed(16550)
An insurer believes that claim amounts, X, on its portfolio of pet insurance policies
follow an exponential distribution with parameter λ = 0.0005. The insurer wants to
purchase an excess of loss reinsurance policy with retention level M, such that the
reinsurer pays Z = X - M on claims that exceed size M, and zero otherwise.
(i) Simulate 10,000 claims from the insurer’s gross claims distribution and
hence estimate the insurer’s mean claims size net of reinsurance,
for M = 2, 500 .
The reinsurer imposes a maximum reinsurance payment of £1,000 on each claim,
so that:
0 𝑖𝑓 𝑋 ≤ 𝑀
𝑍 = J𝑋 − 𝑀 𝑖𝑓 𝑀 < 𝑋 ≤ 𝑀 + 1000
1000 𝑖𝑓 𝑋 > 𝑀 + 1000
(ii) Estimate the retention level M that minimises the variance of the insurer’s
claims net of reinsurance.
Question 7. A reinsurer has entered into an excess of loss reinsurance arrangement with a
retention limit of £45,000. The amounts paid by the reinsurer on each claim are
provided in the file: Reinsurance payments.txt
arrangement, by
(a) the insurer
(b) the reinsurer.
The insurer wishes to determine an appropriate retention limit and has asked an
analyst to investigate the effect of different retention limits on the mean and
variance of claims.
(iv) Calculate the mean and variance of the claims paid by:
(a) the insurer
(b) the reinsurer
under each of the following six retention levels:
£5,000, £10,000, £20,000, £30,000, £40,000 and £50,000.
(v) Plot your results from part (iv) on four separate charts to show how the
mean and variance of the claims paid by the insurer and reinsurer vary
with the retention level.
(vi) Comment, with reference to the total variance, on your results in part (v).
Question 10. Claims from a particular portfolio of insurance policies are expected to follow a
Pareto distribution with parameters 𝛼 = 3 and 𝜆 = 400. A policyholder
deductible of £100 is applied to these policies.
(i) Simulate 10,000 claims from this portfolio, using a seed value of 225 and
store the values in vector x.
(ii) Calculate the expected claim size paid by the insurance company.
(iii) Comment on the difference between your answer to (ii) and the mean of
vector x.
Solutions
Answer 1. (i) Simulating claims
Since the Pareto distribution is not built in to the basic R functions, we will write
our own function. You will have seen how to do this before in a previous unit:
rpareto <- function(n,a,l){
rp <- l*((1-runif(n))^(-1/a)-1)
rp
}
Simulating a set of 10,000 claims from this distribution, using the seed value of
5:
set.seed(5)
x<-rpareto(10000,2,800)
Extracting the vector of claims that do not exceed the retention level 1,000:
yclaims.excl1000 <- yclaims[yclaims<1000]
p<-1
nlm(flnL,p)
$minimum
[1] 64624.84
$estimate
[1] 0.002720991
$gradient
[1] 5.491267
$code
[1] 2
$iterations
[1] 14
(ii) Plot
First we extract the maximum likelihood estimate 𝜆" from the output of the nlm
function, and call this lambda:
lambda<-nlm(flnL,p)$estimate
Now we plot the empirical density function of the original data. In the code
below we have split the title of the graph into three lines, so that R will do the
same when it creates the graph:
plot(density(yclaims,from=100,to=1000),xlab="Amounts paid
by insurer",main="Empirical density function vs
Exp(0.00272)",col="blue")
Now we create a vector of values x, on the same range as the empirical density
plot:
x<-seq(100,1000)
y<-dexp(x,rate=lambda)
(iii) Comment
The blue line (ie the empirical density function) has a peak at x =1,000 . This is to
be expected since the data yclaims represents the insurer’s claim amounts
after reinsurance (ie there are many claims that have hit the reinsurance layer
and for which the insurer pays the maximum1,000).
The two lines diverge in the right-hand tail of the distribution since the data
represents claim amounts after reinsurance whereas the fitted distribution
models the underlying claim size before reinsurance.
The two lines are quite close across the rest of the range (ie where the
underlying claim sizes are the same as the insurer’s claim amounts), so the fitted
distribution appears to be a good fit.
Answer 5. set.seed(1)
x <- sort(rpois(365,rgamma(365,20,0.01)))
f <- 0.01*x
f2 <- 1.04*f
nonf <- (1-0.01)*x
r2 <- nonf+f2
x2 <- r2*(1-0.01*1.04)
skewX2 <- sum((x2-mean(x2))^3)/length(x2); skewX2
skewX2^(1/3)
# Part (ii)
f <- function(M) {
var(x-pmin(pmax(x-M,0),1000))
}
nlm(f,1000)
The number of claims that fall below the retention limit of £45,000 is:
nm <- length(Payments)-length(RI.claims)
nm
[1] 1634
mu <- mean(log(RI.claims))
sigma <- sd(log(RI.claims))
p <- c(mu,sigma)
flogN(p)
nlm(flogN,p)
$minimum
[1] 4578.015
$estimate
[1] 10.4747560 0.2651853
$gradient
[1] -8.453505e-06 -1.749413e-05
$code
[1] 1
$iterations
[1] 25
# Part (iii)
x <- Payments+M
max(x)
breakpoints <- c(20000,45000,seq(46000,86000,1000))
hist(x,freq=FALSE, breaks=breakpoints, xlim=c(0,86000),
ylim=c(0,0.00005),ylab="Density",main="Truncated data
versus logN(10.475,0.265)",
col="grey")
lines(seq(0,86000),dlnorm(seq(0:86000),meanlog=10.475,sdlog
=0.265),col="blue")
m <- nlm(flogN,p)$estimate[1]
s <- nlm(flogN,p)$estimate[2]
lines(seq(0,86000),dlnorm(seq(0:86000),meanlog=m,sdlog=s),c
ol="blue")
# Part (v)
x <- M+max
plnorm(x,10.475,0.265,lower.tail=FALSE)
m <- nlm(flogN,p)$estimate[1]
s <- nlm(flogN,p)$estimate[2]
plnorm(x,m,s,lower.tail=FALSE)
Answer 8. (i)
claims <- rexp(10000, 0.0001)
hist(claims, main = "Histogram of 10,000 Simulated Claims
Values from the Exponential Distribution")
(ii)(a)
mean(claims)
[1] 10037.81
var(claims)
[1] 99976440
(b)
The theoretical value of the mean should be 1/0.0001 = 10,000.
2
The theoretical value of the variance should be (1/0.0001 ) = 100,000,000
The simulated values are very close to the theoretical values.
The difference is due to sampling error
(iii)(a)
For the insurer:
InsClaims3 <- pmin(claims, 20000)
mean(InsClaims3)
[1] 8671.328
var(InsClaims3)
[1] 43964610
(b)
For the reinsurer:
ReClaims3 <- pmax(0, claims - 20000)
mean(ReClaims3)
[1] 1366.485
var(ReClaims3)
(iv)(a)
InsClaims1 <- pmin(claims, 5000)
InsClaims2 <- pmin(claims, 10000)
InsClaims4 <- pmin(claims, 30000)
InsClaims5 <- pmin(claims, 40000)
InsClaims6 <- pmin(claims, 50000)
MeanIns <- c(mean(InsClaims1), mean(InsClaims2),
mean(InsClaims3), mean(InsClaims4), mean(InsClaims5),
mean(InsClaims6))
MeanIns
[1] 3940.658 6339.937 8671.328 9538.658 9869.412 9971.611
VarIns
[1] 2533775 12858446 43964610 70204248 86376566 93370367
(b)
ReClaims1 <- pmax(0, claims - 5000)
ReClaims2 <- pmax(0, claims - 10000)
ReClaims4 <- pmax(0, claims - 30000)
ReClaims5 <- pmax(0, claims - 40000)
ReClaims6 <- pmax(0, claims - 50000)
MeanRe <- c(mean(ReClaims1), mean(ReClaims2),
mean(ReClaims3), mean(ReClaims4), mean(ReClaims5),
mean(ReClaims6))
MeanRe
[1] 6097.15469 3697.87535 1366.48492 499.15454 168.40057
66.20174
VarRe <- c(var(ReClaims1), var(ReClaims2), var(ReClaims3),
var(ReClaims4), var(ReClaims5), var(ReClaims6))
VarRe
[1] 84523429 60046376 25047814 9343406 3450843 1305645
(v)
x <- c(5000, 10000, 20000, 30000, 40000, 50000)
plot(x, MeanIns, xlab = "Retention Limit", ylab = "Mean
amount paid by Insurer", main = "Mean Claim Amount for
Insurer by Retention Limit", type = "l")
(vi) The mean claim amount paid by the insurer increases in size with the retention
limit but at a decreasing rate. It tends towards the unrestricted mean as the retention
limit increases
The mean claim amount paid by the reinsurer is equal to the total mean claim minus
the mean amount paid by the insurer
Hence the mean claim amount paid by the reinsurer decreases in size with the
retention limit also at a decreasing rate.
The trends in the variance of the claims for the insurer and the reinsurer
are not “mirror images”.
Among the six retention limits investigated, the sum of the variance of the reinsurer
and the insurer reaches a minimum at a retention limit of around £20,000 [1]
TotalVar <- VarIns + VarRe
TotalVar
[1] 87057204 72904822 69012424 79547654 89827408 94676012
Answer 9. (i)
(a)
Exp_Vector <- rexp(1000,0.4)
mean(Exp_Vector) # or summary(Exp_Vector)
var(Exp_Vector)
The mean and variance will vary due to the random number generation. If the
sample size was large enough, the mean and variance should be close the
underlying distribution (exponential with parameter 0.4) as follows:
Mean = 2.5
Variance = 6.25
(b)
hist(Exp_Vector)
(c) 1.
x <- sort(Exp_Vector)
PDF <- dexp(x, 0.4)
plot(x, PDF)
(c) 2.
plot(x, PDF, type="l")
(ii) (a)
LNorm_Vector <- rlnorm(1000, meanlog = 0, sdlog = 1)
mean(LNorm_Vector)
var(LNorm_Vector)
The mean and variance will vary due to the random number generation. If the sample size
was large enough, the mean and variance should be close the underlying distribution
(lognormal with parameters μ = 0, σ2 = 1) as follows:
Mean = 1.649
Variance = 4.6708
(ii) (b)
hist(LNorm_Vector)
(c)
hist(LNorm_Vector, freq = FALSE, xlim = c(0,25), ylim = c(0,0.7))
lines(ecdf(LNorm_Vector),col="red")
legend("bottomright",c("True cumulative
distribution","Estimate"),lty=1,col=c("black", "red"))
set.seed(225)
x<-rpareto(10000,3,400)
(ii)
y <- x[x>100]
mean(y)
[1] 355.6002
(iii)
mean(x)
[1] 204.6041
The mean amount of Y is greater than this because we have removed some of
the smallest claims by introducing the deductible.
Risk Models
Question 1. Aggregate claims S follow a compound distribution where the number of claims N
follows a (type 2) negative binomial distribution with parameters k = 10 and p = 0.1.
Individual claim sizes X follow a lognormal distribution with parameters µ = 10 and
s2 = 5.
(i) Simulate 10,000 random variates from the compound distribution and display the
first five values of your data. (Use seed value 16.)
(ii) Compare the sample mean and variance of your dataset to the population mean
and variance of S. Hint: Use the formula for E(S) and Var(S) given on page 16 of
the Tables.
(iii) Adjust your answer to part (i) to simulate the insurers aggregate claims net of
reinsurance, SI , if an individual excess of loss arrangement applies, with
retention limit M = 2,000,000.
Question 2. An insurer’s total annual claim amount S follows a compound Poisson distribution
with parameter l = 500. The individual claim size X is exponentially distributed
with parameter b = 0.001 .
(i) Simulate 1,000 aggregate claim amounts, using the seed value 1975, and display
the last five results.
The insurer believes that aggregate claims can be approximated using a normal
distribution, with 2 parameters µ and s .
(ii) Calculate the maximum likelihood estimates of the parameters µ and s2 based on
your simulated data. Hint: You should use the sample mean and standard deviation
as starting values for your iteration.
Question 3. This question uses the data file ‘IRM.txt’. Therefore please run the following code
before attempting this question:
probability.of.claim <-
read.table("IRM.txt",header=T)[,2]
exponential.parameter <-
read.table("IRM.txt",header=T)[,3]
An insurer has sold 500 life insurance policies. Each policy can give rise to at
most one claim, and if a claim occurs on risk i the benefit payment follows an
exponential distribution with parameter li (for i = 1,..., 500 ). Policies are
independent, and claim sizes (and claim numbers) are not necessarily identically
distributed.
(i) Calculate the insurer’s expected total claim payment for the portfolio.
(ii) Estimate the probability that the insurer’s aggregate claim amount, S, will be
within plus or minus 10% of the population mean. (Hint: You should carry out
10,000 simulations, using the seed value 5.)
(iii) (a) Calculate the probability of the aggregate claims on the portfolio being
zero, and compare this to the probability implied by your simulation.
(b) Hence or otherwise, comment on the shape of the distribution of S.
Question 4. We have generated the data required for this question (policies) using the
following code. You should run this code before attempting the question:
set.seed(1812)
q <- runif(200,0,0.01)
mu <- runif(200,5,13)
sigma <- runif(200,0.5,0.8)
policies <- cbind(q,mu,sigma)
(i) Simulate the reinsurer’s mean aggregate claim amount, using 5,000
simulations and a seed value of 1854.
Question 5. Aggregate claims S follow a compound distribution where the number of claims N
follows a compound Poisson distribution with parameter 𝜆 = 100. Individual
claim sizes X follow an exponential distribution with parameter 𝛽 = 0.002.
Use seed value 8,143 to simulate 1,000 random variates from the compound
distribution.
(i) Hence state the value of the 98th simulated value of S.
(ii) Estimate the median of the insurer’s aggregate claims payment.
(iii) Estimate the coefficient of skewness of the insurer’s aggregate claims
payment S
(i) Use a seed of 123 to generate a sample of 10,000 observations from T and
display the last five values from your simulation.
(ii) Estimate q such that P(T > q) = 0.05.
(iii) Fit a gamma distribution to the simulated dataset t that you derived above.
(iv) Create a Q‐Q plot of the maximum likelihood gamma distribution against
the simulated aggregate claims data t. Comment on the result.
(v) Check the interpretation of the Q‐Q plot in the previous question by
plotting the right‐hand tail of the fitted density function against the right‐
hand tail of the empirical density function.
Question 7. The random variable W follows a compound negative binomial distribution with
N ~NBin(5,0.8) and the claim size random variable X follows a Burr distribution
with parameters 𝛼 = 0.3 , 𝜆 = 10 and 𝛾 = 1.2. Use a seed of 123 to generate a
sample of 10 observations from W.
Question 8. Company B has sold 100 life insurance policies which each gives rise to a
maximum of one claim per year. There are three types of policy:
• for Type 1 policies, claims amounts (in £000s) follow an
Exp(0.005)distribution, and the probability of a claim occurring on the
policy is 0.001
• for Type 2 policies, claims amounts (in £000s) follow an Exp(0.006)
distribution, and the probability of a claim occurring on the policy is 0.002
• for Type 3 policies, claims amounts (in £000s) follow an Exp(0.007)
distribution, and the probability of a claim occurring on the policy is 0.003
There are 30 Type 1 policies, 50 Type 2 policies and 20 Type 3 policies. All
policies are independent.
Simulate a set of 5,000 observations from the insurer’s aggregate claims size
distribution, and hence estimate the probability that it will have to pay out more
than £100,000 a year for these policies.
(You should use the seed value 54321.)
SOLUTIONS
Answer 1
(i)
k <- 10; p <- 0.1
mu <- 10; sigma <- 5^0.5
set.seed(16)
n <- rnbinom(10000,size=k,prob=p)
s <- numeric(10000)
for(i in 1:10000) {
x <- rlnorm(n[i],meanlog=mu,sdlog=sigma)
s[i] <- sum(x)
}
head(s,5)
(ii)
sample.mean <- mean(s)
sample.variance <- var(s)
meanN<-k*(1-p)/p
varN<-k*(1-p)/p^2
meanX<-exp(mu+0.5*sigma^2)
varX<-exp(2*mu+sigma^2)*(exp(sigma^2)-1)
meanS<-meanN*meanX
varS<-meanN*varX+varN*meanX^2
c(meanS, sample.mean)
c(varS,sample.variance)
meanS/sample.mean
varS/sample.variance
(iii)
set.seed(16)
Answer 2
(i)
set.seed(1975)
n <- rpois(1000,500)
s <- numeric(1000)
for(i in 1:1000) {
x <- rexp(n[i],rate=0.001)
s[i] <- sum(x)
}
tail(s,5)
(ii)
f <- function(params) {
lnL <- dnorm(s,mean=params[1],sd=params[2],log=TRUE)
sum(-lnL)
}
p <- c(mean(s),sd(s))
MLE<-nlm(f,p); MLE
(iii)
mu <-MLE$estimate[1]
sigma <- MLE$estimate[2]
agg <- rnorm(100, mean=mu, sd=sigma)
qqplot(agg,s,xlab = "Quantiles from fitted
distribution",ylab = "Quantiles of simulated data",main =
"QQ plot of data")
abline(0,1,col="red")
plot(density(s),main="Simulated data versus fitted
distribution",xlab="Aggregate claim size",col="blue")
agg <- seq(from=min(s),to=max(s))
y <- dnorm(agg, mean=mu, sd=sigma)
lines(agg,y,col="red")
Answer 3
probability.of.claim <- read.table("IRM.txt",header=T)[,2]
exponential.parameter <- read.table("IRM.txt",header=T)[,3]
(i)
E.Si <- probability.of.claim/exponential.parameter
sum(E.Si)
(ii)
lower.bound <- sum(E.Si)*0.9
lower.bound
upper.bound <- sum(E.Si)*1.1
upper.bound
death <- rbinom(500,1,probability.of.claim)
x <- rexp(500,rate=exponential.parameter)
sim <- x*death
sim <- sum(sim)
set.seed(5)
S.sim <- numeric(10000)
for (i in 1:10000) {
(b)
summary(probability.of.claim)
Answer 4
set.seed(1812)
q <- runif(200,0,0.01)
mu <- runif(200,5,13)
sigma <- runif(200,0.5,0.8)
policies <- cbind(q,mu,sigma)
(i)
set.seed(1854)
SR.sim <- numeric(5000)
for (i in 1:5000) {
death <- rbinom(200,1,policies[,1])
x <- rlnorm(200,meanlog=policies[,2],sdlog=policies[,3])
z <- pmax(0,x-60000)
sim <- z*death
sim <- sum(sim)
(ii)
set.seed(1854)
S.sim <- numeric(5000)
for (i in 1:5000) {
death <- rbinom(200,1,policies[,1])
x <- rlnorm(200,meanlog=policies[,2],sdlog=policies[,3])
sim <- x*death
sim <- sum(sim)
S.sim[i] <- sim
}
SR.sim <- pmax(0,S.sim-300000)
mean(SR.sim)
[1] 20635.58
So the reinsurer’s mean aggregate claims payment is approximately 20,636
Answer 5.
(i)
set.seed(8143)
n <- rpois(1000,100)
s <- numeric(1000)
for(i in 1:1000) {
x <- rexp(n[i], rate=0.002)
s[i] <- sum(x)
}
The 98th element of s is:
s[98]
[1] 37309.99
(ii)
quantile(s,0.5)
50%
49812.93
(iii)
skewness<-sum((s-mean(s))^3)/length(s)
skewness/var(s)^(3/2)
[1] 0.1570086
Answer 6.
(i)
set.seed(123)
n <- rbinom(10000,size=80,prob=0.3)
t <- numeric(10000)
rpareto <- function(n,a,l){
rp <- l*((1-runif(n))^(-1/a)-1)
rp
}
for(i in 1:10000) {
x <- rpareto(n[i],a=4,l=4500)
t[i] <- sum(x)
}
(ii)
We need to find the 95th percentile q, which we can do
using the quantile function:
quantile(t,0.95)
95%
57819.2
(iii)
#First we write the negative log-likelihood function f:
f <- function(params) {
lnL <- dgamma(t,shape=params[1],rate=params[2],log=TRUE)
sum(-lnL)
}
(iv)
#First we extract the maximum likelihood estimates from the
#output of nlm.
a <- nlm$estimate[1]
l <- nlm$estimate[2]
The data appears to be a reasonable fit, except for the right‐hand tail of the distribution.
Here the quantiles of the simulated data are higher than the quantiles of the fitted
distribution, and there is one outlier that is very far from the maximum likelihood
distribution. So the simulated data is more positively skewed than the fitted gamma
distribution. (In fact, the data is not a particularly good fit for the left‐hand tail of the
distribution either.)
(v)
#Plotting the empirical density function:
plot(density(t),ylim=c(0,0.000001),xlim=c(50000,300000),main=
"Simulated data
versus
fitted distribution",xlab="Aggregate claim size",col="blue")
#(We have chosen the range of the empirical density function plot
#by trial and error, using the ylim and xlim arguments.)
The simulated data is indeed more positively skew than the fitted distribution.
Answer 7.
set.seed(123)
n <- rnbinom(10,size=5,prob=0.8)
w <- numeric(10)
for(i in 1:10) {
x <- rburr(n[i],a=0.3,l=10,g=1.2)
w[i] <- sum(x)
}
w
[1] 558.14071 387.40722 23.26043 39.80472 50.99664
Answer 8.
set.seed(54321)
rate1 <- 0.005; q.1 <- 0.001
rate2 <- 0.006; q.2 <- 0.002
rate3 <- 0.007; q.3 <- 0.003
for (j in 1:5000) {
x<-numeric(100)
for (i in 1:30) {
x[i] <- rexp(1,rate=rate1)
}
for (i in 31:80) {
x[i] <- rexp(1,rate=rate2)
}
for (i in 81:100) {
x[i] <- rexp(1,rate=rate3)
}
death<-numeric(100)
for (i in 1:30) {
death[i] <- rbinom(1,size=1,prob=q.1)
}
for (i in 31:80) {
death[i] <- rbinom(1,size=1,prob=q.2)
}
for (i in 81:100) {
death[i] <- rbinom(1,size=1,prob=q.3)
}
sim <- x*death
sim<-sum(sim)
S.sim[j] <- sim
}
Survival Models
Question 1. Let 𝑇$% be the random variable denoting the expected future lifetime of lives
currently aged 40. It is believed that 𝑇$% follows the W(0.00001,3.5) distribution.
(i) (a) Generate a sample of 10,000 values of 𝑇$% , setting a seed of 100 and
storing the simulated values in the vector sims.
(b) Estimate P(𝑇&% >10) using your sample.
Under this model, the hazard at age 40 + t is given by:
𝜇$%'( = 𝛼𝛽𝑡)*+
where a = 0.00001 and b = 3.5.
(ii) (a) Construct a function R to calculate µ, for an inputted age, x ³ 40 .
(b) Calculate P(𝑇&% >10) using your function for µ, .
(iii) Comment on you answers to parts (i)(b) and (ii)(b).
Question 2. Mortality of a group of lives is assumed to follow Gompertz’ law for ages 25 and
over. The force of mortality at age x ³ 25 is given by:
𝜇, = 𝐵𝑐 ,*!& for x³25
where B=0.00000729 and c = 1.128
(i) Plot the force of mortality from ages 25 to 100.
(ii) Plot the PDF of T45 , the future lifetime random variable of someone aged
45.
-
(iii) Calculate a numerical estimate of 𝑒&% .
(iv) (a) Simulate 100,000 values of 𝑇&% the future lifetime random variable of
someone aged 50, setting a seed of 50 and storing your simulated values in
the vector sims.
-
(b) Estimate 𝑒&% from your simulated values.
Question 3. The ‘Very-ruthless Management Consultancy Company’ pays very high wages
but also has a very high failure rate, both from sackings and through people
leaving.
The R code to create a data frame representing the life table for a typical new
recruit (with durations measured in years) is given below:
life.table = data.frame(duration = 0:9,
lives = c(100000,72000,51000,36000,24000,15000,10000,6000,2500,0))
(i) Construct an extra column in the table that contains in each row the
probabilities of a new recruit’s survival to that year (the first row should
be set to 1).
(ii) Calculate the expected number of complete years that a new graduate will
complete with the company.
(iii) 45 of the graduates that started working at the company on 1 September
last year are still at the company exactly one year later. Calculate the
expected number of complete additional years that these graduates will
complete with the company.
Question 4. The values of µx for the AM92 mortality table for ages in the range 17 £ x £ 119
are calculated using the following formula:
𝜇, = 𝑎% + 𝑎+ 𝑡 + exp[𝑏% + 𝑏+ 𝑡 + 𝑏! (2𝑡 ! − 1)]
,*.%
where 𝑡 = &% and 𝑎% = 0.00005887 , 𝑎+ = -0.0004988 , 𝑏% = -4.363378 ,
𝑏+ = 5.544956 , 𝑏! = -0.620345.
(i) (a) Construct a function in R to calculate µ, for an inputted age, x .
(b) Calculate the value of µ/% using your function, checking it against
the tabulated value on page 81 of the Tables.
(c) Plot a graph of µx using a log scale. [Hint : use the log = “y”
argument in the plot() function.]
You are given that 𝑞, can be approximated using the relationship
𝑞, ≈ 1 − exp \−𝜇,'! ].
"
(ii) (a) Use this approximation to calculate values of qx , rounded to 6 decimal
places, to complete the table below:
(iii) (a) Calculate an approximate value for the curtate expectation of life 𝑒!& .
(*+
Hint: you can use the cumprod() function and the fact that ( 𝑝, = Π01% 𝑝,'0
(b) Check your value of 𝑒!& against the tabulated value on page 82 of the
Tables.
SOLUTIONS
Answer 1
(i)(a) a = 0.00001
b = 3.5
shape = b
scale = a^(-1/b)
We then simulate the values using the rweibull() function:
set.seed(100)
sims = rweibull(10000,shape,scale)
(b) In part (i) we simulated values of 𝑇!" . So, we first need to convert P(𝑇#" >10) into a probability
in terms of 𝑇!" . Using principle of consistency
!" &#"
$" 𝑝!" = %" 𝑝!" ×%" 𝑝#" ⇒ %" 𝑝#" =
$" &#"
To estimate this from the simulated values, we need to calculate how many of the simulated
values are at least 20 and divided by how many are at least 10.
In R, we can do this using square bracket notation with logical conditions and the length()
function:
length(sims[sims > 20]) / length(sims[sims > 10])
[1] 0.7277898
(ii)(a)
mux = function(x){
a * b * (x - 40)^(b - 1)
}
(b) exp(-integrate(mux, 50, 60)$value)
[1] 0.7216983
(iii) The answers are very similar.
We expect the answer to part (ii)(b) to be fairly accurate, even though R evaluates the integral
numerically. The answer to part (i)(b) is slightly different due to sampling error. If we took a
larger sample, we would expect the answer to be closer to that in part (ii)(b).
Answer 2
(i) B = 0.00000729
c = 1.128
mux = function(x){
B * c ^ (x - 25)
}
plot(25:100, mux(25:100), type = "l", col = "blue",
main = "Graph of mu_x against age",
xlab = "age", ylab = "Force of mortality", lwd = 3)
(b) mean(sims)
[1] 50.94788
Answer 3
(i) First we load in the data:
life.table = data.frame(duration = 0:9, lives =
c(100000,72000,51000,36000,24000,15000,10000,6000,2500,0)
The probabilities of surviving to each year are the number of lives corresponding to that year
divided by the starting number of lives.
We can calculate these using square brackets for indexing and creating a new column using $:
life.table$surv = life.table$lives / life.table$lives[1]
life.table
duration lives surv
1 0 100000 1.000
2 1 72000 0.720
3 2 51000 0.510
4 3 36000 0.360
5 4 24000 0.240
6 5 15000 0.150
7 6 10000 0.100
8 7 6000 0.06
9 8 2500 0.025
10 9 0 0.000
(ii) We can do this using the sum() function and square brackets with a negative number to
remove the first entry:
sum(life.table$surv[-1])
[1] 2.165
Here, we have w = 7 . We first need to calculate ' 𝑝% . These probabilities are the lives in
each year divided by the number of lives in year 1. We can calculate these and add a
column to the table similar to part (i). We set the first row to NA as it represents *% 𝑝% ,
which doesn’t make any sense:
life.table$surv_after_1_year=
c(NA,life.table$surv[2:nrow(life.table)] / life.table$surv[2])
life.table
𝑒% is now the sum of the probabilities in the last column from the third row on. To
calculate this, we can take the sum of the column excluding the first and second entries:
sum(life.table$surv_after_1_year[-c(1:2)])
[1] 2.006944
Answer 4
(i) (a) a0 = 0.00005887
a1 = -0.0004988
b0 = -4.363378
b1 = 5.544956
b2 = -0.620345
mu <- function(x){
t = (x-70)/50
a0 + a1*t + exp(b0 + b1*t + b2*(2*t^2 - 1))
}
(b) mu(80)
[1] 0.06827114
(ii) (a)
We can create a list of ages for the first column using the following R command:
xlist = seq(20,110,by=10)
We can now define a function to calculate the values in the right-hand column of the table in
the question:
fnexp <- function(x) {1-exp(-mu(x + 0.5))}
We can then input the required ages to this function, rounding the answers to 6 decimal
places:
qlist = round(fnexp(xlist),6)
There are various ways to combine the ages and probabilities in the form of a table. One
method is to use the cbind function to join the columns together:
cbind(xlist,qlist)
(b)
The accurate values for ages 20, 60 and 110 (say), from pages 78 and 79 of the Tables, are:
𝑞$" = 0.000582, 𝑞+" = 0.008022 𝑎𝑛𝑑 𝑞%%" = 0.607918
We can see that all the approximations are very good – accurate to at least 4 decimal places.
However, the accuracy diminishes as the age increases.
This is to be expected, because there will be a larger variation in the value of µx over the year
of age at the older ages. So the approximation of using the value for the middle of the age
range will not be as accurate here.
Question 2. A study aimed at estimating hazard rates and survival probabilities from the point
of contracting a particular disease has produced the results in the data file
surv_table_2.csv, which has the following columns:
• Time – the time an event took place measured in weeks since contraction of the
condition
• Deaths – the number of deaths that took place at each time
• Censorings – the number of right censorings that took place at each time
A total of 100 patients were observed since the day that they contracted the
illness. Observation of patients stopped after 25 weeks following contraction of
the condition.
(iv) (a) Update your table to include a column called Lj that contains an estimate
_
of L (t) using the Nelson-Aalen model.
(b) Calculate a set of approximate 95% confidence intervals for L(t) at the
time of each death event.
(c) Produce an appropriate graph in R showing a plot of the estimated
integrated hazard function with approximate 95% confidence intervals.
The following code creates a vector of event times and a vector of event codes (1
indicating death and 0 indicating censoring) for the 100 patients:
(v) (a) Create a survival object using the vectors event.times and event.codes.
(b) Calculate the Nelson-Aalen estimate of the survival function and
approximate 99% confidence intervals.
Hint: the conf.int argument in the survfit() function sets the level of the
estimated confidence interval.
(c) Produce an appropriate graph in R showing the Neslon-Aalen estimate and
approximate 99% confidence intervals.
(vi) (a) State an inequality relating the values of 𝑆H34 (𝑡) and 𝑆H56 (𝑡), the
estimated survival function based on the Nelson-Aalen model.
(b) Demonstrate numerically the validity of the inequality in part (vi)(a) using
the data in this study.
SOLUTIONS
Answer 1
(i) surv.data = read.csv("surv_table_7.1.csv")
(ii) We first need to ensure the survival package is active:
library(survival)
We then need to create a vector of event times as the difference between the end of
observation and time of exposure:
surv.data$event.time = surv.data$observation_end -
surv.data$exposure_time
We also need a vector of event codes. The table has a reason column, which we need to use to
create a vector of 1’s (to indicate death) and 0’s (to indicate censoring):
surv.data$event.code = ifelse(surv.data$reason == "D", 1, 0)
We can then use the Surv() function to create the survival object:
surv.obj = Surv(surv.data$event.time, surv.data$event.code)
surv.obj
[1] 2 3+ 5 8 2 10+ 14 3 10+ 3
(iii) (a)
We can use the survfit() function with the survival object from part (ii):
fitkm = survfit(surv.obj ~ 1, conf.type = "plain")
(b) plot(fitkm,
main = "Kaplan-Meier estimate of survival function",
xlab = "weeks",
ylab = "SKM(t)")
Answer 2
(i) surv.data = read.csv("surv_table_2.csv")
surv.data
(ii) The data in the table is by event time rather than by patient. So, first we need to just extract the
rows that relate to death times:
deaths = surv.data[surv.data$Death > 0, 1:2]
deaths
This gives both tj and dj . We can rename the columns to these titles:
colnames(deaths) = c("tj", "dj")
deaths
Next we need to calculate the nj . As each patient was observed since they contracted the
illness, nj is simply the number of patients that experienced their event on or after tj. To make
the calculation a little easier, we can first create a new column in the original data set that is the
sum of deaths and censoring, ie the total number of events at each time:
surv.data$Events = surv.data$Death + surv.data$Censorings
Next, we calculate the number of events happening at times greater than or equal to all the
death times
deaths$nj = numeric(nrow(deaths))
for (i in 1:nrow(deaths)) {
deaths$nj[i] = sum(surv.data$Events[surv.data$Time >= deaths$tj[i]])
}
deaths
We can then add the confidence intervals using the lines() function and setting the line
type to 2 for a dotted line:
Unfortunately, the upper part of the approximate confidence interval is being cut off by
the limits of the y -axis. We can fix this with the ylim argument in the plot() function:
plot(c(0, deaths$tj), c(0, deaths$Lj), type = "s", lwd = 2,
main = "Nelson-Aalen estimate of the integrated hazard
function",
xlab = "weeks",
ylab = "Lambda(t)", ylim = c(0,0.3))
We now have a vector of times and codes to use with the Surv() function:
surv.obj = Surv(event.times, event.codes)
surv.obj
(b) Using the survfit() function with stype set to 2 and conf.int set to 0.99:
fitna = survfit(surv.obj ~ 1, conf.type = "plain", stype = 2,
conf.int = 0.99)
We can then look at the estimate using the summary() function:
summary(fitna)
(c) plot(fitna,
main = "Nelson-Aalen estimate of survival function",
xlab = "weeks",
ylab = "SKM(t)", ylim = c(0.6, 1))
legend("bottomleft",
c("Nelson-Aalen estimate", "Kaplan-Meier estimate"),
col = c("black", "blue"), lty = 1)
The blue line is never above the black line, as expected based on the relationship between the
estimates.
Question 1. The table below shows the number of circuits a group of 10 army recruits were
able to complete in a training exercise and their ages, split by sex.
Complete Circuits Males Females
1
2 30
3 21,20 24,25
4 19
5 18,18,17
6 21
Two Cox proportional hazards models are being considered for modelling the
result :
Model 1: ln 𝜆(𝑡; 𝑧+ ) = ln 𝜆% (𝑡) + 𝛽+ 𝑧+
Model 2: ln 𝜆(𝑡; 𝑧+ , 𝑧! ) = ln 𝜆% (𝑡) + 𝛽+ 𝑧+ + 𝛽! 𝑧!
Here:
• l(𝑡; 𝑧1 ) and l(𝑡; 𝑧1 , 𝑧2 ) are the hazard rates for ‘dropping out’, i.e. for the
recruit reaching their limit in terms of the number of circuits they can
complete
• 𝑧+ is the recruit’s sex ( 𝑧+ = 0 for males, 𝑧+ = 1 for females)
• 𝑧! is the recruit’s age.
(i) Construct a data frame that contains a row for each recruit and the
following columns:
• Age – the age of the recruit
• Sex – the sex of the recruit
• Circuits – the number of circuits completed by the recruit
• Status – whether each recruit experienced the death event (1) or
was censored (0)
Hint: the data.frame(<col 1> = <vector 1>, ..., <col n> = <vector
n>) function creates a data frame with columns called <col 1>,...,<col n>
containing the values in <vector 1> to <vector n>.
(i) (a) Construct a function that returns the partial log-likelihood for an
inputted value of b+ .
(b) Show that the value of the partial log-likelihood (ignoring
constants) is –16.78165 when b+ = 0.3.
(ii) Plot a graph of the partial log-likelihood function over a suitable range,
indicating an appropriate interval that contains 𝛽H+ .
(iv) Compare your answer to part (iii) to you answer to Question 1 part (ii)(b).
SOLUTIONS
Answer 1
(i) Creating the necessary vectors and putting them into a data frame:
circuits = c(2,3,3,3,3,4,5,5,5,6)
status = rep(1,10)
sex = c(1,0,0,1,1,1,0,0,0,1)
age = c(30,21,20,24,25,19,18,18,17,21)
PH_data = data.frame(circuits,status,sex,age)
(b) The estimated values of b% and b$ are now -2.0086 and 0.5744.
This gives the fitted model:
Model 2: lnl(t; z% ) = lnl" (t) - 2.0086z% + 0.5744z$
We now have:
exp(fit2$coefficients)
sex age
0.1341723 1.7760945
Since 𝛽"% is negative, this model now suggests that the drop-out rate is lower for the
females compared to the males, but again this difference is not significant ( p -value =
14.1%).
Since 𝛽"$ is positive, this model suggests that the drop-out rate increases with age (77.6%
per year of age according to the fitted model). This effect is significant in this model,
since the p - value of 4.3% is less than 5%.
(iv) We are testing the null hypothesis that Model 2 is not a significant improvement over
Model 1.
Using the hint in the question, we can use the anova() function.
anova(fit1, fit2, test = "Chi")
This tells us that the partial log-likelihood for the first model is –16.781 and –13.419 for
the second. The observed value of the test statistic, -2(𝑙& - 𝑙&+3 ), is 6.7042 and the
associated p -value is 0.009618.
As the p -value of 0.962% is less than 5%, there is sufficient evidence to reject the null
hypothesis that Model 2 is not a significant improvement over Model 1. It appears
reasonable to conclude that incorporating age has improved the model.
Answer 2
(i) (a) Constructing the function:
part.log.lik = function(b){
4 * b - log(5 * exp(b) + 5) - 4 * log(4 * exp(b) + 5) -
log(2 * exp(b) + 3) - 3 * log(exp(b) + 3)
}
(b) Checking the value:
part.log.lik(0.3)
[1] -16.78165
(ii) plot(part.log.lik, 0, 1, lwd = 2,
main = "Partial log-likelihood",
xlab = "beta", ylab = "log(L(beta))")
(iv) This value is almost exactly the same as the estimate from 1.
Markov Chains
Question 1. A Markov chain has the following state space and one-step transition
probabilities:
(i) Construct a matrix that contains the one-step transition probabilities, labelling the
rows and columns with ‘state 1’, ‘state 2’ and ‘state 3’.
(ii) Construct a markovchain object ,mc, with the name ‘Q1MC’ and using the
matrix from part (i) as the transition probability matrix.
(iii) (a) Determine whether the Markov chain is irreducible using functions in R.
The @ symbol can be used with markovchain objects to extract its components. The
following code extracts the transition matrix:
mc@transitionMatrix
(iv) Check the row sums of the extracted transition matrix are all 1.
(v) (a) Calculate the probability that the chain will be in state 1 after 4 steps,
given it is currently in state 2, using mc.
(b) Calculate the probability that the chain will be in state 2 after 8 steps, using
mc.
Question 2. The most recent 18 values of a discrete-time process with states 1, 2 and 3 were as
follows:
2 1 2 1 3 1 1 2 2 1 1 3 3 1 2 1 2 1
(i) Create a vector named past18 containing the values of the past states.
Enter the following lines of R code, which fit a Markov chain model to the past
data:
fit1 = markovchainFit(past18)
You should tackle questions 3, 4 and 5 using either the basic R functions or functionality in the
markovchain package (or both). Answers for both approaches are presented in the solutions.
Question 3. A no-claims discount (NCD) system for motor insurance is to be modelled using a
Markov chain with constant transition probabilities, as shown in the diagram:
State 1 No discount
State 2 15% discount
State 3– 30% discount (at least one claim in previous year)
State 3+ 30% discount (no claims in previous yr)
State 4 40% discount
(i) (a) Construct the one-step transition probability matrix for this process P.
(b) Check that the rows of the matrix sum to 1.
(ii) Calculate the probabilities for the discount rate this policyholder will be
receiving after five years.
(iii) Calculate the probabilities for the discount rate for this policyholder after 20
years.
(iv) (a) Calculate 𝑃!&
(b) Comment on the rows of 𝑃!& .
Hint: you may wish to use the functions matrix(), diag(), rowSums() and the
matrix multiplication operator %*%.
Question 4. An insurance company has been using a Markov chain to model its no-claims
discount (NCD) system, which offers the following discounts to motorists on their
annual premium:
Level 1 No discount
Level 2 15% discount
Level 3 25% discount
Level 4 30% discount
• After a claim free year, probability 0.85, policyholders move up a level or stay at
Level 4.
• After a year with exactly one claim, probability 0.11, policyholders move down a
level or stay at Level 1.
• After a year with more than one claim, policyholders move down two levels,
subject to a minimum level of Level 1.
For a particular class of policyholder (same type of car, type of cover etc.), the
insurance company has re-evaluated the one-step probability distribution so that there is
now a 0.83 probability of a claim-free year and a 0.12 probability of a policyholder
claiming exactly once in any given year. The full premium in this class is 750.
(v) Construct a function that calculates the expected total premium income over n
years for a given starting distribution, income vector and one-step transition
matrix, without using the markovchain package.
The insurance company is considering setting the additional premium at between £10 and
£15 per annum.
(viii) Calculate the new expected premium income over the next 10 years for a driver
currently on Level 4 who pays the additional premium for the case where:
(a) the additional premium is £10, storing your answer in the object prem.10
(b) the additional premium is £15, storing your answer in the object prem.15.
(ix) Comment on whether the insurer should set the additional premium at £10, with
reference to your answer to part (vi)(c).
The overall expected premium income is linearly related to the additional premium
amount.
(x) Find the additional premium required for NCD protection (to 2 decimal places)
so that the new expected premium income is equal to that calculated in part
(vi)(c), by using linear interpolation or otherwise.
(xi) Explain why the insurance company might opt for a much higher additional
premium than that calculated in part (vi).
Question 5. An insurance company is using a Markov chain to model its no-claims discount
(NCD) system, which offers the following discounts to motorists on their annual
premium:
Level 1 No discount
Level 2 10% discount
Level 3 20% discount
Level 4 30% discount
Level 5 40% discount
After a claim-free year, probability 0.9, policyholders move up a level or stay at Level
5. After a year with exactly one claim, probability 0.08, policyholders move down a
level or stay at Level 1. After a year with more than one claim, policyholders move to
Level 1.
(i) Construct the one-step transition probability matrix P.
The probability that a policyholder on Level i is on Level j after n years is
denoted by 𝑝92 (𝑛).
(ii) Construct a function that calculates the n -step transition probability matrix as a
function of P and use it to calculate the following probabilities:
(a) p+& (5)
(b) p&: (4)
(c) p:$ (6)
For a particular class of policyholder (same type of car, type of cover etc) the full
premium is 800 and there are 2,000 policyholders in the class. You may assume that
the policyholders have been driving for a number of years.
(iii) Calculate the stationary distribution and use it to calculate the expected annual
premium income for this policy class.
To improve the overall premium income and make the system fairer to the safer
drivers, the insurance company decides to change the rules of the NCD system slightly,
so that any policyholder making exactly one claim in a given year will be moved down
two levels if they also made a claim the previous year, subject to a minimum level of
Level 1.
(iv) Explain how many more states will be needed to continue to model the NCD
system as a Markov chain, and give the new one-step transition probability
matrix Q .
(v) Recalculate the probabilities in part (ii) under the new rules. You may assume
that the driver in (c) did NOT have a claim the previous year.
(vi) Calculate the increase in expected annual premium income from the change in
the rules and comment on your answer.
Question 6. Data is available for the movement of taxis in London. The city is divided into
three zones “North”, “South” and “West”. The movement of a taxi from one zone
to another will depend only on its current position. The following probabilities
have been determined for taxi movements:
• Of all taxis in the North zone, 30% will remain in North and 30% will move to
South, with the remaining 40% moving to West.
• In the South zone, taxis have a 40% chance of moving to North, 40% chance of
staying in South and 20% chance of moving to West.
• Of all the drivers in the West zone, 50% will move to North and 30% to South
with the remaining 20% staying in West.
The movement of taxis in London will be modelled in R using a Markov Chain.
(i) Create a vector with the state space of the Markov Chain, using R code. You
should print this to the screen and paste into your answer.
(ii) Construct a transition matrix of the zone movement probabilities. You should
print this to the screen and paste into your answer.
(iii) Load the R package for Markov Chains and paste your coding into your answer.
(iv) Create a Markov Chain object with state space equal to your vector in part (i)
and transition matrix from part (ii). You should print this to the screen and
paste into your answer.
(v) Calculate the probability that a driver currently in the North zone will be in the
North zone after:
(a) two trips (b) three trips
(vi) Determine the stationary state of the Markov Chain.
Question 7. Before answering this question, the R package for Markov chains should be
loaded into R using the following code:
install.packages("markovchain")
library(markovchain)
Xt is a Markov chain on the state space {1,2,3} with the following transition
matrix:
Question 8. Suppose the transition probabilities are a function of the age of a person. The
transition probability of a person aged x moving:
• from Healthy to Sick state is 0.007x
• from Healthy to Death state is 0.001x
• from Sick to Death state is 0.002(100-x)
• from Sick to Healthy is 0.006(100-x).
Assuming 100 is the maximum age. Calculate the probability of:
(i) Healthy person aged 30 will be in Sick state after 4 years.
(ii) Sick Person aged 25 will be Dead after 7 years.
Solutions
Answer 1.
Answer 2
Answer 3
Answer 4
Answer 5
Answer 6
Answer 7
Answer 8
(i) H2S<-function(x){ 0.007*x}
H2D<-function(x){ 0.001*x}
S2H<-function(x){ 0.006*(100-x)}
S2D<-function(x){ 0.002*(100-x)}
transmat<-function(x){
M<-matrix(0,nrow=3,ncol=3)
M[1,1]<-1-H2S(x)-H2D(x)
M[1,2]<-H2S(x)
M[1,3]<-H2D(x)
M[2,1]<-S2H(x)
M[2,2]<-1-S2H(x)-S2D(x)
M[2,3]<-S2D(x)
M[3,1]<-0
M[3,2]<-0
M[3,3]<-1
M
}
n=30
B<-c(1,0,0)
for (i in 1:4){
B=B%*%transmat(n+i-1)
}
B[2]
Hence the probability of Healthy person aged 30 will be in Sick state after 4 years is 0.2608138.
(ii) n=25
C<-c(0,1,0)
for (i in 1:7){
C=C%*%transmat(n+i-1)
}
C[3]
Hence the probability of sick person aged 25 will be in Death state after 7 years is 0.4333386.
The annual transition rates collected from past data are given here:
µ+! = 0.12 µ+: = 0.02 µ!+ = 0.18
µ!: = 0.05 µ:+ = 0.013 µ:! = 0.05
(iv) (a) Recalculate the probabilities from part (iii) using these new rates.
(b) Comment on your answer to part (iv)(a).
Question 3. In a particular European country, where anyone over the age of 18 may vote, there
are 3 main political parties:
• the ‘Reds’ (R)
• the ‘Blues’ (B)
• the ‘Xenophobes’ (X).
A company that produces and sells opinion polls has suggested that the voting
population’s political affiliation be monitored over time using a 3-state Markov
process.
(i) Explain the most suitable choice for the time set if a Markov chain is used.
(c) the rate at which a 50-year old member of the Blues moves to the
Xenophobes.
Over a very small time period, h,the transition probability matrix P(t,t+h) can be
approximated as:
P(t, t+h) » I+h ∗ A(t)
where I is the appropriate identity matrix with the same dimensions as A(t).
Probabilities over longer time periods can then be estimated by successively
multiplying the probabilities over shorter time periods (increments of length h ) as
follows:
()*
*+
𝑃(𝑠, 𝑡) ≈ Π21% 𝑃(𝑠 + ℎ𝑗, 𝑠 + ℎ𝑗 + ℎ)
+
(iv) Construct a function to estimate the value of P(s,t) for a given subdivision
value of h and start and end times s and t .
(v) Estimate:
(a) the probability that a 25-year old member of the Blues moves to the Reds
over the next 10 years, using h=1/12
(b) the probability that a 60-year old member of the Reds moves to the
Blues over the next 5 years, using h=1/365
(c) the probability that a 50-year old member of the Xenophobes moves to
the Reds over the next 7 years, using h=1/2.
(vi) Discuss whether this time-inhomogeneous Markov jump process is a good
model for tracking political party affiliation.
Solutions
Answer 1
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 100
www.sankhyiki.in
+91-9711150002
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 101
www.sankhyiki.in
+91-9711150002
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 102
www.sankhyiki.in
+91-9711150002
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 103
www.sankhyiki.in
+91-9711150002
Answer 2
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 104
www.sankhyiki.in
+91-9711150002
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 105
www.sankhyiki.in
+91-9711150002
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 106
www.sankhyiki.in
+91-9711150002
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 107
www.sankhyiki.in
+91-9711150002
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 108
www.sankhyiki.in
+91-9711150002
Answer 3
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 109
www.sankhyiki.in
+91-9711150002
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 110
www.sankhyiki.in
+91-9711150002
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 111
www.sankhyiki.in
+91-9711150002
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 112
www.sankhyiki.in
+91-9711150002
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 113
www.sankhyiki.in
+91-9711150002
Time Series
Question 1. Plot the ACF and PACF up to lag k = 20 for the AR(2) time series 𝑌( with defining
equation: 𝑌( = 15 + 0.8𝑌(*+ − 0.6𝑌(*! + 𝑧(
where the white noise terms {Zt} are independent and identically distributed random
variables with mean 0 and standard deviation s =1 .
You should plot your graphs one directly above the other and label them clearly.
Question 2. The time series ldeaths is available in R. It measures the number of deaths in the
UK from lung disease, over a period of time.
(a) the number of observations that have been made per year
(b) the time of the first observation
(c) the time of the last observation
(d) the autocorrelation function at lag 2 (ie 𝜌! )
(e) the number of people who died from lung disease in February 1978
(f) the total number of people who died from lung disease during 1976.
Question 3. Plot the sample ACF and sample PACF for the data ldeaths, up to lag k = 5. Your
graphs should be side by side, and should have appropriate labels.
Question 4. The following code generates 1,000 observations from a time series, and saves it as
an object s.
xx <- numeric(1003)
set.seed(4567)
ww <- rnorm(1003)
xx[1:3] <- ww[1:3]
for(t in 4:1003) { xx[t] <- 0.8*xx[t-1] +0.2* xx[t-3]
+ ww[t]+2*ww[t-1]}
s<-ts(xx[4:1003])
Run the code above and hence determine how many times the time series s should be
differenced.
Question 5. The file sales.txt records the number of sales per month of a particular type of
insurance product, over the ten‐year period from January 2005 to December 2014.
(i) Import this data and classify it as a time series object. You should specify
the relevant dates.
(ii) Apply the method of moving averages and the method of seasonal means
to the data. (You may assume the time series is additive.)
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 114
www.sankhyiki.in
+91-9711150002
(iii) Plot the ACF and the PACF of the remainder, and comment on the result.
Question 6. The volume of water in a reservoir is measured every month from 2010, for five
years. The results (in millions of gallons) are stored in the file seasonality.txt
(i) Import this data and classify it as a time series object, specifying the
relevant dates.
(ii) Plot the data, the sample ACF and the sample PACF in a single plotting
area.
(iii) Remove the seasonality from the data using seasonal differencing. (You
should assume that the data exhibits yearly seasonal variation.)
(iv) Use your answer to part (iii) to plot the seasonally‐adjusted data and its
corresponding sample ACF and sample PACF. Your graphs should be
displayed in a single plotting area.
Question 7. This question uses the file ‘TS.txt’.
Run the following code:
TS<-read.table("TS.txt",sep="\t")
TS<-ts(TS[,1],start=1,end=length(TS[,1]),frequency=1)
(i) Fit an AR(2) model to the time series TS and interpret the result.
(ii) Carry out a Ljung‐Box test using five lags, and hence determine whether
the fitted model is an appropriate fit.
Question 8. This question uses the file ‘Wt.csv’.
Run the following code: 3
Wt<-read.table("Wt.csv",sep="",header=TRUE)
Wt<-ts(Wt$value,start=min(Wt$day),end=max(Wt$day))
The time series Wt represents the value of a particular commodity, recorded daily.
Plot the correlogram and partial correlogram for the data, in a 2 × 1 layout. Hence
suggest what order of ARIMA you would fit to this time series.
Question 9. This question uses the file ‘Xt.csv’.
Run the following code:
Xt<-read.table("Xt.csv",sep=",",header=TRUE)
Xt<-ts(Xt,1990,frequency=12)
Xt is a set of time series data recorded monthly from January 1990 to August
2006.
(i) Calculate the Akaike Information Criteria for ARIMA(p,d,q) models p £ 2
, d £2 , q £ 2 .
(ii) Hence, determine which model is the best fit.
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 115
www.sankhyiki.in
+91-9711150002
Question 11. Generate the time series object data using the following code:
set.seed(1558)
data<100+arima.sim(list(ar=runif(1,0.3,0.6),ma=runif(1
,12,15)), n=40)
data<-ts(data,start=c(2008,1),frequency=4)
This stationary time series represents a company’s revenue in thousands, over a
ten‐year period.
(i) Fit an ARMA(1,1) to the data, and hence estimate the company’s revenue
up until the end of 2019. (You do not have to state the values of the fitted
parameters.)
(ii) Use exponential smoothing to estimate the company’s revenue, over the
same period.
(iii) Create a graph that shows the time series data and the two sets of
forecasts. You should label each line clearly.
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 116
www.sankhyiki.in
+91-9711150002
Question 14. The following two equations form a bivariate time series (𝑋0 , 𝑌0 )> :
𝑋0 = 0.3𝑋0*+ + 0.1𝑌0*+ + 𝑒0
𝑌0 = −0.2𝑋0*+ + 0.6𝑌0*+ + 𝑒0?
where {et } and {et ¢} are independent white noise processes.
Determine the eigenvalues of the parameter matrix and hence state whether
(𝑋0 , 𝑌0 )> is stationary.
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 117
www.sankhyiki.in
+91-9711150002
SOLUTIONS
Answer 1
We calculate the theoretical ACF and PACF using the ARMAacf function, with
lag.max=20:
AR2acf<-ARMAacf(ar=c(0.8,-0.6),lag.max=20)
AR2pacf<-ARMAacf(ar=c(0.8,-0.6),lag.max=20,pacf=TRUE)
Similarly, we create the bar chart for AR2pacf, but including the argument
names.arg=seq(1,20), in order to label the bars explicitly:
barplot(AR2pacf,main="PACF of AR(2)",col="red",xlab="Lag",
names.arg=seq(1,20))
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 118
www.sankhyiki.in
+91-9711150002
Answer 2
(a) Number of observations per year
frequency(ldeaths)
[1] 12
So there are 12 observations per year.
(b) Time of the first observation
start(ldeaths)
[1] 1974 1
Since data is recorded monthly, the first observation is for January 1974.
(c) Time of the last observation
end(ldeaths)
[1] 1979 12
So the last observation is for December 1979.
(d) Autocorrelation function at lag 2
The acf function will plot a graph of the sample autocorrelation function. So to access the
information behind the graph, we can define the object a, and set plot=FALSE:
a<-acf(ldeaths,plot=FALSE,lag.max=72)
(We’ve stipulated lag.max=72 since there are 72 observations in the dataset:
length(ldeaths)
[1] 72
However, any lag of 25 or over would score marks.)
% $ 5% 5$
We’ve calculated the ACF 𝜌4 for lags 𝑘 = 0, %$ , %$ , … , 5$ , 5$ . So to find 𝜌$ we need the 25th
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 119
www.sankhyiki.in
+91-9711150002
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1974 3035 2552 2704 2554 2014 1655 1721 1524 1596 2074 2199 2512
1975 2933 2889 2938 2497 1870 1726 1607 1545 1396 1787 2076 2837
1976 2787 3891 3179 2011 1636 1580 1489 1300 1356 1653 2013 2823
1977 3102 2294 2385 2444 1748 1554 1498 1361 1346 1564 1640 2293
1978 2815 3137 2679 1969 1870 1633 1529 1366 1357 1570 1535 2491
1979 3084 2605 2573 2143 1693 1504 1461 1354 1333 1492 1781 1915
So the number of deaths in February 1978 was 3,137.
However, we wouldn’t want to print out all the data for very large datasets. So instead, we can
extract the answer like this:
Since the time series starts at January 1974 and we have monthly data, the observation for
February 1978 is the 50th element of the series:
ldeaths[50]
[1] 3137
(f) Total number of people who died from lung disease during 1976
We could sum the required elements:
sum(ldeaths[25:36])
[1] 25718
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 120
www.sankhyiki.in
+91-9711150002
Answer 3
First we set our graphical parameters:
par(mfrow=c(1,2))
Then we plot the ACF using the acf function with lag.max=60 (since, for k = 5 and
monthly data, we need to calculate 5 12 60 ´ = auto correlations):
acf(ldeaths,lag.max=60,main="Number of deaths from lung
disease", ylab="Sample ACF",col="red")
Similarly, plotting the PACF using the pacf function:
pacf(ldeaths,lag.max=60,main="Number of deaths from lung
disease", ylab="Sample PACF",col="blue")
Answer 4
Differencing s
It’s very quick to difference data in R so we will do this a few times and then analyse the
results:
ds<-diff(s)
dds<-diff(ds)
(Alternatvely, we could have calculated dds as dds<-diff(s,differences=2).)
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 121
www.sankhyiki.in
+91-9711150002
• we’ll tell R to plot the graphs in a 1 × 3 array using par(mfrow=c(1,3)) and then
reset the graphical parameters afterwards using par(mfrow=c(1,1))
• we use semicolons ; to carry out more than one command in a single line
par(mfrow=c(1,3))
acf(s); acf(ds); acf(dds)
par(mfrow=c(1,1))
There is a steady linear decrease in the ACF of the undifferenced time series,
which indicates that differencing is required. This disappears after differencing
once (ie in the graph of ds), which indicates that we don’t need to difference a
second time.
Analysing the variance
Calculating the sample variance for each of our datasets:
var(s); var(ds); var(dds)
[1] 652.342 [1] 4.760322 [1] 7.903931
The variance reduces (from 652 to 4.76) if we difference s once, but increases (up to
7.90) if we difference a second time. Again this indicates that we should difference the
data only once.
Answer 5
(i) Importing the data:
We use the read.table function, and check this looks sensible using the head
function:
sales<-read.table("sales.txt")
head(sales)
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 122
www.sankhyiki.in
+91-9711150002
V1
1 1057
2 1093
3 1120
4 1120
5 1091
6 1082
Now we classify the data as a time series, where:
• we specify the start date as January 2005 using the argument start=c(2005,1)
• we specify monthly observations using the argument frequency=12.
sales<-ts(sales,frequency=12,start=c(2005,1))
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 123
www.sankhyiki.in
+91-9711150002
The ACF still shows significant seasonal variation in the data. Hence there appears to be
some residual seasonality that the decomposition hasn’t captured.
However, the PACF cuts off for lags k > 2 which suggests that (assuming we can remove
the remaining seasonal variation), an AR(2) model might be appropriate.
Answer 6
(i) Importing the data
Using the read.table function and checking that the data has been imported sensibly:
seas<-read.table("seasonality.txt")
head(seas)
V1
1 212.7483
2 239.6343
3 242.7506
4 239.3206
5 233.8502
6 207.3584
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 124
www.sankhyiki.in
+91-9711150002
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 125
www.sankhyiki.in
+91-9711150002
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 126
www.sankhyiki.in
+91-9711150002
Answer 7
(i) Fitting an autoregressive of order 2
We can use the arima function, where the order argument represents p, d and q
respectively. Since an AR(2) is the same as an ARIMA(2,0,0) , we have:
fit<-arima(TS,order=c(2,0,0))
fit
Call:
arima(x = TS, order = c(2, 0, 0))
Coefficients:
ar1 ar2 intercept
0.2193 0.4475 500.1856
s.e. 0.0889 0.0894 0.2803
sigma^2 estimated as 0.9245: log likelihood = -138.28, aic
= 284.56
So we have fitted a model of the form:
𝑋( = 𝜇 + 𝛼(𝑋(*+ − 𝜇) + 𝛽(𝑋(*! − 𝜇) + 𝑍(
where: 𝜇̂ = 500.1856, 𝛼G = 0.2193 , 𝛽H =0.4475 and the estimated variance of the
white noise terms 𝑍( is 𝜎G ! = 0.9245
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 127
www.sankhyiki.in
+91-9711150002
Answer 8
We’ll plot the ACF and the PACF one above the other by setting the graphical
parameter:
par(mfrow=c(2,1))
(This means the layout of the graphs will be two rows deep and one column wide.)
Now we use the acf and pacf functions:
acf(Wt, main="Data: Wt",ylab="sample ACF")
pacf(Wt,main="",ylab="sample PACF")
The ACF appears to decay. The PACF cuts off (ie the values fall inside the confidence
interval) for lags greater than 1.
Hence, we would fit an AR(1) model, ie an ARIMA(1,0,0).
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 128
www.sankhyiki.in
+91-9711150002
Answer 9
We need to fit a model for every combination of p, d and q and calculate the AIC. That’s
a lot of work! So we’ll write a ‘loop’ to do the work for us. In our solution below, we’ll
start by calculating the AIC for just one model, and then extend this to do the same for all
models.
Calculating the AIC for one model
First let’s create a vector with four entries:
answer<-numeric(4)
All the entries in this vector are zero for now. We’ll populate it with our answers in a
moment.
To calculate the AIC for a ARIMA(0,1,2) model (for example), we fit the model using
the arima function, and then extract the AIC from this model by using $aic, like this:
model<-arima(Xt,order=c(0,1,2))
aic<-model$aic
In fact, we can simplify these last two lines to:
aic<arima(Xt,order=c(0,1,2))$aic
We create a vector to display this result alongside the values of p , d and q respectively:
row<-c(0,1,2,aic)
row
[1] 0.0000 1.0000 2.0000 886.8326
Finally we append this onto the bottom of our vector answer, using the rbind
function:
answer<-rbind(answer,row)
answer
[,1] [,2] [,3] [,4]
answer 0 0 0 0.0000
row 0 1 2 886.8326
(The first row is a ‘dummy’ row, we can ignore it.)
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 129
www.sankhyiki.in
+91-9711150002
Answer 10
(i) Fitting the model
To fit an ARMA(1,1) model we use the arima function, with order=c(1,0,1):
fit<-arima(Yt,order=c(1,0,1))
fit
Call:
arima(x = Yt, order = c(1, 0, 1))
Coefficients:
ar1 ma1 intercept
0.8772 0.4089 109.8152
s.e. 0.0387 0.0815 1.7909
sigma^2 estimated as 4.749: log likelihood = -396.76,
aic = 801.52
So for our fitted model we have: 𝜇̂ =109.8152, 𝛼( = 0.8772, 𝛽" = 0.4089 and 𝜎( $ = 4.749.
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 130
www.sankhyiki.in
+91-9711150002
for (i in 2:(n-1)) {
if (et[i]>et[i+1] && et[i]>et[i-1]) {
turns[i]<-1
} else {
next(i)
}
}
In the code above:
• notice that our loop runs from i =2,…,(n-1), not from i=1,…. n, because the first and
last residuals cannot be turning points
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 131
www.sankhyiki.in
+91-9711150002
• the square brackets [] let us extract individual entries from the vector, so et[i] is the ith
element of vector et.
Similarly, to check whether each residual is a minimum, we can simply change the
direction of the inequality signs:
for (i in 2:(n-1)) {
if (et[i]<et[i+1] && et[i]<et[i-1]) {
turns[i]<-1
} else {
next(i)
}
}
We can calculate the total number of turning points in the residuals as:
nt<-sum(turns)
If the residuals form a white noise process, we would expect the mean and variance of the
number of turning points in the residuals to be:
E<-2/3*(n-2)
V<-(16*n-29)/90
where is the number of residuals.
Our test statistic should include a continuity correction. Comparing the expected number
of turning points with the actual number of turning points:
E; nt
[1] 118.6667
[1] 110
Since nt is less than E, we will apply the continuity correction by including +0.5 in our
calculation of the test statistic:
ts<-(nt-E+0.5)/V^0.5
ts
[1] -1.451
Our null hypothesis is: 𝐻" : the residuals are white noise. The 5% critical value is
+/‐ 1.96, so we have insufficient evidence to reject H0, and we conclude that the residuals
are white noise.
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 132
www.sankhyiki.in
+91-9711150002
Answer 11
(i)
fit<-arima(data,order=c(1,0,1))
start(data);end(data);frequency(data)
p<-predict(fit,n.ahead=8)
p<-p$pred; p
(ii)
HW<-HoltWinters(data,beta=FALSE,gamma=FALSE)
p2<-predict(HW,n.ahead=8); p2
(iii)
Answer 12
(i)
dx<-diff(x)
fit<-ar(dx)
pd<-predict(fit,n.ahead=60)$pred
px<-cumsum(pd)+tail(x,1)
px<-ts(px,start=c(2010,1),frequency=12)
px
px[time(px)==2014+11/12]
(ii)
min(px,x); max(px,x)
ts.plot(x,main="Past data and forecast",ylab="Number
ofdiagnoses",xlim=c(1990,2015),ylim=c(1796,1833),col="
blue")
lines(px,col="red")
Answer 13
p<-predict(fit,n.ahead=100)$pred
p[time(p)==2038.75]
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 133
www.sankhyiki.in
+91-9711150002
Answer 14
m<-matrix(c(0.3,-0.2,0.1,0.6),2,2); m
eigen(m)$values
Answer 15
(i)
PP.test(0.2*s+1.5*t)$p.value
(ii)
PP.test(s)$p.value
PP.test(diff(s))$p.value
PP.test(t)$p.value; PP.test(diff(t))$p.value
find.coint<-function(coint) {
comb<-coint[1]*s+coint[2]*t
test<-PP.test(comb)$p.value
test
}
v<-c(1,1)
find.coint(v)
fit<-nlm(find.coint,v)
alpha<-fit$estimate[1]
beta<-fit$estimate[2] alpha; beta
PP.test(alpha*s+beta*t)$p.value
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 134
www.sankhyiki.in
+91-9711150002
Question 1. An insurer has recorded its claims, alongside the year and month in which each
claim event occurred, in the following file:
table.txt
(ii) Group the claims according to the quarter in which they occurred (ie
quarter 1 refers to the months January to March, quarter 2 refers to months
April to June, etc) and hence find the maximum claim that occurred in the
third quarter of 2016.
Calculate the threshold exceedances for these claims at the threshold u = 3,000,000.
Question 3. Claims sizes in the file ‘exp.txt’ are assumed to follow an exponential distribution
with parameter l . Estimate the distribution of the threshold exceedances for this
data.
Question 4. Plot the hazard rate for the Weibull(0.4,5) distribution, ie where c = 0.4 and g =
5. Hence state whether this distribution has a heavy tail.
Question 5. Plot the density ratios for the Pareto(2, 4) and Pareto(7,12) distributions and hence
state which distribution has the heavier tail.
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 135
www.sankhyiki.in
+91-9711150002
Solutions
Answer 1
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 136
www.sankhyiki.in
+91-9711150002
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 137
www.sankhyiki.in
+91-9711150002
So the maximum claim in the third quarter of 2016 was 31,075. This is consistent with our answer to
part (i).
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 138
www.sankhyiki.in
+91-9711150002
Answer 2
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 139
www.sankhyiki.in
+91-9711150002
Answer 3
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 140
www.sankhyiki.in
+91-9711150002
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 141
www.sankhyiki.in
+91-9711150002
Answer 4
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 142
www.sankhyiki.in
+91-9711150002
Answer 5
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 143
www.sankhyiki.in
+91-9711150002
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 144
www.sankhyiki.in
+91-9711150002
Machine Learning
Question 1. The built-in data set in R called faithful contains 272 records of the waiting time
between eruptions and the duration of the eruption for the Old Faithful geyser.
(iii) Show the result of the clustering, explaining each component of the output.
(iv) (a) Show the standardised cluster centres from part (iii).
(b) Calculate the cluster centres in the original units of the data.
Someone suggests rerunning the algorithm with initial cluster centres of (2.2,80) and
(4.5,55).
(viii) (a) Calculate z-scores for the proposed cluster centres based on the data in
faithful.
Where <initial centres> is a r´c matrix (or data frame) with r as the number
of clusters and c as the number of variables in the data set.
(ix) Show the standardised cluster centres for the clusters calculated in part (viii)(b)
and compare them to those calculated in part (iv)(a).
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 145
www.sankhyiki.in
+91-9711150002
Question 2. A movie streaming website has collected data about movie preferences on some of
their customers. The customers rated various movies, each assigned a unique genre,
giving a score from 1 to 100. The file movie.data.csv contains the following
information:
• Horror: the average score given by the customer for horror films
• Romcom: the average score given by the customer for romantic comedies
• Action: the average score given by the customer for action movies
• Comedy: the average score given by the customer for comedies
• Fantasy: the average score given by the customer for fantasy films.
• Cluster: the proposed cluster allocation for the customer (1, 2 or 3).
An analyst working for the website has attempted to allocate customers into clusters
using the k-means algorithm based on their average movie scores across the genres.
The analyst has stored the proposed cluster for each customer in the column Cluster.
The analyst realises that the algorithm was terminated too early and there is one data
point that is not correctly allocated to the nearest centroid.
(ii) (a) Identify the data point that is not correctly allocated to its nearest
cluster centre.
Hint: the function which.min(<vector>) function outputs the index
of the minimum value in <vector>.
The analyst now further realises that the data should really be standardised before
using the k-means algorithm. The analyst decides to use z-scores to standardise the
data.
(iii) (a) Perform k-means clustering using the kmeans() function with 3
clusters and setting a seed of 6 after appropriately standardising the data
using the scale() function.
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 146
www.sankhyiki.in
+91-9711150002
(b) Output the unscaled cluster centres of the clusters calculated in part
(iii)(a).
(c) Compare the cluster centres calculated in part (iii)(b) to those
calculated in part (ii)(c).
(iv) Describe the clusters calculated in part (iii)(a) by considering the centres
in part (iii)(c).
(v) Plot the first five columns of the data set, colouring each point by its
allocated cluster from the k-means algorithm.
The total within-groups sum of squares is stored for each cluster in the k-means
output as the item ‘tot.withinss’. This measures the within-cluster
homogeneity (ie how similar the points in that cluster are). This item may be
referenced using the following code:
kmeans(<data>, <centers>)$tot.withinss
The website wants to consider other numbers of clusters and wants to use the total
within-groups sum of squares metric to determine an appropriate number to use.
(vi) (a) Calculate the total within-groups sum of squares when performing
k-means clustering on standardised data with 1 to 10 clusters, setting a
seed of 6 before running each iteration and storing the results in a vector
called total.within.ss.
(b) Plot total.within.ss against the number of clusters using a
combined line and points graph.
(c) Suggest an appropriate number of clusters.
Question 3. A scientist is analysing a large number of items, which are known to be of one of
three types (1, 2 or 3). She has measured two factors for each one in order to help
identify which type each item belongs to. She has found that the values of the factors
for each of the types conform to the distributions shown in the table below. Here
U(a,b) indicates the discrete uniform distribution, which is equally likely to take any
of the values a, a+1,…,b.
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 147
www.sankhyiki.in
+91-9711150002
(ii) Construct a data frame of 30 simulated items (10 of each type) with the
characteristics shown in the table, after setting a seed value of 19. The data
frame should contain three columns; the type of the object; the value of
FACTOR1 and the value of FACTOR2.
Hint: the sample(<vector>, n, replace =TRUE) function
generates a sample of size n using the values in <vector> with
replacement.
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 148
www.sankhyiki.in
+91-9711150002
In this question ‘integer’ refers to whether xÎℤ and not the integer data
type in R.
Hint: (sum(x != floor(x)) != 0)returns TRUE if any value in
<vector> is not an integer and TRUE otherwise.
(c) Write a function for each Type that calculates the conditional
probability of obtaining values for the factors F1 and F2 , given the
Type, ie 𝑃(𝐹|𝑇𝑦𝑝𝑒). You should assume that the factors operate
independently given the Type.
You are given that the prior probabilities of each type are P(Type 1) = 0.5 ,
P(Type 2) = 0.3 and P(Type3)=0.2. You are also given that:
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 149