CMM020 Data Visualisation and analysis
Ephraim
4/10/2022
DATA IMPORTING
plankton <- read_csv("plankton.csv") # this file is in current directory
# dimension of dataset
dim(plankton)
## [1] 754 12
• Plankton dataset have 12 columns with 754 observations.
TASKS
1. Use univariate statistics to analyse the plankton dataset. Do not use any plots.
# column names
colnames(plankton)
## [1] "Sample" "Pseudonitzschia.A.Sp" "Alexandrium.Sp"
## [4] "Robgordia.Sp" "Water.Temp" "Species"
## [7] "Region" "Site" "day"
## [10] "month" "year" "period"
# univariate statistics of the dataset
summary(plankton)
## Sample Pseudonitzschia.A.Sp Alexandrium.Sp Robgordia.Sp
## Min. : 1.0 Min. : 787.3 Min. : 0.00 Min. : 7.3
## 1st Qu.:189.2 1st Qu.: 1350.9 1st Qu.: 0.00 1st Qu.: 185.7
## Median :377.5 Median : 2180.2 Median : 0.00 Median : 276.6
## Mean :377.5 Mean : 3761.5 Mean : 145.96 Mean : 425.0
## 3rd Qu.:565.8 3rd Qu.: 4206.9 3rd Qu.: 40.04 3rd Qu.: 458.6
## Max. :754.0 Max. :35056.5 Max. :30530.50 Max. :3611.7
1
## Water.Temp Species Region Site
## Min. :-0.50 Length:754 Length:754 Length:754
## 1st Qu.: 9.70 Class :character Class :character Class :character
## Median :12.10 Mode :character Mode :character Mode :character
## Mean :12.17
## 3rd Qu.:14.90
## Max. :24.60
## day month year period
## Min. : 1.00 Min. : 3.000 Min. :2009 Length:754
## 1st Qu.: 9.00 1st Qu.: 6.000 1st Qu.:2011 Class :character
## Median :16.00 Median : 7.000 Median :2015 Mode :character
## Mean :16.35 Mean : 7.042 Mean :2015
## 3rd Qu.:23.00 3rd Qu.: 8.000 3rd Qu.:2020
## Max. :31.00 Max. :10.000 Max. :2021
2. Use a boxplot to show the distribution of Pseudonitzschia and a second one to show the
distribution of water temperature. Comment on the plots.
# boxplot of pseudonitzschia
boxplot(plankton$Pseudonitzschia.A.Sp, main = "Boxplot of Pseudonitzschia")
Boxplot of Pseudonitzschia
30000
20000
10000
0
• The median value for Pseudonitzschia is 2180.25, Pseudonitzschia is positively skewed and it has many
outliers.
2
# boxplot of water temperature
boxplot(plankton$Water.Temp, main = "Boxplot of Water Temperature")
Boxplot of Water Temperature
25
20
15
10
5
0
* The median value of water temperature is 12.1. The distribution of water temperature generally follows
normaly distribution with some outliers.
3. Use univariate statistics to compare data for Pseudonitzschia in year 2021 with its data in
previous years (consider all previous years together). Comment on the results. Do not use
plots.
# univariate statistics for Pseudonitzshia for every year
Q3_stats <- plankton %>%
group_by(year) %>%
summarise(Average = mean(Pseudonitzschia.A.Sp),
Median = mean(Pseudonitzschia.A.Sp),
Variance = var(Pseudonitzschia.A.Sp),
Standard_Devaition = sd(Pseudonitzschia.A.Sp),
minimum = min(Pseudonitzschia.A.Sp),
maximum = max(Pseudonitzschia.A.Sp))
Q3_stats
3
## # A tibble: 13 x 7
## year Average Median Variance Standard_Devaition minimum maximum
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2009 5974. 5974. 47111699. 6864. 1085. 34185
## 2 2010 5061. 5061. 54605746. 7390. 948. 35056.
## 3 2011 2650. 2650. 2633139. 1623. 848. 8385.
## 4 2012 4400. 4400. 18973507. 4356. 1045. 20253.
## 5 2013 4335. 4335. 15291753. 3910. 831. 18050.
## 6 2014 3364. 3364. 10006100. 3163. 787. 11891.
## 7 2015 2649. 2649. 4848512. 2202. 806. 12710.
## 8 2016 2260. 2260. 6352045. 2520. 791. 11290.
## 9 2017 3814. 3814. 16266831. 4033. 804. 23827.
## 10 2018 2992. 2992. 8974246. 2996. 984. 15297.
## 11 2019 1871. 1871. 1010392. 1005. 822. 5086.
## 12 2020 2525. 2525. 3215184. 1793. 849 11625.
## 13 2021 4884. 4884. 25958427. 5095. 816. 21710.
• The lowest average for Pseudonitzschia is in 2019 which is 1871.175. The maximum average for
Pseudonitzschia is in 2009 which is 5973.740.
4. Produce two histograms: one showing an attribute with a skewed distribution and one
showing an attribute with a normal distribution. Comment on the plots.
# making a skewed distribution histogram of Pseudonitzschia
plankton %>%
ggplot(aes(x = Pseudonitzschia.A.Sp)) +
geom_histogram() +
labs(title = "Histogram of Pseudonitzschia - Skewed Distribution",
x = "Pseudonitzschia") +
theme_bw()
4
Histogram of Pseudonitzschia − Skewed Distribution
300
200
count
100
0 10000 20000 30000
Pseudonitzschia
* Distribution of Pseudonitzschia is Right Skewed.
# Making histogram of Water Temperature Normal Distribution
plankton %>%
ggplot(aes(x = Water.Temp)) +
geom_histogram() +
labs(title = "Histogram of Water Temperature - Normal Distribution",
x = "Water Temp") +
theme_bw()
5
Histogram of Water Temperature − Normal Distribution
60
count
40
20
0 5 10 15 20 25
Water Temp
• Water Temperature follows Normal Distribution.
5. Produce a bar plot for Species. Comment on the plot.
# calculating species for each group and making barplot
plankton %>%
group_by(Species) %>%
summarise(Number_of_Species = n()) %>%
ggplot(aes(x = Species, y = Number_of_Species)) +
geom_bar(stat = "identity") +
labs(title = "Barplot of Species - Four Groups",
x = "Species", y = "Number of Species") +
theme_bw()
6
Barplot of Species − Four Groups
600
Number of Species
400
200
Common cockles Common mussels Pacific oysters Razors
Species
* We have maximum number of samples from Common Mussels species and minimum for Razors species.
6. Produce a pie chart for an attribute of your choice. Comment on the plot.
# pie chart of average water temperature by species
pie_chart <- plankton %>%
group_by(Species) %>%
summarise(Water_temp_average = mean(Water.Temp))
pie(pie_chart$Water_temp_average,
labels = c("Common cockles", "Common mussels", "Pacific oysters", "Razors"),
col = c("Brown", "Purple", "Black", "Blue"),
main = "Pie Chart for Average Water Temperature by Species")
7
Pie Chart for Average Water Temperature by Species
Common mussels Common cockles
Pacific oysters Razors
* The Common Mussels species have highest water temperature average in all species.
7. Use a plot to show values of Robgordia.Sp against values of Pseudonitzschia.A.Sp where the
species is either common mussels or pacific oysters. Use colour to show the Species. Comment
on the plot.
# setting data for specie Common Mussels
Q7 <- plankton %>%
filter(Species == c("Pacific oysters", "Common mussels"))
Q7 %>%
ggplot(aes(x = Robgordia.Sp, y = Pseudonitzschia.A.Sp, color = Species)) + geom_point() +
labs(title = "Relationship between Robgodia.Sp and Pseudonitzschia.A.Sp",
x = "Robgordia.Sp", y = "Pseudonitzschia.Sp") +
theme_classic()
8
Relationship between Robgodia.Sp and Pseudonitzschia.A.Sp
30000
Pseudonitzschia.Sp
20000 Species
Common mussels
Pacific oysters
10000
0
0 1000 2000 3000
Robgordia.Sp
* There is a pefect linear relationship between Robgordia.Sp and Pseudonitzschia.Sp with some outliers.
We have more species samples for common mussels than pacific oysters.
8. Use a plot to show Alexandrium.Sp in different regions by farming species. Use jitter if
needed. Comment on the plot.
plankton %>%
ggplot(aes(y = Alexandrium.Sp, x = Region, fill = Species)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Alexandrium.Sp in different regions for each Species") +
theme_classic()
9
Alexandrium.Sp in different regions for each Species
SIC
SAC
NAC
HCSL
Species
HCS
Common cockles
Region
HCRC Common mussels
Pacific oysters
HCL
Razors
FC
CESUB
CESLH
AGB
0 20000 40000 60000 80000
Alexandrium.Sp
* We have total 11 regions and four different species. * Common mussels species have maximum number of
samples from SIC region. * Common cockles and Razors species have minimum number of samples.
9. Find a pair of plankton species which are correlated and a pair which are not correlated.
Do not use plots. Justify your answers.
# correlation between Pseudonitzschia.Sp and Alexandrium.Sp
cor(plankton$Pseudonitzschia.A.Sp, plankton$Alexandrium.Sp)
## [1] 0.06138349
# correlation between Pseudonitzschia.Sp and Robgordia.Sp
cor(plankton$Pseudonitzschia.A.Sp, plankton$Robgordia.Sp)
## [1] 0.975273
# correlation between Alexandrium.Sp and Robgordia.Sp
cor(plankton$Alexandrium.Sp, plankton$Robgordia.Sp)
## [1] 0.06482684
10
• Pseudonitzschia.Sp and Robgordia.Sp have highest strength as we can see the correlation value of
0.975.
• Pseudonitzschia.Sp and Alexandrium.Sp have lowest correlation of 0.061.
10. Produce a line plot which shows the water temperature (y axis) against the sample index
(x axis), for samples of common cockles and pacific oysters. Both lines (one per species) should
be shown in the same plot.
plankton %>%
filter(Species == c("Common cockles", "Pacific oysters")) %>%
ggplot(aes(x = Sample, y = Water.Temp, color = Species)) +
geom_line() +
labs(title = "Behaviour of Common Cockles and Pacific Oyster by Sample Index", x = "Sample Index", y =
theme_classic()
Behaviour of Common Cockles and Pacific Oyster by Sample Index
20
15
Water Temp
Species
Common cockles
10 Pacific oysters
0 200 400 600
Sample Index
11. Produce a linear regression model of Pseudonitzschia.A.Sp on Robgordia.Sp for
Common mussels. Estimate the value of Pseudonitzschia.A.Sp for a values of Robgordia.Sp of 1000, 2500
and 4000 cells per litre. Justify the appropriateness of the model and comment on any concerns you may
have about your predictions. Can you use this model to predict the value of Rogbodia?
# Common mussels dataset for linear model
11
Q11 <- plankton %>%
filter(Species == "Common mussels")
# linear regression model Pseudonitzschia.Sp on Robgordia.Sp
model <- lm(Pseudonitzschia.A.Sp ~ Robgordia.Sp, data = Q11)
# equation of linear model
equatiomatic::extract_eq(model)
Pseudonitzschia. A. Sp = α + β1 (Robgordia. Sp) + ϵ (1)
# summary of model
summary(model)
##
## Call:
## lm(formula = Pseudonitzschia.A.Sp ~ Robgordia.Sp, data = Q11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13805.7 -474.2 58.9 568.7 1892.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -240.63541 53.07440 -4.534 6.89e-06 ***
## Robgordia.Sp 9.53108 0.08256 115.440 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 989 on 654 degrees of freedom
## Multiple R-squared: 0.9532, Adjusted R-squared: 0.9531
## F-statistic: 1.333e+04 on 1 and 654 DF, p-value: < 2.2e-16
# model equation with parameters coefficients
equatiomatic::extract_eq(model, use_coefs = TRUE)
\ A. Sp = −240.64 + 9.53(Robgordia. Sp)
Pseudonitzschia. (2)
# Robgordia.Sp new values for estimation of Pseudonitzschia.Sp
newdata <- data.frame(Robgordia.Sp = c(1000, 2500, 4000))
newdata
## Robgordia.Sp
## 1 1000
## 2 2500
## 3 4000
12
# Estimating Pseudonitzschia.Sp values based of newdata
predict(model, newdata)
## 1 2 3
## 9290.448 23587.073 37883.699
• This linear model based on dataset we have used, we can say confidently that this model is good fit
and we can use this to predict the values for Robgordia.Sp.
• Robgodia.Sp is significant predictor variable for Pseudonitzschia.Sp.
• The R squared value of this model is 0.9532, It means 95% this model is explaining the variation in
response variable.
12. Create a data frame with three columns: month, year, and the mean of the water tem-
peratures observed in the plankton dataset during that month-year period. Check whether
the mean temperature is 12 degrees at 99% confidence. Without conducting another test, can
you say whether the mean temperature is 12.5 at 95% confidence?
# creating new dataset of three columns
Q12 <- plankton %>%
group_by(year, month) %>%
summarise(Water_Temp_Average_by_month_year = mean(Water.Temp))
# first six rows of newly created Q12 dataset
head(Q12)
## # A tibble: 6 x 3
## # Groups: year [1]
## year month Water_Temp_Average_by_month_year
## <dbl> <dbl> <dbl>
## 1 2009 4 14.5
## 2 2009 5 16.6
## 3 2009 6 11.8
## 4 2009 7 12.9
## 5 2009 8 11.8
## 6 2009 9 11.6
# Check whether the mean temperature is 12 degrees at 99% confidence
t.test(Q12$Water_Temp_Average_by_month_year, conf.level = 0.99)
##
## One Sample t-test
##
## data: Q12$Water_Temp_Average_by_month_year
## t = 47.965, df = 81, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
13
## 11.46471 12.79912
## sample estimates:
## mean of x
## 12.13191
• Yes, mean temperature is 12 degrees at 99% CI [11.62866 12.63517].
• Yes, we can say that mean temperature is 12.5 at 95% CI.
13. You suspect that the water temperature is affected by the period of the year. Use the
plankton dataset to produce a test to check this suspicion, clearly highlighting the NULL
hypothesis and the alternative hypothesis and justifying whether the suspicion is supported
by the data or not. Justify your answer.
# Null Hypothesis: Water Temperature is affected by period of year
# Alternative Hypothesis: Water Temperature is not affected by period of year
model_13 <- lm(Water.Temp ~ period, data = plankton)
# applying test on model to check the suspicion
anova(model_13)
## Analysis of Variance Table
##
## Response: Water.Temp
## Df Sum Sq Mean Sq F value Pr(>F)
## period 1 0.9 0.9353 0.055 0.8147
## Residuals 752 12798.3 17.0190
• As we can see from the F value and P value of this ANOVA test, We reject the NULL Hypothesis and
confidently say that Water Temperature is not affected by period.
• This suspicion is not supported by Plankton dataset.
14. NO CODE NEEDED FOR THIS TASK. You think that an ANOVA test may be useful
to determine whether different species are farmed in different water temperatures. Comment
on any concerns that you may have if applying this test to the plankton dataset.
• Yes, I think ANOVA test might be useful here in this situation of checking whether different species
are formed in different water temperatures.
• For any concerns, I think we should check the normality of dataset, observations should be independent
to each other and homogeneity of variance among species should be approximately equal.
15. Using the plankton dataset (or one derived from it), select some data that has not been
visualised already, and produce TWO alternative plots of that data which you have not pre-
sented earlier. One of the plots should be very informative and effective while the other one
should have deliberate deficiencies. Compare and contrast the 2 plots, clearly highlighting
their merits and drawbacks.
14
# Informaitve plot
# Average Water Temp across all years
# Deriving dataset
Q15 <- plankton %>%
group_by(year) %>%
summarize(Avg_Water_Temp_across_year = mean(Water.Temp))
# Making Plot
ggplot(Q15, aes(x = year, y = Avg_Water_Temp_across_year)) +
geom_bar(stat = "identity", fill = "brown") +
coord_flip() +
labs(title = "Average Water Temperature across years [2009-2021]", y = "Average Water Temperature", x
theme_classic()
Average Water Temperature across years [2009−2021]
2020
2016
Year
2012
2008
0 5 10
Average Water Temperature
# Plot with some deficiencies
plotrix::pie3D(Q15$Avg_Water_Temp_across_year, labels = Q15$year,
main = "Pie Chart of Average Temperature across years")
15
Pie Chart of Average Temperature across years
2012 2011
2013 2010
2014
2009
2015
2021
2016
2017 2020
2018 2019
We made two plots for average water temperature for all years we have in our dataset, one with all
information and other with some deficiencies.
Plot: 01 - Bar Chart of average temperature
• In this plot, we have everything clear and we can see the insight of this derived dataset very comfortably.
• The maximum water temperature is in 2010 and minimum temperature is in 2013.
• We have proper title for this plot with correct axis names.
Plot: 02 - Pie Chart of average temperature
• In this pie chart, we have some deficiencies like don’t have proper title, the percentage for every year,
and plot’s color are not satisfying.
• we can not find the insight comfortably because of pie chart is not suitable when we have to show the
weightage of more than three or four values.
• This pie chart has more drawbacks than its merits.
Comparison
• Plot 1 - Bar Chart is more suitable than Pie Chart for this derived dataset from Plankton.
16