50% found this document useful (2 votes)

984 views20 pages

Diamond Pricing for Data Analysts

Predicting Diamond Price Using Linear Model. This is part of exploring various possibilities for developing a prediction algorithm for Predicting the Diamond price given its various characteristics. In subsequent articles, machine learning algorithms will be explored and the best algorithm for determining the diamond price will be identified. The diamonds dataset is taken from the ggplot2 library.

Uploaded by

Sarajit Poddar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

50% found this document useful (2 votes)

984 views20 pages

Diamond Pricing for Data Analysts

Uploaded by

Sarajit Poddar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Predicting Diamond Price using Linear Model

Sarajit Poddar
26 July 2015

Contents
1 Executive Summary

1.1

Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2

About the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Exploratory Analysis

2.1

Loading relevant libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

Subsetting the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3

Plotting the characteristics of dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Predicting the diamond price

3.1

Determining the Significant Predictors of Diamond price . . . . . . . . . . . . . . . . . . . . .

3.2

Exploring the predictors using box plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3

Generating the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.4

Analysing the variance between multiple models . . . . . . . . . . . . . . . . . . . . . . . . .

3.5

Analysing the Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.6

Predicting using the fitted model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.7

Plotting the predicted data with actual data . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Final conclusion

1
1.1

Executive Summary
Objective

Using the diamond dataset from the ggplot2 library, to determing the predictors of the dimond prices. Using
linear model to predict the prices, comparing the predicted price with the actuals and observb the residuals.
Check the accuracy of the fit vis-a-vis other predictive models using machine learning algorithms. The
machine learning algorithms will be explored in subsequent articles.

1.2
1.2.1

About the data

Description

Prices of 50,000 round cut diamonds

Description: A dataset containing the prices and other attributes of almost 54,000 diamonds. The variables
are as follows:
1

1.2.2

Details

price. price in US dollars ($326-$18,823)

carat. weight of the diamond (0.2-5.01)
cut. quality of the cut (Fair, Good, Very Good, Premium, Ideal)
colour. diamond colour, from J (worst) to D (best)
clarity. a measurement of how clear the diamond is (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best))
x. length in mm (0-10.74)
y. width in mm (0-58.9)
z. depth in mm (0-31.8)
depth. total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43-79)
table. width of top of diamond relative to widest point (43-95)

2
2.1

Exploratory Analysis
Loading relevant libraries

# Load required libraries

library(dplyr);
library(tidyr);
library(ggplot2)

2.2

Subsetting the dataset

The dataset is subset to a smaller size as the dataset it huge

# Load the diamonds dataset
data(diamonds)
# Convert continuous variables to factors
# Cut by interval of 1000
diamonds$price2 <- as.numeric(cut(diamonds$price,
seq(from = 0, to = 20000, by = 1000)))
# Cut by interval 0.5
diamonds$carat2 <- as.numeric(cut(diamonds$carat,
seq(from = 0, to = 6, by = 0.1)))
# Summary of diamonds dataset
summary(diamonds)
##
##
##

carat
Min.
:0.2000
1st Qu.:0.4000

Fair
Good

cut
: 1610
: 4906

color
D: 6775
E: 9797
2

clarity
SI1
:13065
VS2
:12258

##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Median :0.7000
Mean
:0.7979
3rd Qu.:1.0400
Max.
:5.0100
depth
Min.
:43.00
1st Qu.:61.00
Median :61.80
Mean
:61.75
3rd Qu.:62.50
Max.
:79.00
y
Min.
: 0.000
1st Qu.: 4.720
Median : 5.710
Mean
: 5.735
3rd Qu.: 6.540
Max.
:58.900

Very Good:12082
Premium :13791
Ideal
:21551
table
Min.
:43.00
1st Qu.:56.00
Median :57.00
Mean
:57.46
3rd Qu.:59.00
Max.
:95.00
z
Min.
: 0.000
1st Qu.: 2.910
Median : 3.530
Mean
: 3.539
3rd Qu.: 4.040
Max.
:31.800

F: 9542
SI2
: 9194
G:11292
VS1
: 8171
H: 8304
VVS2
: 5066
I: 5422
VVS1
: 3655
J: 2808
(Other): 2531
price
x
Min.
: 326
Min.
: 0.000
1st Qu.: 950
1st Qu.: 4.710
Median : 2401
Median : 5.700
Mean
: 3933
Mean
: 5.731
3rd Qu.: 5324
3rd Qu.: 6.540
Max.
:18823
Max.
:10.740
price2
Min.
: 1.000
1st Qu.: 1.000
Median : 3.000
Mean
: 4.398
3rd Qu.: 6.000
Max.
:19.000

carat2
Min.
: 2.000
1st Qu.: 4.000
Median : 7.000
Mean
: 8.468
3rd Qu.:11.000
Max.
:51.000

# Structure of the diamond dataset

str(diamonds)
## 'data.frame':
53940 obs. of 12 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut
: Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x
: num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y
: num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z
: num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## $ price2 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ carat2 : num 3 3 3 3 4 3 3 3 3 3 ...
# Lets say that input price range is 1000 to 5000 and
# the number of obs is 500
input.pricerange.low <- 1000
input.pricerange.high <- 5000
input.obs
<- 5000
# Subsetting sampling the data based on the price range
data.sample <- subset(diamonds,
price >= input.pricerange.low &
price <= input.pricerange.high)
# Sampling the data from the subset
data.sample <- data.sample[sample(1:nrow(data.sample), input.obs,
replace=FALSE),]

2.3
2.3.1

Plotting the characteristics of dataset

Plotting using base graphics

#-------------------------------# Plotting with Base graphics

#-------------------------------x <- data.sample$price
# Plotting the histogram
myhist <- hist(x, breaks=10, density=10, col="darkgrey",
xlab="Diamond Price",
main="Frequency Distribution of Diamond Price")
# Adding a vertical line for the mean
abline(v=mean(x), col="darkgreen", lwd=2)
# Plotting the density curve
multiplier <- myhist$counts / myhist$density
mydensity
<- density(x)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity, col="blue", lwd=2)
# Plotting the normal curve with the same mean and Standard deviation
xfit <- seq(min(x), max(x), length=40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
yfit <- yfit * diff(myhist$mids[1:2]) * length(x)
lines(xfit, yfit, col="red", lwd=2)
# Add legend
legend('topright', c("Mean", "Density Curve", "Normal Curve"),
lty=c(1,1,1), lwd=c(2,2,2), col = c("darkgreen", "blue", "red"))

Frequency Distribution of Diamond Price

600
400
0

200

Frequency

800

Mean
Density Curve
Normal Curve

1000

2000

3000

4000

Diamond Price
2.3.2
g
g
g
g

Plotting using ggplot

<<<<-

ggplot(data.sample, aes(x=price))
g + geom_histogram(aes(y = ..density..), fill="dark grey")
g + geom_density(alpha=.3, fill="#FF6666")
g + stat_function(fun = dnorm, colour = "red",
arg = list(mean = mean(data.sample$price),
sd=sd(data.sample$price)))
g <- g + xlab("Diamond price")
g <- g + ylab("Frequency")
g <- g + ggtitle("Frequency Distribution of Diamond Price")
g

5000

Frequency Distribution of Diamond Price

5e04

Frequency

4e04

3e04

2e04

1e04

0e+00
1000

2000

3000

4000

5000

Diamond price
2.3.3
g
#
#
g
g
g
g
g
g

Diamond price distribution with regards to Cut

<- ggplot(data.sample)
Using the cut as to show the differences in the price due to the
quality of the cut
<- g + geom_bar(aes(x=price, fill= cut))
<- g + xlab("Price of Diamonds")
<- g + ylab("Number of Diamonds")
<- g + ggtitle("Prices of Sampled Diamonds")
<- g + theme(legend.position="bottom")

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Prices of Sampled Diamonds

Number of Diamonds

300

200

100

0
1000

2000

3000

4000

5000

Price of Diamonds
cut

2.3.4
g
g
g
g
g

<<<<-

Fair

Good

Very Good

Premium

Ideal

Regression line showing the impact of Carat on the price (Using lm)
ggplot(data.sample, aes(y = price, x = carat))
g + geom_point(aes(color=clarity), position="jitter")
g + geom_smooth(method=lm, col="red", lwd=1)
g + theme(legend.position="bottom")

price

6000

4000

2000

0.4

0.8

1.2

carat
clarity

2.3.5
g
g
g
g
g

<<<<-

SI2

SI1

VS2

VS1

VVS2

VVS1

Regression line showing the impact of Carat on the price (Using Loess)
ggplot(data.sample, aes(y = price, x = carat))
g + geom_point(aes(color=clarity))
g + geom_smooth(method=loess, col="blue", lwd=1)
g + theme(legend.position="bottom")

5000

price

4000

3000

2000

1000
0.4

0.8

1.2

carat
clarity

2.3.6
g
g
g
g
g
g

<<<<<-

SI2

SI1

VS2

VS1

Regression line faceted by Colour and Cut

ggplot(data.sample, aes(y = price, x = carat))
g + geom_point(aes(color=clarity), position="jitter")
g + facet_grid(color~cut)
g + geom_smooth(method=lm, col="salmon", lwd=1)
g + theme(legend.position="bottom")

VVS2

VVS1

Good

Very Good

Premium

Ideal
D
E
F
G

price

Fair
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0

H
I
J

0.4 0.8 1.2

carat
clarity

2.3.7

SI2

SI1

VS2

VS1

VVS2

Correlation plot between all variables

library(corrplot)
# Convert Diamonds dataset all fields to numeric
diamonds.num <- data.sample
diamonds.num[, 1:12] <- sapply(diamonds.num[, 1:12], as.numeric)
# Remove price and carat and retain price2 and carat2
diamonds.num <- select(diamonds.num, cut:table, x:carat2)
M <- cor(diamonds.num)
corrplot.mixed(M)

VVS1

cut
0.8

0.06 color
0.6

0.23 0.02clarity
0.4

0.25 0.06 0.07depth

0.2

0.46 0.03 0.170.25 table

0.23 0.3 0.6

0.2

x
0.2

0.18 0.25 0.5

0.15 0.84

y
0.4

0.27 0.3 0.58 0.22 0.12 0.94 0.83

z
0.6

0.16 0.15 0.35 0.06 0.13 0.85 0.73 0.82 price2

0.8

0.25 0.31 0.59 0.1 0.18 0.97 0.82 0.93 0.86 carat2
1
From the correlation matrix it appeared that variables that have high correlation with the Price are Carat,
X, Y and Z. Other vairables dont have much correlation. However, the variable X, Y, Z are also highly
correlated to Carat. This can also mean that takening Carat 2 as the
2.3.8

Exploratory plot

# Loading required libraries

library(ggplot2)
library(GGally)
library(scales)
# Sampling the data for the plot generation
diasamp <- diamonds[sample(1:length(diamonds$price), 500),]
# Generating the plot
ggpairs(diasamp, params = c(shape = I('.'), outlier.shape = I('.')))

3
3.1

Predicting the diamond price

Determining the Significant Predictors of Diamond price

model.data <- subset(data.sample, select = -c(price2, carat2))

full.model <- lm(price ~ ., data = model.data)
11

reduced.model <- step(full.model, direction="backward", k=2, trace=0)

summary(reduced.model)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call:
lm(formula = price ~ carat + cut + color + clarity + depth +
table + x + z, data = model.data)
Residuals:
Min
1Q
-2307.76 -186.11

Median
-18.29

3Q
179.54

Max
1564.51

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1757.964
436.220 -4.030 5.66e-05 ***
carat
5946.169
114.468 51.946 < 2e-16 ***
cut.L
289.409
18.773 15.416 < 2e-16 ***
cut.Q
-118.602
14.754 -8.039 1.13e-15 ***
cut.C
122.496
13.433
9.119 < 2e-16 ***
cut^4
27.127
11.409
2.378
0.0175 *
color.L
-889.472
16.658 -53.397 < 2e-16 ***
color.Q
-205.282
14.891 -13.785 < 2e-16 ***
color.C
-54.810
14.049 -3.901 9.69e-05 ***
color^4
31.203
13.128
2.377
0.0175 *
color^5
54.780
12.283
4.460 8.38e-06 ***
color^6
43.795
11.144
3.930 8.62e-05 ***
clarity.L
1938.848
28.227 68.689 < 2e-16 ***
clarity.Q
-726.394
24.116 -30.121 < 2e-16 ***
clarity.C
451.869
20.706 21.823 < 2e-16 ***
clarity^4
-248.135
17.122 -14.492 < 2e-16 ***
clarity^5
77.947
14.643
5.323 1.06e-07 ***
clarity^6
-17.521
13.226 -1.325
0.1853
clarity^7
18.544
11.973
1.549
0.1215
depth
-10.051
4.490 -2.239
0.0252 *
table
-4.957
2.624 -1.889
0.0589 .
x
91.671
49.098
1.867
0.0619 .
z
97.390
44.027
2.212
0.0270 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 324.1 on 4977 degrees of freedom
Multiple R-squared: 0.927, Adjusted R-squared: 0.9267
F-statistic: 2873 on 22 and 4977 DF, p-value: < 2.2e-16

We observe that the variables which have high impact on the Diamond Price are cut, color, clarity, table, y, z
and carat.

3.2

Exploring the predictors using box plot

#------------------------------## Exploring the predictors using box plot

#------------------------------# Exploring association of Cut with Carat and Price

ggplot(data.sample, aes(factor(carat2), price)) +
geom_boxplot(aes(fill = cut)) + xlab("Carat") +
theme(legend.position="bottom")

5000

price

4000

3000

2000

1000
3

Carat
cut

Fair

Good

Very Good

# Exploring association of Clarity with Carat and Price

ggplot(data.sample, aes(factor(carat2), price)) +
geom_boxplot(aes(fill = clarity)) + xlab("Carat") +
theme(legend.position="bottom")

Premium

Ideal

5000

price

4000

3000

2000

1000
3

Carat
clarity

SI2

SI1

VS2

VS1

VVS2

VVS1

# Exploring association of Color with Carat and Price

ggplot(data.sample, aes(factor(carat2), price)) +
geom_boxplot(aes(fill = color)) + xlab("Carat") +
theme(legend.position="bottom")

5000

price

4000

3000

2000

1000
3

Carat
color

3.3

Generating the Model

# The Starting and Suggested Model

simple.model <- lm(price ~ carat, data = model.data)
fitted.model <- lm(price ~ carat + cut + clarity + color + table + y + z,
data = model.data)

3.4

Analysing the variance between multiple models

# Summary of the simple model and fitted model

summary(simple.model)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call:
lm(formula = price ~ carat, data = model.data)
Residuals:
Min
1Q
-3138.17 -307.81

Median
-14.44

3Q
299.14

Max
2393.19

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -728.13
24.62 -29.57
<2e-16 ***
carat
4694.20
32.72 143.48
<2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 529.1 on 4998 degrees of freedom
Multiple R-squared: 0.8047, Adjusted R-squared: 0.8046
F-statistic: 2.059e+04 on 1 and 4998 DF, p-value: < 2.2e-16

summary(fitted.model)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call:
lm(formula = price ~ carat + cut + clarity + color + table +
y + z, data = model.data)
Residuals:
Min
1Q
-2340.21 -184.12

Median
-17.18

3Q
178.61

Max
1571.15

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2253.841
180.766 -12.468 < 2e-16 ***
carat
6154.635
66.980 91.888 < 2e-16 ***
cut.L
316.192
17.328 18.247 < 2e-16 ***
cut.Q
-126.597
14.558 -8.696 < 2e-16 ***
cut.C
123.818
13.368
9.262 < 2e-16 ***
cut^4
25.591
11.421
2.241 0.02509 *
15

##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

clarity.L
1936.765
28.124 68.866 < 2e-16 ***
clarity.Q
-733.049
24.012 -30.529 < 2e-16 ***
clarity.C
454.915
20.721 21.954 < 2e-16 ***
clarity^4
-247.327
17.145 -14.426 < 2e-16 ***
clarity^5
79.403
14.658
5.417 6.34e-08 ***
clarity^6
-18.140
13.241 -1.370 0.17075
clarity^7
19.361
11.993
1.614 0.10653
color.L
-890.439
16.679 -53.387 < 2e-16 ***
color.Q
-205.497
14.892 -13.800 < 2e-16 ***
color.C
-54.116
14.056 -3.850 0.00012 ***
color^4
32.902
13.141
2.504 0.01232 *
color^5
54.910
12.298
4.465 8.18e-06 ***
color^6
44.704
11.158
4.007 6.25e-05 ***
table
-1.865
2.462 -0.757 0.44892
y
30.742
12.057
2.550 0.01081 *
z
64.395
38.046
1.693 0.09060 .
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 324.6 on 4978 degrees of freedom
Multiple R-squared: 0.9268, Adjusted R-squared: 0.9265
F-statistic: 3001 on 21 and 4978 DF, p-value: < 2.2e-16

# Conduct Analysis of Variance between the simple model and the best fitted model
anova(simple.model, fitted.model)
##
##
##
##
##
##
##
##
##

Analysis of Variance Table

Model 1: price ~ carat
Model 2: price ~ carat + cut + clarity + color + table + y + z
Res.Df
RSS Df Sum of Sq
F
Pr(>F)
1
4998 1399334980
2
4978 524428896 20 874906084 415.24 < 2.2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.5
3.5.1

Analysing the Residuals

Checking unidentified patterns in the Residuals

The graph shows that variance between the actual and prediction are higher when the price of the dimond
increaes. There is a possibility that a factor that increases the price at a higher price range, is not captured in
the model. Hence the variance of the price cant be adequately captured by the model based on the available
predictors.
x <- model.data$price;
y <- resid(fitted.model)
ggplot(data.frame(x, y), aes(x,y)) +
geom_hline(yintercept=0, size=1) +
geom_point(size=3, colour="black", alpha = 0.1) +
geom_point(size=2, colour="salmon", alpha = 0.2) +
xlab("Fitted value") +

ylab("Residual") +
geom_smooth(method="loess", colour="red", lwd=1)

1000

Residual

1000

2000

1000

2000

3000

4000

Fitted value
3.5.2

Density plot of residuals to check Normal Distribution

The graph shows that the residula falls in a normal pattern.

x <- residuals(fitted.model)
# Plotting the histogram
myhist <- hist(x, breaks=10, density=10, col="darkgrey",
xlab="Residuals",
main="Frequency Distribution of residuals")
# Adding a vertical line for the mean
abline(v=mean(x), col="darkgreen", lwd=2)
# Plotting the density curve
multiplier <- myhist$counts / myhist$density
mydensity
<- density(x)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity, col="blue", lwd=2)
# Plotting the normal curve with the same mean and Standard deviation
xfit <- seq(min(x), max(x), length=40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
17

5000

yfit <- yfit * diff(myhist$mids[1:2]) * length(x)

lines(xfit, yfit, col="red", lwd=2)

1000
0

500

Frequency

2000

Frequency Distribution of residuals

2000

1000

2000

Residuals
3.6

Predicting using the fitted model

The formula for prediction is

Diamond price = -4253.844 + 4920.324 * carat + xx * cut + 77* clarity + zz * color + 1.462 * table +
376.099 * y + 275.481 * z
Note: The value of xx, yy and xx depends on the class of the variable
Coefficients: (Intercept) carat cut.L cut.Q cut.C cut4 clarity.L clarity.Q
-4253.844 4920.324 261.542 -83.890 66.899 37.187 2011.021 -749.420
clarity.C clarity4 clarity5 clarity6 clarity7 color.L color.Q color.C
477.879 -296.209 77.985 -30.944 27.387 -944.980 -226.105 -80.026
color4 color5 color6 table y z
18.037 12.063 30.269 1.462 376.099 275.481
# Join the predicted and model data for comparition
pred.data <- model.data
pred.data <- select(pred.data, cut:z, carat)
pred <- predict(fitted.model, pred.data)
pred <- data.frame(model.data, pred)
# Round the predicted data
pred$pred <- round(pred$pred, 0)
# Determining RMSE to assess fit (Root Mean Squared Error)

model.rmse<- sqrt(mean(residuals(fitted.model)^2))
model.rmse
## [1] 323.8607

3.7

Plotting the predicted data with actual data

Here we see that the prediction is more accurate between the price range of USD 1000 to USD 5000. Outside
this price range, the prediction is not accurate. Perhaps a different prediction model should be created for
dataset which are outside the range.
For the price range below 1000, the predicted price is lower than the actual price. Similarly for the price
range above USD 4500, the predicted price is higher than the actual price.
g
g
g
#
g
g
g
g
g

<- ggplot(pred, aes(y = price, x = pred))

<- g + geom_point(size=3, colour="black", alpha = 0.1)
<- g + geom_point(size=2, colour="salmon", alpha = 0.2)
g <- g + geom_point()
<- g + ylab("Actual Price")
<- g + xlab("Predicted Price")
<- g + geom_smooth(method=loess, col="blue", lwd=1)
<- g + geom_smooth(method=lm, col="red", lwd=1)

Actual Price

6000

4000

2000

0
0

2000

4000

Predicted Price

6000

0 1000

3000

5000

0 4

Normal QQ

8015
2367
5217

Standardized residuals

2000

Residuals vs Fitted

2000

Residuals

par(mfrow=c(2, 2))
plot(fitted.model)

236749190
5217

0 1000

3000

5000

Fitted values

Residuals vs Leverage
2315

8 2

1.5

2367 5217

Standardized residuals

ScaleLocation
49190

Theoretical Quantiles

0.0

Standardized residuals

Fitted values

1
0.5
0.5
1

4792

Cook's distance
49190

0.0

0.2

0.4

0.6

0.8

Leverage

The points in Q-Q plot are more-or-less on the line, indicating that residuals are normaly distributed.

Final conclusion

We have seen that using Linear model, a good predictive model can be developed, provided that the variables
(predictors) which significantly impact the outcome (price in this case) can be accurately identified.
We also observe tha the prediction may work within some boundary condition. If the boundary conditions
are accurately identified, then different models can be built for predicting the data outside the fitted model.

Moon Phases: Ujjain 2009 Calendar
100% (3)
Moon Phases: Ujjain 2009 Calendar
81 pages
The Use of Lamps in Prognostication Sarajit
No ratings yet
The Use of Lamps in Prognostication Sarajit
2 pages
Point Based Kuta: Wednesday, October 06, 2004
No ratings yet
Point Based Kuta: Wednesday, October 06, 2004
34 pages
Jyotish Sarajit Poddar Book 22 An Introduction To Jaimini Sutra
100% (1)
Jyotish Sarajit Poddar Book 22 An Introduction To Jaimini Sutra
500 pages
Timing Through Graha
No ratings yet
Timing Through Graha
4 pages
Siva - Syllabus: First Year
No ratings yet
Siva - Syllabus: First Year
7 pages
Porutham
No ratings yet
Porutham
2 pages
Planet's Position at Birth Time: Avakhada Chakra Ghat Chakra
No ratings yet
Planet's Position at Birth Time: Avakhada Chakra Ghat Chakra
20 pages
Insights on Karakamsa Astrology
No ratings yet
Insights on Karakamsa Astrology
4 pages
22 Koch
No ratings yet
22 Koch
3 pages
Vedic Astrology: Understanding Karakas
50% (2)
Vedic Astrology: Understanding Karakas
6 pages
The Importance of Navamsa Dispositor
100% (1)
The Importance of Navamsa Dispositor
3 pages
Stri Jatja
No ratings yet
Stri Jatja
3 pages
Ardha Nadiamsa
No ratings yet
Ardha Nadiamsa
18 pages
Lesson 13: Progressions in Jyotish
No ratings yet
Lesson 13: Progressions in Jyotish
25 pages
Sanyasa Yogas in Hindu Astrology
No ratings yet
Sanyasa Yogas in Hindu Astrology
4 pages
Understanding Kalasarpa Dosha Effects
No ratings yet
Understanding Kalasarpa Dosha Effects
5 pages
Understanding Arudha Lagna in Vedic Astrology
No ratings yet
Understanding Arudha Lagna in Vedic Astrology
14 pages
Understanding Ātmakāraka in Astrology
No ratings yet
Understanding Ātmakāraka in Astrology
7 pages
Tajika
50% (2)
Tajika
7 pages
Gulika Effects in Astrology Houses
100% (1)
Gulika Effects in Astrology Houses
5 pages
Role of Atmakaraka Final G
No ratings yet
Role of Atmakaraka Final G
22 pages
Causes of Marriage Delay in Astrology
100% (1)
Causes of Marriage Delay in Astrology
7 pages
Exploring Various Dasha Systems
100% (1)
Exploring Various Dasha Systems
1 page
Hora Sara: Chapter-12. The Effects of Budha Dasha
No ratings yet
Hora Sara: Chapter-12. The Effects of Budha Dasha
26 pages
Understanding Lagna and Lagnesh in Astrology
No ratings yet
Understanding Lagna and Lagnesh in Astrology
2 pages
Vedic Astrology Resources Compilation
No ratings yet
Vedic Astrology Resources Compilation
4 pages
Vargavimshopakam (Vimshopaka Bala)
No ratings yet
Vargavimshopakam (Vimshopaka Bala)
2 pages
Tarabalam Chandrabalam
No ratings yet
Tarabalam Chandrabalam
5 pages
Advanced Astrology Birth Chart Analysis
No ratings yet
Advanced Astrology Birth Chart Analysis
11 pages
Jyotish Vidya
No ratings yet
Jyotish Vidya
39 pages
Jaimini Astrology for Spiritual Insight
No ratings yet
Jaimini Astrology for Spiritual Insight
4 pages
The Twelve Houses: Aditya Hridayam Stotra. Surya Mandal Stotra of Twelve Stanzas
No ratings yet
The Twelve Houses: Aditya Hridayam Stotra. Surya Mandal Stotra of Twelve Stanzas
1 page
Redefining Tit Hip Raves Ha Chart Color
No ratings yet
Redefining Tit Hip Raves Ha Chart Color
39 pages
Understanding Kala Amrita & Sarpa Yoga
No ratings yet
Understanding Kala Amrita & Sarpa Yoga
10 pages
Comparisonof Panchangasforhoroscopematchingformarriage BW
No ratings yet
Comparisonof Panchangasforhoroscopematchingformarriage BW
13 pages
DR B.V.Raman, The Architect of Astrological Renaissance
No ratings yet
DR B.V.Raman, The Architect of Astrological Renaissance
3 pages
Adura of 12 House
No ratings yet
Adura of 12 House
25 pages
Natural Fructification of Yogas
No ratings yet
Natural Fructification of Yogas
6 pages
Argala Aspect
No ratings yet
Argala Aspect
19 pages
Arudha Lagna Interpretation Tips - Mariocean@gmail - Com - Gmail
100% (1)
Arudha Lagna Interpretation Tips - Mariocean@gmail - Com - Gmail
2 pages
Bhinna Ashtaka Varga Points and Assessment
No ratings yet
Bhinna Ashtaka Varga Points and Assessment
4 pages
DSCreport
No ratings yet
DSCreport
11 pages
EM 526 - Lab Assignment 03
No ratings yet
EM 526 - Lab Assignment 03
1 page
Diamond Price Analysis and Visualization
No ratings yet
Diamond Price Analysis and Visualization
38 pages
Lab 6 Data Visualization
No ratings yet
Lab 6 Data Visualization
8 pages
Create RStudio Project for Diamonds Data
No ratings yet
Create RStudio Project for Diamonds Data
4 pages
Diamond Price Prediction
No ratings yet
Diamond Price Prediction
8 pages
Case Study
No ratings yet
Case Study
20 pages
Linear - Regression - Assignment: Problem Statement
100% (3)
Linear - Regression - Assignment: Problem Statement
24 pages
24 Model Building - R For Data Science
No ratings yet
24 Model Building - R For Data Science
17 pages
Predicting Diamond Prices Using Regression
No ratings yet
Predicting Diamond Prices Using Regression
19 pages
Automated Diamond Price Prediction Using Machine Learning
No ratings yet
Automated Diamond Price Prediction Using Machine Learning
1 page
Diamond Price Prediction Model
No ratings yet
Diamond Price Prediction Model
6 pages
Group 12 - Final Presentation
No ratings yet
Group 12 - Final Presentation
51 pages
DWM Report
No ratings yet
DWM Report
7 pages
Unit-Iv: Variation, Missing Values, Co Variation, Patterns and Models
No ratings yet
Unit-Iv: Variation, Missing Values, Co Variation, Patterns and Models
17 pages
Predictive Modeling for Analysts
50% (2)
Predictive Modeling for Analysts
69 pages
Panel Data
No ratings yet
Panel Data
9 pages
Regression Estimation in Sampling Theory
No ratings yet
Regression Estimation in Sampling Theory
12 pages
M&M Color Distribution Analysis
No ratings yet
M&M Color Distribution Analysis
7 pages
Test Bank Questions Chapters 1 and 2
No ratings yet
Test Bank Questions Chapters 1 and 2
3 pages
ARIMA Model: Pros and Cons Explained
No ratings yet
ARIMA Model: Pros and Cons Explained
6 pages
BYU Stat 121 Statistical Tables
No ratings yet
BYU Stat 121 Statistical Tables
1 page
Solutions Manual To Accompany Applied Multivariate Statistical Analysis 6th Edition 0131877151 Available Full Chapters
100% (10)
Solutions Manual To Accompany Applied Multivariate Statistical Analysis 6th Edition 0131877151 Available Full Chapters
96 pages
BUS 302 Study Material
No ratings yet
BUS 302 Study Material
8 pages
Hypothesis Testing Essentials
No ratings yet
Hypothesis Testing Essentials
3 pages
One-Tailed and Two-Tailed Tests: Advanced Statistics
No ratings yet
One-Tailed and Two-Tailed Tests: Advanced Statistics
11 pages
What Does Statistically Significant Mean
No ratings yet
What Does Statistically Significant Mean
6 pages
Examination Paper I
No ratings yet
Examination Paper I
1 page
Understanding Cross-Sectional Data Analysis
No ratings yet
Understanding Cross-Sectional Data Analysis
40 pages
9843_CLAP4CLIP_Continual_Learn (1)-pages
No ratings yet
9843_CLAP4CLIP_Continual_Learn (1)-pages
17 pages
Capri
No ratings yet
Capri
100 pages
Examining Relationships in Quantitative Research
No ratings yet
Examining Relationships in Quantitative Research
9 pages
Genie Modeler User Manual: Version 4.0.R4, Built On 12/23/2022 Bayesfusion, LLC
No ratings yet
Genie Modeler User Manual: Version 4.0.R4, Built On 12/23/2022 Bayesfusion, LLC
670 pages
Chapter 10 Guided Notebook
No ratings yet
Chapter 10 Guided Notebook
21 pages
Health Insurance & Life Expectancy Analysis
No ratings yet
Health Insurance & Life Expectancy Analysis
3 pages
Chap 014
No ratings yet
Chap 014
16 pages
Math Model Validation Worksheet
100% (1)
Math Model Validation Worksheet
3 pages
Chapter 9 104
No ratings yet
Chapter 9 104
4 pages
Biostatistics Exam Study Guide
No ratings yet
Biostatistics Exam Study Guide
26 pages
Mastering Python For Data Science - Sample Chapter
71% (7)
Mastering Python For Data Science - Sample Chapter
24 pages
STAT378 Syllabus
No ratings yet
STAT378 Syllabus
7 pages
Simple Linear Regression-Example
100% (1)
Simple Linear Regression-Example
4 pages
Ffboa
No ratings yet
Ffboa
13 pages
The Logistic Regression Analysis in Spss - Statistics Solutions PDF
No ratings yet
The Logistic Regression Analysis in Spss - Statistics Solutions PDF
2 pages
DATAENG Practice Problem 11
No ratings yet
DATAENG Practice Problem 11
6 pages
MPhil Econometrics Exam Questions
No ratings yet
MPhil Econometrics Exam Questions
2 pages

Diamond Pricing for Data Analysts

Uploaded by

Diamond Pricing for Data Analysts

Uploaded by

Predicting Diamond Price using Linear Model

About the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Loading relevant libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Subsetting the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Plotting the characteristics of dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Predicting the diamond price

Determining the Significant Predictors of Diamond price . . . . . . . . . . . . . . . . . . . . .

Exploring the predictors using box plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Generating the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysing the variance between multiple models . . . . . . . . . . . . . . . . . . . . . . . . .

Analysing the Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Predicting using the fitted model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Plotting the predicted data with actual data . . . . . . . . . . . . . . . . . . . . . . . . . . . .

About the data

Prices of 50,000 round cut diamonds

price. price in US dollars ($326-$18,823)

# Load required libraries

Subsetting the dataset

The dataset is subset to a smaller size as the dataset it huge

# Structure of the diamond dataset

Plotting the characteristics of dataset

#-------------------------------# Plotting with Base graphics

Frequency Distribution of Diamond Price

Plotting using ggplot

Frequency Distribution of Diamond Price

Diamond price distribution with regards to Cut

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Prices of Sampled Diamonds

Regression line faceted by Colour and Cut

0.4 0.8 1.2

0.4 0.8 1.2

0.4 0.8 1.2

0.4 0.8 1.2

0.4 0.8 1.2

Correlation plot between all variables

0.25 0.06 0.07depth

0.46 0.03 0.170.25 table

0.23 0.3 0.6

0.18 0.25 0.5

0.27 0.3 0.58 0.22 0.12 0.94 0.83

0.16 0.15 0.35 0.06 0.13 0.85 0.73 0.82 price2

# Loading required libraries

Predicting the diamond price

model.data <- subset(data.sample, select = -c(price2, carat2))

reduced.model <- step(full.model, direction="backward", k=2, trace=0)

Exploring the predictors using box plot

#------------------------------## Exploring the predictors using box plot

#------------------------------# Exploring association of Cut with Carat and Price

# Exploring association of Clarity with Carat and Price

# Exploring association of Color with Carat and Price

Generating the Model

# The Starting and Suggested Model

Analysing the variance between multiple models

# Summary of the simple model and fitted model

Analysis of Variance Table

Analysing the Residuals

Density plot of residuals to check Normal Distribution

The graph shows that the residula falls in a normal pattern.

yfit <- yfit * diff(myhist$mids[1:2]) * length(x)

Frequency Distribution of residuals

Predicting using the fitted model

The formula for prediction is

Plotting the predicted data with actual data

<- ggplot(pred, aes(y = price, x = pred))

You might also like