0% found this document useful (0 votes)
26 views53 pages

Automobile Price Data

The document outlines an analysis of automobile price data, focusing on error detection and data cleaning methods, including imputation techniques for missing values. It examines numerical and categorical features that influence car prices, revealing positive correlations with attributes like engine size and curb weight, while also analyzing price variations across different categories. The final dataset is to be saved in .csv format after encoding non-numeric columns and cleaning the data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views53 pages

Automobile Price Data

The document outlines an analysis of automobile price data, focusing on error detection and data cleaning methods, including imputation techniques for missing values. It examines numerical and categorical features that influence car prices, revealing positive correlations with attributes like engine size and curb weight, while also analyzing price variations across different categories. The final dataset is to be saved in .csv format after encoding non-numeric columns and cleaning the data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Automobile Price Data

(Analysis and Predictive


Models)

Submitted By: Debanjan Chowdhury


Objective
• Check for errors in dataset
• Replace them with a suitable approach and explain why
you opted for that method or technique
• Final Report on the dataset before and after cleaning
• The final dataset need to contain only numeric column
and rest of the columns need to be encoded and save it in
.csv format
• Find the factors which affect the price
Features in the dataset
Numerical Features
Feature Name Meaning
normalized-losses relative average loss payment per insured vehicle year.

wheel-base horizontal distance between the centers of the front and rear wheels
length length of the car
width width of the car
height height of the car
curb-weight total mass of the car with standard equipments and all necessary operating
consumables while not loaded with either passenger or cargo
engine-size size of the engine
bore diameter of each cylinder
Features in the dataset
Numerical Features
Feature Name Meaning
stroke distance travelled by the piston during each phase
compression-ratio ratio of maximum to minimum volume in the cylinder of an IC engine.
horsepower power produced by the engine
peak-rpm maximum rounds the drive wheels can move in one minute
city-mpg mileage in city
highway-mpg mileage in highway
price price of car (response variable)
Features in the dataset
Categorical Features
Feature Name Meaning
symboling insurance risk level of a car
make manufacturing company
fuel-type type of fuel used by the car
aspiration type of aspiration of the car
num-of-doors number of doors in the car
body-style style of the body of the car
drive-wheels drive wheels of the car
engine-location location of engine in the car
engine-type type of engine in the car
Features in the dataset
Categorical Features
Feature Name Meaning
num-of-cylinders number of cylinders in the car
fuel-system fuel system of the car
Errors in the dataset
There are some errors in the dataset. They are :

1. Certain columns have '?' in place of values. These columns are:


a. normalized-losses
b. num-of-doors
c. bore
d. stroke
e. peak-rpm
f. price
g. horsepower

Therefore we need to use certain imputation techniques in order to fillup the missing values.
Errors in the dataset
2. Certain columns have object as datatype which should not be. These
columns are :
a. normalized-losses (float64)
b. stroke (float64)
c. horsepower (float64)
d. peak-rpm (float64)
e. price (float64)
f. bore (float64)

Therefore we need to convert the columns first to their respective datatypes.


What did I do with the columns with '?'
Firstly I checked all the numerical columns such as 'normalized-
losses', 'bore', 'stroke', 'peak-rpm', 'price', 'horsepower' for any
outliers. For this I plotted boxplot for all the columns. For those
columns, which had outliers, I filled up the missing values with their
median values and for those who didn't, I filled up the missing
values with their mean value.

Then for the categorical column ('num-of-doors') I filled the missing


values with the mode of that column.
Overall Data Analysis (Numerical Features)

wheel-base

We can see that as the wheel-


base value increases, the price
of car increases. It means it has
a positive correlation with the
price (0.584847).
Overall Data Analysis (Numerical Features)

length

We can see that as the length


value increases, the price of car
increases. It means it has a
positive correlation with the price
(0.686567).
Overall Data Analysis (Numerical Features)

width

We can see that as the width


value increases, the price of car
increases. It means it has a
positive correlation with the price
(0.724558).
Overall Data Analysis (Numerical Features)

height

We can see that as the height


value increases, the price of car
increases. It means it has a
positive correlation (although
weak) with the price (0.140439).
Overall Data Analysis (Numerical Features)

curb-weight

We can see that as the curb-


weight value increases, the price
of car increases. It means it has
a positive correlation with the
price (0.819817).
Overall Data Analysis (Numerical Features)

engine-size

We can see that as the engine-


size value increases, the price of
car increases. It means it has a
positive correlation with the price
(0.860343).
Overall Data Analysis (Numerical Features)

bore

We can see that as the bore


value increases, the price of car
increases. It means it has a
positive correlation with the price
(0.532865).
Overall Data Analysis (Numerical Features)

horsepower

We can see that as the


horsepower value increases, the
price of car increases. It means
it has a positive correlation with
the price (0.749919).
Overall Data Analysis (Numerical Features)

city-mpg

We can see that as the city-mpg


value increases, the price of car
decreases. It means it has a
negative correlation with the
price (-0.668822).
Overall Data Analysis (Numerical Features)

highway-mpg

We can see that as the highway-


mpg value increases, the price
of car decreases. It means it has
a negative correlation with the
price (-0.693037).
Overall Data Analysis (Categorical Features)

fuel-type
fuel-type price
diesel 15838.15
gas 12859.73

We can see that the mean price


of car with diesel is higher than
that with gas.
Overall Data Analysis (Categorical Features)

aspiration
aspiration price
turbo 16093.73
std 12502.05

We can see that the mean price


of car with turbo is higher than
that with std.
Overall Data Analysis (Categorical Features)

num-of-doors
num-of-doors price
four 13470.42
two 12733.08

We can see that the mean price


of car with four doors is higher
than that with two doors.
Overall Data Analysis (Categorical Features)

body-style
body-style price
hardtop 22208.5
convertible 21890.5
sedan 14372.99
wagon 12371.96
hatchback 9967.09

We can see that the mean price


of car with hardtop is highest
and with hatchback is least.
Overall Data Analysis (Categorical Features)

drive-wheels
drive-wheels price
rwd 19633.11
4wd 10247
fwd 9262.28

We can see that the mean price


of car with rwd is highest and
with fwd is lowest.
Overall Data Analysis (Categorical Features)

engine-location
engine-location price
rear 34528
front 12832.82

We can see that the mean price


of car with rear engine location
is higher than that with front
engine location.
Overall Data Analysis (Categorical Features)

engine-type
engine-type price
ohcv 25098.38
dohc 18116.42
I 14627.58
ohcf 13738.6
rotor 13020
ohc 11541.57
dohcv 10295

We can see that the mean price of car with ohcv engine type is the highest and
with dohcv engine type the price is the least.
Overall Data Analysis (Categorical Features)

num-of-cylinders
fuel-type price
twelve 36000
eight 33179
six 23671.83
five 20942.82
two 13020
four 10303.1
three 5151

We can see that the mean price of car with twelve cylinders is the
highest and with three cylinders is the least.
Overall Data Analysis (Categorical Features)

fuel-system
fuel-type price
mpfi 17449.61
idi 15838.15
mfi 12964
4bbl 12145
spfi 11048
spdi 10990.44
1bbl 7555.55
2bbl 7519.92

We can see that the mean price of car with mpfi fuel type is the highest and with 2bbl fuel type, it
is the least
Overall Data Analysis (Categorical Features)

make
make price
jaguar 34600
mercedes-benz 33647
porsche 27179.4
bmw 26118.75
volvo 18063.18
audi 16778.57
mercury 16503
alfa-romero 15498.33
peugot 15489.1
saab 15223.33
Overall Data Analysis (Categorical Features)

make
make price
mazda 10652.88
nissan 10415.67
volkswagen 10077.5
toyota 9885.81
isuzu 9605.75
renault 9595
mitsubishi 9239.77
subaru 8541.25
honda 8184.69
plymouth 7963.43
Overall Data Analysis (Categorical Features)

make
make price
dodge 7875.44
chevrolet 6007

We can see that chevrolet has the


least mean price while jaguar has the
max mean price.
Further Analysis...
• For further analysis, I grouped the entire data
based on the price
• Low Price ( <10,000)
• Medium Price (>=10,000 to <20,000)
• High Price (>=20,000)
Comparison of different groups(Numerical Features)

Groups wheel- length width height curb- engine- bore horsep city- highwa
base weight size ower mpg y-mpg

High 101.2- 180.3- 66.9- 52.8- 3012- 164- 3.46- 123- 16-21 19-25
price 106.7 197 70.6 55.9 3740 234 3.74 182

medium 96.0- 173.5- 65.7- 52-56.1 2501.75 120- 3.27- 95-145 19-24 24-30
price 104.3 186.7 67.8 -3052 152 3.62

low 93.7- 158.1- 63.8- 51.6- 1974.5- 92.0- 3.03- 68.0- 26.0- 32.0-
price 96.5 172 65.4 54.5 2303.5 109.75 3.31 88.0 31.0 38.0
Comparison of different groups(Categorical Features)

Groups fuel-type (mean fuel-type (share


price) percent)

diesel gas diesel gas

high 27209.2 31429.1 20% 80%

medium 14740.7 14301.9 11% 89%


8 2
low 8008.33 7678.57 20% 80%
• For high price data:
gas fuelled cars have high price (31429.1) whereas diesel fuelled cars have
low price (27209.2).
gas fuelled cars are more (80%) whereas diesel fuelled cars are less (20%).
• For medium price data:
diesel fuelled cars have high price(14740.78) whereas gas fuelled cars have
low price (14301.92).
gas fuelled cars are more (89%) whereas diesel fuelled carrs are less (11%).
• For low price data:
diesel fuelled cars have more price (8008.33) than gas fuelled cars (7678.57).
gas fuelled cars are more (80%) whereas diesel fuelled carrs are less (20%).
Comparison of different groups(Categorical Features)

Groups aspiration (mean aspiration (share


price) percent)

turbo std turbo std

high 26078 32337 28% 72%

medium 15305.5 13977.6 28% 72%


7 1
low 8699.14 7621.80 7.1% 92.9%
• For high price data:
std aspired cars have high price (32337) whereas turbo aspired cars have low
price (26078).
std aspired cars are more (72%) whereas turbo aspired cars are less (28%).
• For medium price data:
turbo aspired cars have high price(15305.57) whereas std aspired cars have
low price (13977.61).
std aspired cars are more (72%) whereas turbo aspired carrs are less (28%).
• For low price data:
turbo aspired cars have more price (8699.14) than std aspired cars (7621.80).
std aspired cars are more (92.9%) whereas turbo aspired cars are less
(7.1%).
Comparison of different groups(Categorical Features)

Groups num-of-doors num-of-doors


(mean price) (share percent)

two four two four

high 33251.9 28807.2 40% 60%


7
medium 14007.6 14580.7 40.2% 59.8%
1 3
low 7358.13 8000.08 46.9% 53.1%
• For high price data:
two door cars have high price (33251.9) whereas four door cars have low price
(28807.27).
four door cars are more (60%) whereas two door cars are less (40%).
• For medium price data:
four door cars have high price(14580.73) whereas two door cars have low
price (14007.61).
four door cars are more (59.8%) whereas two door cars are less (40.2%).
• For low price data:
four door cars have more price (8000.08) than two cars (7358.13).
four door cars are more (53.1%) whereas two door cars are less (46.9%).
Comparison of different groups(Categorical Features)

Group body-style (mean price) body-style (share percent)


s

convertibl wagon sedan hatchbac hardtop conver wagon sedan hatchb hardto
e k tible ack p
high 36042 28248 29538 22018 35033 8% 4% 68% 4% 16%

mediu 14814.75 14784.69 14528.47 13921.77 11199 4.9% 15.9% 46.3% 31.7% 1.2%
m

low _______ 8077.27 7940.95 7295.63 8779 _____ 11.2% 41.8% 43.9% 3%
Comparison of different groups(Categorical Features)
• For high price data:
convertible cars have high price (36042) whereas hatchback cars have low
price (22018).
sedan cars are more (68%) whereas wagon & hatchback cars are less (4%).
• For medium price data:
convertible cars have high price(14814.75) whereas hardtop cars have low
price (11199).
sedan cars are more (46.3%) whereas hardtop cars are less (1.2%).
• For low price data:
hardtop cars have more price (8779) than hatchback cars (7295.63).
hatchback cars are more (43.9%) whereas hardtop cars are less (3%).
Comparison of different groups(Categorical Features)

Groups drive-wheels drive-wheels


(mean price) (share percent)

rwd fwd 4wd rwd fwd 4wd


high 308 238 ____ 96% 4% ___
64.7 75 _
1
medium 154 130 1267 53.7 41.5 4.9
85.6 77.6 4.5 % % %
6 5
low 874 756 8305 8.2% 86.7 5.1
9.25 4.22 % %
• For high price data:
rwd cars have high price (30864.71) whereas fwd cars have low price (23875).
rwd cars are more (96%) whereas fwd cars are less (4%).
• For medium price data:
rwd cars have high price(15485.66) whereas 4wd cars have low price
(12674.5).
rwd cars are more (53.7%) whereas 4wd cars are less (4.9%).
• For low price data:
rwd cars have more price (8749.25) than fwd cars (7564.22).
fwd cars are more (86.7%) whereas 4wd cars are less (5.1%).
Comparison of different groups(Categorical Features)

Groups engine-location engine-location


(mean price) (share percent)

front rear front rear


high 30047.4 34528 88% 12%
5
medium 14350.1 ______ 100% _______

low 7698.76 ______ 100% _______


Comparison of different groups(Categorical Features)

Groups engine-type (mean price) engine-type (share percent)

dohc ohcv l ohc roto ohcf dohc dohc ohcv l ohc roto ohc doh
r v r f cv
high 3390 3551 ____ 271 ___ 345 ___ 8% 24% ___ 56 ___ 12 ___
0 4.17 54.2 82 % %
1

medium 1634 1617 1548 138 130 105 102 9.8 8.5% 13. 58. 4.9 3.7 1.2
5.13 0.57 9.1 92.6 20 0.33 95 % 4% 5% % % %
3

low 9418 ____ 5151 768 ___ 770 ___ 2% ____ 1% 87. ___ 9.2 ___
7.76 4.89 8% %
• For high price data:
cars with ohcv engines have high price (35514.17) whereas cars with ohc
engines have low price (27154.21).
cars with ohc engines are more (56%) whereas cars with dohc engines are
less (8%).
• For medium price data:
rwd cars have high price(15485.66) whereas 4wd cars have low price
(12674.5).
rwd cars are more (53.7%) whereas 4wd cars are less (4.9%).
• For low price data:
rwd cars have more price (8749.25) than fwd cars (7564.22).
fwd cars are more (86.7%) whereas 4wd cars are less (5.1%).
Comparison of different groups(Categorical Features)
Comparison of different groups(Categorical Features)
Comparison of different groups(Categorical Features)
Comparison of different groups(Categorical Features)
Encoding
I found that values of each of the following columns are related to each other wrt the price column. Therefore I label
encoded them whereas the other categorical columns dont show any kind of relations as showed by them. Therefor i
one hot encoded them.

The columns which are label encoded are :


fuel-type
aspiration
num-of-doors
drive-wheels
engine-location
num-of-cylinders

The columns which are one hot encoded are:


make
body-style
engine-type
fuel-system
Model Prediction results
Model r2 score MSE value

Linear Regression 0.8617 9410258.931

Linear Regression with RFE = 56 0.9595 4354786.504

Linear Regression with Normalization 0.9689 0.00019

Decision Tree Regressor 0.88004 4558429.659

Random Forest Regressor 0.8098 7167520.255

KNN Regressor (n_neighbors = 5) 0.9984 1.19E-05

You might also like