Data Mining Project
(Predict Election Winners)
- By Harshal Kolhatkar
Problem Statement
• An election is to be held in next month, ABC Corporation
a data analytics company wants to predict the future of
two largest parties in country.
• Two major parties are BJP & Congress.
• Goal : Use Polling data to predict state Winner.
Given Dataset
Instance represent a state in a given election
• State : Name of the state
• Year : Election year (2004,2008,2012)
Dependent Variable
• BJP : 1 if BJP won state, 0 if congress won.
Independent Variable
• Times now, India Today : Polled BJP% - Polled Congress%
• DiffCount : Polls with BJP winner – Polls with congress winner
• PropBJP : Polls with BJP winner / # polls
Data Cleaning
Summary of Polling Data
Data Cleaning – Packages to handle Missing Values
List of R Packages
1. MICE (Multiple Imputation Via Chain Equation)
2. Amelia
3. miss Forest
4. Hmisc
5. mi
Data Cleaning
Graphical Representation of Missing Value
Before Cleansing After Cleansing
Data Visualization
Mean = 0.2525253 Mean = 0.02858385
Standard Deviation = 14.27238 Standard Deviation = 1.026924
Before Normalizing After Normalizing
Data Visualization
Mean = 0.3838384 Mean = 0.02858385
Standard Deviation = 15.45745 Standard Deviation = 1.026924
Before Normalizing After Normalizing
Data Visualization – Checking Normality
Before Normalizing After Normalizing
Times Now
Data Visualization – Checking Normality
Before Normalizing After Normalizing
India Today
Data Modeling
Collinearity is a linear association between two explanatory variables.
Two variables are perfectly collinear if there is an exact linear relationship
between them.
Data Modeling (Using Train & Test)
Years : 2004, 2008, 2012
Train : 2004, 2008
Test : 2012
Data Model ( Logistic regression )
With India Today + Prop BJP With Prop BJP
Data Model ( Logistic regression )
• Train model ( Year 2004,2008)
• Accuracy of model = 94.11%
• Test Model (Year 2012)
• Accuracy of data = 96.77%
Conclusion
• Finally I conclude that the model that I have
made is performing well to predict data of
year 2012.
• So we can use this model to predict the state
winners.