0% found this document useful (0 votes)
11 views12 pages

Housing Price Prediction Linear Regression Assignment

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views12 pages

Housing Price Prediction Linear Regression Assignment

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Linear Regression Assignment

House Price Prediction

Course: Machine Learning / Data Science

Due Date: Wed. Oct 22 2025

Overview
This assignment will help you develop practical skills in linear regression modeling, from basic
single-predictor models to advanced techniques including regularization. You will work with
real-world housing data to predict house prices.
Notation: Throughout this assignment, use b for the intercept and w (with subscripts w1 , w2 , . . .
for multiple features) for coefficients.

Learning Outcomes
By completing this assignment, you will be able to:

• Build and evaluate linear regression models


• Perform exploratory data analysis and feature selection
• Validate model assumptions and diagnose problems
• Apply advanced techniques like feature engineering and regularization
• Communicate findings effectively through visualizations and reports

Dataset
California Housing Dataset (available via scikit-learn)

• Target Variable: Median house value (MedHouseVal)


• Features: Median income (MedInc), house age (HouseAge), average rooms (AveRooms),
average bedrooms (AveBedrms), population (Population), average occupancy (AveOc-
cup), latitude (Latitude), longitude (Longitude)
• Size: 20,640 observations

Loading the dataset:


from sklearn . datasets import f e t c h _ c a l i f o r n i a _ h o u s i n g
data = f e t c h _ c a l i f o r n i a _ h o u s i n g ()

1
House Price Prediction Linear Regression Assignment

1 Part 1: Data Exploration & Single Predictor Modeling


Total Points: 30

1.1 Task 1.1: Understanding Your Data (10 points)


Perform the following exploratory data analysis steps:
1. Load the California Housing dataset using scikit-learn
2. Display the first 10 rows of the dataset
3. Report the dataset shape (number of rows and columns)
4. Display basic statistics for all features:
• Mean
• Standard deviation
• Minimum value
• Maximum value
5. Check for missing values in the dataset

Deliverable: Include a summary table of statistics and a statement about data quality in your
report.

1.2 Task 1.2: Building Your First Model (20 points)


Build a simple linear regression model with a single predictor:
1. Select MedInc (median income) as the predictor variable (X)
2. Select MedHouseVal (median house value) as the target variable (y)
3. Create a scatter plot showing the relationship between median income and house value
4. Fit a simple linear regression model using scikit-learn
5. Report the following model parameters:
• Slope (coefficient): w
• Intercept: b
• Write the equation: ŷ = b + wx
6. Provide an interpretation: “For every 1 unit increase in median income, house value
increases/decreases by units”
7. Calculate and report model performance metrics:
• R² (coefficient of determination)
• Mean Squared Error (MSE)
• Root Mean Squared Error (RMSE)
8. Create a visualization: Plot the regression line on top of the scatter plot
9. Make a prediction: What is the predicted house value for a median income of 5.0?

Deliverable: Scatter plot with regression line, model equation, performance metrics, and
interpretation.

Page 2
House Price Prediction Linear Regression Assignment

2 Part 2: Multiple Predictors & Model Diagnostics


Total Points: 40

2.1 Task 2.1: Expanding to Multiple Features (15 points)


Build a multiple linear regression model:

1. Select the following 5 features as predictors:

• MedInc (Median Income)


• HouseAge (House Age)
• AveRooms (Average Rooms)
• AveBedrms (Average Bedrooms)
• Population (Population)

2. Fit a multiple linear regression model using these features

3. Report all coefficients with their corresponding feature names in a table


(Use w1 , w2 , w3 , w4 , w5 for the 5 features and b for the intercept)

4. Compare the performance with the simple linear regression from Part 1:

• R² comparison
• RMSE comparison

5. Analysis: Which model performs better? Explain why in 2-3 sentences.

Deliverable: Coefficient table, performance comparison, and written analysis.

2.2 Task 2.2: Understanding Feature Relationships (10 points)


Analyze the relationships between features:

1. Create a correlation matrix including all features and the target variable

2. Visualize the correlation matrix using a heatmap (use seaborn or matplotlib)

3. Identify and list the top 3 features most correlated with house value

4. Check for multicollinearity: Are there any pairs of features highly correlated with each
other? (correlation > 0.7 or < −0.7)

5. Discuss: How might multicollinearity affect your model?

Deliverable: Correlation heatmap and written analysis of feature relationships.

Page 3
House Price Prediction Linear Regression Assignment

2.3 Task 2.3: Validating Model Assumptions (15 points)


Linear regression relies on several assumptions. Check these assumptions:

2.3.1 Residual Analysis (5 points)


1. Calculate residuals: residuals = yactual − ŷ

2. Create a residual plot: residuals vs. predicted values

3. Analyze the plot: Is there any visible pattern (e.g., funnel shape, curves)?

4. Interpretation: What does the pattern (or lack thereof) indicate about your model?

2.3.2 Normality Check (5 points)


1. Create a histogram of residuals

2. Create a Q-Q (quantile-quantile) plot of residuals

3. Assessment: Do the residuals appear to be normally distributed?

4. Explain: Why is normality of residuals important?

2.3.3 Outlier Detection (5 points)


1. Define outliers as observations where:
|residual| > 3 × σresiduals

2. Count: How many outliers did you find?

3. Display the first 5 outliers showing:

• Actual house value


• Predicted house value
• Residual value

4. Discuss: Should these outliers be removed? Why or why not?

Deliverable: Residual plot, histogram, Q-Q plot, outlier analysis table, and interpretations.

Page 4
House Price Prediction Linear Regression Assignment

3 Part 3: Model Optimization & Enhancement


Total Points: 30

3.1 Task 3.1: Evaluating Model Generalization (10 points)


Test how well your model generalizes to unseen data:

1. Split the dataset into training and testing sets:

• Training set: 80% of the data


• Testing set: 20% of the data
• Use random state=42 for reproducibility

2. Fit the multiple linear regression model (from Task 2.1) on the training data only

3. Evaluate the model on both sets and report:

• Training R²
• Testing R²
• Training RMSE
• Testing RMSE

4. Create a comparison table or bar chart showing training vs. testing performance

5. Analysis:

• Is there evidence of overfitting? (Training performance much better than testing)


• Is there evidence of underfitting? (Both performances are poor)
• Explain your conclusion in 2-3 sentences

Deliverable: Performance comparison table/chart and overfitting/underfitting analysis.

3.2 Task 3.2: Creating Better Features (10 points)


Engineer new features to improve model performance:

1. Create two new interaction features:

• RoomsPerHousehold = AveRooms
AveOccup

• BedroomsPerRoom = AveBedrms
AveRooms

2. Create a polynomial feature:

• MedInc squared = (MedInc)2

3. Fit a new model including:

• All original features from Task 2.1


• The 3 newly engineered features

4. Evaluate on the test set and report R² and RMSE

5. Compare with the model from Task 3.1:

Page 5
House Price Prediction Linear Regression Assignment

• Did the R² improve? By how much?


• Did the RMSE decrease? By how much?

6. Interpretation: Which engineered feature(s) seem most valuable? How can you tell?

Deliverable: Performance comparison and analysis of feature engineering impact.

Page 6
House Price Prediction Linear Regression Assignment

3.3 Task 3.3: Preventing Overfitting with Regularization (10 points)


Apply regularization techniques to improve model generalization:

1. Implement Ridge Regression with the following alpha values: [0.1, 1, 10, 100]

• Use the feature set from Task 3.2 (with engineered features)
• For each alpha, train on the training set
• Report the testing R² score for each alpha

2. Implement Lasso Regression with the same alpha values: [0.1, 1, 10, 100]

• Use the same feature set


• Report the testing R² score for each alpha

3. Create a line plot showing:

• X-axis: Alpha values (consider using log scale)


• Y-axis: Testing R²
• Two lines: one for Ridge, one for Lasso

4. Identify the best model:

• Which regularization technique performs best?


• What is the optimal alpha value?

5. Create a coefficient comparison table with three columns:

• Column 1: Standard Linear Regression coefficients


• Column 2: Best Ridge model coefficients
• Column 3: Best Lasso model coefficients

6. Analysis:

• Which features did Lasso shrink to zero (or near zero, |w| < 0.01)?
• What does this tell you about feature importance?
• Explain the difference between Ridge and Lasso in your own words

Deliverable: Alpha vs. R² plot, coefficient comparison table, and regularization analysis.

Page 7
House Price Prediction Linear Regression Assignment

4 Deliverables
4.1 1. Python Code (Jupyter Notebook or .py file)
• Well-commented code explaining each step
• Clear section headers matching the assignment parts
• Code should run without errors
• Include all necessary imports at the beginning

4.2 2. Written Report (PDF, 4-6 pages)


Your report should include the following sections:

Introduction (0.5 page)


• Brief description of the California Housing dataset
• Statement of the prediction objective
• Overview of your approach

Results (3-4 pages)


For each task, present:

• Key findings with supporting tables and figures


• All requested metrics and values
• Clear figure captions and table labels

Analysis & Discussion (1-1.5 pages)


Address these key questions:

• Which features are most important for predicting house prices?


• How did model performance improve from Part 1 to Part 3?
• What are the limitations of your final model?
• In what situations might your model make poor predictions?

Conclusion (0.5 page)


• Summary of key learnings
• Recommendations for future improvements
• Reflection on the linear regression technique

4.3 3. Visualizations (Minimum 8 plots)


Required plots:

1. Scatter plot with regression line (Part 1)


2. Correlation heatmap (Part 2)
3. Residual plot (Part 2)
4. Histogram of residuals (Part 2)
5. Q-Q plot of residuals (Part 2)
6. Training vs. Testing performance comparison (Part 3)
7. Alpha vs. R² plot for regularization (Part 3)
8. At least one additional plot supporting your analysis

Page 8
House Price Prediction Linear Regression Assignment

Visualization Guidelines:

• All plots must have clear titles


• Label all axes with units where applicable
• Use appropriate colors and legends
• Ensure plots are readable (not too small)

Page 9
House Price Prediction Linear Regression Assignment

5 Grading Rubric

Component Points
Part 1: Data Exploration & Single Predictor Modeling 30
Task 1.1: Understanding Your Data 10
Task 1.2: Building Your First Model 20
Part 2: Multiple Predictors & Model Diagnostics 40
Task 2.1: Expanding to Multiple Features 15
Task 2.2: Understanding Feature Relationships 10
Task 2.3: Validating Model Assumptions 15
Part 3: Model Optimization & Enhancement 30
Task 3.1: Evaluating Model Generalization 10
Task 3.2: Creating Better Features 10
Task 3.3: Preventing Overfitting with Regularization 10
Code Quality & Documentation 10
Clear comments and structure 5
Code runs without errors 5
Report Quality & Analysis 15
Clear writing and organization 5
Depth of analysis and insights 7
Professional formatting 3
Visualizations 10
All required plots included 5
Quality and clarity of visualizations 5
Total 100

Letter Grade Conversion


• A: 90 - 100 points
• B: 80 - 89.9 points
• C: 70 - 79.9 points
• D: 60 - 69.9 points
• F: Below 60 points

Page 10
House Price Prediction Linear Regression Assignment

6 Submission Guidelines
What to Submit
Create a ZIP file containing:

1. Python code file (.ipynb or .py)

2. Written report (PDF format)

3. All figures/plots as separate image files (optional, if not embedded in report)

File Naming Convention


• ZIP file: LastName FirstName LinearRegression.zip
• Code file: LastName FirstName Code.ipynb
• Report: LastName FirstName Report.pdf

Submission Platform
Moodle

Important Dates
• Assignment Release: Thu. Oct 09 2025
• Due Date: Wed. Oct 22 2025 at 11:50 pm

Academic Integrity
• You may discuss concepts with classmates, but all code and writing must be your own
• Properly cite any external resources or code snippets used
• Use of AI tools (ChatGPT, Copilot, etc.) must be disclosed in your report
• Plagiarism will result in zero points and potential disciplinary action

7 Resources
Required Python Libraries
Install using pip or conda:
pip install numpy pandas matplotlib seaborn scikit - learn

Helpful Documentation
• Scikit-learn: https://scikit-learn.org/stable/
• Linear Regression: https://scikit-learn.org/stable/modules/linear_model.html
• California Housing Dataset: https://scikit-learn.org/stable/datasets/real_world.
html#california-housing-dataset
• Matplotlib: https://matplotlib.org/
• Seaborn: https://seaborn.pydata.org/
• Pandas: https://pandas.pydata.org/

Page 11
House Price Prediction Linear Regression Assignment

Getting Started Code

# Import libraries
import numpy as np
import pandas as pd
import matplotlib . pyplot as plt
import seaborn as sns
from sklearn . datasets import f e t c h _ c a l i f o r n i a _ h o u s i n g
from sklearn . model_selection import train_test_split
from sklearn . linear_model import LinearRegression , Ridge , Lasso
from sklearn . metrics import r2_score , mea n_ sq ua re d_ er ro r

# Load dataset
california = f e t c h _ c a l i f o r n i a _ h o u s i n g ()
X = pd . DataFrame ( california . data , columns = california . feature_names )
y = california . target

# Display basic info


print ( X . head () )
print ( X . shape )

Good Luck!
Remember: The goal is to learn and understand linear regression,
not just to get the right answers. Show your thinking process!

Page 12

You might also like