PART A
(PART A: TO BE REFFERED BY STUDENTS)
Experiment No. 2
A.1 Aim:
To implement Linear Regression.
A.2 Prerequisite:
Python Basic Concepts
A.3 Outcome:
Students will be able to implement Linear Regression.
A.4 Theory:
Machine Learning, being a subset of Artificial Intelligence (AI), has been playing a dominant
role in our daily lives. Data science engineers and developers working in various domains are
widely using machine learning algorithms to make their tasks simpler and life easier.
What is a Regression Problem?
Majority of the machine learning algorithms fall under the supervised learning category. It is the
process where an algorithm is used to predict a result based on the previously entered values and
the results generated from them. Suppose we have an input variable ‘x’ and an output variable
‘y’ where y is a function of x (y=f{x}). Supervised learning reads the value of entered variable
‘x’ and the resulting variable ‘y’ so that it can use those results to later predict a highly accurate
output data of ‘y’ from the entered value of ‘x’. A regression problem is when the resulting
variable contains a real or a continuous value. It tries to draw the line of best fit from the data
gathered from a number of points.
Linear Regression
Linear regression is a quiet and simple statistical regression method used for predictive analysis
and shows the relationship between the continuous variables. Linear regression shows the linear
relationship between the independent variable (X-axis) and the dependent variable (Y-axis),
consequently called linear regression. If there is a single input variable (x), such linear regression
is called simple linear regression. And if there is more than one input variable, such linear
regression is called multiple linear regression. The linear regression model gives a sloped straight
line describing the relationship within the variables.
To calculate best-fit line linear regression uses a traditional slope-intercept form.
y= Dependent Variable.
x= Independent Variable.
a0= intercept of the line.
a1 = Linear regression coefficient.
Need of a Linear regression
As mentioned above, Linear regression estimates the relationship between a dependent variable
and an independent variable. Let’s understand this with an easy example:
Let’s say we want to estimate the salary of an employee based on year of experience. You have
the recent company data, which indicates that the relationship between experience and salary.
Here year of experience is an independent variable, and the salary of an employee is a dependent
variable, as the salary of an employee is dependent on the experience of an employee. Using this
insight, we can predict the future salary of the employee based on current & past information.
A regression line can be a Positive Linear Relationship or a Negative Linear Relationship.
PART B
(PART B : TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the practical. The
soft copy must be uploaded on the Blackboard or emailed to the concerned lab in charge faculties at
the end of the practical in case the there is no Black board access available)
Roll. No. BE-A10 Name: Nishad Sutar
Class: BE-Comps A Batch: A1
Date of Experiment: 14/07/2025 Date of Submission: 21/07/2025
Grade:
B.1 Software Code written by student:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
# Load the dataset
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['MedianHouseValue'] = housing.target
df.head()
df.info()
# Data Wrangling
print("Missing values in dataset:\n", df.isnull().sum())
# Feature Engineering
# Create new feature: Rooms per person
df['RoomsPerPerson'] = df['AveRooms'] / df['Population']
df['BedroomsPerRoom'] = df['AveBedrms'] / df['AveRooms']
# Handle infinite or NaN values (in case Population = 0)
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace=True)
# Optional log transformation (can reduce skewness)
df['LogPopulation'] = np.log1p(df['Population'])
df['LogAveOccup'] = np.log1p(df['AveOccup'])
# Drop raw columns if using log-transformed versions
df.drop(['Population', 'AveOccup'], axis=1, inplace=True)
# Feature Matrix and Target
X = df.drop('MedianHouseValue', axis=1)
y = df['MedianHouseValue']
# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Model Training
model = LinearRegression()
model.fit(X_train_scaled, y_train)
# Predictions
y_pred = model.predict(X_test_scaled)
# Evaluation
print(f"Intercept (a0): {model.intercept_}")
print(f"First Coefficient (a1): {model.coef_[0]}")
print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred):.2f}")
print(f"R² Score: {r2_score(y_test, y_pred):.2f}")
# Plotting
plt.figure(figsize=(12, 5))
# Actual vs Predicted
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred, alpha=0.5, color='blue')
plt.xlabel("Actual Median House Value")
plt.ylabel("Predicted Median House Value")
plt.title("Actual vs Predicted Values")
plt.grid(True)
plt.tight_layout()
plt.show()
B.2 Input and Output:
B.3 Observations and learning:
In this experiment, I successfully implemented the Linear Regression algorithm, a fundamental
supervised learning technique used for predictive analysis. I observed that this model works by
establishing a linear relationship between a dependent (target) variable and one or more independent
(predictor) variables. The core task was to find the "line of best fit" that most accurately represents the
data points, which is mathematically described by the slope-intercept formula, y = a0 + a1*x. I noted the
distinction between simple linear regression, which involves a single independent variable, and multiple
linear regression, which uses several. The example of predicting an employee's salary based on years of
experience clearly illustrated how this algorithm can be used to forecast continuous values in real-world
scenarios.
B.4 Conclusion:
In conclusion, this experiment fulfilled its aim of implementing a Linear Regression model. Through this
process, Ihave gained a practical understanding of how to predict continuous outcomes by modeling the
relationships between variables. The experiment reinforces that Linear Regression is a straightforward
and powerful statistical tool that serves as a cornerstone of machine learning. Its simplicity and
interpretability make it an essential algorithm for data scientists to master for tasks involving forecasting
and understanding data relationships.
B.5 Question of Curiosity
(To be answered by student based on the practical performed and learning/observations)