0% found this document useful (0 votes)
39 views21 pages

Zerox Ready

The document describes a program to implement linear regression on a housing dataset to predict prices. It performs data preprocessing, exploratory data analysis, builds and evaluates a linear regression model, and implements feature selection and k-fold cross validation to improve performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views21 pages

Zerox Ready

The document describes a program to implement linear regression on a housing dataset to predict prices. It performs data preprocessing, exploratory data analysis, builds and evaluates a linear regression model, and implements feature selection and k-fold cross validation to improve performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

RollNo – 21131A4416

MACHINE INTELLIGENCE APPLICATIONS LAB

Week 07

Aim:
Write a program to perform Linear Discriminant Analysis for binary classification
considering a real time datasets

Description:
Linear Discriminant Analysis (LDA) is a dimensionality reduction and
classification technique. It seeks to find a projection that maximizes the separation
between two classes while minimizing the variance within each class. By
transforming the data into a lower-dimensional space, LDA aims to enhance the
class separability, making it useful for binary classification tasks.

Dataset –‘iris.csv’
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler,
LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import
LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,
confusion_matrix
dataset = pd.read_csv("/content/Iris.csv")

# Perform one-hot encoding for the 'Species' column


encoder = OneHotEncoder()
encoded_species =
encoder.fit_transform(dataset[['Species']]).toarray()
encoded_species_df = pd.DataFrame(encoded_species,
columns=encoder.get_feature_names_out(['Species']))
dataset = pd.concat([dataset, encoded_species_df],
axis=1)

# Drop the original 'Species' column after encoding


dataset.drop(['Species'], axis=1, inplace=True)

X = dataset.iloc[:, :-3].values # Corrected slicing


for feature matrix
y = dataset.iloc[:, -3:].values # Corrected slicing
for target variable

sc = StandardScaler()
X = sc.fit_transform(X)
le = LabelEncoder()

# Flatten the y array for LabelEncoder


y = y.argmax(axis=1)
y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X,


y, test_size=0.2)

lda = LinearDiscriminantAnalysis(n_components=2)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)

plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train,


cmap='rainbow', alpha=0.7, edgecolors='b')

classifier = RandomForestClassifier(max_depth=2,
random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

print('Accuracy : ' + str(accuracy_score(y_test,


y_pred)))
conf_m = confusion_matrix(y_test, y_pred)
print(conf_m)

Output:
Accuracy : 0.9333333333333333
[[13 0 0]
[ 1 6 1]
[ 0 0 9]]
WEEK-8
AIM: Program to implement the linear Regression for a sample training data set .CSV file
DESCRIPTION: Linear regression is a type of supervised machine learning algorithm
that computes the linear relationship between a dependent variable and one or more
independent features. When the number of the independent feature, is 1 then it is known as
Univariate Linear regression, and in the case of more than one feature, it is known as
multivariate linear regression. The goal of the algorithm is to find the best linear equation that
can predict the value of the dependent variable based on the independent variables. The
equation provides a straight line that represents the relationship between the dependent and
independent variables. The slope of the line indicates how much the dependent variable
changes for a unit change in the independent variable(s).

Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv("/content/data.csv")
data.head()

data= data.drop(['date','street','city','statezip','country'], axis=1)


data.isnull().sum()

data.info()

21131A4436 1
data.describe()

Exploratory Data Analysis:

import warnings
warnings.filterwarnings('ignore')

sns.distplot(data["price"],hist = True, kde = True)


plt.xlabel("count")
plt.ylabel("Density")
plt.title("Density of price")
plt.legend("Price")
plt.show()
print("Standard Deviation",np.std(data["price"]))

sns.pairplot(data)

21131A4436 2
sns.heatmap(data.corr(),annot=True)

x = data[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',


'floors', 'waterfront', 'view', 'condition', 'sqft_above',
'sqft_basement', 'yr_built', 'yr_renovated']]
y = data['price']
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3, random_state= 42)

Building a Model:

from sklearn.linear_model import LinearRegression

LR = LinearRegression()
LR.fit(x_train, y_train)
y_pred = LR.predict(x_test)
from sklearn import metrics
from sklearn.metrics import r2_score
R2 = r2_score(y_test, y_pred)
print("R2Score:",R2)
MAE = metrics.mean_absolute_error(y_test, y_pred)
MSE = metrics.mean_squared_error(y_test,y_pred)

21131A4436 3
print("Mean absolute Error",MAE)
print("Mean squared Error",MSE)
R2Score: 0.05963787830481415
Mean absolute Error 196135.02968581286
Mean squared Error 679366480217.377

print('Intercept of the model:',LR.intercept_)


print('Coefficient of the line:',LR.coef_)

Intercept of the model: 5100419.087753873


Coefficient of the line: [-6.96263921e+04 2.99215643e+04 1.93470477e+02 -4.90437951e-
01
7.17703866e+04 4.07818603e+05 4.10106223e+04 2.90841996e+04
1.00645260e+02 9.28252165e+01 -2.65359055e+03 5.49482827e+00]

coefficients = LR.coef_
feature_names = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
'floors', 'waterfront', 'view', 'condition', 'sqft_above',
'sqft_basement', 'yr_built', 'yr_renovated']
plt.figure(figsize=(10, 6))
plt.barh(feature_names, coefficients, color='skyblue')
plt.xlabel('Coefficient Value')
plt.ylabel('Features')
plt.title('Coefficients of Linear Regression Model')
plt.gca().invert_yaxis()
plt.show()

k_best.fit(x_train, y_train)
x_train.columns[k_best.get_support()]
x_train_selected = x_train[x_train.columns[k_best.get_support()]]
x_test_selected = x_test[x_train.columns[k_best.get_support()]]
21131A4436 4
LR = LinearRegression()
LR.fit(x_train_selected, y_train)
y_pred1 = LR.predict(x_test_selected)

from sklearn import metrics


from sklearn.metrics import r2_score
R2 = r2_score(y_test, y_pred1)
print("R2Score:",R2)
MAE = metrics.mean_absolute_error(y_test, y_pred1)
MSE = metrics.mean_squared_error(y_test,y_pred1)
print("Mean absolute Error",MAE)
print("Mean squared Error",MSE)
R2Score: 0.0539628699713669
Mean absolute Error 204387.02881027036
Mean squared Error 683466401245.4042

from sklearn.linear_model import LinearRegression


from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import KFold

# Create a Linear Regression model


model = LinearRegression()

# Define the number of folds (in this case, 10)


n_folds = 10

# Create a KFold object


kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)

# Initialize lists to store evaluation metrics for each fold


mse_scores = []
mae_scores = []
R2_scores = []

# Split your data into folds and perform cross-validation


for train_index, val_index in kf.split(x_train):
x_train_fold, x_val_fold = x_train.iloc[train_index], x_train.iloc[val_index]
y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]

# Train the model on the training fold


model.fit(x_train_fold, y_train_fold)

# Make predictions on the validation fold


y_val_pred = model.predict(x_val_fold)

# Calculate Mean Squared Error (MSE) and Mean Absolute Error (MAE) for this fold
mse = mean_squared_error(y_val_fold, y_val_pred)

21131A4436 5
mae = mean_absolute_error(y_val_fold, y_val_pred)
R2 = r2_score(y_val_fold, y_val_pred)

# Append the scores to the lists


mse_scores.append(mse)
mae_scores.append(mae)
R2_scores.append(R2)

# Calculate the average and standard deviation of MSE and MAE across all folds
average_mse = sum(mse_scores) / len(mse_scores)
average_mae = sum(mae_scores) / len(mae_scores)
average_R2 = sum(R2_scores)/len(R2_scores)

# Print the average scores


print("Average Mean Squared Error (MSE):", average_mse)
print("Average Mean Absolute Error (MAE):", average_mae)
print("Average R2Score:",R2)
Average Mean Squared Error (MSE): 68091180895.09597
Average Mean Absolute Error (MAE): 159891.7275636576
Average R2Score: 0.5306868413035454

import pandas as pd

# Example: List of results for different models


results = [
{'Model': 'Without Preprocessing Dataset', 'R2 Score': 0.05963787830481415, 'MAE':
196135.02968581286, 'MSE': 679366480217.377},
{'Model': 'Using k-best features', 'R2 Score':0.0539628699713669,
'MAE':204387.02881027036, 'MSE': 683466401245.4042},
{'Model': 'Using k-fold Cross Validation', 'R2 Score': 0.5306868413035454, 'MAE':
159891.7275636576, 'MSE': 68091180895.09597},
]

# Convert the results list into a DataFrame


df = pd.DataFrame(results)

# Calculate the correlation matrix


correlation_matrix = df.corr()

# Print the correlation matrix


print(correlation_matrix)

plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap of Model Metrics")
plt.show()
R2 Score MAE MSE
R2 Score 1.000000 -0.986445 -0.999990

21131A4436 6
MAE -0.986445 1.000000 0.985682
MSE -0.999990 0.985682 1.000000

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Example: Results for three models


results = [
{'Model': 'Without Preprocessing Dataset', 'Metric': 'R2 Score', 'Value': 0.0596},
{'Model': 'Without Preprocessing Dataset', 'Metric': 'MAE', 'Value': 196135.03},
{'Model': 'Without Preprocessing Dataset', 'Metric': 'MSE', 'Value': 679366480217.38},
{'Model': 'Using k-best features', 'Metric': 'R2 Score', 'Value': 0.05396},
{'Model': 'Using k-best features', 'Metric': 'MAE', 'Value': 204387.03},
{'Model': 'Using k-best features', 'Metric': 'MSE', 'Value': 683466401245.40},
{'Model': 'Using k-fold Cross Validation', 'Metric': 'R2 Score', 'Value': 0.5307},
{'Model': 'Using k-fold Cross Validation', 'Metric': 'MAE', 'Value': 159891.73},
{'Model': 'Using k-fold Cross Validation', 'Metric': 'MSE', 'Value': 68091180895.10},
]

# Convert the results list into a DataFrame


df = pd.DataFrame(results)
# Create a grouped bar plot for each metric
plt.figure(figsize=(12, 6))
sns.barplot(x='Metric', y='Value', hue='Model', data=df)
plt.title('Model Comparison for Different Metrics')
plt.ylabel('Metric Value')
plt.legend(title='Model')
plt.xticks(rotation=45)
plt.show()

Conclusion:
The model using k-fold cross-validation outperforms the model using k-best features and
dataset without preprocessing in terms of R2 Score, MAE, and MSE.
The model with k-fold cross-validation is generally considered better because it provides a
more robust and accurate evaluation of the model's performance across multiple data splits.

21131A4436 7
WEEK-9
AIM: Write a program to implement the Non-linear Regression for a sample training dataset
stored as a .CSV file. Compute Mean Square Error by considering few test datasets

DESCRIPTION: Non-Linear regression is a type of polynomial regression. It is a method


to model a non-linear relationship between the dependent and independent variables. It is
used in place when the data shows a curvy trend and linear regression would not produce
very accurate results when compared to non-linear regression. There are many different
regressions that exist and can be used to fit whatever the dataset looks like such as quadratic,
cubic regression, and so on to infinite degrees according to our requirement.

Program:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd

LINEAR: (y=mx+c)

x = np.arange(-5.0, 5.0, 0.1)

##You can adjust the slope and intercept to verify the changes in the
graph
y = 5*(x) + 9
y_noise = 2 * np.random.normal(size=x.size)
ydata = y + y_noise
#plt.figure(figsize=(8,6))
plt.plot(x, ydata, 'bo')
plt.plot(x,y, 'r')
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()

21131A4436 1
POLYNOMIAL: (P(x) = anxn + an-1xn-1 +an-2xn-2 + ………………. + a1x + a0 )

x = np.arange(-5.0, 5.0, 0.1)


y = 1*(x**3) + 5*(x**2) + 2*x + 3
y_noise = 20 * np.random.normal(size=x.size)
ydata = y + y_noise
plt.plot(x, ydata, 'bo')
plt.plot(x,y, 'r')
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()

QUADRATIC: (y = x^2)

x = np.arange(-5.0, 5.0, 0.1)


y = np.power(x,2)
y_noise = 2 * np.random.normal(size=x.size)
ydata = y + y_noise
plt.plot(x, ydata, 'bo')
plt.plot(x,y, 'r')
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()

21131A4436 2
EXPONENTIAL: (y=e^x)

X = np.arange(-5.0, 5.0, 0.1)

##You can adjust the slope and intercept to verify the changes in the
graph

Y= np.exp(X)

plt.plot(X,Y)
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()

LOGARITHMIC : (y = log(x))

X = np.arange(1.0, 10.0, 0.1)

Y = np.log(X)

plt.plot(X,Y)
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()

21131A4436 3
SIGMOIDAL/LOGISTIC: (Y=a + (b/1+c^(X−d))

X = np.arange(-5.0, 5.0, 0.1)

Y = 1-4/(1+np.power(3, X-2))

plt.plot(X,Y)
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()

data = pd.read_csv("/content/china_gdp.csv")
data.head()

Year Value

0 1960 5.918412e+10

1 1961 4.955705e+10

2 1962 4.668518e+10

3 1963 5.009730e+10

4 1964 5.906225e+10

plt.figure(figsize=(8,5))
x_data, y_data = (data["Year"].values, data["Value"].values)
plt.plot(x_data, y_data, 'go')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()

21131A4436 4
The above graph resembles sigmoid/logistic function.

def sigmoid(x, Beta_1, Beta_2):


y = 1 / (1 + np.exp(-Beta_1*(x-Beta_2)))
return y

# Lets normalize our data


xdata =x_data/max(x_data)
ydata =y_data/max(y_data)
from scipy.optimize import curve_fit
popt, pcov = curve_fit(sigmoid, xdata, ydata)
# Now we plot our resulting regression model.
x = np.linspace(1960, 2015, 55)
x = x/max(x)
plt.figure(figsize=(8,5))
y = sigmoid(x, *popt)
plt.plot(xdata, ydata, 'ro', label='data')
plt.plot(x,y, linewidth=3.0, label='fit')
plt.legend(loc='best')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()

# split data into train/test

21131A4436 5
msk = np.random.rand(len(data)) < 0.8
train_x = xdata[msk]
test_x = xdata[~msk]
train_y = ydata[msk]
test_y = ydata[~msk]
# build the model using train set
popt, pcov = curve_fit(sigmoid, train_x, train_y)
# predict using test set
y_pred = sigmoid(test_x, *popt)
# evaluation
print("Mean absolute error: %.2f" % np.mean(np.absolute(y_pred -
test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((y_pred - test_y)
** 2))
from sklearn.metrics import r2_score
print("R2-score: %.2f" % r2_score(y_pred , test_y))

Mean absolute error: 0.18


Residual sum of squares (MSE): 0.13
R2-score: -2255226626698815320889491456.00

21131A4436 6
WEEK-11
AIM: Write a program to implement k-Nearest Neighbor algorithm to classify the irisdata
set. Print both correct and wrong predictions.

DESCRIPTION: K-Nearest Neighbour is one of the simplest Machine Learning


algorithms based on Supervised Learning technique. K-NN algorithm assumes the similarity
between the new case/data and available cases and put the new case into the category that is
most similar to the available categories.K-NN algorithm stores all the available data and
classifies a new data point based on the similarity. This means when new data appears then it
can be easily classified into a well suite category by using K- NN algorithm.

Program:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

iris = pd.read_csv("Iris.csv")
iris.head()
iris.isnull().sum()
Id 0 SepalLengthCm 0 SepalWidthCm 0 PetalLengthCm 0 PetalWidthCm 0
Species 0 dtype: int64

sns.boxplot(x="Species", y="SepalLengthCm", data=iris)

sns.FacetGrid(data=iris,hue="Species",height=5).map(plt.scatter,"SepalL
engthCm","SepalWidthCm").add_legend()

21131A4436 7
plt.figure(figsize=(8,6))
sns.heatmap(iris[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm',
'PetalWidthCm']].corr(),annot=True,cmap="YlGnBu")
plt.show()

X = iris.drop(['Id', 'Species'], axis=1)


y = iris['Species']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.4, random_state=5)
from sklearn import metrics
# experimenting with different n values
k_range = list(range(1,26))
scores = []
for k in k_range:

21131A4436 8
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
scores.append(metrics.accuracy_score(y_test, y_pred))

plt.plot(k_range, scores)
plt.xlabel('Value of k for KNN')
plt.ylabel('Accuracy Score')
plt.title('Accuracy Scores for Values of k of k-Nearest-Neighbors')
plt.show()

Based on graph we can consider k value approximately to 12


knn = KNeighborsClassifier(n_neighbors=12)
knn.fit(X, y)

# Make predictions on the test data


y_pred = knn.predict(X_test)

# Calculate the accuracy of the model


accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Print correct and wrong predictions


correct_predictions = X_test[y_test == y_pred]
wrong_predictions = X_test[y_test != y_pred]

print("\nCorrect Predictions:")
print(correct_predictions)

print("\nWrong Predictions:")

21131A4436 9
print(wrong_predictions)

# Print the confusion matrix


conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)

Accuracy: 100.00%

Correct Predictions:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
82 5.8 2.7 3.9 1.2
134 6.1 2.6 5.6 1.4
114 5.8 2.8 5.1 2.4
42 4.4 3.2 1.3 0.2
109 7.2 3.6 6.1 2.5
57 4.9 2.4 3.3 1.0
1 4.9 3.0 1.4 0.2
70 5.9 3.2 4.8 1.8
25 5.0 3.0 1.6 0.2
84 5.4 3.0 4.5 1.5
66 5.6 3.0 4.5 1.5
133 6.3 2.8 5.1 1.5
102 7.1 3.0 5.9 2.1
107 7.3 2.9 6.3 1.8
26 5.0 3.4 1.6 0.4
23 5.1 3.3 1.7 0.5
123 6.3 2.7 4.9 1.8
130 7.4 2.8 6.1 1.9
21 5.1 3.7 1.5 0.4
12 4.8 3.0 1.4 0.1
71 6.1 2.8 4.0 1.3
128 6.4 2.8 5.6 2.1
48 5.3 3.7 1.5 0.2
72 6.3 2.5 4.9 1.5
88 5.6 3.0 4.1 1.3
148 6.2 3.4 5.4 2.3
74 6.4 2.9 4.3 1.3
96 5.7 2.9 4.2 1.3
63 6.1 2.9 4.7 1.4
132 6.4 2.8 5.6 2.2
39 5.1 3.4 1.5 0.2
53 5.5 2.3 4.0 1.3
79 5.7 2.6 3.5 1.0
10 5.4 3.7 1.5 0.2
50 7.0 3.2 4.7 1.4
49 5.0 3.3 1.4 0.2
43 5.0 3.5 1.6 0.6
135 7.7 3.0 6.1 2.3
40 5.0 3.5 1.3 0.3
115 6.4 3.2 5.3 2.3
142 5.8 2.7 5.1 1.9
69 5.6 2.5 3.9 1.1
17 5.1 3.5 1.4 0.3
46 5.1 3.8 1.6 0.2
54 6.5 2.8 4.6 1.5
1
21131A4436
0
126 6.2 2.8 4.8 1.8
61 5.9 3.0 4.2 1.5
124 6.7 3.3 5.7 2.1
117 7.7 3.8 6.7 2.2
20 5.4 3.4 1.7 0.2
146 6.3 2.5 5.0 1.9
35 5.0 3.2 1.2 0.2
6 4.6 3.4 1.4 0.3
15 5.7 4.4 1.5 0.4
28 5.2 3.4 1.4 0.2
97 6.2 2.9 4.3 1.3
56 6.3 3.3 4.7 1.6
81 5.5 2.4 3.7 1.0
98 5.1 2.5 3.0 1.1
149 5.9 3.0 5.1 1.8

Wrong Predictions:
Empty DataFrame
Columns: [SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm]
Index: []

Confusion Matrix:
[[20 0 0]
[ 0 21 0]
[ 0 0 19]]

1
21131A4436
1

You might also like