RollNo – 21131A4416
MACHINE INTELLIGENCE APPLICATIONS LAB
Week 07
Aim:
Write a program to perform Linear Discriminant Analysis for binary classification
considering a real time datasets
Description:
Linear Discriminant Analysis (LDA) is a dimensionality reduction and
classification technique. It seeks to find a projection that maximizes the separation
between two classes while minimizing the variance within each class. By
transforming the data into a lower-dimensional space, LDA aims to enhance the
class separability, making it useful for binary classification tasks.
Dataset –‘iris.csv’
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler,
LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import
LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,
confusion_matrix
dataset = pd.read_csv("/content/Iris.csv")
# Perform one-hot encoding for the 'Species' column
encoder = OneHotEncoder()
encoded_species =
encoder.fit_transform(dataset[['Species']]).toarray()
encoded_species_df = pd.DataFrame(encoded_species,
columns=encoder.get_feature_names_out(['Species']))
dataset = pd.concat([dataset, encoded_species_df],
axis=1)
# Drop the original 'Species' column after encoding
dataset.drop(['Species'], axis=1, inplace=True)
X = dataset.iloc[:, :-3].values # Corrected slicing
for feature matrix
y = dataset.iloc[:, -3:].values # Corrected slicing
for target variable
sc = StandardScaler()
X = sc.fit_transform(X)
le = LabelEncoder()
# Flatten the y array for LabelEncoder
y = y.argmax(axis=1)
y = le.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.2)
lda = LinearDiscriminantAnalysis(n_components=2)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train,
cmap='rainbow', alpha=0.7, edgecolors='b')
classifier = RandomForestClassifier(max_depth=2,
random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print('Accuracy : ' + str(accuracy_score(y_test,
y_pred)))
conf_m = confusion_matrix(y_test, y_pred)
print(conf_m)
Output:
Accuracy : 0.9333333333333333
[[13 0 0]
[ 1 6 1]
[ 0 0 9]]
WEEK-8
AIM: Program to implement the linear Regression for a sample training data set .CSV file
DESCRIPTION: Linear regression is a type of supervised machine learning algorithm
that computes the linear relationship between a dependent variable and one or more
independent features. When the number of the independent feature, is 1 then it is known as
Univariate Linear regression, and in the case of more than one feature, it is known as
multivariate linear regression. The goal of the algorithm is to find the best linear equation that
can predict the value of the dependent variable based on the independent variables. The
equation provides a straight line that represents the relationship between the dependent and
independent variables. The slope of the line indicates how much the dependent variable
changes for a unit change in the independent variable(s).
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("/content/data.csv")
data.head()
data= data.drop(['date','street','city','statezip','country'], axis=1)
data.isnull().sum()
data.info()
21131A4436 1
data.describe()
Exploratory Data Analysis:
import warnings
warnings.filterwarnings('ignore')
sns.distplot(data["price"],hist = True, kde = True)
plt.xlabel("count")
plt.ylabel("Density")
plt.title("Density of price")
plt.legend("Price")
plt.show()
print("Standard Deviation",np.std(data["price"]))
sns.pairplot(data)
21131A4436 2
sns.heatmap(data.corr(),annot=True)
x = data[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
'floors', 'waterfront', 'view', 'condition', 'sqft_above',
'sqft_basement', 'yr_built', 'yr_renovated']]
y = data['price']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3, random_state= 42)
Building a Model:
from sklearn.linear_model import LinearRegression
LR = LinearRegression()
LR.fit(x_train, y_train)
y_pred = LR.predict(x_test)
from sklearn import metrics
from sklearn.metrics import r2_score
R2 = r2_score(y_test, y_pred)
print("R2Score:",R2)
MAE = metrics.mean_absolute_error(y_test, y_pred)
MSE = metrics.mean_squared_error(y_test,y_pred)
21131A4436 3
print("Mean absolute Error",MAE)
print("Mean squared Error",MSE)
R2Score: 0.05963787830481415
Mean absolute Error 196135.02968581286
Mean squared Error 679366480217.377
print('Intercept of the model:',LR.intercept_)
print('Coefficient of the line:',LR.coef_)
Intercept of the model: 5100419.087753873
Coefficient of the line: [-6.96263921e+04 2.99215643e+04 1.93470477e+02 -4.90437951e-
01
7.17703866e+04 4.07818603e+05 4.10106223e+04 2.90841996e+04
1.00645260e+02 9.28252165e+01 -2.65359055e+03 5.49482827e+00]
coefficients = LR.coef_
feature_names = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
'floors', 'waterfront', 'view', 'condition', 'sqft_above',
'sqft_basement', 'yr_built', 'yr_renovated']
plt.figure(figsize=(10, 6))
plt.barh(feature_names, coefficients, color='skyblue')
plt.xlabel('Coefficient Value')
plt.ylabel('Features')
plt.title('Coefficients of Linear Regression Model')
plt.gca().invert_yaxis()
plt.show()
k_best.fit(x_train, y_train)
x_train.columns[k_best.get_support()]
x_train_selected = x_train[x_train.columns[k_best.get_support()]]
x_test_selected = x_test[x_train.columns[k_best.get_support()]]
21131A4436 4
LR = LinearRegression()
LR.fit(x_train_selected, y_train)
y_pred1 = LR.predict(x_test_selected)
from sklearn import metrics
from sklearn.metrics import r2_score
R2 = r2_score(y_test, y_pred1)
print("R2Score:",R2)
MAE = metrics.mean_absolute_error(y_test, y_pred1)
MSE = metrics.mean_squared_error(y_test,y_pred1)
print("Mean absolute Error",MAE)
print("Mean squared Error",MSE)
R2Score: 0.0539628699713669
Mean absolute Error 204387.02881027036
Mean squared Error 683466401245.4042
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import KFold
# Create a Linear Regression model
model = LinearRegression()
# Define the number of folds (in this case, 10)
n_folds = 10
# Create a KFold object
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
# Initialize lists to store evaluation metrics for each fold
mse_scores = []
mae_scores = []
R2_scores = []
# Split your data into folds and perform cross-validation
for train_index, val_index in kf.split(x_train):
x_train_fold, x_val_fold = x_train.iloc[train_index], x_train.iloc[val_index]
y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]
# Train the model on the training fold
model.fit(x_train_fold, y_train_fold)
# Make predictions on the validation fold
y_val_pred = model.predict(x_val_fold)
# Calculate Mean Squared Error (MSE) and Mean Absolute Error (MAE) for this fold
mse = mean_squared_error(y_val_fold, y_val_pred)
21131A4436 5
mae = mean_absolute_error(y_val_fold, y_val_pred)
R2 = r2_score(y_val_fold, y_val_pred)
# Append the scores to the lists
mse_scores.append(mse)
mae_scores.append(mae)
R2_scores.append(R2)
# Calculate the average and standard deviation of MSE and MAE across all folds
average_mse = sum(mse_scores) / len(mse_scores)
average_mae = sum(mae_scores) / len(mae_scores)
average_R2 = sum(R2_scores)/len(R2_scores)
# Print the average scores
print("Average Mean Squared Error (MSE):", average_mse)
print("Average Mean Absolute Error (MAE):", average_mae)
print("Average R2Score:",R2)
Average Mean Squared Error (MSE): 68091180895.09597
Average Mean Absolute Error (MAE): 159891.7275636576
Average R2Score: 0.5306868413035454
import pandas as pd
# Example: List of results for different models
results = [
{'Model': 'Without Preprocessing Dataset', 'R2 Score': 0.05963787830481415, 'MAE':
196135.02968581286, 'MSE': 679366480217.377},
{'Model': 'Using k-best features', 'R2 Score':0.0539628699713669,
'MAE':204387.02881027036, 'MSE': 683466401245.4042},
{'Model': 'Using k-fold Cross Validation', 'R2 Score': 0.5306868413035454, 'MAE':
159891.7275636576, 'MSE': 68091180895.09597},
]
# Convert the results list into a DataFrame
df = pd.DataFrame(results)
# Calculate the correlation matrix
correlation_matrix = df.corr()
# Print the correlation matrix
print(correlation_matrix)
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap of Model Metrics")
plt.show()
R2 Score MAE MSE
R2 Score 1.000000 -0.986445 -0.999990
21131A4436 6
MAE -0.986445 1.000000 0.985682
MSE -0.999990 0.985682 1.000000
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Example: Results for three models
results = [
{'Model': 'Without Preprocessing Dataset', 'Metric': 'R2 Score', 'Value': 0.0596},
{'Model': 'Without Preprocessing Dataset', 'Metric': 'MAE', 'Value': 196135.03},
{'Model': 'Without Preprocessing Dataset', 'Metric': 'MSE', 'Value': 679366480217.38},
{'Model': 'Using k-best features', 'Metric': 'R2 Score', 'Value': 0.05396},
{'Model': 'Using k-best features', 'Metric': 'MAE', 'Value': 204387.03},
{'Model': 'Using k-best features', 'Metric': 'MSE', 'Value': 683466401245.40},
{'Model': 'Using k-fold Cross Validation', 'Metric': 'R2 Score', 'Value': 0.5307},
{'Model': 'Using k-fold Cross Validation', 'Metric': 'MAE', 'Value': 159891.73},
{'Model': 'Using k-fold Cross Validation', 'Metric': 'MSE', 'Value': 68091180895.10},
]
# Convert the results list into a DataFrame
df = pd.DataFrame(results)
# Create a grouped bar plot for each metric
plt.figure(figsize=(12, 6))
sns.barplot(x='Metric', y='Value', hue='Model', data=df)
plt.title('Model Comparison for Different Metrics')
plt.ylabel('Metric Value')
plt.legend(title='Model')
plt.xticks(rotation=45)
plt.show()
Conclusion:
The model using k-fold cross-validation outperforms the model using k-best features and
dataset without preprocessing in terms of R2 Score, MAE, and MSE.
The model with k-fold cross-validation is generally considered better because it provides a
more robust and accurate evaluation of the model's performance across multiple data splits.
21131A4436 7
WEEK-9
AIM: Write a program to implement the Non-linear Regression for a sample training dataset
stored as a .CSV file. Compute Mean Square Error by considering few test datasets
DESCRIPTION: Non-Linear regression is a type of polynomial regression. It is a method
to model a non-linear relationship between the dependent and independent variables. It is
used in place when the data shows a curvy trend and linear regression would not produce
very accurate results when compared to non-linear regression. There are many different
regressions that exist and can be used to fit whatever the dataset looks like such as quadratic,
cubic regression, and so on to infinite degrees according to our requirement.
Program:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
LINEAR: (y=mx+c)
x = np.arange(-5.0, 5.0, 0.1)
##You can adjust the slope and intercept to verify the changes in the
graph
y = 5*(x) + 9
y_noise = 2 * np.random.normal(size=x.size)
ydata = y + y_noise
#plt.figure(figsize=(8,6))
plt.plot(x, ydata, 'bo')
plt.plot(x,y, 'r')
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()
21131A4436 1
POLYNOMIAL: (P(x) = anxn + an-1xn-1 +an-2xn-2 + ………………. + a1x + a0 )
x = np.arange(-5.0, 5.0, 0.1)
y = 1*(x**3) + 5*(x**2) + 2*x + 3
y_noise = 20 * np.random.normal(size=x.size)
ydata = y + y_noise
plt.plot(x, ydata, 'bo')
plt.plot(x,y, 'r')
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()
QUADRATIC: (y = x^2)
x = np.arange(-5.0, 5.0, 0.1)
y = np.power(x,2)
y_noise = 2 * np.random.normal(size=x.size)
ydata = y + y_noise
plt.plot(x, ydata, 'bo')
plt.plot(x,y, 'r')
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()
21131A4436 2
EXPONENTIAL: (y=e^x)
X = np.arange(-5.0, 5.0, 0.1)
##You can adjust the slope and intercept to verify the changes in the
graph
Y= np.exp(X)
plt.plot(X,Y)
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()
LOGARITHMIC : (y = log(x))
X = np.arange(1.0, 10.0, 0.1)
Y = np.log(X)
plt.plot(X,Y)
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()
21131A4436 3
SIGMOIDAL/LOGISTIC: (Y=a + (b/1+c^(X−d))
X = np.arange(-5.0, 5.0, 0.1)
Y = 1-4/(1+np.power(3, X-2))
plt.plot(X,Y)
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()
data = pd.read_csv("/content/china_gdp.csv")
data.head()
Year Value
0 1960 5.918412e+10
1 1961 4.955705e+10
2 1962 4.668518e+10
3 1963 5.009730e+10
4 1964 5.906225e+10
plt.figure(figsize=(8,5))
x_data, y_data = (data["Year"].values, data["Value"].values)
plt.plot(x_data, y_data, 'go')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()
21131A4436 4
The above graph resembles sigmoid/logistic function.
def sigmoid(x, Beta_1, Beta_2):
y = 1 / (1 + np.exp(-Beta_1*(x-Beta_2)))
return y
# Lets normalize our data
xdata =x_data/max(x_data)
ydata =y_data/max(y_data)
from scipy.optimize import curve_fit
popt, pcov = curve_fit(sigmoid, xdata, ydata)
# Now we plot our resulting regression model.
x = np.linspace(1960, 2015, 55)
x = x/max(x)
plt.figure(figsize=(8,5))
y = sigmoid(x, *popt)
plt.plot(xdata, ydata, 'ro', label='data')
plt.plot(x,y, linewidth=3.0, label='fit')
plt.legend(loc='best')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()
# split data into train/test
21131A4436 5
msk = np.random.rand(len(data)) < 0.8
train_x = xdata[msk]
test_x = xdata[~msk]
train_y = ydata[msk]
test_y = ydata[~msk]
# build the model using train set
popt, pcov = curve_fit(sigmoid, train_x, train_y)
# predict using test set
y_pred = sigmoid(test_x, *popt)
# evaluation
print("Mean absolute error: %.2f" % np.mean(np.absolute(y_pred -
test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((y_pred - test_y)
** 2))
from sklearn.metrics import r2_score
print("R2-score: %.2f" % r2_score(y_pred , test_y))
Mean absolute error: 0.18
Residual sum of squares (MSE): 0.13
R2-score: -2255226626698815320889491456.00
21131A4436 6
WEEK-11
AIM: Write a program to implement k-Nearest Neighbor algorithm to classify the irisdata
set. Print both correct and wrong predictions.
DESCRIPTION: K-Nearest Neighbour is one of the simplest Machine Learning
algorithms based on Supervised Learning technique. K-NN algorithm assumes the similarity
between the new case/data and available cases and put the new case into the category that is
most similar to the available categories.K-NN algorithm stores all the available data and
classifies a new data point based on the similarity. This means when new data appears then it
can be easily classified into a well suite category by using K- NN algorithm.
Program:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
iris = pd.read_csv("Iris.csv")
iris.head()
iris.isnull().sum()
Id 0 SepalLengthCm 0 SepalWidthCm 0 PetalLengthCm 0 PetalWidthCm 0
Species 0 dtype: int64
sns.boxplot(x="Species", y="SepalLengthCm", data=iris)
sns.FacetGrid(data=iris,hue="Species",height=5).map(plt.scatter,"SepalL
engthCm","SepalWidthCm").add_legend()
21131A4436 7
plt.figure(figsize=(8,6))
sns.heatmap(iris[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm',
'PetalWidthCm']].corr(),annot=True,cmap="YlGnBu")
plt.show()
X = iris.drop(['Id', 'Species'], axis=1)
y = iris['Species']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.4, random_state=5)
from sklearn import metrics
# experimenting with different n values
k_range = list(range(1,26))
scores = []
for k in k_range:
21131A4436 8
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
scores.append(metrics.accuracy_score(y_test, y_pred))
plt.plot(k_range, scores)
plt.xlabel('Value of k for KNN')
plt.ylabel('Accuracy Score')
plt.title('Accuracy Scores for Values of k of k-Nearest-Neighbors')
plt.show()
Based on graph we can consider k value approximately to 12
knn = KNeighborsClassifier(n_neighbors=12)
knn.fit(X, y)
# Make predictions on the test data
y_pred = knn.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
# Print correct and wrong predictions
correct_predictions = X_test[y_test == y_pred]
wrong_predictions = X_test[y_test != y_pred]
print("\nCorrect Predictions:")
print(correct_predictions)
print("\nWrong Predictions:")
21131A4436 9
print(wrong_predictions)
# Print the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)
Accuracy: 100.00%
Correct Predictions:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
82 5.8 2.7 3.9 1.2
134 6.1 2.6 5.6 1.4
114 5.8 2.8 5.1 2.4
42 4.4 3.2 1.3 0.2
109 7.2 3.6 6.1 2.5
57 4.9 2.4 3.3 1.0
1 4.9 3.0 1.4 0.2
70 5.9 3.2 4.8 1.8
25 5.0 3.0 1.6 0.2
84 5.4 3.0 4.5 1.5
66 5.6 3.0 4.5 1.5
133 6.3 2.8 5.1 1.5
102 7.1 3.0 5.9 2.1
107 7.3 2.9 6.3 1.8
26 5.0 3.4 1.6 0.4
23 5.1 3.3 1.7 0.5
123 6.3 2.7 4.9 1.8
130 7.4 2.8 6.1 1.9
21 5.1 3.7 1.5 0.4
12 4.8 3.0 1.4 0.1
71 6.1 2.8 4.0 1.3
128 6.4 2.8 5.6 2.1
48 5.3 3.7 1.5 0.2
72 6.3 2.5 4.9 1.5
88 5.6 3.0 4.1 1.3
148 6.2 3.4 5.4 2.3
74 6.4 2.9 4.3 1.3
96 5.7 2.9 4.2 1.3
63 6.1 2.9 4.7 1.4
132 6.4 2.8 5.6 2.2
39 5.1 3.4 1.5 0.2
53 5.5 2.3 4.0 1.3
79 5.7 2.6 3.5 1.0
10 5.4 3.7 1.5 0.2
50 7.0 3.2 4.7 1.4
49 5.0 3.3 1.4 0.2
43 5.0 3.5 1.6 0.6
135 7.7 3.0 6.1 2.3
40 5.0 3.5 1.3 0.3
115 6.4 3.2 5.3 2.3
142 5.8 2.7 5.1 1.9
69 5.6 2.5 3.9 1.1
17 5.1 3.5 1.4 0.3
46 5.1 3.8 1.6 0.2
54 6.5 2.8 4.6 1.5
1
21131A4436
0
126 6.2 2.8 4.8 1.8
61 5.9 3.0 4.2 1.5
124 6.7 3.3 5.7 2.1
117 7.7 3.8 6.7 2.2
20 5.4 3.4 1.7 0.2
146 6.3 2.5 5.0 1.9
35 5.0 3.2 1.2 0.2
6 4.6 3.4 1.4 0.3
15 5.7 4.4 1.5 0.4
28 5.2 3.4 1.4 0.2
97 6.2 2.9 4.3 1.3
56 6.3 3.3 4.7 1.6
81 5.5 2.4 3.7 1.0
98 5.1 2.5 3.0 1.1
149 5.9 3.0 5.1 1.8
Wrong Predictions:
Empty DataFrame
Columns: [SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm]
Index: []
Confusion Matrix:
[[20 0 0]
[ 0 21 0]
[ 0 0 19]]
1
21131A4436
1