Machine Learning Lab
1. Develop a program to create histograms for all numerical features and analyze the distribution of
each feature. Generate box plots for all numerical features and identify any outliers. Use
California Housing dataset.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
# Load the dataset
california_housing = fetch_california_housing()
data = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
data['MedHouseVal'] = california_housing.target # Add the target variable to the dataframe
# Plot histograms for all numerical features
data.hist(bins=30, figsize=(15, 10))
plt.suptitle('Histograms of Numerical Features')
plt.show()
# Plot box plots for all numerical features
plt.figure(figsize=(15, 10))
for i, column in enumerate(data.columns):
plt.subplot(3, 3, i+1)
sns.boxplot(y=data[column])
plt.title(column)
plt.suptitle('Box Plots of Numerical Features')
plt.tight_layout()
plt.show()
# Identify outliers using the IQR method
for column in data.columns:
Q1 = data[column].quantile(0.25)
Q3 = data[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
print(f"Outliers in {column}: {len(outliers)}")
Output:
Dept. of CSE, SJMIT Chitradurga
Machine Learning Lab
Dept. of CSE, SJMIT Chitradurga
Machine Learning Lab
2. Develop a program to Compute the correlation matrix to understand the relationships between
pairs of features. Visualize the correlation matrix using a heatmap to know which variables have
strong positive/negative correlations. Create a pair plot to visualize pairwise relationships between
features. Use California Housing dataset.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
# Load the dataset
california_housing = fetch_california_housing()
data = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
data['MedHouseVal'] = california_housing.target # Add the target variable to the dataframe
# Compute the correlation matrix
correlation_matrix = data.corr()
print("Correlation Matrix:")
print(correlation_matrix)
# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix Heatmap')
plt.show()
# Create a pair plot
sns.pairplot(data)
plt.suptitle('Pair Plot of Numerical Features', y=1.02)
plt.show()
Output:
Dept. of CSE, SJMIT Chitradurga
Machine Learning Lab
Dept. of CSE, SJMIT Chitradurga
Machine Learning Lab
Dept. of CSE, SJMIT Chitradurga
Machine Learning Lab
Dept. of CSE, SJMIT Chitradurga