LO3 Develop a machine learning
application using an appropriate
programming language or
machine learning tool for solving
a real-world problem
RAJAD SHAKYA
Data Analysis
● process of examining, cleaning, transforming, and
modeling data to discover useful information, draw
conclusions, and support decision-making.
● Data Cleaning (Handling Missing Values, Outlier
Detection)
● Data Transformation (Normalization/Scaling,
Encoding Categorical Variables)
NumPy
● Numerical Python is a powerful library for numerical
computations in Python.
● provides support for arrays, matrices, and a variety
of mathematical functions to operate on these data
structures
● pip install numpy.
NumPy
● import numpy as np
● Np.__version__
● np.info(np.logspace)
NumPy Arrays
● Data Type: Homogeneous (all elements must be of
the same data type, e.g., integers, floats).
● Performance: Significantly faster than lists for
numerical computations
● Efficient use of memory because of homogeneous
data types and contiguous memory allocation.
Creating Arrays
import numpy as np
# Creating arrays from lists
arr1 = np.array([1, 2, 3, 4])
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print(arr1)
print(arr2)
Creating Arrays
zeros = np.zeros((2, 3)) # Array of zeros
ones = np.ones((2, 3)) # Array of ones
arange = np.arange(0, 10, 2) # Array with a range
linspace = np.linspace(0, 1, 5)
# Array with linearly spaced values
Array Attributes
arr = np.array([[1, 2, 3], [4, 5, 6]])
print("Array Shape:", arr.shape)
print("Array Dimensions:", arr.ndim)
print("Array Size:", arr.size)
print("Array Data Type:", arr.dtype)
Array Indexing and Slicing
arr = np.array([1, 2, 3, 4, 5])
# Indexing
print(arr[0]) # First element
print(arr[-1]) # Last element
# Slicing
print(arr[1:4]) # Elements from index 1 to 3
print(arr[:3]) # First three elements
print(arr[::2]) # Every second element
Array Indexing and Slicing
# Multi-dimensional array indexing
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print(arr2[0, 1])
# Element at first row, second column
print(arr2[:, 1])
# All elements in second column
Array Manipulation
arr = np.array([[1, 2, 3], [4, 5, 6]])
reshaped = arr.reshape((3, 2))
print(reshaped)
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
concatenated = np.concatenate((arr1, arr2))
print(concatenated)
Array Manipulation
# Split
split = np.split(concatenated, 2)
print(split)
# Flatten
flattened = arr.flatten()
print(flattened)
Array Operations
arr = np.array([1, 2, 3, 4])
# Element-wise operations
print(arr + 2) # Add 2 to each element
print(arr * 3) # Multiply each element by 3
print(arr ** 2) # Square each element
Array Operations
print(np.sum(arr)) # Sum of all elements
print(np.mean(arr)) # Mean of elements
print(np.min(arr)) # Minimum element
print(np.max(arr)) # Maximum element
print(np.std(arr)) # Standard deviation
Statistical and Mathematical Functions
arr = np.array([1, 2, 3, 4, 5])
# Statistical functions
print(np.mean(arr)) # Mean
print(np.median(arr)) # Median
print(np.var(arr)) # Variance
# Mathematical functions
print(np.sin(arr)) # Sine of each element
print(np.log(arr)) # Natural logarithm of each element
print(np.exp(arr)) # Exponential of each element
Broadcasting
arr1 = np.array([1, 2, 3])
arr2 = np.array([[4], [5], [6]])
# Broadcasting addition
result = arr1 + arr2
print(result)
Random Module
# Generating random numbers
random_num = np.random.rand(5)
# 5 random numbers between 0 and 1
print(random_num)
Random Module
# Random integers
random_ints = np.random.randint(1, 10, size=5)
# 5 random integers between 1 and 10
print(random_ints)
Random Module
np.random.seed(10)
a1 = np.random.randint(1,10,(3,3))
a2 = np.random.randint(1,10,(3,3))
np.dot(a1,a2)
a1.T
Linear Algebra
np.linalg.det(A)
I=A*B
# Multiply corresponding elements of A and B
J=A/B
# Divide corresponding elements of A by B
K = np.sqrt(A)
# Square root of each element in A
Aggregations
sum_A = np.sum(A)
mean_A = np.mean(A)
max_A = np.max(A)
min_A = np.min(A)
sum_axis0 = np.sum(A, axis=0)
# Sum along axis 0 (columns)
mean_axis1 = np.mean(B, axis=1)
# Mean along axis 1 (rows)
Cheatsheet
https://s3.amazonaws.com/assets.datacamp.com/blo
g_assets/Numpy_Python_Cheat_Sheet.pdf
More
● np.eye(5)
● x = np.array([[0,1],
[2,3]])
● np.sum(x,axis=1)
● Np.sqrt
● np.exp
Matplotlib
● Matplotlib is a comprehensive library for creating
static, animated, and interactive visualizations in
Python.
● useful for generating plots and charts
● pip install matplotlib
Matplotlib
● import matplotlib.pyplot as plt
● Line Plot
○ plt.plot().
○ x = [1, 2, 3, 4, 5]
○ y = [2, 3, 5, 7, 11]
○
○ plt.plot(x, y)
Matplotlib
● plt.title('Simple Line Plot')
● plt.xlabel('X-axis')
● plt.ylabel('Y-axis')
● plt.show()
● plt.plot(x, y, color='red')
● plt.plot(x, y, color='#FF5733')
Matplotlib
● plt.plot(x, y, linestyle='--') # Dashed line
● plt.plot(x, y, linestyle='-.') # Dash-dot line
● plt.plot(x, y, linestyle=':') # Dotted line
● plt.plot(x, y, marker='o') # Circle marker
● plt.plot(x, y, marker='s') # Square marker
● plt.plot(x, y, marker='^') # Triangle up marker
Matplotlib
● plt.plot(x, y, color='green', linestyle='--',
linewidth=2, marker='o', markersize=8)
● plt.plot(x, y, label='Prime Numbers')
● plt.legend()
●
Matplotlib
● x = [1, 2, 3, 4, 5]
● y = [2, 3, 5, 7, 11]
● plt.scatter(x, y)
● plt.title('Simple Scatter Plot')
● plt.xlabel('X-axis')
● plt.ylabel('Y-axis')
● plt.show()
Matplotlib
● plt.scatter(x, y, s=100)
● plt.scatter(x, y, color='blue')
● plt.scatter(x, y, c='green')
# Using 'c' as shorthand for color
● plt.scatter(x, y, c='#FF5733')
# Using HEX color codes
Matplotlib
● plt.scatter(x, y, marker='^') # Triangle up marker
●
Shortcuts
● plt.plot(X,Y,'r--')
● plt.show()
●
● # ro -> red color marker circle
● # r- -> red color - line
● # r-- -> red color dotted line
Shortcuts
● X = np.arange(-16,16)
● Y = X **3
● plt.plot(X,Y,'b^-',linewidth=4.5)
Shortcuts
● X = np.arange(0,10,0.4)
● plt.plot(X,X,'r--',X,X**2,'bs',X,X**3,'g^')
● plt.show()
Shortcuts
● X = np.arange(0,10,0.4)
●
● plt.plot(X,X,'r--',label="x vs x")
● plt.plot(X,X**2,'bs',label="x vs x**2")
● plt.plot(X,X**3,'g^',label="x vs x**3")
● plt.xlabel("this is value of x")
● plt.ylabel("this is value of y")
● plt.title("graph for x vs x,x2,x3")
● plt.legend()
● plt.grid(True)
● plt.show()
Shortcuts
● - create a graph of equation
(all in one graph as well and individually)
1. Y = 4X^3 + 3X^2
2. Y = 3X^2 + 2X
3. Y = X+5
(Assume any data range using np arange)
Bar Plot
● used to represent categorical data with rectangular bars.
import matplotlib.pyplot as plt
categories = ['A', 'B', 'C', 'D']
values = [10, 15, 7, 10]
plt.bar(categories, values)
plt.title('Basic Bar Plot')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.show()
Bar Plot
● plt.barh(categories, values)
● plt.bar(categories, values,
color='skyblue',
edgecolor='black',
linewidth=1.5,
width=0.6)
Sub Plot
X = np.array([1,2,3,4])
plt.subplot(2,1,1)
plt.plot(X,X**2)
plt.subplot(2,1,2)
plt.plot(X,X**3)
Sub Plot
fig, axs = plt.subplots(2, 2, figsize=(10, 8))
# First subplot with title and labels
axs[0, 0].plot([1, 2, 3], [4, 5, 6], color='blue', linewidth=2)
axs[0, 1].plot([1, 2, 3], [6, 5, 4], color='green', linewidth=1.5)
axs[0, 1].set_title('Second Plot')
axs[1, 0].bar(['A', 'B', 'C'], [5, 6, 7], color='red')
axs[1, 1].scatter([1, 2, 3], [4, 5, 6], color='purple', marker='o')
plt.tight_layout()
plt.show()
Histogram
● used to represent the distribution of numerical data
by dividing it into bins.
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5]
plt.hist(data, bins=5) #can also be given in array
plt.title('Basic Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Histogram
● plt.hist(data, bins=5, color='lightgreen',
edgecolor='black', linewidth=1.5)
● Line Width: linewidth
● Line Style: linestyle (solid, dashed, etc.)
● Color: color
● Marker: marker (o, s, ^, etc.)
Histogram
● plt.hist(data, bins=5, color='lightgreen',
edgecolor='black', linewidth=1.5)
● Line Width: linewidth
● Line Style: linestyle (solid, dashed, etc.)
● Color: color
● Marker: marker (o, s, ^, etc.)
Box plot
np.random.seed(10)
data = np.random.randn(100)
# Create box plot
plt.boxplot(data)
plt.title('Box Plot')
plt.ylabel('Values')
plt.show()
Pandas
● powerful Python library specifically designed for
data manipulation and analysis.
● offers two primary data structures:
○ Series (one-dimensional) and
○ DataFrame (two-dimensional)
Series
● single column from a spreadsheet or a labeled array.
● collection of elements all of the same data type
along with an index (labels) to identify each
element.
● import pandas as pd
● data = [10, 20, 30, 40]
● my_series = pd.Series(data, index=['Apple',
'Banana', 'Cherry', 'Mango'])
● print(my_series)
DataFrame
● full spreadsheet or a two-dimensional table.
● consists of multiple Series (columns) with
potentially different data types and a set of labels
(index) for rows.
● List or dictionary based initialization
DataFrame
● df['Name']
● df[['Name', 'City']]
● # Accessing the first row
● print(df.iloc[0])
●
● # Accessing rows by index
● print(df.iloc[1:3]) # Rows 1 to 2
DataFrame
● df.loc[row_label, column_label]
● # Accessing rows by label
● print(df.loc[0]) # First row
● # Accessing multiple rows by label
● print(df.loc[1:2]) # Rows 1 to 2
● print(df.loc[:, ['Name', 'Score']])
● df.loc[df['Age'] > 25]
DataFrame
● df.iloc[row_index, column_index]
● Df.iloc[0]
● Df.iloc[0:2]
● df.iloc[0:2, [0, 3]]
● print(df.loc[1:3, ['Name', 'City']])
● print(df.iloc[1:3, [0, 2]])
DataFrame
● df['Score'] = df['Score'] + 5
● # Applying a lambda function to increase age by 1
● df['Age'] = df['Age'].apply(lambda x: x + 1)
● print(df)
● def multiply_by_2(x):
● return x * 2
●
● result = df.apply(multiply_by_2)
● print(result)
DataFrame
● df_sorted = df.sort_values(by='Age')
● df.sort_values(by=['City', 'Score'])
● df.groupby('City')['Score'].mean()
Functions
● print(df.head(2))
● print(df.tail(2))
● df.describe()
● df['City'].value_counts()
● df_dropped = df.drop('Age', axis=1)
● # Dropping a row
● df_dropped = df.drop(2, axis=0)
Functions
● df.groupby('City')['Score'].mean()
● data_with_na = {
● 'A': [1, 2, None, 4],
● 'B': [None, 2, 3, 4]
● }
● df_with_na = pd.DataFrame(data_with_na)
● dropped_na_df = df_with_na.dropna()
● print(dropped_na_df)
● filled_na_df = df_with_na.fillna(0)
● print(filled_na_df)
EDA : Iris Dataset
● df.head()
● Df.shape
● df.info()
● df.describe()
● df.isnull().sum()
EDA : Iris Dataset
● data = df.drop_duplicates(subset ="Species",)
● df.value_counts("Species")
● import seaborn as sns
● import matplotlib.pyplot as plt
● sns.countplot(x='Species', data=df, )
● plt.show()
EDA : Iris Dataset
● import seaborn as sns
● import matplotlib.pyplot as plt
●
● sns.scatterplot(x='SepalLengthCm',
y='SepalWidthCm',
● hue='Species', data=df, )
●
● # Placing Legend outside the Figure
● plt.legend(bbox_to_anchor=(1, 1), loc=2)
● plt.show()
EDA : Iris Dataset
sns.pairplot(df.drop(['Id'], axis = 1),
hue='Species', height=2)
fig, axes = plt.subplots(2, 2, figsize=(10,10))
axes[0,0].set_title("Sepal Length")
axes[0,0].hist(df['SepalLengthCm'], bins=7)
axes[0,1].set_title("Sepal Width")
axes[0,1].hist(df['SepalWidthCm'], bins=5);
axes[1,0].set_title("Petal Length")
axes[1,0].hist(df['PetalLengthCm'], bins=6);
axes[1,1].set_title("Petal Width")
axes[1,1].hist(df['PetalWidthCm'], bins=6);
EDA : Iris Dataset
data.corr(method='pearson')
sns.heatmap(df.corr(method='pearson').drop(
['Id'], axis=1).drop(['Id'], axis=0),
annot = True);
plt.show()
EDA : Iris Dataset
def graph(y):
sns.boxplot(x="Species", y=y, data=df)
plt.figure(figsize=(10,10))
plt.subplot(221)
graph('SepalLengthCm')
plt.subplot(222)
graph('SepalWidthCm')
plt.subplot(223)
graph('PetalLengthCm')
plt.subplot(224)
graph('PetalWidthCm')
plt.show()
EDA : Iris Dataset
df = pd.read_csv('Iris.csv')
sns.boxplot(x='SepalWidthCm', data=df)
Q1 = np.percentile(df['SepalWidthCm'], 25,
interpolation = 'midpoint')
Q3 = np.percentile(df['SepalWidthCm'], 75,
interpolation = 'midpoint')
IQR = Q3 - Q1
print("Old Shape: ", df.shape)
EDA : Iris Dataset
upper = np.where(df['SepalWidthCm'] >= (Q3+1.5*IQR))
# Lower bound
lower = np.where(df['SepalWidthCm'] <= (Q1-1.5*IQR))
# Removing the Outliers
df.drop(upper[0], inplace = True)
df.drop(lower[0], inplace = True)
print("New Shape: ", df.shape)
sns.boxplot(x='SepalWidthCm', data=df)
EDA : Titanic Dataset
● titanic = pd.read_csv(url)
● titanic.head()
● titanic.info()
● titanic.isnull().sum()
● titanic['Age'].fillna(titanic['Age'].median(),
inplace=True)
● titanic.drop(columns=['Cabin'], inplace=True)
EDA : Titanic Dataset
import seaborn as sns
import matplotlib.pyplot as plt
# Set up the plotting environment
sns.set(style='whitegrid')
# Count plot for 'Survived'
plt.figure(figsize=(8, 6))
sns.countplot(x='Survived', data=titanic)
plt.title('Survival Count')
plt.show()
EDA : Titanic Dataset
# Distribution plot for 'Age'
plt.figure(figsize=(8, 6))
sns.histplot(titanic['Age'], kde=True, bins=30)
plt.title('Age Distribution')
plt.show()
# Count plot for 'Pclass'
plt.figure(figsize=(8, 6))
sns.countplot(x='Pclass', data=titanic)
plt.title('Passenger Class Count')
plt.show()
EDA : Titanic Dataset
# Count plot for 'Sex'
plt.figure(figsize=(8, 6))
sns.countplot(x='Sex', data=titanic)
plt.title('Gender Count')
plt.show()
# Survival rate by 'Sex'
plt.figure(figsize=(8, 6))
sns.barplot(x='Sex', y='Survived', data=titanic)
plt.title('Survival Rate by Gender')
plt.show()
# Survival rate by 'Pclass'
plt.figure(figsize=(8, 6))
sns.barplot(x='Pclass', y='Survived', data=titanic)
plt.title('Survival Rate by Class')
plt.show()
EDA : Titanic Dataset
# Correlation matrix
plt.figure(figsize=(12, 8))
corr_matrix = titanic.corr()
sns.heatmap(corr_matrix, annot=True,
cmap='coolwarm', linewidths=0.2)
plt.title('Correlation Matrix')
plt.show()
EDA on Haberman's Survival Dataset
● What is the dataset about ?
● The dataset consists of several columns what do they mean?
● Get basic information about the dataset, including the number
of rows and columns and a preview of the data.
● Check if there are any missing values in the dataset
● Get basic statistics for numerical columns.
● Find the counts for each category in Survival_Status column.
● Visualize the distribution of numerical columns using
histograms and boxplots.
● Analyze the distribution of the survival status.
● Check the correlation between the features.
● Perform feature scaling on columns wherever needed.
Categorical variables
● Nominal Data
○ categories with no inherent order or ranking.
○ Blood type (A, B, AB, O)
● Ordinal Data:
○ Represents categories with an inherent order
or ranking.
○ Education level (high school, bachelor's
degree, master's degree)
Feature encoding
● crucial step in the machine learning pipeline,
particularly when working with categorical data.
● ML Algorithm require require numerical input, so
categorical data must be converted into numerical
form.
● Label Encoding
● One-Hot Encoding
Label Encoding
● converts each unique category in a feature into an
integer value.
● useful for ordinal categorical variables where the
order of the categories is important.
● For example, the categories low, medium, and high
could be encoded as 0, 1, and 2, respectively.
● Can introduce ordinal relationships where none
exist, which can mislead some machine learning
models.
Label Encoding
import pandas as pd
from sklearn.preprocessing import LabelEncoder
data = {'color': ['red', 'blue', 'green', 'blue']}
df = pd.DataFrame(data)
label_encoder = LabelEncoder()
df['color_encoded'] =
label_encoder.fit_transform(df['color'])
print(df)
One-Hot Encoding
● converts categorical variables into a set of binary
variables (one-hot vectors).
● Each category is represented by a binary vector
where only one element is 1 and the rest are 0.
● useful for nominal categorical variables where no
ordinal relationship exists.
● Can result in high-dimensional data if the
categorical feature has many unique values.
One-Hot Encoding
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
data = {'color': ['red', 'blue', 'green', 'blue']}
df = pd.DataFrame(data)
df_one_hot = pd.get_dummies(df, columns=['color'])
Feature Scaling
● process of normalizing or standardizing the range of
features in your dataset.
● prevents features with larger scales from
dominating the model's learning process.
● Min-Max Scaling (Normalization)
● Standardization (Z-score Normalization):
Min-Max Scaling (Normalization)
● Transforms features by scaling each feature to a
given range, usually [0, 1].
● X′=Xmax−Xmin/X−Xmin
● makes it easier to understand and interpret.
● Improved Performance for Distance-Based
Algorithms
● Sensitive to Outliers:
● Does Not Handle Variance
Min-Max Scaling (Normalization)
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
data = {'height': [150, 160, 170, 180, 190], 'weight':
[50, 60, 70, 80, 90]}
df = pd.DataFrame(data)
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df),
columns=df.columns)
print(df_scaled)
Standardization ( z-score Normalization)
import pandas as pd
from sklearn.preprocessing import StandardScaler
data = {'height': [150, 160, 170, 180, 190], 'weight': [50,
60, 70, 80, 90]}
df = pd.DataFrame(data)
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df),
columns=df.columns)
print(df_scaled)
Thank You
RAJAD SHAKYA