0% found this document useful (0 votes)
23 views80 pages

ML3 Data Analysis

The document provides a comprehensive overview of developing a machine learning application using Python, focusing on data analysis, manipulation, and visualization techniques with libraries such as NumPy, Matplotlib, and Pandas. It covers key concepts like data cleaning, array operations, statistical functions, and exploratory data analysis (EDA) using the Iris dataset. Additionally, it includes practical examples and code snippets for creating various types of plots and handling data structures.

Uploaded by

ramzanrawal777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views80 pages

ML3 Data Analysis

The document provides a comprehensive overview of developing a machine learning application using Python, focusing on data analysis, manipulation, and visualization techniques with libraries such as NumPy, Matplotlib, and Pandas. It covers key concepts like data cleaning, array operations, statistical functions, and exploratory data analysis (EDA) using the Iris dataset. Additionally, it includes practical examples and code snippets for creating various types of plots and handling data structures.

Uploaded by

ramzanrawal777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

LO3 Develop a machine learning

application using an appropriate


programming language or
machine learning tool for solving
a real-world problem
RAJAD SHAKYA
Data Analysis
● process of examining, cleaning, transforming, and
modeling data to discover useful information, draw
conclusions, and support decision-making.

● Data Cleaning (Handling Missing Values, Outlier


Detection)

● Data Transformation (Normalization/Scaling,


Encoding Categorical Variables)
NumPy
● Numerical Python is a powerful library for numerical
computations in Python.

● provides support for arrays, matrices, and a variety


of mathematical functions to operate on these data
structures

● pip install numpy.


NumPy
● import numpy as np

● Np.__version__

● np.info(np.logspace)
NumPy Arrays
● Data Type: Homogeneous (all elements must be of
the same data type, e.g., integers, floats).

● Performance: Significantly faster than lists for


numerical computations

● Efficient use of memory because of homogeneous


data types and contiguous memory allocation.
Creating Arrays
import numpy as np

# Creating arrays from lists


arr1 = np.array([1, 2, 3, 4])
arr2 = np.array([[1, 2, 3], [4, 5, 6]])

print(arr1)
print(arr2)
Creating Arrays
zeros = np.zeros((2, 3)) # Array of zeros
ones = np.ones((2, 3)) # Array of ones
arange = np.arange(0, 10, 2) # Array with a range
linspace = np.linspace(0, 1, 5)
# Array with linearly spaced values
Array Attributes

arr = np.array([[1, 2, 3], [4, 5, 6]])

print("Array Shape:", arr.shape)


print("Array Dimensions:", arr.ndim)
print("Array Size:", arr.size)
print("Array Data Type:", arr.dtype)
Array Indexing and Slicing
arr = np.array([1, 2, 3, 4, 5])
# Indexing
print(arr[0]) # First element
print(arr[-1]) # Last element
# Slicing
print(arr[1:4]) # Elements from index 1 to 3
print(arr[:3]) # First three elements
print(arr[::2]) # Every second element
Array Indexing and Slicing
# Multi-dimensional array indexing
arr2 = np.array([[1, 2, 3], [4, 5, 6]])

print(arr2[0, 1])
# Element at first row, second column
print(arr2[:, 1])
# All elements in second column
Array Manipulation
arr = np.array([[1, 2, 3], [4, 5, 6]])

reshaped = arr.reshape((3, 2))


print(reshaped)

arr1 = np.array([1, 2, 3])


arr2 = np.array([4, 5, 6])
concatenated = np.concatenate((arr1, arr2))
print(concatenated)
Array Manipulation
# Split
split = np.split(concatenated, 2)
print(split)

# Flatten
flattened = arr.flatten()
print(flattened)
Array Operations
arr = np.array([1, 2, 3, 4])

# Element-wise operations
print(arr + 2) # Add 2 to each element
print(arr * 3) # Multiply each element by 3
print(arr ** 2) # Square each element
Array Operations

print(np.sum(arr)) # Sum of all elements


print(np.mean(arr)) # Mean of elements
print(np.min(arr)) # Minimum element
print(np.max(arr)) # Maximum element
print(np.std(arr)) # Standard deviation
Statistical and Mathematical Functions
arr = np.array([1, 2, 3, 4, 5])

# Statistical functions
print(np.mean(arr)) # Mean
print(np.median(arr)) # Median
print(np.var(arr)) # Variance

# Mathematical functions
print(np.sin(arr)) # Sine of each element
print(np.log(arr)) # Natural logarithm of each element
print(np.exp(arr)) # Exponential of each element
Broadcasting
arr1 = np.array([1, 2, 3])
arr2 = np.array([[4], [5], [6]])

# Broadcasting addition
result = arr1 + arr2
print(result)
Random Module
# Generating random numbers
random_num = np.random.rand(5)
# 5 random numbers between 0 and 1
print(random_num)
Random Module
# Random integers
random_ints = np.random.randint(1, 10, size=5)
# 5 random integers between 1 and 10
print(random_ints)
Random Module
np.random.seed(10)

a1 = np.random.randint(1,10,(3,3))
a2 = np.random.randint(1,10,(3,3))

np.dot(a1,a2)

a1.T
Linear Algebra
np.linalg.det(A)

I=A*B
# Multiply corresponding elements of A and B
J=A/B
# Divide corresponding elements of A by B
K = np.sqrt(A)
# Square root of each element in A
Aggregations
sum_A = np.sum(A)
mean_A = np.mean(A)
max_A = np.max(A)
min_A = np.min(A)

sum_axis0 = np.sum(A, axis=0)


# Sum along axis 0 (columns)
mean_axis1 = np.mean(B, axis=1)
# Mean along axis 1 (rows)
Cheatsheet
https://s3.amazonaws.com/assets.datacamp.com/blo
g_assets/Numpy_Python_Cheat_Sheet.pdf
More
● np.eye(5)
● x = np.array([[0,1],
[2,3]])

● np.sum(x,axis=1)

● Np.sqrt
● np.exp
Matplotlib
● Matplotlib is a comprehensive library for creating
static, animated, and interactive visualizations in
Python.

● useful for generating plots and charts

● pip install matplotlib


Matplotlib
● import matplotlib.pyplot as plt

● Line Plot
○ plt.plot().

○ x = [1, 2, 3, 4, 5]
○ y = [2, 3, 5, 7, 11]

○ plt.plot(x, y)
Matplotlib
● plt.title('Simple Line Plot')
● plt.xlabel('X-axis')
● plt.ylabel('Y-axis')
● plt.show()

● plt.plot(x, y, color='red')
● plt.plot(x, y, color='#FF5733')
Matplotlib
● plt.plot(x, y, linestyle='--') # Dashed line
● plt.plot(x, y, linestyle='-.') # Dash-dot line
● plt.plot(x, y, linestyle=':') # Dotted line

● plt.plot(x, y, marker='o') # Circle marker


● plt.plot(x, y, marker='s') # Square marker
● plt.plot(x, y, marker='^') # Triangle up marker
Matplotlib
● plt.plot(x, y, color='green', linestyle='--',
linewidth=2, marker='o', markersize=8)

● plt.plot(x, y, label='Prime Numbers')


● plt.legend()


Matplotlib
● x = [1, 2, 3, 4, 5]
● y = [2, 3, 5, 7, 11]

● plt.scatter(x, y)
● plt.title('Simple Scatter Plot')
● plt.xlabel('X-axis')
● plt.ylabel('Y-axis')
● plt.show()
Matplotlib
● plt.scatter(x, y, s=100)

● plt.scatter(x, y, color='blue')
● plt.scatter(x, y, c='green')
# Using 'c' as shorthand for color
● plt.scatter(x, y, c='#FF5733')
# Using HEX color codes
Matplotlib
● plt.scatter(x, y, marker='^') # Triangle up marker


Shortcuts
● plt.plot(X,Y,'r--')
● plt.show()

● # ro -> red color marker circle
● # r- -> red color - line
● # r-- -> red color dotted line
Shortcuts
● X = np.arange(-16,16)
● Y = X **3

● plt.plot(X,Y,'b^-',linewidth=4.5)
Shortcuts
● X = np.arange(0,10,0.4)

● plt.plot(X,X,'r--',X,X**2,'bs',X,X**3,'g^')
● plt.show()
Shortcuts
● X = np.arange(0,10,0.4)

● plt.plot(X,X,'r--',label="x vs x")
● plt.plot(X,X**2,'bs',label="x vs x**2")
● plt.plot(X,X**3,'g^',label="x vs x**3")
● plt.xlabel("this is value of x")
● plt.ylabel("this is value of y")
● plt.title("graph for x vs x,x2,x3")
● plt.legend()
● plt.grid(True)
● plt.show()
Shortcuts
● - create a graph of equation
(all in one graph as well and individually)

1. Y = 4X^3 + 3X^2
2. Y = 3X^2 + 2X
3. Y = X+5

(Assume any data range using np arange)


Bar Plot
● used to represent categorical data with rectangular bars.
import matplotlib.pyplot as plt
categories = ['A', 'B', 'C', 'D']
values = [10, 15, 7, 10]
plt.bar(categories, values)
plt.title('Basic Bar Plot')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.show()
Bar Plot
● plt.barh(categories, values)

● plt.bar(categories, values,
color='skyblue',
edgecolor='black',
linewidth=1.5,
width=0.6)
Sub Plot
X = np.array([1,2,3,4])

plt.subplot(2,1,1)
plt.plot(X,X**2)

plt.subplot(2,1,2)
plt.plot(X,X**3)
Sub Plot
fig, axs = plt.subplots(2, 2, figsize=(10, 8))

# First subplot with title and labels


axs[0, 0].plot([1, 2, 3], [4, 5, 6], color='blue', linewidth=2)
axs[0, 1].plot([1, 2, 3], [6, 5, 4], color='green', linewidth=1.5)
axs[0, 1].set_title('Second Plot')
axs[1, 0].bar(['A', 'B', 'C'], [5, 6, 7], color='red')
axs[1, 1].scatter([1, 2, 3], [4, 5, 6], color='purple', marker='o')

plt.tight_layout()
plt.show()
Histogram
● used to represent the distribution of numerical data
by dividing it into bins.
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5]
plt.hist(data, bins=5) #can also be given in array
plt.title('Basic Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Histogram
● plt.hist(data, bins=5, color='lightgreen',
edgecolor='black', linewidth=1.5)

● Line Width: linewidth


● Line Style: linestyle (solid, dashed, etc.)
● Color: color
● Marker: marker (o, s, ^, etc.)
Histogram
● plt.hist(data, bins=5, color='lightgreen',
edgecolor='black', linewidth=1.5)

● Line Width: linewidth


● Line Style: linestyle (solid, dashed, etc.)
● Color: color
● Marker: marker (o, s, ^, etc.)
Box plot
np.random.seed(10)
data = np.random.randn(100)

# Create box plot


plt.boxplot(data)
plt.title('Box Plot')
plt.ylabel('Values')
plt.show()
Pandas
● powerful Python library specifically designed for
data manipulation and analysis.

● offers two primary data structures:


○ Series (one-dimensional) and
○ DataFrame (two-dimensional)
Series
● single column from a spreadsheet or a labeled array.
● collection of elements all of the same data type
along with an index (labels) to identify each
element.
● import pandas as pd
● data = [10, 20, 30, 40]
● my_series = pd.Series(data, index=['Apple',
'Banana', 'Cherry', 'Mango'])
● print(my_series)
DataFrame
● full spreadsheet or a two-dimensional table.

● consists of multiple Series (columns) with


potentially different data types and a set of labels
(index) for rows.

● List or dictionary based initialization


DataFrame
● df['Name']

● df[['Name', 'City']]

● # Accessing the first row


● print(df.iloc[0])

● # Accessing rows by index
● print(df.iloc[1:3]) # Rows 1 to 2
DataFrame
● df.loc[row_label, column_label]
● # Accessing rows by label
● print(df.loc[0]) # First row

● # Accessing multiple rows by label


● print(df.loc[1:2]) # Rows 1 to 2

● print(df.loc[:, ['Name', 'Score']])


● df.loc[df['Age'] > 25]
DataFrame
● df.iloc[row_index, column_index]
● Df.iloc[0]

● Df.iloc[0:2]
● df.iloc[0:2, [0, 3]]

● print(df.loc[1:3, ['Name', 'City']])

● print(df.iloc[1:3, [0, 2]])


DataFrame
● df['Score'] = df['Score'] + 5
● # Applying a lambda function to increase age by 1
● df['Age'] = df['Age'].apply(lambda x: x + 1)
● print(df)
● def multiply_by_2(x):
● return x * 2

● result = df.apply(multiply_by_2)
● print(result)
DataFrame
● df_sorted = df.sort_values(by='Age')

● df.sort_values(by=['City', 'Score'])

● df.groupby('City')['Score'].mean()
Functions
● print(df.head(2))
● print(df.tail(2))
● df.describe()
● df['City'].value_counts()
● df_dropped = df.drop('Age', axis=1)
● # Dropping a row
● df_dropped = df.drop(2, axis=0)
Functions
● df.groupby('City')['Score'].mean()
● data_with_na = {
● 'A': [1, 2, None, 4],
● 'B': [None, 2, 3, 4]
● }
● df_with_na = pd.DataFrame(data_with_na)
● dropped_na_df = df_with_na.dropna()
● print(dropped_na_df)
● filled_na_df = df_with_na.fillna(0)
● print(filled_na_df)
EDA : Iris Dataset
● df.head()

● Df.shape

● df.info()

● df.describe()

● df.isnull().sum()
EDA : Iris Dataset
● data = df.drop_duplicates(subset ="Species",)

● df.value_counts("Species")

● import seaborn as sns


● import matplotlib.pyplot as plt

● sns.countplot(x='Species', data=df, )
● plt.show()
EDA : Iris Dataset
● import seaborn as sns
● import matplotlib.pyplot as plt

● sns.scatterplot(x='SepalLengthCm',
y='SepalWidthCm',
● hue='Species', data=df, )

● # Placing Legend outside the Figure
● plt.legend(bbox_to_anchor=(1, 1), loc=2)
● plt.show()
EDA : Iris Dataset
sns.pairplot(df.drop(['Id'], axis = 1),
hue='Species', height=2)
fig, axes = plt.subplots(2, 2, figsize=(10,10))
axes[0,0].set_title("Sepal Length")
axes[0,0].hist(df['SepalLengthCm'], bins=7)
axes[0,1].set_title("Sepal Width")
axes[0,1].hist(df['SepalWidthCm'], bins=5);
axes[1,0].set_title("Petal Length")
axes[1,0].hist(df['PetalLengthCm'], bins=6);
axes[1,1].set_title("Petal Width")
axes[1,1].hist(df['PetalWidthCm'], bins=6);
EDA : Iris Dataset
data.corr(method='pearson')

sns.heatmap(df.corr(method='pearson').drop(
['Id'], axis=1).drop(['Id'], axis=0),
annot = True);

plt.show()
EDA : Iris Dataset
def graph(y):
sns.boxplot(x="Species", y=y, data=df)
plt.figure(figsize=(10,10))
plt.subplot(221)
graph('SepalLengthCm')
plt.subplot(222)
graph('SepalWidthCm')
plt.subplot(223)
graph('PetalLengthCm')
plt.subplot(224)
graph('PetalWidthCm')
plt.show()
EDA : Iris Dataset
df = pd.read_csv('Iris.csv')
sns.boxplot(x='SepalWidthCm', data=df)

Q1 = np.percentile(df['SepalWidthCm'], 25,
interpolation = 'midpoint')

Q3 = np.percentile(df['SepalWidthCm'], 75,
interpolation = 'midpoint')
IQR = Q3 - Q1
print("Old Shape: ", df.shape)
EDA : Iris Dataset
upper = np.where(df['SepalWidthCm'] >= (Q3+1.5*IQR))

# Lower bound
lower = np.where(df['SepalWidthCm'] <= (Q1-1.5*IQR))

# Removing the Outliers


df.drop(upper[0], inplace = True)
df.drop(lower[0], inplace = True)
print("New Shape: ", df.shape)

sns.boxplot(x='SepalWidthCm', data=df)
EDA : Titanic Dataset
● titanic = pd.read_csv(url)
● titanic.head()
● titanic.info()

● titanic.isnull().sum()
● titanic['Age'].fillna(titanic['Age'].median(),
inplace=True)
● titanic.drop(columns=['Cabin'], inplace=True)
EDA : Titanic Dataset
import seaborn as sns
import matplotlib.pyplot as plt
# Set up the plotting environment
sns.set(style='whitegrid')
# Count plot for 'Survived'
plt.figure(figsize=(8, 6))
sns.countplot(x='Survived', data=titanic)
plt.title('Survival Count')
plt.show()
EDA : Titanic Dataset
# Distribution plot for 'Age'
plt.figure(figsize=(8, 6))
sns.histplot(titanic['Age'], kde=True, bins=30)
plt.title('Age Distribution')
plt.show()
# Count plot for 'Pclass'
plt.figure(figsize=(8, 6))
sns.countplot(x='Pclass', data=titanic)
plt.title('Passenger Class Count')
plt.show()
EDA : Titanic Dataset
# Count plot for 'Sex'
plt.figure(figsize=(8, 6))
sns.countplot(x='Sex', data=titanic)
plt.title('Gender Count')
plt.show()
# Survival rate by 'Sex'
plt.figure(figsize=(8, 6))
sns.barplot(x='Sex', y='Survived', data=titanic)
plt.title('Survival Rate by Gender')
plt.show()
# Survival rate by 'Pclass'
plt.figure(figsize=(8, 6))
sns.barplot(x='Pclass', y='Survived', data=titanic)
plt.title('Survival Rate by Class')
plt.show()
EDA : Titanic Dataset
# Correlation matrix
plt.figure(figsize=(12, 8))
corr_matrix = titanic.corr()
sns.heatmap(corr_matrix, annot=True,
cmap='coolwarm', linewidths=0.2)
plt.title('Correlation Matrix')
plt.show()
EDA on Haberman's Survival Dataset
● What is the dataset about ?
● The dataset consists of several columns what do they mean?
● Get basic information about the dataset, including the number
of rows and columns and a preview of the data.
● Check if there are any missing values in the dataset
● Get basic statistics for numerical columns.
● Find the counts for each category in Survival_Status column.
● Visualize the distribution of numerical columns using
histograms and boxplots.
● Analyze the distribution of the survival status.
● Check the correlation between the features.
● Perform feature scaling on columns wherever needed.
Categorical variables
● Nominal Data
○ categories with no inherent order or ranking.
○ Blood type (A, B, AB, O)

● Ordinal Data:
○ Represents categories with an inherent order
or ranking.
○ Education level (high school, bachelor's
degree, master's degree)
Feature encoding
● crucial step in the machine learning pipeline,
particularly when working with categorical data.

● ML Algorithm require require numerical input, so


categorical data must be converted into numerical
form.

● Label Encoding
● One-Hot Encoding
Label Encoding
● converts each unique category in a feature into an
integer value.
● useful for ordinal categorical variables where the
order of the categories is important.
● For example, the categories low, medium, and high
could be encoded as 0, 1, and 2, respectively.
● Can introduce ordinal relationships where none
exist, which can mislead some machine learning
models.
Label Encoding
import pandas as pd
from sklearn.preprocessing import LabelEncoder
data = {'color': ['red', 'blue', 'green', 'blue']}
df = pd.DataFrame(data)
label_encoder = LabelEncoder()
df['color_encoded'] =
label_encoder.fit_transform(df['color'])
print(df)
One-Hot Encoding
● converts categorical variables into a set of binary
variables (one-hot vectors).
● Each category is represented by a binary vector
where only one element is 1 and the rest are 0.
● useful for nominal categorical variables where no
ordinal relationship exists.
● Can result in high-dimensional data if the
categorical feature has many unique values.
One-Hot Encoding
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

data = {'color': ['red', 'blue', 'green', 'blue']}


df = pd.DataFrame(data)
df_one_hot = pd.get_dummies(df, columns=['color'])
Feature Scaling
● process of normalizing or standardizing the range of
features in your dataset.

● prevents features with larger scales from


dominating the model's learning process.

● Min-Max Scaling (Normalization)

● Standardization (Z-score Normalization):


Min-Max Scaling (Normalization)
● Transforms features by scaling each feature to a
given range, usually [0, 1].
● X′=Xmax−Xmin/X−Xmin
● makes it easier to understand and interpret.
● Improved Performance for Distance-Based
Algorithms
● Sensitive to Outliers:
● Does Not Handle Variance
Min-Max Scaling (Normalization)
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
data = {'height': [150, 160, 170, 180, 190], 'weight':
[50, 60, 70, 80, 90]}
df = pd.DataFrame(data)
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df),
columns=df.columns)
print(df_scaled)
Standardization ( z-score Normalization)
import pandas as pd
from sklearn.preprocessing import StandardScaler
data = {'height': [150, 160, 170, 180, 190], 'weight': [50,
60, 70, 80, 90]}
df = pd.DataFrame(data)
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df),
columns=df.columns)
print(df_scaled)
Thank You

RAJAD SHAKYA

You might also like