Aiml Notes
Aiml Notes
NOTES
Python Language Overview
Python is a high-level, interpreted, interactive and object-oriented scripting
language. Python is designed to be highly readable. It uses English keywords
frequently where as other languages use punctuation, and it has fewer syntactical
constructions than other languages. It is case sensitive language.
Python is interpreted: Python is processed at runtime by the interpreter. You
do not need to compile your program before executing it. This is similar to PERL
and PHP.
Python is Interactive: you can actually sit at a Python prompt and interact with
the interpreter directly to write your programs.
# String values
name = "John"
message = "Hello, World!"
# List value
fruits = ["apple", "banana", "orange"]
# Dictionary value
person = {"name": "John", "age": 30}
Types:-
In Python, each value has a specific type that defines the kind of data it represents.
For example, integers, floating-point numbers, strings, lists, tuples, and dictionaries
are all different types in Python.
Example:-
# Numeric types
num_type = type(42) # int
float_type = type(3.14) # float
# String type
str_type = type("Hello, World!") # str
Keywords:-
Keywords are reserved words in Python that have a special meaning and cannot be
used as variable names. They are used to define the structure and flow of the code.
Example:-
if True:
# 'if' is a keyword
pass
def my_function():
# 'def' is a keyword
pass
Statements:-
Statements are individual lines of code that perform actions or computations. They
can be simple or compound, like loops or conditionals.
Example:-
x = 10 # Assignment statement
if x > 5: # If statement
print("x is greater than 5")
Expressions:-
Expressions are combinations of values, variables, and operators that evaluate to a
single value. They can involve arithmetic, comparisons, and other operations.
Example:-
# Arithmetic expressions
result1 = 2 + 3 * 5 # Result: 17
result2 = (2 + 3) * 5 # Result: 25
4
# Comparison expressions
greater_than = 5 > 3 # Result: True
equal_to = 10 == 5 # Result: False
Variables:-
Variables are used to store values in Python. They act as placeholders for data and
allow us to refer to values by their names.
Example:-
name = "John" # 'name' is a variable storing the value "John"
Data Structures
Lists:-
List Methods:-
Python provides a variety of built-in methods that you can use to manipulate lists.
Example:-
fruits.append("orange")
Example:-
fruits.insert(1, "grape")
3. remove(): Removes the first occurrence of a specific element from the list.
Example:-
fruits.remove("grape")
4. pop(): Removes the element at the specified index or the last element if no index
is provided, and returns the removed element.
Example:-
removed_fruit = fruits.pop(1)
# removed_fruit: "banana"
5. index(): Returns the index of the first occurrence of a specified element in the
list.
Example:-
index = fruits.index("banana")
# index: 1
6
Example:-
count = fruits.count("banana")
# count: 2
7. sort(): Sorts the list in ascending order (for numbers and strings) or in
alphabetical order (for strings).
Example:-
numbers = [4, 2, 7, 1, 5]
numbers.sort()
# numbers: [1, 2, 4, 5, 7]
Example:-
fruits.reverse()
Example:-
fruits.clear()
# fruits: []
Tuples:-
1. Tuples are ordered collections of elements, similar to lists, but they are
immutable, meaning once created, their elements cannot be changed.
2. Tuples are generally used when you want to ensure data integrity and prevent
accidental changes.
3. Since tuples are immutable, you cannot modify their elements after creation.
7
Tuple Methods:-
Example:-
my_tuple = (1, 2, 3, 2, 4, 2)
count = my_tuple.count(2)
# count: 3
2. index(): Returns the index of the first occurrence of a specified element in the
tuple.
Example:-
index = my_tuple.index(30)
# index: 2
Dictionaries:-
2. Each key in a dictionary must be unique, and the values can be of different data
types.
Dictionaries Methods:-
Dictionaries are a powerful data structure that stores data in key-value pairs. They
allow fast access to values based on their keys.
1. keys(): Returns a view object of all the keys in the dictionary.
Example:-
my_dict = {"name": "John", "age": 30, "city": "New York"}
keys = my_dict.keys()
# keys: ["name", "age", "city"]
8
2. values(): Returns a view object of all the values in the dictionary.
Example:-
my_dict = {"name": "John", "age": 30, "city": "New York"}
values = my_dict.values()
# values: ["John", 30, "New York"]
3. items(): Returns a view object of all the key-value pairs in the dictionary as
tuples.
Example:-
my_dict = {"name": "John", "age": 30, "city": "New York"}
items = my_dict.items()
# items: [("name", "John"), ("age", 30), ("city", "New York")]
4. get(): Retrieves the value associated with a given key. It returns a default value
if the key is not found (instead of raising an error).
Example:-
my_dict = {"name": "John", "age": 30}
name = my_dict.get("name", "Unknown")
city = my_dict.get("city", "Unknown")
# name: "John"
# city: "Unknown" (since "city" key is not present)
5. pop(): Removes and returns the value associated with the specified key.
Example:-
my_dict = {"name": "John", "age": 30}
age = my_dict.pop("age")
# age: 30, my_dict: {"name": "John"}
6. popitem(): Removes and returns the last key-value pair added to the dictionary
(since Python 3.7, dictionaries maintain insertion order).
Example:-
9
my_dict = {"name": "John", "age": 30, "city": "New York"}
item = my_dict.popitem()
# item: ("city", "New York"), my_dict: {"name": "John", "age": 30}
8. update(): Updates the dictionary with key-value pairs from another dictionary
or an iterable of key-value pairs.
Example:-
my_dict = {"name": "John", "age": 30}
my_dict.update({"city": "New York"})
# my_dict: {"name": "John", "age": 30, "city": "New York"}
Python Functions
Python functions are blocks of reusable code that perform a specific task. They
allow you to break down your code into smaller, more manageable pieces, making
it easier to understand, maintain, and reuse. Functions in Python are defined using
the (def) keyword, followed by the function name, parameters (if any), and a colon.
Syntax:-
def function_name(parameter1, parameter2):
# function body
# write some action
return value
Example:-
1. Function without parameters and return value:-
def greet():
10
print("Hello, how are you?")
# Calling the function
greet() # Output: Hello, how are you?
11
5. Recursive function (a function that calls itself):
def factorial(n):
if n == 0 or n == 1:
return 1
else:
return n * factorial(n - 1)
result = factorial(5)
print(result) # Output: 120 (5! = 5 * 4 * 3 * 2 * 1)
12
Now, you can import this module in another Python script and use the
functions it provides:
Program:-
# main.py
import math_operations
result1 = math_operations.add(5, 3)
result2 = math_operations.subtract(10, 4)
print(result1) # Output: 8
print(result2) # Output: 6
Package:-
A package is a collection of multiple modules that are organized in a directory
structure. It enables you to group related modules together under a common
namespace.
Example:-
Let's create a package called `shapes` that contains two modules, `circle.py` and
`rectangle.py`, each defining functions related to their respective shapes.
Package:-
shapes/
|-- __init__.py
|-- circle.py
|-- rectangle.py
Program 1
# circle.py
13
import math
def area(radius):
return math.pi * radius 2
Program 2
# rectangle.py
def area(length, width):
return length * width
The `shapes` directory contains an `__init__.py` file, which is required to treat the
directory as a package.
Now, we can use the `circle` and `rectangle` modules from the `shapes` package in
our main script:
Program:-
# main.py
from shapes import circle, rectangle
circle_area = circle.area(5)
rectangle_area = rectangle.area(4, 6)
2. Built-in Modules:-
Python comes with a set of built-in modules that are available for use without any
additional installation. These modules provide a wide range of functionalities and
cover areas such as math, file handling, networking, and more. To use built-in
modules, you simply import them into your script.
Program:-
import math
result = math.sqrt(25)
print(result) # Output: 5.0
4. Third-Party Modules:-
15
Python also allows you to use third-party modules developed by the Python
community. These modules are not part of the Python distribution and need to be
installed separately using package managers like `pip`.
To use a third-party module, you first need to install it, and then you can import and
utilize its functionalities in your scripts.
Program:-
# Installation using pip: pip install pandas
import pandas as pd
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print(series)
5. sys: Provides access to some variables used or maintained by the interpreter and
functions that interact with the interpreter.
7. json: Provides functions for working with JSON (JavaScript Object Notation)
data.
8. csv: Allows reading and writing CSV (Comma Separated Values) files.
Exception Handling
Exception handling is essential for writing robust and error-resistant code.
16
It allows you to handle potential problems gracefully and provides an opportunity
to log or report errors, which is crucial for debugging and maintaining the code's
stability.
It allows you to gracefully handle errors or exceptional situations that may occur
during the execution of a program.
It helps prevent the program from crashing and allows you to take appropriate
actions when errors occur.
Python provides a straight forward syntax for handling exceptions using the try,
except, else, and finally blocks.
1. try: The try block contains the code that might raise an exception. It is the part of
the code where you anticipate potential errors.
2. except: The except block is used to specify how to handle a specific type of
exception. If an exception occurs in the try block and matches the specified type,
the code inside the except block is executed.
3. else: The else block is optional and used to specify code that should be executed
only if no exception occurs in the try block.
4. finally: The finally block is optional and will always be executed, regardless of
whether an exception occurred or not. It's commonly used for clean-up tasks, like
closing files or releasing resources.
Program:-
try:
num1 = int(input("Enter a number: "))
num2 = int(input("Enter another number: "))
result = num1 / num2
print("Result:", result)
except ValueError:
17
print("Please enter valid integers.")
except ZeroDivisionError:
print("Cannot divide by zero.")
else:
print("No exceptions occurred.")
finally:
print("This block always gets executed, regardless of exceptions.")
NumPy Library
NumPy is a powerful library for numerical computing in Python. It stands for
"Numerical Python" and provides support for large, multi-dimensional arrays and
matrices, along with a collection of mathematical functions to operate on these
arrays efficiently. NumPy is widely used in data science, machine learning, and
scientific computing due to its speed and versatility.
Arrays: The core data structure in NumPy is the `ndarray`, which stands for n-
dimensional array. It is similar to Python lists but allows for more efficient and
faster operations on large datasets.
Creating Arrays: You can create NumPy arrays using the `numpy.array()`
function or other specific functions like `numpy.zeros()`, `numpy.ones()`, or
`numpy.arange()`.
Program:-
import numpy as np
# Creating a 1D array
arr1 = np.array([1, 2, 3, 4, 5])
# Creating a 2D array
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
Arithmetic Operations: NumPy allows you to perform various operations on
arrays, such as element-wise addition, subtraction, multiplication, and division.
18
1. Addition: The `np.add()` function is used to add two arrays element-wise or add
a constant value to all elements in an array.
Example:-
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
result = np.add(arr1, arr2)
print(result) # Output: [5 7 9]
19
4. Division: The `np.divide()` function is used to perform element-wise division of
two arrays or divide all elements in an array by a constant value.
Example:-
import numpy as np
arr1 = np.array([10, 20, 30])
arr2 = np.array([2, 5, 10])
result = np.divide(arr1, arr2)
print(result) # Output: [ 5. 4. 3.]
Program:-
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
# Element-wise addition
result_add = arr1 + arr2 # Output: [5, 7, 9]
# Element-wise multiplication
result_mul = arr1 * arr2 # Output: [4, 10, 18]
20
Program:-
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
# Sum of array elements
sum_arr = np.sum(arr) # Output: 15
# Mean of array elements
mean_arr = np.mean(arr) # Output: 3.0
# Minimum and maximum elements
min_val = np.min(arr) # Output: 1
max_val = np.max(arr) # Output: 5
Indexing and Slicing: You can access elements of NumPy arrays using indexing
and perform slicing operations to get subsets of the array.
Program:-
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
# Accessing a specific element
print(arr[2]) # Output: 3
# Slicing
print(arr[1:5]) # Output: [2, 3, 4, 5]
It provides powerful tools for working with arrays, handling large datasets, and
performing complex mathematical operations efficiently.
2. The dimensions must either be equal, one of them must be 1, or one of them must
not exist.
Program:-
import numpy as np
# Broadcasting example
a = np.array([1, 2, 3])
b = np.array([10, 20, 30])
# Element-wise addition between a and b
result = a + b
print(result) # Output: [11 22 33]
NumPy Functions:-
NumPy is a powerful library in Python for numerical computing. It provides a wide
range of functions for various mathematical and array manipulation operations.
Functions:-
np.array(): Create a numpy array from a Python list or tuple.
np.zeros(): Create an array filled with zeros.
np.ones(): Create an array filled with ones.
np.arange(): Create an array with evenly spaced values within a given interval.
np.linspace(): Create an array with a specified number of evenly spaced values
within a given interval.
np.reshape(): Reshape an array into a specified shape.
22
np.sum(): Compute the sum of array elements.
np.mean(): Compute the mean of array elements.
np.max(): Find the maximum value in an array.
np.min(): Find the minimum value in an array.
np.dot(): Compute the dot product of two arrays.
np.transpose(): Transpose an array.
Program:-
import numpy as np
# Create an array from a list
arr1 = np.array([1, 2, 3, 4, 5])
# Create an array filled with zeros
arr2 = np.zeros(5)
# Create an array filled with ones
arr3 = np.ones(5)
# Create an array with evenly spaced values from 0 to 9
arr4 = np.arange(10)
# Compute the sum of array elements
sum_arr1 = np.sum(arr1)
# Find the maximum value in an array
max_arr4 = np.max(arr4)
print(arr1)
print(arr2)
print(arr3)
print(arr4)
print(sum_arr1)
print(max_arr4)
23
Pandas Library
Pandas is a powerful Python library for data manipulation and analysis. It provides
data structures like Data Frames and Series, which allow you to handle and process
data efficiently.
The Data Frame is the most commonly used data structure in pandas and is similar
to a table in a relational database or an Excel spreadsheet. It allows you to store and
manipulate data in a tabular format with rows and columns.
Example:-
1. Importing pandas:-
import pandas as pd
2. Creating a DataFrame:-
# Create a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'San Francisco']
}
df = pd.DataFrame(data)
print(df)
3. Accessing columns:-
# Access a single column
ages = df['Age']
print(ages)
4. Accessing rows:-
# Access a single row by index
row1 = df.loc[1]
print(row1)
24
5. Data Manipulation:-
# Adding a new column
df['Salary'] = [50000, 60000, 70000]
# Dropping a column
df.drop(columns='City', inplace=True)
6. Data Input/Output:-
# Reading data from CSV file
data = pd.read_csv('data.csv')
Aggregation Function
In pandas, aggregation functions are used to perform summary operations on data
in a DataFrame. These functions help to collapse multiple data points into a single
value, providing valuable insights into the dataset.
Some commonly used aggregation functions in pandas include sum, mean, median,
min, max, count, std, var, and many others.
Example:-
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 28, 40],
'Salary': [50000, 60000, 70000, 55000, 80000]
25
}
df = pd.DataFrame(data)
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
27
# Add labels and title
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Simple Line Plot")
Seaborn:-
`seaborn` is a higher-level library built on top of `matplotlib`, designed to create
attractive and informative statistical graphics. It simplifies the process of creating
complex visualizations and enhances the default aesthetics.
Scatter Plot Example:-
import seaborn as sns
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
28
CSV (Comma Separated Values)
CSV stands for Comma-Separated Values. It is a file format used to store tabular
data, such as spreadsheets or databases, in a plain text format. In a CSV file, each
line represents a row of data, and within each line, the individual values are
separated by commas. CSV files provide a lightweight and flexible way to store and
exchange structured data, making them a popular choice in various fields.
Characteristics:-
1. Plain Text Format: CSV files are plain text files that can be easily read and
edited with a text editor or spreadsheet software.
2. Tabular Structure: CSV files are structured as tables, with rows representing
records and columns representing fields or attributes of the data.
3. Comma Separation: The values within each row are separated by commas.
However, other delimiters like semicolons or tabs can also be used depending on
the regional settings or specific requirements.
4. No Data Type Enforcement: CSV files do not enforce any specific data types
for the values. All values are treated as strings unless explicitly converted by the
software reading the file.
5. No Cell Formatting: CSV files do not preserve any formatting such as font
styles, colors, or formulas. They focus solely on storing the raw data.
Example:-
Name,Age,Department,Salary
John Doe,30,Marketing,50000
Jane Smith,35,Finance,60000
Mark Johnson,28,IT,55000
CSV Uses:-
1. Data Import/Export: CSV files are widely used to import or export data
between different software systems or databases. They provide a standardized
format that can be easily understood and processed by different applications.
29
2. Data Analysis: CSV files are often used in data analysis tasks. Data scientists
and analysts can import CSV files into tools like Excel, Python, R, or SQL
databases to perform statistical analysis, generate reports, or visualize the data.
3. Data Interchange: CSV files can be shared and exchanged between different
users or systems. They provide a simple and widely supported format for sharing
data, making it easier to collaborate or integrate data from multiple sources.
4. Web Development: CSV files can be used to store and exchange data in web
applications. For example, they can be used to upload and download data from
web forms or to provide data feeds for other applications.
Machine Learning
Machine learning is a subset of artificial intelligence (AI) that focuses on enabling
computers to learn and make predictions or decisions without being explicitly
programmed. It involves developing algorithms and models that allow machines to
learn from data and improve their performance over time.
(Machine learning is like teaching a computer to recognize patterns and make
decisions based on examples.)
30
Unsupervised Learning: The algorithm learns from unlabeled data, where there
are no predefined labels or desired outputs. The algorithm analyzes the input data
and identifies patterns, structures, or relationships within the data. It is often used
for tasks such as clustering, where the algorithm groups similar data points
together based on their characteristics.
Example Customer segmentation, image compression, market basket analysis,
and outlier detection.
31
Example Text classification, Image classification, Anomaly detection, Speech
analysis, Internet content classification.
Regression:-
Regression algorithms are used if there is a relationship between the input variable
and the output variable. It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc.
1. Linear Regression
2. Regression Trees
3. Non-Linear Regression
4. Bayesian Linear Regression
5. Polynomial Regression
Classification:-
Classification algorithms are used when the output variable is categorical, which
means there are two classes such as Yes-No, Male-Female, True-false, etc.
1. Random Forest
2. Decision Trees
32
3. Logistic Regression
4. Support vector Machines
5. KNN (K-Nearest Neighbors)
It operates based on the assumption that similar instances are likely to have
similar class labels or values.
It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
33
Step-2: Calculate the Euclidean distance of K number of neighbors.
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each
category.
Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
Step-6: Our model is ready.
KNN Program
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('user_data.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state
=0)
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
34
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
#Fitting K-NN classifier to the training set
from sklearn.neighbors import KNeighborsClassifier
classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
classifier.fit(x_train, y_train)
#Predicting the test set result
y_pred= classifier.predict(x_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
Naïve Bayes Classifier
Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
It is mainly used in text classification that includes a high-dimensional training
dataset.
Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.
It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
36
The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is
called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is termed
as Support Vector Machine.
Handle Outliers: SVMs are robust to outliers, which are data points that deviate
significantly from other points. Outliers have a minimal impact on the decision
boundary because SVMs focus on maximizing the margin using the support
vectors.
Handle High-dimensional Spaces: SVMs perform well even when the number
of features is much larger than the number of instances. This is known as the
"curse of dimensionality." SVMs are less prone to overfitting in high-
37
dimensional spaces, making them suitable for problems with a large number of
features.
Types of SVM:-
Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line.
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line.
SVM Program
#Importing the libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('user_data.csv')
x = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_stat
e = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
# "Support vector classifier"
from sklearn.svm import SVC
classifier = SVC(kernel='linear', random_state=0)
classifier.fit(x_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(x_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
38
cm = confusion_matrix(y_test, y_pred)
Decision Tree
Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the
outcome.
There are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given
dataset.
It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
39
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
1. Entropy
2. Gini Index
Pruning:-
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the
optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture
all the important features of the dataset. Therefore, a technique that decreases the
size of the learning tree without reducing accuracy is known as Pruning.
41
3. Stacking: Stacking combines multiple heterogeneous models by training a meta-
model on the predictions of the individual models. The individual models serve
as base models that make predictions on the training data. The meta-model takes
these predictions as inputs and learns to make the final prediction. Stacking
leverages the different strengths and weaknesses of various models to create a
more powerful classifier.
Ensemble methods are widely used in classification tasks as they often produce
better results than individual models. They can increase accuracy, improve
generalization, and handle complex datasets.
Cross Validation
Cross-validation is a technique used in machine learning to assess the performance
and generalization ability of a model. It helps in estimating how well the model will
perform on unseen data. The basic idea behind cross-validation is to split the
available data into multiple subsets, train the model on some of these subsets, and
evaluate its performance on the remaining subset.
Basic steps of cross-validations:-
Reserve a subset of the dataset as a validation set.
Provide the training to the model using the training dataset.
Now, evaluate model performance using the validation set. If the model performs
well with the validation set, perform the further step, else check for the issues.
43
K-Fold Cross-Validation: The dataset is divided into k subsets, and the model
is trained and evaluated k times, each time using a different subset for evaluation
and the remaining subsets for training.
Validation Set Approach: We divide our input dataset into a training set and
test or validation set in the validation set approach. Both the subsets are given
50% of the dataset.
Applications:-
Computer Vision
Computer vision is a field of study that focuses on enabling computers to understand
and interpret visual information from images or videos. It involves developing
algorithms and techniques to extract meaningful information from visual data.
Some types of computer vision tasks with examples:-
1. Image Classification: Identifying and categorizing objects or scenes in images.
Example classifying whether an image contains a cat or a dog.
7. Pose Estimation: Estimating the pose or position of objects or human body part.
Example determining the 3D pose of a person's joints from an image or video.
45
Face Recognition and Detection with OpenCV
OpenCV:-
OpenCV is huge open-source library for the computer vision, machine learning, and
image processing and now it plays a major role in real-time operation which is very
important in today’s systems. When it integrated with various libraries, such as
NumPy, python is capable of processing the OpenCV array structure for analysis.
To identify image pattern and its various features we use vector space and perform
mathematical operations on these features. OpenCV is a computer vision library that
supports programming languages like Python, C++, and Java. The package was
initially created by Intel in 1999 and was later made open-source and released to the
public.
OpenCV provides a solid foundation for implementing these tasks, but additional
steps, such as data preprocessing, feature selection, and model training, may be
required to build a robust face recognition system. OpenCV can be used in
conjunction with deep learning frameworks like TensorFlow or PyTorch for such
applications.
Applications of OpenCV:-
1. face recognition
2. Automated inspection and surveillance
3. Vehicle counting on highways along with their speeds
4. Interactive art installations
5. Street view image stitching
6. Robot and driver-less car navigation and control
7. object recognition
8. Medical image analysis
9. Movies – 3D structure from motion
10. TV Channels advertisement recognition
46
Face Detection:-
2. To perform face detection, you can use the `CascadeClassifier` class in OpenCV,
which can load the pre-trained Haar cascades for face detection.
3. The Haar cascades or deep learning models are applied to the input image or
video frame, and they identify regions that potentially contain faces.
4. Detected faces are typically represented by bounding boxes that enclose the
detected face regions.
Face Recognition:-
2. OpenCV can be used in conjunction with other libraries like Dlib or Face
Recognition to implement face recognition algorithms.
3. The first step in face recognition is to capture and preprocess face images.
OpenCV provides functions for face alignment, normalization, and feature
extraction.
5. Once the features are extracted, they are typically used to train a machine learning
model, such as a Support Vector Machine (SVM) or a Convolutional Neural
Network (CNN).
47
6. OpenCV provides machine learning functions and classes for training and using
these models, such as `cv2.ml.SVM` or `cv2.dnn.Net`.
7. During recognition, the trained model is applied to new face images to predict or
classify the individuals they belong to.
8. The predicted results can be used for tasks like face identification, verification,
or attendance systems.
Installing the OpenCV library is required before we can begin face detection in
Python.
pip install opencv-python
Once the library is installed, we can start writing our code. The relevant modules
must first be imported and read in an image as the early phase.
Program
import cv2
// load the cascade classifier for face detection //
face_cascade = cv2.CascadeClassifier('path/to/haarcascade_frontalface_default.xml')
// load the image //
img = cv2.imread('path/to/image.jpg')
// convert the image to grayscale //
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
// detect faces in the image //
faces = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5)
// draw a rectangle around the detected faces //
for (x, y, w, h) in faces:
cv2.rectangle(img, (x, y), (x+w, y+h), (255, 0, 0), 2)
// display the image //
cv2.imshow('img', img)
48
cv2.waitKey(0)
cv2.destroyAllWindows()
Detect MultiScale:- This method takes in the image and a few parameters such as
the scale factor and the minimum number of neighbors.
Scale factor:- It is used to control the size of the detection window, and the
minimum number of neighbors is used to control the number of false positives.
Installing the OpenCV library is required before we can begin face detection in
Python.
pip install opencv-python
conda install -c conda-forge dlib
pip install face_recognition
Once the library is installed, we can start writing our code. The relevant modules
must first be imported and read in an image as the early phase.
Program
#defing the variable and reading the file
image_1 = cv2.imread("Elon_musk.png")
FaceRecognizer()
In OpenCV, the `FaceRecognizer` class provides a set of methods and
functionalities for face recognition. It is an abstract class that serves as a base for
implementing different face recognition algorithms. These FaceRecognizer classes
provide methods for training the model with labeled face images and performing
face recognition on new, unseen face images.
FaceRecognizer classes in OpenCV:-
`cv2.face.LBPHFaceRecognizer`: Local Binary Patterns Histograms (LBPH) is
a simple yet effective face recognition algorithm. It extracts local binary patterns
from the face images and generates histograms as features for recognition.
51
An ANN consists of interconnected nodes called artificial neurons or units. These
neurons are organized into layers, typically an input layer, one or more hidden
layers, and an output layer. Each neuron receives input from the neurons in the
previous layer, performs a computation, and passes the result to the neurons in the
next layer.
The connections between neurons in the network are represented by weights, which
determine the strength and significance of the information being passed. During the
training phase, the network learns by adjusting these weights based on the input data
and the desired output. This process is often done using a technique called
backpropagation, which compares the network's output with the expected output and
adjusts the weights accordingly.
Artificial Neural Networks Models:-
Feedforward Neural Networks (FNN): In this type of network, information
flows in one direction, from the input layer to the output layer. It's like a data
pipeline where the input is processed layer by layer, and the output is generated.
Convolutional Neural Networks (CNN): CNNs are mainly used for image and
video analysis. They have specialized layers called convolutional layers that
extract features from the input data, such as edges, textures, or shapes. CNNs are
effective in tasks like image recognition and object detection.
52
Artificial neural networks have become powerful tools in machine learning and have
been applied to a wide range of fields, including image and speech recognition,
natural language processing, recommendation systems, and more.
ANN Structure
An artificial neural network consists of interconnected nodes called artificial
neurons or units. These units are organized into layers. There are typically three
types of layers:
Input Layer: This is the first layer of the network where the data is initially fed.
Each neuron in the input layer represents a feature or an attribute of the input
data.
Example if you're training a network to recognize handwritten digits, each
neuron in the input layer could represent a pixel of an image.
Hidden Layers: These layers are placed between the input layer and the output
layer. They are called "hidden" because we don't directly observe their outputs.
Each hidden layer contains multiple neurons, and the number of hidden layers
can vary depending on the complexity of the problem. The hidden layers perform
computations on the input data and progressively extract more complex features
as we move deeper into the network.
Output Layer: This is the final layer of the network, where the network's
prediction or output is generated. The number of neurons in the output layer
depends on the specific task the network is designed for.
Example In a digit recognition task, there would be 10 neurons in the output
layer, each representing a digit from 0 to 9.
53
Characteristics of ANN
Neural networks can face some challenges when learning from data. There are a few
common problems along with their characteristics:-
1. Overfitting: Overfitting occurs when a neural network becomes too specialized
and performs exceptionally well on the training data but fails to generalize to
new, unseen data. It's like a student who memorizes answers for a specific set of
questions but struggles when faced with new ones. Overfitting can happen when
the network becomes too complex or when there is insufficient training data.
54
How it Works:-
1. Input Layer: The input layer is where we provide the initial data to the network.
Each neuron in the input layer represents a feature or an attribute of the input
data.
Example If we're working with images, each neuron in the input layer could
represent a pixel's intensity.
2. Hidden Layers: Between the input layer and the output layer, there can be one
or more hidden layers. The hidden layers consist of neurons that process the
information from the input layer. Each neuron in the hidden layers takes input
from the previous layer, performs a computation using weighted connections,
and passes the result to the next layer.
3. Weights and Activation: The special thing about neural networks is that there
are weights associated with the connections between neurons. These weights
determine the importance of the information flowing through each connection.
Additionally, each neuron in the hidden layers applies an activation function to
the result of its computation. The activation function introduces non-linearities
and helps the network to learn complex patterns in the data.
4. Output Layer: After passing through one or more Hidden Layers, the data
reaches the Output Layer. This layer provides the final result of the network's
computation. Each neuron in the Output Layer corresponds to an element of the
desired output.
Example If the network is trained for binary classification (yes/no), there
would be one neuron in the output layer.
Activation Function
An activation function in a neural network is a mathematical function that adds non-
linearity to the network's computations. It determines whether a neuron should be
activated (fire) or not based on the input it receives.
55
In simple words, think of an activation function as a gate that allows information to
flow through a neuron only if it meets a certain criteria. This gate introduces
flexibility and allows neural networks to model complex relationships in data.
Common types of activation functions:-
Sigmoid Function:-
The sigmoid function squeezes the input between 0 and 1. It's like a "squashing"
function that maps large positive and negative numbers to values close to 0 or 1. It
was widely used in the past but is less popular now due to some limitations, like the
"vanishing gradient problem."
ReLU is a popular activation function that simply returns the input if it's positive
and sets it to zero if it's negative. This means ReLU allows the information to flow
if it's positive and blocks it if it's negative. ReLU is computationally efficient and
helps mitigate the vanishing gradient problem.
56
Leaky ReLU is similar to ReLU, but instead of setting negative values to zero, it
allows a small, non-zero slope for negative inputs. This prevents neurons from
becoming inactive and addresses some of the limitations of the regular ReLU.
Tanh function maps the input between -1 and 1, making it similar to the sigmoid
function but centered around zero. It allows negative values, which helps with
centering the data and avoiding the vanishing gradient problem to some extent.
Softmax Function:-
Backpropagation
Backpropagation is a technique used in neural networks to adjust the weights of the
connections between neurons during the training phase. It helps the network learn
from its mistakes and improve its predictions.
Backward Pass: In the backward pass, the network compares the generated
output with the desired output and calculates the error. The error is a measure of
how much the network's prediction deviates from the expected outcome. The
goal of backpropagation is to minimize this error.
2. Error Calculation: After the output is generated, we compare it with the desired
output to calculate the error. The error is a measure of how much the network's
prediction deviates from the expected outcome.
3. Backward Pass: In the backward pass, the network propagates the error
backward from the output layer to the hidden layers. It computes the gradients of
the error with respect to each weight in the network using the chain rule from
calculus.
58
4. Weight Update: The gradients obtained from the backward pass represent the
direction and magnitude of the weight adjustments needed to reduce the error.
The network updates the weights by moving in the opposite direction of the
gradients. This process of updating the weights is called gradient descent.
2. Feature Extraction: As the filters slide over the input image, they perform a
mathematical operation called "convolution," which combines the values of the
image pixels within the window to create a "feature map." These feature maps
highlight the presence of different features in the input image.
4. Pooling Layer: After the convolutional layer, there is usually a pooling layer,
which reduces the spatial dimensions of the feature maps while retaining the most
important information. Pooling helps in reducing computation and making the
network more robust to variations in the input.
59
5. Fully Connected Layers: Following the convolutional and pooling layers, there
are one or more fully connected layers. These layers are similar to those in
traditional neural networks and help combine the features learned from the
previous layers to make predictions.
6. Output Layer: The final layer in the CNN is the output layer, which produces
the network's prediction or classification. The number of neurons in the output
layer depends on the specific task the CNN is designed for. For example, if it's a
digit recognition task, there would be 10 neurons, each representing a digit from
0 to 9.
CNNs have revolutionized computer vision tasks by achieving impressive results in
various applications, such as image classification, object detection, and image
segmentation.
ARCHITECTURE OF CNN
Convolutional Layer:-
It is the first and core layer of the CNN. It is one of the building blocks of a CNN
and is used for extracting important features from the image.We use a convolution
operation that will extract all the important features from the image.
Convolutional Operations
1. Kernel or Filter:-
60
Every input image is represented by a matrix of pixel values.
Apart from the input matrix, we also have another matrix called the filter matrix.
The filter matrix is also known as a kernel, or simply a filter.
We take the filter matrix, slide it over the input matrix by one pixel, perform
element-wise multiplication, sum up the results, and produce a single number.
2. Stride:-
It refers to the step size at which the convolutional kernel or filter is moved across
the input data. It determines how much the kernel is shifted at each step during the
convolution operation.
There are two types of stride:-
A. Small Stride: When the stride is set to a small number, we can encode a more
detailed representation of the image than when the stride is set to a large number.
B. Large Stride: A stride with a high value takes less time to compute than one
with a low value. It means we capture less data, so it also reduced the
dimensionality of the feature maps as compared to lower stride values.
3. Feature Map:-
A. Various filters are used for extracting different features from the image. If we use
a sharpen filter, then it will sharpen our image.
B. There are several kinds of filters in Image processing domain. Some of them are
used to detect edges, while others may blur the image.
61
C. Our goal is to learn the filter values from the input data and output labels.
D. We can use multiple filters for extracting different features from the image, and
produce multiple feature maps. So, the depth of the feature map will be the
number of filters.
E. If we use seven filters to extract different features from the image, then the depth
of our feature map will be seven.
4. Padding:-
We can simply pad the input matrix with zeros so that the filter can fit the input
matrix.
There are two types of padding:-
A. Zero Padding: Padding with zeros on the input matrix is called same padding or
zero padding.
B. Valid Padding: We can also simply discard the region of the input matrix where
the filter doesn't fit in. This is called valid padding.
Pooling Layer:-
A pooling layer is a common component used in CNNs for image analysis and other
tasks. Its purpose is to reduce the spatial dimensions (width and height) of the input
data while retaining important features.
62
The pooling layer divides the input into small, non-overlapping regions called
pooling windows. Each pooling window collects information from a local
neighborhood of the input. It helps retain important features while discarding
irrelevant details, contributing to the overall effectiveness and efficiency of the
network.
Types of Pooling Operations:-
There are three major types of pooling operations:-
1. Max Pooling
2. Average Pooling
3. Sum Pooling
A. Max Pooling:-
In max pooling, we slide over the filter on the input matrix and simply take the
maximum value from the filter window.
63
Fully Connected Layer:-
A fully connected layer, also known as a dense layer, is a fundamental component
in neural networks that connects every neuron from one layer to every neuron in the
subsequent layer. We flatten the feature map and convert it into a vector, and feed
it as an input to the feedforward network.
The feedforward network takes this flattened feature map as an input, applies an
activation function, such as sigmoid, and returns the output, this is called a fully
connected layer or dense layer.
The main goal of fully connected layers is to learn and model complex relationships
in the data, enabling the network to make accurate predictions or classifications
based on the learned features.
2. Layers: A neural network has multiple layers, typically divided into three types:
a) Input Layer: This layer receives the initial data, like an image or text.
b) Hidden Layers: These layers process the data and learn patterns from it.
c) Output Layer: The final layer produces the network's prediction or result,
depending on the task.
64
3. Weights and Biases: To learn from data, each connection between neurons has
a weight. These weights determine the importance of one neuron's output on
another. Additionally, each neuron has a bias, which helps in adjusting the
output.
4. Activation Function: After the weighted sum of inputs and biases is calculated,
an activation function is applied to introduce non-linearity into the model. It helps
the network learn complex relationships in the data.
5. Training: During training, the network adjusts its weights and biases to
minimize the difference between predicted output and actual output. It does this
by comparing the predicted values to the correct answers from the training data.
6. Loss Function: The loss function measures how far off the predictions are from
the actual values. The goal is to minimize this difference during training.
8. Epochs: The training process is divided into epochs. In each epoch, the network
goes through the entire dataset once, making adjustments to improve its
predictions.
10. Prediction: After successful training, the neural network can be used to make
predictions on new, unseen data.
2. Direction of Steepest Descent: You take small steps downhill in the direction
of the steepest slope. This way, you'll eventually reach the lowest point, which
corresponds to the minimum of the error.
4. Learning Rate: The learning rate controls the size of the steps you take. If the
learning rate is too large, you might overshoot the minimum and miss it. If it's
too small, you might take too many tiny steps and take forever to reach the
minimum.
5. Convergence: With each step, the error reduces, and the model's parameters get
closer to the optimal values. The process continues until the parameters converge
to a point where further adjustments do not significantly improve the model's
performance.
6. Local Minima: It's essential to note that gradient descent might get stuck in a
local minimum, which is not the absolute lowest point but a low point nearby.
66
Various techniques, such as using different starting points or advanced
optimization algorithms, are employed to mitigate this issue.
7. Non-Convex Functions: In complex models, the error surface might have many
valleys and hills (non-convex). In such cases, gradient descent can still find good
solutions, but it might take more time and care.
Perceptron Learning Algorithm
The Perceptron Learning Algorithm is one of the simplest learning algorithms used
for training a type of neural network called a "perceptron.". It is the foundation of
neural networks and deep learning. The perceptron is a single-layer neural network
used for binary classification tasks, which means it can decide between two possible
outcomes.
How Works Perceptron Learning Algorithm:-
1. Perceptron: A perceptron is a basic building block of an artificial neural
network. It takes multiple inputs, multiplies each input by a corresponding
weight, and sums them up. The perceptron then applies an activation function to
the sum to produce an output.
2. Initialization: In the beginning, the weights of the perceptron are set to random
values or initialized to small numbers.
3. Learning from Data: The perceptron is trained using a labeled dataset. Each
data point in the dataset consists of input features and a known output (a label).
5. Prediction: With the initial weights, the perceptron predicts the output for each
data point in the training set.
6. Comparison and Weight Update: For each data point, the predicted output is
compared to the actual label. If the prediction is correct, no changes are made to
the weights. However, if the prediction is wrong, the weights are adjusted to
correct the error.
67
7. Updating Weights: To update the weights, the Perceptron Learning Algorithm
takes into account the difference between the predicted output and the actual
label. The weights are then adjusted to reduce this difference.
9. Convergence: The algorithm keeps iterating until it reaches a point where the
perceptron makes no mistakes on the entire training dataset or a certain number
of epochs are reached.
10. Termination: The training process stops when the perceptron achieves
satisfactory accuracy on the training data or when it reaches a predefined number
of epochs.
AND or OR Boolean Function using Perceptron
AND Function using Perceptron:-
1. We have two binary inputs, let's call them “input1” and “input2”.
2. Assign equal weights of `0.5` to both inputs (this means both inputs are equally
important in determining the output).
5. If `sum >= 1.0`, the perceptron outputs `1`, representing "True" for the AND
operation; otherwise, it outputs `0`, representing "False."
Truth Table for AND Function:-
INPUT 1 INPUT 2 OUTPUT
0 0 0
0 1 0
1 0 0
1 1 1
68
OR Function using Perceptron:-
2. Assign equal weights of `0.5` to both inputs (similarly, both inputs are equally
important).
5. If `sum >= 0.5`, the perceptron outputs `1`, representing "True" for the OR
operation; otherwise, it outputs `0`, representing "False."
Truth Table for OR Function:-
INPUT INPUT 2 OUTPUT
1
0 0 0
0 1 1
1 0 1
1 1 1
1. A single-layer perceptron consists of only one layer of neurons (also called the
input layer) that directly connects to the output layer.
2. It can only learn and solve linearly separable problems. Linearly separable
problems are those that can be separated by a straight line or a hyperplane in
higher dimensions.
69
3. SLPs use a simple activation function, typically the step function, which can only
make binary decisions (output 0 or 1).
4. Since it has no hidden layers, it lacks the ability to capture complex patterns or
relationships in the data.
5. SLPs were introduced earlier in the history of neural networks and were limited
in their capabilities.
2. MLPs can learn and solve both linearly separable and non-linearly separable
problems. Non-linearly separable problems are those that require more complex
decision boundaries to separate the data.
3. MLPs can use a variety of activation functions, such as the sigmoid, ReLU, or
tanh functions, allowing them to model non-linear relationships and capture more
intricate patterns in the data.
5. MLPs are the basis for modern deep learning models and are widely used in a
wide range of applications due to their ability to handle complex and high-
dimensional data.
Natural Language Processing
Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that
deals with the interaction between computers and human languages. It focuses on
enabling computers to understand, interpret, and generate human language in a way
that is meaningful and useful.
70
Differentiating Natural Language and Computer Language:-
Natural Language: Natural language refers to the languages that humans use
for communication, such as English, Spanish, Chinese, etc. It is characterized by
its complexity, ambiguity, and nuances, making it challenging for computers to
process and understand without specialized techniques like NLP.
4. Voice Assistants: NLP powers voice assistants like Siri and Alexa, enabling
users to interact with their devices through spoken language.
71
5. Sentiment Analysis: NLP can determine the sentiment or emotions expressed in
text, making it useful for understanding customer feedback, social media
monitoring, etc.
Disadvantages of NLP:-
1. Ambiguity: Natural languages often have ambiguous meanings, which can be
challenging for computers to interpret correctly, leading to potential
misunderstandings.
3. Data Limitations: NLP models require extensive and diverse training data to
perform well, which might not always be available for certain languages or
domains.
5. Privacy and Ethical Concerns: NLP systems may encounter issues related to
data privacy, bias, and ethical considerations, especially when dealing with
sensitive information or user-generated content.
Text Processing
Text processing is a broader term that encompasses various techniques used to
handle and manipulate text data in NLP.
The goal of text processing is to clean and prepare the raw text data for further
analysis or modelling.
Text processing aims to convert unstructured text into a structured format that
can be easily understood and analyzed by computers.
72
1. Tokenization: The first step in text processing is tokenization. It involves
breaking down the text into smaller units called "tokens." Tokens can be words,
sentences, or even characters, depending on the level of granularity needed for
analysis.
1
3. Stopword Removal: Stopwords are common words (e.g., "the," "is," "and") that
don't carry much meaning and are often removed to reduce noise in the text and
focus on more important words.
Lexical Processing
It focuses on understanding and analyzing individual words (lexical units) in a
text.
The goal of lexical processing is to gain insights into the vocabulary, word
meanings, and grammatical relationships between words in a text.
73
2. Bag of Words (BoW): BoW is a popular representation for text processing. It
creates a "bag" of all unique words in the text, ignoring grammar and word order,
and counts the frequency of each word.
3. Named Entity Recognition (NER): NER is a task in NLP that identifies and
classifies named entities such as names of people, organizations, locations, and
dates in the text.
Semantics:-
74
The study of meaning in language and how words and sentences convey
information. It is essential for building NLP systems that can comprehend the
context and intent behind human language. Several NLP tasks involve semantic
analysis to extract and interpret meaning from text.
1. Named Entity Recognition (NER): Identifying and classifying named entities,
like names of people, organizations, locations, and dates, in a sentence.
Pragmatics:
It deals with the study of language use in context. It focuses on understanding the
meaning of sentences beyond their literal interpretations and takes into account the
context, speaker's intentions, and the implied meaning of the communication. It
plays a vital role in several NLP tasks that require a deeper understanding of
language use and context.
1. Anaphora Resolution: Identifying and resolving references (anaphora) in a
sentence that refer back to previously mentioned words or phrases.
75
Segmentation
Sentence segmentation involves splitting a text into individual sentences. One
common way to achieve this is by using the sent_tokenize function from the NLTK
library (Natural Language Toolkit).
Program:-
# Make sure to download the punkt tokenizer data
import nltk
nltk.download('punkt')
# Sample Sentence
text = "This is a sample text. It contains multiple sentences. Let's split them."
Tokenization
It is the process of tokenizing or splitting a string, text into a list of tokens. One can
think of token as parts like a word is a token in a sentence, and a sentence is a token
in a paragraph.
Text into sentences tokenization
Sentences into words tokenization
Sentences using regular expressions tokenization
76
Program:-
# Import the NLTK library
import nltk
from nltk.tokenize import word_tokenize
Program:-
import nltk
nltk.download('punkt')
nltk.download('stopwords')
# Sample text
text = "This is an example sentence that contains some stop words."
# Remove stopwords
processed_text = remove_stopwords(text)
print(processed_text)
Stemming
Stemming is a text pre-processing technique used in NLP to reduce words to their
base or root form, called the "stem." It involves removing suffixes from words to
obtain the core meaning of the word.
Example stemming would convert words like "running," "runs," and "ran" to the
common stem "run."
nltk.download('punkt')
def perform_stemming(text):
stemmer = PorterStemmer()
words = word_tokenize(text)
stemmed_words = [stemmer.stem(word) for word in words]
return " ".join(stemmed_words)
# Sample text
text = "I am running and eating, while others are playing and laughing."
# Perform stemming
stemmed_text = perform_stemming(text)
print(stemmed_text)
Lemmatization
Lemmatization is a text normalization technique in NLP that reduces words to their
base or root form (known as the lemma) while preserving the context and meaning
of the word. It helps in grouping different forms of a word together, such as plurals
and verb conjugations, making it easier for algorithms to process the text and extract
meaningful information.
Example lemmatization would convert "running" and "ran" to their base form
"run," "cats" to "cat," and "better" to "good."
Program:-
import nltk
from nltk.stem import WordNetLemmatizer
79
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('wordnet')
def lemmatize_text(text):
lemmatizer = WordNetLemmatizer()
words = word_tokenize(text)
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
return " ".join(lemmatized_words)
# Sample text
text = "The cats were running and playing in the garden."
Part-of-speech Tagging
It involves assigning a specific grammatical category (such as noun, verb, adjective,
etc.) to each word in a given sentence. This information helps in understanding the
syntactic structure and grammatical relationships within the text.
Program:-
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
80
def pos_tagging(text):
words = word_tokenize(text)
pos_tags = nltk.pos_tag(words)
return pos_tags
# Sample text
text = "The quick brown fox jumps over the lazy dog."
The “pos_tagging” function tokenizes the input text into individual words using the
“word_tokenize” function from NLTK. Then, the “pos_tag” function is used to
assign POS tags to each word.
OUTPUT:-
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over',
'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]
'DT' (Determiner) 'JJ' (Adjective)
'VBZ' (Verb) 'IN' (Preposition)
'NN' (Noun) '.' (Punctuation)
Sentiment Analysis
Sentiment Analysis, also known as Opinion Mining, is a text analysis technique that
aims to determine the sentiment or emotional tone expressed in a piece of text. The
goal is to identify whether the sentiment is positive, negative, neutral, or sometimes
more fine-grained emotions like happy, sad, angry, etc.
Applications:-
1. Social Media Monitoring: Analyzing social media posts, tweets, or comments
to understand public opinion about products, services, or events.
81
2. Customer Feedback Analysis: Identifying customer sentiment in reviews,
surveys, or feedback to gauge customer satisfaction.
Text: "I am disappointed with the service. It was very slow and unresponsive."
Sentiment: Negative
Text Classification
Text Classification, also known as Document Classification or Text Categorization,
is a broader task where text documents are categorized into predefined classes or
categories based on their content or topic.
Applications:-
1. Spam Detection: Classifying emails as spam or not spam.
3. Intent Recognition: Identifying the intent of user queries for chatbots or virtual
assistants.
Example:-
For an e-commerce website, categorizing product descriptions into different classes
like "Electronics," "Clothing," "Books," etc.
Text: "The latest fashion trends for summer are now available."
Category: Clothing
82
Difference between Training Set and Test Set
Training Set:-
1. The training set is a subset of the data used to train a machine learning model.
2. It contains labeled data, where the input features (independent variables) and the
corresponding target labels (dependent variable) are known
3. The model learns from this data during the training phase, adjusting its internal
parameters to fit the patterns in the data.
Test Set:-
1. The test set is another subset of the data that the trained machine learning model
has not seen during training.
2. It also contains labeled data with input features and corresponding target labels.
3. The test set is used to evaluate the model's performance and its ability to
generalize to new, unseen data.
4. By using a test set, we can estimate how well the model will perform on real-
world data and assess its accuracy, precision, recall, and other performance
metrics.
Regarding splitting on the dependent variable:-
When we split the data into a training set and a test set, we want to ensure that both
sets represent the same distribution of data. By randomly sampling data for the
training and test sets, we ensure that both sets have similar characteristics, and the
model learns and evaluates on a representative sample.
However, we do not split the data based on the dependent variable (target label)
because that would introduce a bias. The dependent variable is the variable we are
trying to predict or classify, and it is used to measure the model's performance
during testing.
83
If we split the data based on the dependent variable, we might end up with a test set
that has significantly different characteristics from the training set. This would not
accurately reflect the model's performance on new, unseen data, as the test set would
be biased towards the specific values of the dependent variable in the training set.
84