0% found this document useful (0 votes)
69 views40 pages

Aditya Report

The document is a project report for the Salary Prediction Project submitted by Aditya Kumar Gupta as part of his Bachelor of Technology degree in Computer Science and Engineering at Ramgarh Engineering College. It includes acknowledgments, a certificate from the supervisor, and a detailed table of contents covering various topics related to data science, artificial intelligence, and machine learning. The report outlines the project's objectives, methodologies, and the significance of Python and its data structures in implementing the project.

Uploaded by

d42.21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views40 pages

Aditya Report

The document is a project report for the Salary Prediction Project submitted by Aditya Kumar Gupta as part of his Bachelor of Technology degree in Computer Science and Engineering at Ramgarh Engineering College. It includes acknowledgments, a certificate from the supervisor, and a detailed table of contents covering various topics related to data science, artificial intelligence, and machine learning. The report outlines the project's objectives, methodologies, and the significance of Python and its data structures in implementing the project.

Uploaded by

d42.21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

SALARY PREDICTION PROJECT

A Project Report
for Industrial Training and Internship
Submi ed by
ADITYA KUMAR GUPTA
STUDENT ID:- 249011001185
In the par al fulfillment of the award of the degree of
BACHELOR OF TECHNOLOGY
In the
COMPUTER SCIENCE & ENGINEERING
OF
RAMGARH ENGINEERING COLLAGE

At

Ardent Computech Pvt. Ltd.


CERTIFICATE FROM SUPERVISOR
This is to cer fy that ADITYA KUMAR GUPTA , 23033445002 have completed the project
tled SALARY PREDICTION PROJECT
under my supervision during the period from 02-06-2025 to 15-07-2025 which is in par al
fulfillment of requirements for the award of the B.Tech degree and submi ed to the
Department of COMPUTER SCIENCE ENGINEERING of RAMGARH ENGINEERING COLLAGE

___________________________

Signature of the Supervisor

Date: dd/mm/yy

Name of the Project Supervisor: SOURAV GOSWAMI SIR


BONAFIDE CERTIFICATE
Cer fied that this project work was carried out under my supervision
SALARY PREDICTION PROJECT is the bonafide work of
Name of the student: Signature:
ADITYA KUMAR GUPTA

SIGNATURE
Name :
PROJECT MENTOR

SIGNATURE
Name:
EXAMINERS
Ardent Original Seal
1. Title of the Project: SALARY PREDICTION PROJECT
2. Project Member: ADITYA KUMAR GUPTA 23033445002
KUMARI NEHA 23033445018
PUNITA SINGH 23033445033
SNEHA KUMARI 23033445047
AJIT KUMAR SINGH 23033445003
3. Name of the guide: Mr. SOURAV GOSWAMI SIR
4. Address: Ardent Computech Pvt. Ltd (An ISO 9001:2008 Cer fied )
CF-137, Sector - 1, Salt Lake City, Kolkata - 700064

Signature of Approver
Date:
MR. Sourav Goswami Sir
Ardent Original Seal
ACKNOWLEDGEMENT

The achievement that is associated with the successful comple on of any task
would be incomplete without men oning the names of those people whose
endless coopera on made it possible. Their constant guidance and
encouragement made all our efforts successful.

We take this opportunity to express our deep gra tude towards our project
mentor, Mr. SOURAV GOSWAMI SIR for giving such valuable sugges ons,
guidance and encouragement during the development of this project work.

Last but not the least we are grateful to all the faculty members of Ardent
Computech Pvt. Ltd. for their support.
Table of Contents
DATA SCIENCE , ARTIFICIAL INTELLIGENCE ,
MACHINE LEARNING AND BIG DATA

1. Understanding Python – Syntax, Concepts, and Basics

2. Python Data Structures – List, Tuple, Set, Dic onary

3. Understanding Ar ficial Intelligence (AI) and Machine Learning (ML)

4. Full NumPy – Arrays, Indexing, Broadcas ng, Opera ons

5. Full Pandas – DataFrames, Series, Data Analysis Techniques

6. Feature Engineering – Handling Nulls, Encoding, Scaling, etc.

7. Data Processing – Cleaning, Normaliza on, Standardiza on

8. Matplotlib – Data Visualiza on using Python

9. Streamlit and Plotly – Interac ve Data Dashboards & Graphs

10. Regression Algorithms – Linear, Polynomial, etc.

11. Classifica on Algorithms – Logis c Regression, KNN, etc.

12. NLP (Natural Language Processing) – Text Data HandlingN

13. Computer Vision –Image Data Processing & Use Cases


What is Python?

Python is a popular programming language. It was created by Guido van Rossum, and released in 1991.

It is used for:

 web development (server-side),

 so ware development,

 mathema cs,

 system scrip ng.

What can Python do?

 Python can be used on a server to create web applica ons.

 Python can be used alongside so ware to create workflows.

 Python can connect to database systems. It can also read and modify files.

 Python can be used to handle big data and perform complex mathema cs.

 Python can be used for rapid prototyping, or for produc on-ready so ware development.

Why Python?

 Python works on different pla orms (Windows, Mac, Linux, Raspberry Pi, etc).

 Python has a simple syntax similar to the English language.

 Python has syntax that allows developers to write programs with fewer lines than some other programming
languages.

 Python runs on an interpreter system, meaning that code can be executed as soon as it is wri en. This means
that prototyping can be very quick.

 Python can be treated in a procedural way, an object-oriented way or a func onal way.

Good to know

 The most recent major version of Python is Python 3, which we shall be using in this tutorial. However,
Python 2, although not being updated with anything other than security updates, is s ll quite popular.

 In this tutorial Python will be wri en in a text editor. It is possible to write Python in an Integrated
Development Environment, such as Thonny, Pycharm, Netbeans or Eclipse which are par cularly useful when
managing larger collec ons of Python files.

Python Syntax compared to other programming languages

 Python was designed for readability, and has some similari es to the English language with influence from
mathema cs.

 Python uses new lines to complete a command, as opposed to other programming languages which o en
use semicolons or parentheses.

 Python relies on indenta on, using whitespace, to define scope; such as the scope of loops, func ons and
classes. Other programming languages o en use curly-brackets for this purpose.

Example

print("Hello, World!")
Data Structures are a way of organizing data so that it can be accessed more efficiently depending upon the
situa on. Data Structures are fundamentals of any programming language around which a program is built. Python
helps to learn the fundamental of these data structures in a simpler way as compared to other programming
languages.

In this ar cle, we will discuss the Data Structures in the Python Programming Language and how they are related to
some specific Python Data Types. We will discuss all the in-built data structures like list tuples, dic onaries, etc. as
well as some advanced data structures like trees, graphs, etc.

Lists

Python Lists are just like the arrays, declared in other languages which is an ordered collec on of data. It is very
flexible as the items in a list do not need to be of the same type.

The implementa on of Python List is similar to Vectors in C++ or ArrayList in JAVA. The costly opera on is inser ng or
dele ng the element from the beginning of the List as all the elements are needed to be shi ed. Inser on and
dele on at the end of the list can also become costly in the case where the preallocated memory becomes full.

We can create a list in python as shown below.

Example: Crea ng Python List

List = [1, 2, 3, "GFG", 2.3]

print(List)

Output

[1, 2, 3, 'GFG', 2.3]

List elements can be accessed by the assigned index. In python star ng index of the list, sequence is 0 and the ending
index is (if N elements are there) N-1.
Example: Python List Opera ons

# Crea ng a List with

# the use of mul ple values

List = ["Geeks", "For", "Geeks"]

print("\nList containing mul ple values: ")

print(List)

# Crea ng a Mul -Dimensional List

# (By Nes ng a list inside a List)

List2 = [['Geeks', 'For'], ['Geeks']]

print("\nMul -Dimensional List: ")

print(List2)

# accessing a element from the

# list using index number

print("Accessing element from the list")

print(List[0])

print(List[2])

# accessing a element using

# nega ve indexing

print("Accessing element using nega ve indexing")

# print the last element of list

print(List[-1])
# print the third last element of list

print(List[-3])

Output

List containing mul ple values:

['Geeks', 'For', 'Geeks']

Mul -Dimensional List:

[['Geeks', 'For'], ['Geeks']]

Accessing element from the list

Geeks

Geeks

Accessing element using nega ve indexing

Geeks

Geeks

Dic onary

Python dic onary is like hash tables in any other language with the me complexity of O(1). It is an unordered
collec on of data values, used to store data values like a map, which, unlike other Data Types that hold only a single
value as an element, Dic onary holds the key:value pair. Key-value is provided in the dic onary to make it more
op mized.

Indexing of Python Dic onary is done with the help of keys. These are of any hashable type i.e. an object whose can
never change like strings, numbers, tuples, etc. We can create a dic onary by using curly braces ({}) or dic onary
comprehension.

Example: Python Dic onary Opera ons

# Crea ng a Dic onary

Dict = {'Name': 'Geeks', 1: [1, 2, 3, 4]}

print("Crea ng Dic onary: ")

print(Dict)

# accessing a element using key

print("Accessing a element using key:")

print(Dict['Name'])

# accessing a element using get()

# method
print("Accessing a element using get:")

print(Dict.get(1))

# crea on using Dic onary comprehension

myDict = {x: x**2 for x in [1,2,3,4,5]}

print(myDict)

Output

Crea ng Dic onary:

{'Name': 'Geeks', 1: [1, 2, 3, 4]}

Accessing a element using key:

Geeks

Accessing a element using get:

[1, 2, 3, 4]

{1: 1, 2: 4, 3: 9, 4: 16, 5: 25}

Tuple

Python Tuple is a collec on of Python objects much like a list but Tuples are immutable in nature i.e. the elements in
the tuple cannot be added or removed once created. Just like a List, a Tuple can also contain elements of various
types.

In Python, tuples are created by placing a sequence of values separated by ‘comma’ with or without the use of
parentheses for grouping of the data sequence.

Note: Tuples can also be created with a single element, but it is a bit tricky. Having one element in the parentheses is
not sufficient, there must be a trailing ‘comma’ to make it a tuple.

String

Python Strings are arrays of bytes represen ng Unicode characters. In simpler terms, a string is an immutable array
of characters. Python does not have a character data type, a single character is simply a string with a length of 1.

Note: As strings are immutable, modifying a string will result in crea ng a new copy.

What Is Ar ficial Intelligence?


AI enables machines to think and act like us. Yes! Like humans. Imagine a computer or robot that can learn, reason,
solve problems, and even understand language, just like we do. From recognizing pa erns to making decisions, AI
mimics the cogni ve abili es of the human brain.

By studying how we think and process informa on, AI systems are designed to think smarter and more efficiently,
enabling them to perform tasks that once required human intelligence. Whether it's recommending your next
favorite movie or driving a car autonomously, AI is shaping the future in ways we once only imagined.

Does this sound intriguing, making you wonder how AI accomplishes all of this? In the following part, we will learn
how AI works, types of AI, and more. Let’s dig in. To explore AI further, check out this ar ficial intelligence tutorial.

Types of Ar ficial Intelligence

Below are the various types of AI:

1. Purely Reac ve

These machines do not have any memory or data to work with, specializing in just one field of work. For example, in
a chess game, the machine observes the moves and makes the best possible decision to win.

2. Limited Memory

These machines collect previous data and con nue adding it to their memory. They have enough memory or
experience to make proper decisions, but memory is minimal. For example, this machine can suggest a restaurant
based on the loca on data that has been gathered.

3. Theory of Mind

This kind of AI can understand thoughts and emo ons, as well as interact socially. However, a machine based on this
type is yet to be built.

4. Self-Aware

Self-aware machines are the future genera on of these new technologies. They will be intelligent, sen ent, and
conscious.

Learn the AI Skills That Drive Success

With Our Trending Ar ficial Intelligence ProgramExplore Program

How Does Ar ficial Intelligence Work?

AI systems work by merging large with intelligent, itera ve processing algorithms. This combina on allows AI to learn
from pa erns and features in the analyzed data. Each me an Ar ficial Intelligence system performs a round of data
processing, it tests and measures its performance and uses the results to develop addi onal exper se.

(Image Source)

Now that you know how AI works at a basic level, let’s take a closer look at the two major categories of AI and how
they differ in purpose and capabili es.
Weak AI vs. Strong AI

Next up in the what is AI discussion is weak and strong AI. When discussing ar ficial intelligence, it is common to
dis nguish between two broad categories: weak AI and strong AI. Let's explore the characteris cs of each type:

1. What is Weak AI (Narrow AI)

Weak AI or Narrow AI, refers to AI systems that are designed to perform specific tasks and are limited to those tasks
only. These AI systems excel at their designated func ons but lack general intelligence. Examples of weak AI include
voice assistants like Siri or Alexa, recommenda on algorithms, and image recogni on systems. Weak AI operates
within predefined boundaries and cannot generalize beyond their specialized domain.

2. What is Strong AI (General AI)

Strong AI, also known as general AI, refers to AI systems that possess human-level intelligence or even surpass
human intelligence across a wide range of tasks. Strong AI would be capable of understanding, reasoning, learning,
and applying knowledge to solve complex problems in a manner similar to human cogni on. However, the
development of strong AI is s ll largely theore cal and has not been achieved to date.

Ways of Implemen ng AI

Let’s explore the following ways that explain how we can implement AI:

1. Machine Learning

Machine learning gives AI the ability to learn. This is done by using algorithms to discover pa erns and generate
insights from the data it is exposed to.

2. Deep Learning

Deep learning, a subcategory of machine learning, allows AI to mimic a human brain’s neural network. It can make
sense of pa erns, noise, and sources of confusion in the data.

Consider an image shown below:

Here, we segregated the various kinds of images using deep learning. The machine goes through mul ple features of
photographs and dis nguishes them with feature extrac on. The machine segregates the features of each photo into
different categories, such as landscape, portrait, or others.

Let us understand how a deep learning model works.

Consider the image shown below:


The above image depicts the three main layers of a neural network:

 Input Layer

 Hidden Layer

 Output Layer

Machine learning is a branch of Ar ficial Intelligence that focuses on developing models and algorithms that let
computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the
systems to think and understand like humans by learning from the data

Machine Learning is mainly divided into three core types: Supervised, Unsupervised and Reinforcement Learning
along with two addi onal types, Semi-Supervised and Self-Supervised Learning.

 Supervised Learning: Trains models on labeled data to predict or classify new, unseen data.

 Unsupervised Learning: Finds pa erns or groups in unlabeled data, like clustering or dimensionality
reduc on.

 Reinforcement Learning: Learns through trial and error to maximize rewards, ideal for decision-making tasks.

Note: The following are not part of the original three core types of ML, but they have become increasingly important
in real-world applica ons, especially in deep learning.

Addi onal Types:

 Self-Supervised Learning: Self-supervised learning is o en considered a subset of unsupervised learning, but


it has grown into its own field due to its success in training large-scale models. It generates its own labels
from the data, without any manual labeling.

 Semi-Supervised Learning: This approach combines a small amount of labeled data with a large amount of
unlabeled data. It’s useful when labeling data is expensive or me-consuming.

Module 1: Machine Learning Pipeline

In order to make predic ons there are some steps through which data passes in order to produce a machine learning
model that can make predic ons.

1. ML workflow

2. Data Cleaning

3. Feature Scaling
4. Data Preprocessing in Python

Module 2: Supervised Learning

Supervised learning algorithms are generally categorized into two main types:

 Classifica on - where the goal is to predict discrete labels or categories

 Regression - where the aim is to predict con nuous numerical values.

Supervised Learning

There are many algorithms used in supervised learning each suited to different types of problems. Some of the most
commonly used supervised learning algorithms are:

1. Linear Regression

This is one of the simplest ways to predict numbers using a straight line. It helps find the rela onship between input
and output.

 Introduc on to Linear Regression

 Gradient Descent in Linear Regression

 Mul ple Linear Regression

 Ridge Regression

 Lasso regression

 Elas c net Regression

2. Logis c Regression

Used when the output is a "yes or no" type answer. It helps in predic ng categories like pass/fail or spam/not spam.

 Understanding Logis c Regression

 Cost func on in Logis c Regression


3. Decision Trees

A model that makes decisions by asking a series of simple ques ons, like a flowchart. Easy to understand and use.

 Decision Tree in Machine Learning

 Types of Decision tree algorithms

 Decision Tree - Regression (Implementa on)

 Decision tree - Classifica on (Implementa on)

4. Support Vector Machines (SVM)

A bit more advanced—it tries to draw the best line (or boundary) to separate different categories of data.

 Understanding SVMs

 SVM Hyperparameter Tuning - GridSearchCV

 Non-Linear SVM

5. k-Nearest Neighbors (k-NN)

This model looks at the closest data points (neighbors) to make predic ons. Super simple and based on similarity.

 Introduc on to KNN

 Decision Boundaries in K-Nearest Neighbors (KNN)

6. Naïve Bayes

A quick and smart way to classify things based on probability. It works well for text and spam detec on.

 Introduc on to Naive Bayes

 Gaussian Naive Bayes

 Mul nomial Naive Bayes

 Bernoulli Naive Bayes

 Complement Naive Bayes

7. Random Forest (Bagging Algorithm)

A powerful model that builds lots of decision trees and combines them for be er accuracy and stability.

 Introduc on to Random forest

 Random Forest Classifier

 Random Forest Regression

 Hyperparameter Tuning in Random Forest

Introduc on to Ensemble Learning

Ensemble learning combines mul ple simple models to create a stronger, smarter model. There are mainly two types
of ensemble learning:

 Bagging that combines mul ple models trained independently.

 Boos ng that builds models sequen ally each correc ng the errors of the previous one.

Module 3: Unsupervised learning


Unsupervised learning are again divided into three main categories based on their purpose:

 Clustering

 Associa on Rule Mining

 Dimensionality Reduc on.

NumPy Array Broadcas ng

Broadcas ng in NumPy allows us to perform arithme c opera ons on arrays of different shapes without reshaping
them. It automa cally adjusts the smaller array to match the larger array's shape by replica ng its values along the
necessary dimensions. This makes element-wise opera ons more efficient by reducing memory usage and
elimina ng the need for loops. In this ar cle, we will see how broadcas ng works.

Lets see an example:

import numpy as np

array_2d = np.array([[1, 2, 3], [4, 5, 6]])

scalar = 10

result = array_2d + scalar

print(result)

Output:

[[11 12 13]
[14 15 16]].

Working of Broadcas ng in NumPy

Broadcas ng applies specific rules to find whether two arrays can be aligned for opera ons or not that are:

1. Check Dimensions: Ensure the arrays have the same number of dimensions or expandable dimensions.

2. Dimension Padding: If arrays have different numbers of dimensions the smaller array is le -padded with
ones.

3. Shape Compa bility: Two dimensions are compa ble if they are equal or one of them is 1.

Example 1: Broadcas ng a Scalar to a 1D Array

It creates a NumPy array arr with values [1, 2, 3]. It adds a scalar value 1 to each element of the array using
broadcas ng.
Example 2: Broadcas ng a 1D Array to a 2D Array

This example shows how a 1D array a1 is added to a 2D array a2. NumPy automa cally expands the 1D array along
the rows of the 2D array to perform element-wise addi on.

import numpy as np

a1 = np.array([2, 4, 6])

a2 = np.array([[1, 3, 5], [7, 9, 11]])

res = a1 + a2

print(res)

Output:

[[ 3 7 11]
[ 9 13 17]]

It allows us to perform addi on on arrays of different shapes without needing to reshape them.

Example 3: Broadcas ng in Condi onal Opera ons

We may need to apply a condi on to an en re array or a subset of it. Broadcas ng can help to perform these
opera ons efficiently without needing loops.

import numpy as np

ages = np.array([12, 24, 35, 45, 60, 72])

age_group = np.array(["Adult", "Minor"])

result = np.where(ages > 18, age_group[0], age_group[1])

print(result)

Output:

['Minor' 'Adult' 'Adult' 'Adult' 'Adult' 'Adult']

In this example, broadcas ng is used to efficiently check which elements in the ages array are greater than 18.
Instead of looping through the array the condi on is applied across the en re array using NumPy’s where() func on.
Result is an array of 'Adult' or 'Minor' based on the ages.

Example 4: Using Broadcas ng for Matrix Mul plica on

In this example, each element of a 2D matrix is mul plied by the corresponding element in a broadcasted vector.

import numpy as np

matrix = np.array([[1, 2], [3, 4]])

vector = np.array([10, 20])

result = matrix * vector

print(result)

Output:

[[10 40]
[30 80]]
The 1D vector is broadcasted across the rows of the 2D matrix matrix. This allows element-wise mul plica on of
corresponding elements between the matrix and the vector.

Example 5: Scaling Data with Broadcas ng

Consider a real-world scenario where we need to calculate the total calories in foods based on the amount of fats,
proteins and carbohydrates. Each nutrient has a specific caloric value per gram.

 Fats: 9 calories per gram (CPG)

 Proteins: 4 CPG

 Carbohydrates: 4 CPG

Le table shows the original data with food items and their respec ve grams of fats, proteins and carbs. The array [3,
3, 8] represents the caloric values per gram for fats, proteins and carbs respec vely. This array is being broadcast to
match the dimensions of the original data and arrow indicates the broadcas ng opera on.

 Broadcas ng array is mul plied element-wise with each row of the original data.

 As a result right table shows the result of the mul plica on where each cell represents the caloric
contribu on of that specific nutrient in the food item.

Pandas

Pandas is an open-source so ware library designed for data manipula on and analysis. It provides data structures
like series and DataFrames to easily clean, transform and analyze large datasets and integrates with other Python
libraries, such as NumPy and Matplotlib.

It offers func ons for data transforma on, aggrega on and visualiza on, which are important for analysis. Created by
Wes McKinney in 2008, Pandas widely used by data scien sts, analysts and researchers worldwide. Pandas revolves
around two primary Data structures: Series (1D) for single columns and DataFrame (2D) for tabular data enabling
efficient data manipula on.

Important Facts to Know :

 DataFrames: It is a two-dimensional data structure constructed with rows and columns, which is more similar
to Excel spreadsheet.
 pandas: This name is derived for the term "panel data" which is econometrics terms of data sets.

What is Pandas Used for?

With pandas, you can perform a wide range of data opera ons, including

 Reading and wri ng data from various file formats like CSV, Excel and SQL databases.

 Cleaning and preparing data by handling missing values and filtering entries.

 Merging and joining mul ple datasets seamlessly.

 Reshaping data through pivo ng and stacking opera ons.

 Conduc ng sta s cal analysis and genera ng descrip ve sta s cs.

 Visualizing data with integrated plo ng capabili es.

Why Learn Pandas

Here’s why it’s worth learning:

 It offers a simple and intui ve way to work with structured data, especially using DataFrames.

 Makes data explora on easy, so you can quickly understand pa erns or spot issues.

 Saves me by reducing the need for complex code.

 It's widely used in industries like finance, healthcare, marke ng and research.

 A must-have skill for data science, analy cs and machine learning roles.

Pandas Basics

In this sec on, we will explore the fundamentals of Pandas. We will start with an introduc on to Pandas, learn how
to install it and get familiar with its func onali es. Addi onally, we will cover how to use Jupyter Notebook, a
popular tool for interac ve coding. By the end of this sec on, we will have a solid understanding of how to set up and
start working with Pandas for data analysis.

 Pandas Introduc on

 Pandas Installa on

 Ge ng started with Pandas

 How To Use Jupyter Notebook

Pandas DataFrame

A DataFrame is a two-dimensional, size-mutable and poten ally heterogeneous tabular data structure with labeled
axes (rows and columns).

 Crea ng a DataFrame

 Pandas Dataframe Index

 Pandas Access DataFrame

 Indexing and Selec ng Data with Pandas

 Slicing Pandas Dataframe

 Filter Pandas Dataframe with mul ple condi ons

 Merging, Joining and Concatena ng Dataframes

 Sor ng Pandas DataFrame


 Pivot Table in Pandas

Pandas Series

A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floa ng-point
numbers, Python objects, etc.). It’s similar to a column in a spreadsheet or a database table.

 Crea ng a Series

 Accessing elements of a Pandas Series

 Binary Opera ons on Series

 Pandas Series Index() Methods

 Create a Pandas Series from array

Data Input and Output (I/O)

Pandas offers a variety of func ons to read data from and write data to different file formats as given below:

 Read CSV Files with Pandas

 Wri ng data to CSV Files

 Export Pandas dataframe to a CSV file

 Read JSON Files with Pandas

 Parsing JSON Dataset

 Expor ng Pandas DataFrame to JSON File

 Working with Excel Files in Pandas

 Read Text Files with Pandas

 Text File to CSV using Python Pandas

Data Cleaning in Pandas

Data cleaning is an essen al step in data preprocessing to ensure accuracy and consistency. Here are some ar cles to
know more about it:

 Handling Missing Data

 Removing Duplicates

 Pandas Change Datatype

 Drop Empty Columns in Pandas

 String manipula ons in Pandas

 String methods in Pandas

 Detect Mixed Data Types and Fix it

Pandas Opera ons

We will cover data processing, normaliza on, manipula on and analysis, along with techniques for grouping and
aggrega ng data. These concepts will help you efficiently clean, transform and analyze datasets. By the end of this
sec on, you’ll learn Pandas opera ons to handle real-world data effec vely.

 Data Processing with Pandas.

 Data Normaliza on in Pandas


 Data Manipula on in Pandas

 Data Analysis using Pandas

 Grouping and Aggrega ng with Pandas

 Different Types of Joins in Pandas

Advanced Pandas Opera ons

In this sec on, we will explore advanced Pandas func onali es for deeper data analysis and visualiza on. We will
cover techniques for finding correla ons, working with me series data and using Pandas' built-in plo ng func ons
for effec ve data visualiza on. By the end of this sec on, y

Feature Engineering: Scaling, Normaliza on, and Standardiza on

Feature Scaling is a technique to standardize the independent features present in the data. It is performed during the
data pre-processing to handle highly varying values. If feature scaling is not done then machine learning algorithm
tends to use greater values as higher and consider smaller values as lower regardless of the unit of the values. For
example it will take 10 m and 10 cm both as same regardless of their unit. In this ar cle we will learn about different
techniques which are used to perform feature scaling.

1. Absolute Maximum Scaling

This method of scaling requires two-step:

1. We should first select the maximum absolute value out of all the entries of a par cular measure.

2. Then a er this we divide each entry of the column by this maximum value.

Xscaled=Xi−max(∣X∣)max(∣X∣)Xscaled=max(∣X∣)Xi−max(∣X∣)

A er performing the above-men oned two steps we will observe that each entry of the column lies in the range of -1
to 1. But this method is not used that o en the reason behind this is that it is too sensi ve to the outliers. And while
dealing with the real-world data presence of outliers is a very common thing.

For the demonstra on purpose we will use the dataset which you can download from here. This dataset is a simpler
version of the original house price predic on dataset having only two columns from the original dataset. The first five
rows of the original data are shown below:

import pandas as pd

df = pd.read_csv('SampleFile.csv')

print(df.head())

Output:

LotArea MSSubClass

0 8450 60

1 9600 20

2 11250 60

3 9550 70

4 14260 60

2. Min-Max Scaling

This method of scaling requires below two-step:

1. First we are supposed to find the minimum and the maximum value of the column.
2. Then we will subtract the minimum value from the entry and divide the result by the difference between the
maximum and the minimum value.

Xscaled=Xi−XminXmax−XminXscaled=Xmax−XminXi−Xmin

As we are using the maximum and the minimum value this method is also prone to outliers but the range in which
the data will range a er performing the above two steps is between 0 to 1.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaled_data = scaler.fit_transform(df)

scaled_df = pd.DataFrame(scaled_data, columns=df.columns)

scaled_df.head()

Output:

LotArea MSSubClass

0 0.033420 0.235294

1 0.038795 0.000000

2 0.046507 0.235294

3 0.038561 0.294118

4 0.060576 0.235294

3. Normaliza on

Normaliza on is the process of adjus ng the values of data points so that they all have the same length or size,
specifically a length of 1. This is done by dividing each data point by the "length" (called as Euclidean norm) of that
data point. Think of it like adjus ng the size of a vector so that it fits within a standard size of 1.

The formula for Normaliza on looks like this:

Xscaled=Xi∥X∥Xscaled=∥X∥Xi

Where:

 XiXi is each individual value.

 ∥X∥∥X∥ represents the Euclidean norm (or length) of the vector XX.

from sklearn.preprocessing import Normalizer

scaler = Normalizer()

scaled_data = scaler.fit_transform(df)

scaled_df = pd.DataFrame(scaled_data,

columns=df.columns)

print(scaled_df.head())

Output:

LotArea MSSubClass

0 0.999975 0.007100
1 0.999998 0.002083

2 0.999986 0.005333

3 0.999973 0.007330

4 0.999991 0.004208

4. Standardiza on

This method of scaling is basically based on the central tendencies and variance of the data.

1. First we should calculate the mean and standard devia on of the data we would like to normalize it.

2. Then we are supposed to subtract the mean value from each entry and then divide the result by the standard
devia on.

This helps us achieve a normal distribu on of the data with a mean equal to zero and a standard devia on equal to 1.

Xscaled=Xi−Xmean σXscaled=σXi−Xmean

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaled_data = scaler.fit_transform(df)

scaled_df = pd.DataFrame(scaled_data,

columns=df.columns)

print(scaled_df.head())

Output:

LotArea MSSubClass

0 -0.207142 0.073375

1 -0.091886 -0.872563

2 0.073480 0.073375

3 -0.096897 0.309859

4 0.375148 0.073375

Matplotlib Visualiza ons?

Matplotlib allows many ways for customiza on and styling of our plots. We can change colors, add labels, adjust
styles and much more. By applying these customiza on techniques to basic plots we can make our visualiza ons
clearer and more informa ve. Lets see various customizing ways:

Matplotlib Visualiza ons?


Matplotlib allows many ways for customiza on and styling of our plots. We can change colors, add labels, adjust styles
and much more. By applying these customiza on techniques to basic plots we can make our visualiza ons clearer and
more informa ve. Lets see various customizing ways:
1. Customizing Line Chart
We can customize line charts using these proper es:
1. Color: Change the color of the line
2. Linewidth: Adjust the width of the line
3. Marker: Change the style of plo ed points
4. Markersize: Change the size of the markers
5. Linestyle: Define the style of the line like solid, dashed, etc.
Example:
import matplotlib.pyplot as plt
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
plt.plot(x, y, color='green', linewidth=3, marker='o',
markersize=15, linestyle='--')
plt. tle("Customizing Line Chart")
plt.ylabel('Y-Axis')
plt.xlabel('X-Axis')
plt.show()

2. Customizing Bar Chart


import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('/content/ p.csv')
x = data['day']
y = data['total_bill']
plt.bar(x, y, color='green', edgecolor='blue',
linewidth=2)
plt. tle("Customizing Bar Chart")
plt.ylabel('Total Bill')
plt.xlabel('Day')
plt.show()
3. Customizing Histogram Plot

Example:

import matplotlib.pyplot as plt

import pandas as pd

data = pd.read_csv('/content/ p.csv')

x = data['total_bill']

plt.hist(x, bins=25, color='green', edgecolor='blue',

linestyle='--', alpha=0.5)

plt. tle(" Customizing Histogram Plot")

plt.ylabel('Frequency')

Introduc on to Streamlit and Plotly

Streamlit is an open-source Python Framework that allows developers to create interac ve web applica ons with
ease. It's designed to help data scien sts and machine learning engineers turn data scripts into shareable web apps
in just a few lines of code. Streamlit's simplicity and flexibility have made it a popular choice among data
professionals.

Plotly, on the other hand, is a versa le library that enables the crea on of beau ful and interac ve plots. It supports
over 40 unique chart types covering a wide range of sta s cal, financial, geographic, scien fic, and 3-dimensional
use-cases. Plotly's interac vity gives users the ability to zoom, pan, hover, and drill down into the visualiza ons,
making data explora on intui ve and informa ve.

When used together, Streamlit and Plotly form a powerful combina on, allowing developers to create interac ve
dashboards with complex visualiza ons with rela ve ease.

Advanced Streamlit and Plotly Techniques

While crea ng basic interac ve visualiza ons with Streamlit and Plotly is simple, these tools also offer advanced
features that allow for more complex and customized visualiza ons.
One such feature is the ability to update Plotly figures in Streamlit. This can be done using
the update_layout and update_traces methods in Plotly. For instance,Similarly, you can update the traces of a figure
to change the proper es of the plo ed data, such as the marker color:Another advanced feature is the ability to
resolve sizing issues with Plotly charts in Streamlit. Some mes, a Plotly chart might not fit well within the layout of a
Streamlit app, causing it to be cut off or overlap with other elements. This can be resolved by adjus ng
the height and width parameters of the st.plotly_chart func on:

import streamlit as st

import plotly.express as px

import pandas as pd

df = pd.read_csv('data.csv')

chart_type = st.selectbox('Choose a chart type', ['Bar', 'Line'])

if chart_type == 'Bar':

fig = px.bar(df, x='Fruit', y='Amount', color='City', barmode='group')

elif chart_type == 'Line':

fig = px.line(df, x='Fruit', y='Amount', color='City')

st.plotly_chart(fig, use_container_width=True)

Regression in machine learning refers to a supervised learning technique where the goal is to predict a con nuous
numerical value based on one or more independent features. It finds rela onships between variables so that
predic ons can be made. we have two types of variables present in regression:

 Dependent Variable (Target): The variable we are trying to predict e.g house price.

 Independent Variables (Features): The input variables that influence the predic on e.g locality, number of
rooms.

Regression analysis problem works with if output variable is a real or con nuous value such as “salary” or “weight”.
Many different regression models can be used but the simplest model in them is linear regression.

Types of Regression

Regression can be classified into different types based on the number of predictor variables and the nature of the
rela onship between variables:

1. Simple Linear Regression

Linear regression is one of the simplest and most widely used sta s cal models. This assumes that there is a linear
rela onship between the independent and dependent variables. This means that the change in the dependent
variable is propor onal to the change in the independent variables. For example predic ng the price of a house
based on its size.

2. Mul ple Linear Regression

Mul ple linear regression extends simple linear regression by using mul ple independent variables to predict target
variable. For example predic ng the price of a house based on mul ple features such as size, loca on, number of
rooms, etc.

3. Polynomial Regression

Polynomial regression is used to model with non-linear rela onships between the dependent variable and the
independent variables. It adds polynomial terms to the linear regression model to capture more complex
rela onships. For example when we want to predict a non-linear trend like popula on growth over me we use
polynomial regression.

4. Ridge & Lasso Regression

Ridge & lasso regression are regularized versions of linear regression that help avoid overfi ng by penalizing large
coefficients. When there’s a risk of overfi ng due to too many features we use these type of regression algorithms.

5. Support Vector Regression (SVR)

SVR is a type of regression algorithm that is based on the Support Vector Machine (SVM) algorithm. SVM is a type of
algorithm that is used for classifica on tasks but it can also be used for regression tasks. SVR works by finding a
hyperplane that minimizes the sum of the squared residuals between the predicted and actual values.

6. Decision Tree Regression

Decision tree Uses a tree-like structure to make decisions where each branch of tree represents a decision and leaves
represent outcomes. For example predic ng customer behavior based on features like age, income, etc there we use
decison tree regression.

7. Random Forest Regression

Random Forest is a ensemble method that builds mul ple decision trees and each tree is trained on a different
subset of the training data. The final predic on is made by averaging the predic ons of all of the trees. For example
customer churn or sales data using this.

List of Machine Learning Classifica on Algorithms

Classifica on algorithms organize and understand complex datasets in machine learning. These algorithms are
essen al for categorizing data into classes or labels, automa ng decision-making and pa ern iden fica on.
Classifica on algorithms are o en used to detect email spam by analyzing email content. These algorithms enable
machines to quickly recognize spam trends and make real- me judgments, improving email security.

Some of the top-ranked machine learning algorithms for Classifica on are:

1. Logis c Regression

2. Decision Tree

3. Random Forest

4. Support Vector Machine (SVM)

5. Naive Bayes

6. K-Nearest Neighbors (KNN)

Let us see about each of them one by one:

1. Logis c Regression Classifica on Algorithm in Machine Learning

In Logis c regression is classifica on algorithm used to es mate discrete values, typically binary, such as 0 and 1, yes
or no. It predicts the probability of an instance belonging to a class that makes it essec al for binary classifica on
problems like spam detec on or diagnosing disease.

Logis c func ons are ideal for classifica on problems since their output is between 0 and 1. Many fields employ it
because of its simplicity, interpretability, and efficiency. Logis c Regression works well when features and event
probability are linear. Logis c Regression used for binary classifica on tasks. Logis c regression is used for binary
categoriza on. Despite its name, it predicts class membership likelihood. A logis c func on models probability in this

linear model

2. Decision Tree

Decision Trees are versa le and simple classifica on and regression techniques. Recursively spli ng the dataset into
key-criteria subgroups provides a tree-like structure. Judgments at each node produce leaf nodes. Decision trees are
easy to understand and depict, making them useful for decision-making. Overfi ng may occur, therefore trimming
improves generality. A tree-like model of decisions and their consequences, including chance event outcomes,
resource costs and u lity.

3. Random Forest

Random forest are an ensemble learning techniques that combines mul ple decision trees to improve predic ve
accuracy and control over-fi ng. By aggrega ng the predic ons of numerous trees, Random Forests enhance
the decision-making process, making them robust against noise and bias.

Random Forest uses numerous decision trees to increase predic on accuracy and reduce overfi ng. It constructs
many trees and integrates their predic ons to create a reliable model. Diversity is added by using a random dataset
and characteris cs in each tree. Random Forests excel at high-dimensional data, feature importance metrics, and
overfi ng resistance. Many fields use them for classifica on and regression.
4.Support Vector Machine (SVM)

SVM is an effec ve classifica on and regression algorithm. It seeks the hyperplane that best classifies data while
increasing the margin. SVM works well in high-dimensional areas and handles nonlinear feature interac ons with its
kernel technique. It is powerful classifica on algorithm known for their accuracy in high-dimensional spaces

5.Naive Bayes

Text categoriza on and spam filtering benefit from Bayes theorem-based probabilis c classifica on algorithm Naive
Bayes. Despite its simplicity and "naive" assump on of feature independence, Naive Bayes o en works well in
prac ce. It uses condi onal probabili es of features to calculate the class likelihood of an instance. Naive Bayes
handles high-dimensional datasets quickly.

6.K-Nearest Neighbors (KNN)

KNN uses the majority class of k-nearest neighbours for easy and adap ve classifica on and regression. Non-
parametric KNN has no data distribu on assump ons. It works best with uneven decision boundaries and performs
well for varied jobs. K-Nearest Neighbors (KNN) is an instance-based, or lazy learning algorithm, where the func on is
only approximated locally, and all computa on is deferred un l func on evalua on.

Natural Language Processing (NLP)

Natural Language Processing (NLP) is a branch of Ar ficial Intelligence (AI) that helps machines to understand and
process human languages either in text or audio form. It is used across a variety of applica ons from speech
recogni on to language transla on and text summariza on.
Natural Language Processing can be categorized into two components:

1. Natural Language Understanding: It involves interpre ng the meaning of the text.

2. Natural Language Genera on: It involves genera ng human-like text based on processed data.

 Natural Language Understanding

 Natural Language Genera on

Phases of Natural Language Processing

It involves a series of phases that work together to process and interpret language with each phase contribu ng to
understanding its structure and meaning.

Phases of NLP

For more details you can refer to: Phases of NLP

Libraries for NLP

Some of natural language processing libraries include:

 NLTK (Natural Language Toolkit)

 spaCy

 TextBlob

 Transformers (by Hugging Face)

 Gensim

 NLP Libraries in Python.

Normalizing Textual Data in NLP

Text Normaliza on transforms text into a consistent format improves the quality and makes it easier to process in
NLP tasks.

Key steps in text normaliza on includes:

1. Regular Expressions (RE) are sequences of characters that define search pa erns.

 Text Normaliza on

 Regular Expressions (RE)


 How to write Regular Expressions?

 Proper es of Regular Expressions

 Email Extrac on using RE

2. Tokeniza on is a process of spli ng text into smaller units called tokens.

 Tokeniza on

 Word Tokeniza on

 Rule-based Tokeniza on

 Subword Tokeniza on

 Dic onary-Based Tokeniza on

 Whitespace Tokeniza on

 WordPiece Tokeniza on

3. Lemma za on reduces words to their base or root form.

 Lemma za on

4. Stemming reduces works to their root by removing suffixes. Types of stemmers include:

 Stemming

 Porter Stemmer

 Lancaster Stemmer

 Snowball Stemmer

 Rule-based Stemming

5. Stopword removal is a process to remove common words from the document.

 Stopword removal

7. Parts of Speech (POS) Tagging assigns a part of speech to each word in sentence based on defini on and
context.

Computer Vision –Image Data Processing & Use Cases:-

The consistent advancement in AI is transforming the dynamics between businesses and customers. Image
processing in computer vision is one such revolu onary technology that enables businesses to create an impeccable
experience for their customers. It provides exper se, efficiency and excellence by bridging the gap between
imperfec on and perfec on. Be it extrac ng details from a sea of informa on or finding a specific product amidst
thousands of similar products, the user is able to do it in a few taps due to computer vision! Explore some more use
cases of image processing in computer vision.
Iden fying Defects in Machines for Seamless Opera on

Manufacturing companies are leveraging image processing to combat poten al hazards by detec ng defec ve
components or materials during the manufacturing process. It mul plies the produc on speed with accuracy and
streamlines the mode of opera ons. Image processing in computer vision has enhanced product visibility in the
manufacturing process and eliminated response down me, ensuring seamless opera ons. This subsequently enables
manufacturers to deliver a quality product in a limited me while earning maximum profits.

Classifying Images for Enhanced User Experience

Image processing in computer vision categorizes images according to their features, pa erns and content. It eases
the process of searching for a similar type of product amidst jillions of products. Shoppers can upload the exis ng
product picture for which they want an alternate op on, and computer vision will provide them with a similar-
looking pa ern/feature product in a few seconds. Such a seamless experience increases customer reten on and
business sales. Image processing has reshaped the e-commerce industry by improving the interac on between
consumers and service providers.

Enabling Autonomous Driving Using Image Processing in Computer Vision

Image processing in computer vision detects text, objects and images and understands what they symbolize. So when
a car is incorporated with such a technology, it will automa cally detect the pedestrian walking on the road or
understand the traffic signal light and func on accordingly. For example, if it sees a red light on a signal, the car will
stop or go slow when it recognizes a school area. Such automa on is not only providing comfort and convenience but
also enhancing users' safety and driving experience.

Providing Facial Recogni on for Enhanced Security

Image processing in computer vision combined with other AI applica ons has strengthened the poten al of local and
na onal security. Image processing recognizes the blacklisted faces, and computer vision further removes every piece
of informa on regarding that image. This helps the government combat terrorism and maintain peace in the country.
SALARY PREDICTION PROJECT

import streamlit as st
import pandas as pd
import numpy as np
from model import train_model, predict_salary,
get_model_metrics
from preprocess import preprocess_input
import plotly.express as px
import plotly.graph_objects as go
def load_css():
with open('style.css') as f:
st.markdown(f'<style>{f.read()}</style>',
unsafe_allow_html=True)
load_css()
data = pd.read_csv('data/salary_dataset.csv')
# Train Model
model, X, y = train_model(data)
# Streamlit UI
st.markdown("<h1 style='text-align: center; color:
#4CAF50;'>SalarySense AI: Smart Salary Predictor 💰</h1>",
unsafe_allow_html=True)
st.markdown("<p style='text-align: center;'>Predict your salary
with real-time insights and feature impact analysis.</p>",
unsafe_allow_html=True)
st.sidebar.header('Enter Your Profile Details')
experience = st.sidebar.slider('Years of Experience', 0, 20, 1)
education = st.sidebar.selectbox('Education Level',
['Bachelor', 'Master', 'PhD'])
company = st.sidebar.selectbox('Company Type', ['Service',
'Product', 'Startup'])
city = st.sidebar.selectbox('City', ['Delhi', 'Mumbai',
'Bangalore'])
if st.sidebar.button('Predict Salary'):
input_df = preprocess_input(experience, education, company,
city)
predicted_salary = predict_salary(model, input_df)
st.markdown(f"<h2 style='color: #ff6347;'>Predicted Salary:
₹{predicted_salary[0]:,.2f}</h2>", unsafe_allow_html=True)
# Plot using Plotly
fig = px.scatter(data, x='Experience', y='Salary',
color='Education', title='Experience vs Salary with Education
Levels')
fig.add_traces(go.Scatter(x=data['Experience'],
y=model.predict(X), mode='lines', name='Regression Line',
line=dict(color='red')))
st.plotly_chart(fig)
# Model Performance
mae, mse, r2 = get_model_metrics(model, X, y)
st.subheader('Model Performance Metrics')
st.write(f"**Mean Absolute Error (MAE):** {mae:.2f}")
st.write(f"**Mean Squared Error (MSE):** {mse:.2f}")
st.write(f"**R² Score:** {r2:.2f}")
st.success('Prediction complete! 🎉')

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error,
mean_squared_error, r2_score
def train_model(data):
X = data[['Experience', 'Education', 'Company', 'City']]
y = data['Salary']

categorical_features = ['Education', 'Company', 'City']


preprocessor = ColumnTransformer(transformers=[
('cat', OneHotEncoder(), categorical_features)
], remainder='passthrough')
model = Pipeline(steps=[
('preprocessor', preprocessor),
('regressor', LinearRegression())
])
model.fit(X, y)
return model, X, y

def predict_salary(model, input_df):


return model.predict(input_df)

def get_model_metrics(model, X, y):


y_pred = model.predict(X)
mae = mean_absolute_error(y, y_pred)
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)
return mae, mse, r2

import pandas as pd

def preprocess_input(experience, education, company, city):


input_df = pd.DataFrame({
'Experience': [experience],
'Education': [education],
'Company': [company],
'City': [city]
})
return input_df

/* Custom Button Styling */


.stButton>button {
background-color: #4CAF50;
color: white;
padding: 10px 24px;
border: none;
border-radius: 12px;
font-size: 16px;
transition-duration: 0.4s;
}

.stButton>button:hover {
background-color: #45a049;
}
[data-testid="stSidebar"] {
background-color: #f0f0f5;
}

Experience,Education,Company,City,Salary
2,Bachelor,Service,Delhi,40000
5,Master,Product,Mumbai,90000
3,Bachelor,Startup,Bangalore,60000
7,PhD,Product,Delhi,150000
4,Master,Service,Bangalore,75000
6,Bachelor,Startup,Mumbai,85000
8,PhD,Product,Delhi,160000
CONCLUSION

I have collected all the raw data from online resources, books and research ar cles. Then I took this raw data and
format it. In this project, I have first gathered the dataset form Kaggle website and used it. At first, I visualized the data.
A er that I checked whether the data set contains any null values or not, if yes then I removed all the null values from
the data set. A er that I handled all the categorical features in the data set in by using dummy variable technique. Later
on, I used Random Forest model to measure the accuracy of the model. Future modifica ons can be done in order to
increase the accuracy of the model.

Acknowledgement

I would like to express my special thanks to my project guide Mr. Sourav Goswami as well as to Ardent Computech
who gave me the golden opportunity to train me and to do this wonderful real me project on the topic of SALARY
PREDICTION PROJECT which also helped me doing a lot of research of my own and gathered new knowledge

You might also like