Aditya Report
Aditya Report
A Project Report
for Industrial Training and Internship
Submi ed by
ADITYA KUMAR GUPTA
STUDENT ID:- 249011001185
In the par al fulfillment of the award of the degree of
BACHELOR OF TECHNOLOGY
In the
COMPUTER SCIENCE & ENGINEERING
OF
RAMGARH ENGINEERING COLLAGE
At
___________________________
Date: dd/mm/yy
SIGNATURE
Name :
PROJECT MENTOR
SIGNATURE
Name:
EXAMINERS
Ardent Original Seal
1. Title of the Project: SALARY PREDICTION PROJECT
2. Project Member: ADITYA KUMAR GUPTA 23033445002
KUMARI NEHA 23033445018
PUNITA SINGH 23033445033
SNEHA KUMARI 23033445047
AJIT KUMAR SINGH 23033445003
3. Name of the guide: Mr. SOURAV GOSWAMI SIR
4. Address: Ardent Computech Pvt. Ltd (An ISO 9001:2008 Cer fied )
CF-137, Sector - 1, Salt Lake City, Kolkata - 700064
Signature of Approver
Date:
MR. Sourav Goswami Sir
Ardent Original Seal
ACKNOWLEDGEMENT
The achievement that is associated with the successful comple on of any task
would be incomplete without men oning the names of those people whose
endless coopera on made it possible. Their constant guidance and
encouragement made all our efforts successful.
We take this opportunity to express our deep gra tude towards our project
mentor, Mr. SOURAV GOSWAMI SIR for giving such valuable sugges ons,
guidance and encouragement during the development of this project work.
Last but not the least we are grateful to all the faculty members of Ardent
Computech Pvt. Ltd. for their support.
Table of Contents
DATA SCIENCE , ARTIFICIAL INTELLIGENCE ,
MACHINE LEARNING AND BIG DATA
Python is a popular programming language. It was created by Guido van Rossum, and released in 1991.
It is used for:
so ware development,
mathema cs,
Python can connect to database systems. It can also read and modify files.
Python can be used to handle big data and perform complex mathema cs.
Python can be used for rapid prototyping, or for produc on-ready so ware development.
Why Python?
Python works on different pla orms (Windows, Mac, Linux, Raspberry Pi, etc).
Python has syntax that allows developers to write programs with fewer lines than some other programming
languages.
Python runs on an interpreter system, meaning that code can be executed as soon as it is wri en. This means
that prototyping can be very quick.
Python can be treated in a procedural way, an object-oriented way or a func onal way.
Good to know
The most recent major version of Python is Python 3, which we shall be using in this tutorial. However,
Python 2, although not being updated with anything other than security updates, is s ll quite popular.
In this tutorial Python will be wri en in a text editor. It is possible to write Python in an Integrated
Development Environment, such as Thonny, Pycharm, Netbeans or Eclipse which are par cularly useful when
managing larger collec ons of Python files.
Python was designed for readability, and has some similari es to the English language with influence from
mathema cs.
Python uses new lines to complete a command, as opposed to other programming languages which o en
use semicolons or parentheses.
Python relies on indenta on, using whitespace, to define scope; such as the scope of loops, func ons and
classes. Other programming languages o en use curly-brackets for this purpose.
Example
print("Hello, World!")
Data Structures are a way of organizing data so that it can be accessed more efficiently depending upon the
situa on. Data Structures are fundamentals of any programming language around which a program is built. Python
helps to learn the fundamental of these data structures in a simpler way as compared to other programming
languages.
In this ar cle, we will discuss the Data Structures in the Python Programming Language and how they are related to
some specific Python Data Types. We will discuss all the in-built data structures like list tuples, dic onaries, etc. as
well as some advanced data structures like trees, graphs, etc.
Lists
Python Lists are just like the arrays, declared in other languages which is an ordered collec on of data. It is very
flexible as the items in a list do not need to be of the same type.
The implementa on of Python List is similar to Vectors in C++ or ArrayList in JAVA. The costly opera on is inser ng or
dele ng the element from the beginning of the List as all the elements are needed to be shi ed. Inser on and
dele on at the end of the list can also become costly in the case where the preallocated memory becomes full.
print(List)
Output
List elements can be accessed by the assigned index. In python star ng index of the list, sequence is 0 and the ending
index is (if N elements are there) N-1.
Example: Python List Opera ons
print(List)
print(List2)
print(List[0])
print(List[2])
# nega ve indexing
print(List[-1])
# print the third last element of list
print(List[-3])
Output
Geeks
Geeks
Geeks
Geeks
Dic onary
Python dic onary is like hash tables in any other language with the me complexity of O(1). It is an unordered
collec on of data values, used to store data values like a map, which, unlike other Data Types that hold only a single
value as an element, Dic onary holds the key:value pair. Key-value is provided in the dic onary to make it more
op mized.
Indexing of Python Dic onary is done with the help of keys. These are of any hashable type i.e. an object whose can
never change like strings, numbers, tuples, etc. We can create a dic onary by using curly braces ({}) or dic onary
comprehension.
print(Dict)
print(Dict['Name'])
# method
print("Accessing a element using get:")
print(Dict.get(1))
print(myDict)
Output
Geeks
[1, 2, 3, 4]
Tuple
Python Tuple is a collec on of Python objects much like a list but Tuples are immutable in nature i.e. the elements in
the tuple cannot be added or removed once created. Just like a List, a Tuple can also contain elements of various
types.
In Python, tuples are created by placing a sequence of values separated by ‘comma’ with or without the use of
parentheses for grouping of the data sequence.
Note: Tuples can also be created with a single element, but it is a bit tricky. Having one element in the parentheses is
not sufficient, there must be a trailing ‘comma’ to make it a tuple.
String
Python Strings are arrays of bytes represen ng Unicode characters. In simpler terms, a string is an immutable array
of characters. Python does not have a character data type, a single character is simply a string with a length of 1.
Note: As strings are immutable, modifying a string will result in crea ng a new copy.
By studying how we think and process informa on, AI systems are designed to think smarter and more efficiently,
enabling them to perform tasks that once required human intelligence. Whether it's recommending your next
favorite movie or driving a car autonomously, AI is shaping the future in ways we once only imagined.
Does this sound intriguing, making you wonder how AI accomplishes all of this? In the following part, we will learn
how AI works, types of AI, and more. Let’s dig in. To explore AI further, check out this ar ficial intelligence tutorial.
1. Purely Reac ve
These machines do not have any memory or data to work with, specializing in just one field of work. For example, in
a chess game, the machine observes the moves and makes the best possible decision to win.
2. Limited Memory
These machines collect previous data and con nue adding it to their memory. They have enough memory or
experience to make proper decisions, but memory is minimal. For example, this machine can suggest a restaurant
based on the loca on data that has been gathered.
3. Theory of Mind
This kind of AI can understand thoughts and emo ons, as well as interact socially. However, a machine based on this
type is yet to be built.
4. Self-Aware
Self-aware machines are the future genera on of these new technologies. They will be intelligent, sen ent, and
conscious.
AI systems work by merging large with intelligent, itera ve processing algorithms. This combina on allows AI to learn
from pa erns and features in the analyzed data. Each me an Ar ficial Intelligence system performs a round of data
processing, it tests and measures its performance and uses the results to develop addi onal exper se.
(Image Source)
Now that you know how AI works at a basic level, let’s take a closer look at the two major categories of AI and how
they differ in purpose and capabili es.
Weak AI vs. Strong AI
Next up in the what is AI discussion is weak and strong AI. When discussing ar ficial intelligence, it is common to
dis nguish between two broad categories: weak AI and strong AI. Let's explore the characteris cs of each type:
Weak AI or Narrow AI, refers to AI systems that are designed to perform specific tasks and are limited to those tasks
only. These AI systems excel at their designated func ons but lack general intelligence. Examples of weak AI include
voice assistants like Siri or Alexa, recommenda on algorithms, and image recogni on systems. Weak AI operates
within predefined boundaries and cannot generalize beyond their specialized domain.
Strong AI, also known as general AI, refers to AI systems that possess human-level intelligence or even surpass
human intelligence across a wide range of tasks. Strong AI would be capable of understanding, reasoning, learning,
and applying knowledge to solve complex problems in a manner similar to human cogni on. However, the
development of strong AI is s ll largely theore cal and has not been achieved to date.
Ways of Implemen ng AI
Let’s explore the following ways that explain how we can implement AI:
1. Machine Learning
Machine learning gives AI the ability to learn. This is done by using algorithms to discover pa erns and generate
insights from the data it is exposed to.
2. Deep Learning
Deep learning, a subcategory of machine learning, allows AI to mimic a human brain’s neural network. It can make
sense of pa erns, noise, and sources of confusion in the data.
Here, we segregated the various kinds of images using deep learning. The machine goes through mul ple features of
photographs and dis nguishes them with feature extrac on. The machine segregates the features of each photo into
different categories, such as landscape, portrait, or others.
Input Layer
Hidden Layer
Output Layer
Machine learning is a branch of Ar ficial Intelligence that focuses on developing models and algorithms that let
computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the
systems to think and understand like humans by learning from the data
Machine Learning is mainly divided into three core types: Supervised, Unsupervised and Reinforcement Learning
along with two addi onal types, Semi-Supervised and Self-Supervised Learning.
Supervised Learning: Trains models on labeled data to predict or classify new, unseen data.
Unsupervised Learning: Finds pa erns or groups in unlabeled data, like clustering or dimensionality
reduc on.
Reinforcement Learning: Learns through trial and error to maximize rewards, ideal for decision-making tasks.
Note: The following are not part of the original three core types of ML, but they have become increasingly important
in real-world applica ons, especially in deep learning.
Semi-Supervised Learning: This approach combines a small amount of labeled data with a large amount of
unlabeled data. It’s useful when labeling data is expensive or me-consuming.
In order to make predic ons there are some steps through which data passes in order to produce a machine learning
model that can make predic ons.
1. ML workflow
2. Data Cleaning
3. Feature Scaling
4. Data Preprocessing in Python
Supervised learning algorithms are generally categorized into two main types:
Supervised Learning
There are many algorithms used in supervised learning each suited to different types of problems. Some of the most
commonly used supervised learning algorithms are:
1. Linear Regression
This is one of the simplest ways to predict numbers using a straight line. It helps find the rela onship between input
and output.
Ridge Regression
Lasso regression
2. Logis c Regression
Used when the output is a "yes or no" type answer. It helps in predic ng categories like pass/fail or spam/not spam.
A model that makes decisions by asking a series of simple ques ons, like a flowchart. Easy to understand and use.
A bit more advanced—it tries to draw the best line (or boundary) to separate different categories of data.
Understanding SVMs
Non-Linear SVM
This model looks at the closest data points (neighbors) to make predic ons. Super simple and based on similarity.
Introduc on to KNN
6. Naïve Bayes
A quick and smart way to classify things based on probability. It works well for text and spam detec on.
A powerful model that builds lots of decision trees and combines them for be er accuracy and stability.
Ensemble learning combines mul ple simple models to create a stronger, smarter model. There are mainly two types
of ensemble learning:
Boos ng that builds models sequen ally each correc ng the errors of the previous one.
Clustering
Broadcas ng in NumPy allows us to perform arithme c opera ons on arrays of different shapes without reshaping
them. It automa cally adjusts the smaller array to match the larger array's shape by replica ng its values along the
necessary dimensions. This makes element-wise opera ons more efficient by reducing memory usage and
elimina ng the need for loops. In this ar cle, we will see how broadcas ng works.
import numpy as np
scalar = 10
print(result)
Output:
[[11 12 13]
[14 15 16]].
Broadcas ng applies specific rules to find whether two arrays can be aligned for opera ons or not that are:
1. Check Dimensions: Ensure the arrays have the same number of dimensions or expandable dimensions.
2. Dimension Padding: If arrays have different numbers of dimensions the smaller array is le -padded with
ones.
3. Shape Compa bility: Two dimensions are compa ble if they are equal or one of them is 1.
It creates a NumPy array arr with values [1, 2, 3]. It adds a scalar value 1 to each element of the array using
broadcas ng.
Example 2: Broadcas ng a 1D Array to a 2D Array
This example shows how a 1D array a1 is added to a 2D array a2. NumPy automa cally expands the 1D array along
the rows of the 2D array to perform element-wise addi on.
import numpy as np
a1 = np.array([2, 4, 6])
res = a1 + a2
print(res)
Output:
[[ 3 7 11]
[ 9 13 17]]
It allows us to perform addi on on arrays of different shapes without needing to reshape them.
We may need to apply a condi on to an en re array or a subset of it. Broadcas ng can help to perform these
opera ons efficiently without needing loops.
import numpy as np
print(result)
Output:
In this example, broadcas ng is used to efficiently check which elements in the ages array are greater than 18.
Instead of looping through the array the condi on is applied across the en re array using NumPy’s where() func on.
Result is an array of 'Adult' or 'Minor' based on the ages.
In this example, each element of a 2D matrix is mul plied by the corresponding element in a broadcasted vector.
import numpy as np
print(result)
Output:
[[10 40]
[30 80]]
The 1D vector is broadcasted across the rows of the 2D matrix matrix. This allows element-wise mul plica on of
corresponding elements between the matrix and the vector.
Consider a real-world scenario where we need to calculate the total calories in foods based on the amount of fats,
proteins and carbohydrates. Each nutrient has a specific caloric value per gram.
Proteins: 4 CPG
Carbohydrates: 4 CPG
Le table shows the original data with food items and their respec ve grams of fats, proteins and carbs. The array [3,
3, 8] represents the caloric values per gram for fats, proteins and carbs respec vely. This array is being broadcast to
match the dimensions of the original data and arrow indicates the broadcas ng opera on.
Broadcas ng array is mul plied element-wise with each row of the original data.
As a result right table shows the result of the mul plica on where each cell represents the caloric
contribu on of that specific nutrient in the food item.
Pandas
Pandas is an open-source so ware library designed for data manipula on and analysis. It provides data structures
like series and DataFrames to easily clean, transform and analyze large datasets and integrates with other Python
libraries, such as NumPy and Matplotlib.
It offers func ons for data transforma on, aggrega on and visualiza on, which are important for analysis. Created by
Wes McKinney in 2008, Pandas widely used by data scien sts, analysts and researchers worldwide. Pandas revolves
around two primary Data structures: Series (1D) for single columns and DataFrame (2D) for tabular data enabling
efficient data manipula on.
DataFrames: It is a two-dimensional data structure constructed with rows and columns, which is more similar
to Excel spreadsheet.
pandas: This name is derived for the term "panel data" which is econometrics terms of data sets.
With pandas, you can perform a wide range of data opera ons, including
Reading and wri ng data from various file formats like CSV, Excel and SQL databases.
Cleaning and preparing data by handling missing values and filtering entries.
It offers a simple and intui ve way to work with structured data, especially using DataFrames.
Makes data explora on easy, so you can quickly understand pa erns or spot issues.
It's widely used in industries like finance, healthcare, marke ng and research.
A must-have skill for data science, analy cs and machine learning roles.
Pandas Basics
In this sec on, we will explore the fundamentals of Pandas. We will start with an introduc on to Pandas, learn how
to install it and get familiar with its func onali es. Addi onally, we will cover how to use Jupyter Notebook, a
popular tool for interac ve coding. By the end of this sec on, we will have a solid understanding of how to set up and
start working with Pandas for data analysis.
Pandas Introduc on
Pandas Installa on
Pandas DataFrame
A DataFrame is a two-dimensional, size-mutable and poten ally heterogeneous tabular data structure with labeled
axes (rows and columns).
Crea ng a DataFrame
Pandas Series
A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floa ng-point
numbers, Python objects, etc.). It’s similar to a column in a spreadsheet or a database table.
Crea ng a Series
Pandas offers a variety of func ons to read data from and write data to different file formats as given below:
Data cleaning is an essen al step in data preprocessing to ensure accuracy and consistency. Here are some ar cles to
know more about it:
Removing Duplicates
We will cover data processing, normaliza on, manipula on and analysis, along with techniques for grouping and
aggrega ng data. These concepts will help you efficiently clean, transform and analyze datasets. By the end of this
sec on, you’ll learn Pandas opera ons to handle real-world data effec vely.
In this sec on, we will explore advanced Pandas func onali es for deeper data analysis and visualiza on. We will
cover techniques for finding correla ons, working with me series data and using Pandas' built-in plo ng func ons
for effec ve data visualiza on. By the end of this sec on, y
Feature Scaling is a technique to standardize the independent features present in the data. It is performed during the
data pre-processing to handle highly varying values. If feature scaling is not done then machine learning algorithm
tends to use greater values as higher and consider smaller values as lower regardless of the unit of the values. For
example it will take 10 m and 10 cm both as same regardless of their unit. In this ar cle we will learn about different
techniques which are used to perform feature scaling.
1. We should first select the maximum absolute value out of all the entries of a par cular measure.
2. Then a er this we divide each entry of the column by this maximum value.
Xscaled=Xi−max(∣X∣)max(∣X∣)Xscaled=max(∣X∣)Xi−max(∣X∣)
A er performing the above-men oned two steps we will observe that each entry of the column lies in the range of -1
to 1. But this method is not used that o en the reason behind this is that it is too sensi ve to the outliers. And while
dealing with the real-world data presence of outliers is a very common thing.
For the demonstra on purpose we will use the dataset which you can download from here. This dataset is a simpler
version of the original house price predic on dataset having only two columns from the original dataset. The first five
rows of the original data are shown below:
import pandas as pd
df = pd.read_csv('SampleFile.csv')
print(df.head())
Output:
LotArea MSSubClass
0 8450 60
1 9600 20
2 11250 60
3 9550 70
4 14260 60
2. Min-Max Scaling
1. First we are supposed to find the minimum and the maximum value of the column.
2. Then we will subtract the minimum value from the entry and divide the result by the difference between the
maximum and the minimum value.
Xscaled=Xi−XminXmax−XminXscaled=Xmax−XminXi−Xmin
As we are using the maximum and the minimum value this method is also prone to outliers but the range in which
the data will range a er performing the above two steps is between 0 to 1.
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)
scaled_df.head()
Output:
LotArea MSSubClass
0 0.033420 0.235294
1 0.038795 0.000000
2 0.046507 0.235294
3 0.038561 0.294118
4 0.060576 0.235294
3. Normaliza on
Normaliza on is the process of adjus ng the values of data points so that they all have the same length or size,
specifically a length of 1. This is done by dividing each data point by the "length" (called as Euclidean norm) of that
data point. Think of it like adjus ng the size of a vector so that it fits within a standard size of 1.
Xscaled=Xi∥X∥Xscaled=∥X∥Xi
Where:
∥X∥∥X∥ represents the Euclidean norm (or length) of the vector XX.
scaler = Normalizer()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
columns=df.columns)
print(scaled_df.head())
Output:
LotArea MSSubClass
0 0.999975 0.007100
1 0.999998 0.002083
2 0.999986 0.005333
3 0.999973 0.007330
4 0.999991 0.004208
4. Standardiza on
This method of scaling is basically based on the central tendencies and variance of the data.
1. First we should calculate the mean and standard devia on of the data we would like to normalize it.
2. Then we are supposed to subtract the mean value from each entry and then divide the result by the standard
devia on.
This helps us achieve a normal distribu on of the data with a mean equal to zero and a standard devia on equal to 1.
Xscaled=Xi−Xmean σXscaled=σXi−Xmean
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
columns=df.columns)
print(scaled_df.head())
Output:
LotArea MSSubClass
0 -0.207142 0.073375
1 -0.091886 -0.872563
2 0.073480 0.073375
3 -0.096897 0.309859
4 0.375148 0.073375
Matplotlib allows many ways for customiza on and styling of our plots. We can change colors, add labels, adjust
styles and much more. By applying these customiza on techniques to basic plots we can make our visualiza ons
clearer and more informa ve. Lets see various customizing ways:
Example:
import pandas as pd
x = data['total_bill']
linestyle='--', alpha=0.5)
plt.ylabel('Frequency')
Streamlit is an open-source Python Framework that allows developers to create interac ve web applica ons with
ease. It's designed to help data scien sts and machine learning engineers turn data scripts into shareable web apps
in just a few lines of code. Streamlit's simplicity and flexibility have made it a popular choice among data
professionals.
Plotly, on the other hand, is a versa le library that enables the crea on of beau ful and interac ve plots. It supports
over 40 unique chart types covering a wide range of sta s cal, financial, geographic, scien fic, and 3-dimensional
use-cases. Plotly's interac vity gives users the ability to zoom, pan, hover, and drill down into the visualiza ons,
making data explora on intui ve and informa ve.
When used together, Streamlit and Plotly form a powerful combina on, allowing developers to create interac ve
dashboards with complex visualiza ons with rela ve ease.
While crea ng basic interac ve visualiza ons with Streamlit and Plotly is simple, these tools also offer advanced
features that allow for more complex and customized visualiza ons.
One such feature is the ability to update Plotly figures in Streamlit. This can be done using
the update_layout and update_traces methods in Plotly. For instance,Similarly, you can update the traces of a figure
to change the proper es of the plo ed data, such as the marker color:Another advanced feature is the ability to
resolve sizing issues with Plotly charts in Streamlit. Some mes, a Plotly chart might not fit well within the layout of a
Streamlit app, causing it to be cut off or overlap with other elements. This can be resolved by adjus ng
the height and width parameters of the st.plotly_chart func on:
import streamlit as st
import plotly.express as px
import pandas as pd
df = pd.read_csv('data.csv')
if chart_type == 'Bar':
st.plotly_chart(fig, use_container_width=True)
Regression in machine learning refers to a supervised learning technique where the goal is to predict a con nuous
numerical value based on one or more independent features. It finds rela onships between variables so that
predic ons can be made. we have two types of variables present in regression:
Dependent Variable (Target): The variable we are trying to predict e.g house price.
Independent Variables (Features): The input variables that influence the predic on e.g locality, number of
rooms.
Regression analysis problem works with if output variable is a real or con nuous value such as “salary” or “weight”.
Many different regression models can be used but the simplest model in them is linear regression.
Types of Regression
Regression can be classified into different types based on the number of predictor variables and the nature of the
rela onship between variables:
Linear regression is one of the simplest and most widely used sta s cal models. This assumes that there is a linear
rela onship between the independent and dependent variables. This means that the change in the dependent
variable is propor onal to the change in the independent variables. For example predic ng the price of a house
based on its size.
Mul ple linear regression extends simple linear regression by using mul ple independent variables to predict target
variable. For example predic ng the price of a house based on mul ple features such as size, loca on, number of
rooms, etc.
3. Polynomial Regression
Polynomial regression is used to model with non-linear rela onships between the dependent variable and the
independent variables. It adds polynomial terms to the linear regression model to capture more complex
rela onships. For example when we want to predict a non-linear trend like popula on growth over me we use
polynomial regression.
Ridge & lasso regression are regularized versions of linear regression that help avoid overfi ng by penalizing large
coefficients. When there’s a risk of overfi ng due to too many features we use these type of regression algorithms.
SVR is a type of regression algorithm that is based on the Support Vector Machine (SVM) algorithm. SVM is a type of
algorithm that is used for classifica on tasks but it can also be used for regression tasks. SVR works by finding a
hyperplane that minimizes the sum of the squared residuals between the predicted and actual values.
Decision tree Uses a tree-like structure to make decisions where each branch of tree represents a decision and leaves
represent outcomes. For example predic ng customer behavior based on features like age, income, etc there we use
decison tree regression.
Random Forest is a ensemble method that builds mul ple decision trees and each tree is trained on a different
subset of the training data. The final predic on is made by averaging the predic ons of all of the trees. For example
customer churn or sales data using this.
Classifica on algorithms organize and understand complex datasets in machine learning. These algorithms are
essen al for categorizing data into classes or labels, automa ng decision-making and pa ern iden fica on.
Classifica on algorithms are o en used to detect email spam by analyzing email content. These algorithms enable
machines to quickly recognize spam trends and make real- me judgments, improving email security.
1. Logis c Regression
2. Decision Tree
3. Random Forest
5. Naive Bayes
In Logis c regression is classifica on algorithm used to es mate discrete values, typically binary, such as 0 and 1, yes
or no. It predicts the probability of an instance belonging to a class that makes it essec al for binary classifica on
problems like spam detec on or diagnosing disease.
Logis c func ons are ideal for classifica on problems since their output is between 0 and 1. Many fields employ it
because of its simplicity, interpretability, and efficiency. Logis c Regression works well when features and event
probability are linear. Logis c Regression used for binary classifica on tasks. Logis c regression is used for binary
categoriza on. Despite its name, it predicts class membership likelihood. A logis c func on models probability in this
linear model
2. Decision Tree
Decision Trees are versa le and simple classifica on and regression techniques. Recursively spli ng the dataset into
key-criteria subgroups provides a tree-like structure. Judgments at each node produce leaf nodes. Decision trees are
easy to understand and depict, making them useful for decision-making. Overfi ng may occur, therefore trimming
improves generality. A tree-like model of decisions and their consequences, including chance event outcomes,
resource costs and u lity.
3. Random Forest
Random forest are an ensemble learning techniques that combines mul ple decision trees to improve predic ve
accuracy and control over-fi ng. By aggrega ng the predic ons of numerous trees, Random Forests enhance
the decision-making process, making them robust against noise and bias.
Random Forest uses numerous decision trees to increase predic on accuracy and reduce overfi ng. It constructs
many trees and integrates their predic ons to create a reliable model. Diversity is added by using a random dataset
and characteris cs in each tree. Random Forests excel at high-dimensional data, feature importance metrics, and
overfi ng resistance. Many fields use them for classifica on and regression.
4.Support Vector Machine (SVM)
SVM is an effec ve classifica on and regression algorithm. It seeks the hyperplane that best classifies data while
increasing the margin. SVM works well in high-dimensional areas and handles nonlinear feature interac ons with its
kernel technique. It is powerful classifica on algorithm known for their accuracy in high-dimensional spaces
5.Naive Bayes
Text categoriza on and spam filtering benefit from Bayes theorem-based probabilis c classifica on algorithm Naive
Bayes. Despite its simplicity and "naive" assump on of feature independence, Naive Bayes o en works well in
prac ce. It uses condi onal probabili es of features to calculate the class likelihood of an instance. Naive Bayes
handles high-dimensional datasets quickly.
KNN uses the majority class of k-nearest neighbours for easy and adap ve classifica on and regression. Non-
parametric KNN has no data distribu on assump ons. It works best with uneven decision boundaries and performs
well for varied jobs. K-Nearest Neighbors (KNN) is an instance-based, or lazy learning algorithm, where the func on is
only approximated locally, and all computa on is deferred un l func on evalua on.
Natural Language Processing (NLP) is a branch of Ar ficial Intelligence (AI) that helps machines to understand and
process human languages either in text or audio form. It is used across a variety of applica ons from speech
recogni on to language transla on and text summariza on.
Natural Language Processing can be categorized into two components:
2. Natural Language Genera on: It involves genera ng human-like text based on processed data.
It involves a series of phases that work together to process and interpret language with each phase contribu ng to
understanding its structure and meaning.
Phases of NLP
spaCy
TextBlob
Gensim
Text Normaliza on transforms text into a consistent format improves the quality and makes it easier to process in
NLP tasks.
1. Regular Expressions (RE) are sequences of characters that define search pa erns.
Text Normaliza on
Tokeniza on
Word Tokeniza on
Rule-based Tokeniza on
Subword Tokeniza on
Whitespace Tokeniza on
WordPiece Tokeniza on
Lemma za on
4. Stemming reduces works to their root by removing suffixes. Types of stemmers include:
Stemming
Porter Stemmer
Lancaster Stemmer
Snowball Stemmer
Rule-based Stemming
Stopword removal
7. Parts of Speech (POS) Tagging assigns a part of speech to each word in sentence based on defini on and
context.
The consistent advancement in AI is transforming the dynamics between businesses and customers. Image
processing in computer vision is one such revolu onary technology that enables businesses to create an impeccable
experience for their customers. It provides exper se, efficiency and excellence by bridging the gap between
imperfec on and perfec on. Be it extrac ng details from a sea of informa on or finding a specific product amidst
thousands of similar products, the user is able to do it in a few taps due to computer vision! Explore some more use
cases of image processing in computer vision.
Iden fying Defects in Machines for Seamless Opera on
Manufacturing companies are leveraging image processing to combat poten al hazards by detec ng defec ve
components or materials during the manufacturing process. It mul plies the produc on speed with accuracy and
streamlines the mode of opera ons. Image processing in computer vision has enhanced product visibility in the
manufacturing process and eliminated response down me, ensuring seamless opera ons. This subsequently enables
manufacturers to deliver a quality product in a limited me while earning maximum profits.
Image processing in computer vision categorizes images according to their features, pa erns and content. It eases
the process of searching for a similar type of product amidst jillions of products. Shoppers can upload the exis ng
product picture for which they want an alternate op on, and computer vision will provide them with a similar-
looking pa ern/feature product in a few seconds. Such a seamless experience increases customer reten on and
business sales. Image processing has reshaped the e-commerce industry by improving the interac on between
consumers and service providers.
Image processing in computer vision detects text, objects and images and understands what they symbolize. So when
a car is incorporated with such a technology, it will automa cally detect the pedestrian walking on the road or
understand the traffic signal light and func on accordingly. For example, if it sees a red light on a signal, the car will
stop or go slow when it recognizes a school area. Such automa on is not only providing comfort and convenience but
also enhancing users' safety and driving experience.
Image processing in computer vision combined with other AI applica ons has strengthened the poten al of local and
na onal security. Image processing recognizes the blacklisted faces, and computer vision further removes every piece
of informa on regarding that image. This helps the government combat terrorism and maintain peace in the country.
SALARY PREDICTION PROJECT
import streamlit as st
import pandas as pd
import numpy as np
from model import train_model, predict_salary,
get_model_metrics
from preprocess import preprocess_input
import plotly.express as px
import plotly.graph_objects as go
def load_css():
with open('style.css') as f:
st.markdown(f'<style>{f.read()}</style>',
unsafe_allow_html=True)
load_css()
data = pd.read_csv('data/salary_dataset.csv')
# Train Model
model, X, y = train_model(data)
# Streamlit UI
st.markdown("<h1 style='text-align: center; color:
#4CAF50;'>SalarySense AI: Smart Salary Predictor 💰</h1>",
unsafe_allow_html=True)
st.markdown("<p style='text-align: center;'>Predict your salary
with real-time insights and feature impact analysis.</p>",
unsafe_allow_html=True)
st.sidebar.header('Enter Your Profile Details')
experience = st.sidebar.slider('Years of Experience', 0, 20, 1)
education = st.sidebar.selectbox('Education Level',
['Bachelor', 'Master', 'PhD'])
company = st.sidebar.selectbox('Company Type', ['Service',
'Product', 'Startup'])
city = st.sidebar.selectbox('City', ['Delhi', 'Mumbai',
'Bangalore'])
if st.sidebar.button('Predict Salary'):
input_df = preprocess_input(experience, education, company,
city)
predicted_salary = predict_salary(model, input_df)
st.markdown(f"<h2 style='color: #ff6347;'>Predicted Salary:
₹{predicted_salary[0]:,.2f}</h2>", unsafe_allow_html=True)
# Plot using Plotly
fig = px.scatter(data, x='Experience', y='Salary',
color='Education', title='Experience vs Salary with Education
Levels')
fig.add_traces(go.Scatter(x=data['Experience'],
y=model.predict(X), mode='lines', name='Regression Line',
line=dict(color='red')))
st.plotly_chart(fig)
# Model Performance
mae, mse, r2 = get_model_metrics(model, X, y)
st.subheader('Model Performance Metrics')
st.write(f"**Mean Absolute Error (MAE):** {mae:.2f}")
st.write(f"**Mean Squared Error (MSE):** {mse:.2f}")
st.write(f"**R² Score:** {r2:.2f}")
st.success('Prediction complete! 🎉')
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error,
mean_squared_error, r2_score
def train_model(data):
X = data[['Experience', 'Education', 'Company', 'City']]
y = data['Salary']
import pandas as pd
.stButton>button:hover {
background-color: #45a049;
}
[data-testid="stSidebar"] {
background-color: #f0f0f5;
}
Experience,Education,Company,City,Salary
2,Bachelor,Service,Delhi,40000
5,Master,Product,Mumbai,90000
3,Bachelor,Startup,Bangalore,60000
7,PhD,Product,Delhi,150000
4,Master,Service,Bangalore,75000
6,Bachelor,Startup,Mumbai,85000
8,PhD,Product,Delhi,160000
CONCLUSION
I have collected all the raw data from online resources, books and research ar cles. Then I took this raw data and
format it. In this project, I have first gathered the dataset form Kaggle website and used it. At first, I visualized the data.
A er that I checked whether the data set contains any null values or not, if yes then I removed all the null values from
the data set. A er that I handled all the categorical features in the data set in by using dummy variable technique. Later
on, I used Random Forest model to measure the accuracy of the model. Future modifica ons can be done in order to
increase the accuracy of the model.
Acknowledgement
I would like to express my special thanks to my project guide Mr. Sourav Goswami as well as to Ardent Computech
who gave me the golden opportunity to train me and to do this wonderful real me project on the topic of SALARY
PREDICTION PROJECT which also helped me doing a lot of research of my own and gathered new knowledge