0% found this document useful (0 votes)
8 views72 pages

Programs MLT Lab Print

The document outlines several exercises related to data analysis and machine learning using Python, including calculating average salary, data processing methods, exploratory data analysis (EDA), linear regression, decision trees, and K-means clustering. Each exercise includes code snippets demonstrating the implementation of various data manipulation and analysis techniques using libraries like pandas, numpy, and scikit-learn. The document serves as a practical guide for students at K. Ramakrishnan College of Engineering to apply data science concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views72 pages

Programs MLT Lab Print

The document outlines several exercises related to data analysis and machine learning using Python, including calculating average salary, data processing methods, exploratory data analysis (EDA), linear regression, decision trees, and K-means clustering. Each exercise includes code snippets demonstrating the implementation of various data manipulation and analysis techniques using libraries like pandas, numpy, and scikit-learn. The document serves as a practical guide for students at K. Ramakrishnan College of Engineering to apply data science concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

K.

Ramakrishnan College of Engineering (Autonomous), Trichy

EX.NO. 1 CALCULATION OF AVERAGE SALARY USING DATAFRAME

PROGRAM

import pandas as pd

import numpy as np

# Create the DataFrame with random integers

df = pd.DataFrame({

'Age': np.random.randint(18, 61, size=10), # 61 is exclusive

'Salary': np.random.randint(30000, 100001, size=10)})

# Display the DataFrame

print("Generated DataFrame:")

print(df)

# Calculate and display the average salary

average_salary = df['Salary'].mean()

print(f"\nAverage Salary: ₹{average_salary:,.2f}")

OUTPUT

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

EX.NO. 2 IMPLEMENTATION OF DATA PROCESSING METHODS

PROGRAM

import pandas as pd

import numpy as np

# Sample data with some null values

data = {

'Name': ['Alice', 'Bob', 'Charlie', 'David', None],

'Age': [25, 30, None, 22, 29],

'Salary': [50000, None, 62000, 58000, 60000]

# Create DataFrame

df = pd.DataFrame(data)

print("Original DataFrame:\n", df)

# 1. Dropping rows with any null value

df_dropped = df.dropna()

print("\nDataFrame after dropping rows with null values:\n", df_dropped)

# 2. Filling null values

df_filled = df.fillna({

'Name': 'Unknown',

'Age': df['Age'].mean(), # fill Age with average

'Salary': df['Salary'].median() # fill Salary with median

})

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

print("\nDataFrame after filling null values:\n", df_filled)

# 3. Selecting specific data

selected_rows = df_filled[df_filled['Salary'] > 55000]

print("\nSelected rows where Salary > 55000:\n", selected_rows)

# 4. Convert 'Name' column into a list

name_list = df_filled['Name'].tolist()

print("\n'Name' column as list:\n", name_list)

OUTPUT
Original DataFrame:
Name Age Salary
0 Alice 25.0 50000.0
1 Bob 30.0 NaN
2 Charlie NaN 62000.0
3 David 22.0 58000.0
4 None 29.0 60000.0

Data Frame after dropping rows with null values:

Name Age Salary


0 Alice 25.0 50000.0
3 David 22.0 58000.0

Data Frame after filling null values:


Name Age Salary
0 Alice 25.000000 50000.0
1 Bob 30.000000 58000.0
2 Charlie 26.5 62000.0

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

3 David 22.000000 58000.0


4 Unknown 29.000000 60000.0

Selected rows where Salary > 55000:


Name Age Salary
1 Bob 30.0 58000.0
2 Charlie 26.5 62000.0
3 David 22.0 58000.0
4 Unknown 29.0 60000.0

'Name' column as list:


['Alice', 'Bob', 'Charlie', 'David', 'Unknown']

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

EX.NO:3 IMPLEMENTATION OF EDA

PROGRAM

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set seed for reproducibility
np.random.seed(42)
# Create synthetic dataset

df = pd.DataFrame({
'Age': np.random.randint(22, 60, size=100),
'Salary': np.random.randint(30000, 120000, size=100),
'Experience_Years': np.random.randint(0, 35, size=100),
'Performance_Score': np.random.normal(loc=6, scale=1.5, size=100).round(1),
'Training_Hours': np.random.randint(0, 100, size=100)
})
# Inject some missing values
df.loc[np.random.choice(df.index, size=5), 'Performance_Score'] = np.nan

# 🖥 Display first few rows

print(" Sample of the dataset:")


print(df.head())

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

# Summary statistics print("\n


Summary Statistics:")
print(df.describe())
# Check for missing data
print("\n Missing Values:")
print(df.isnull().sum())
# Histograms of numeric features
df.hist(figsize=(12, 8), edgecolor='black')
plt.suptitle('Histogram of Numeric Features', fontsize=16)
plt.tight_layout()
plt.show()
Boxplots for all numeric features

plt.figure(figsize=(12, 6))
sns.boxplot(data=df,palette='Set3')
plt.title('Boxplot of Numeric Features', fontsize=16)
plt.xticks(rotation=45)
plt.show()

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

OUTPUT

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

EX. NO:4 IMPLEMENTATION OF SINGLE LINEAR REGRESSION

PROGRAM

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression


from sklearn.metrics import mean_squared_error, r2_score
# Step 1: Load dataset

df = pd.read_csv('placement.csv')

# Step 2: Display first few rows to understand structure print("First 5 rows of the
dataset:")

print(df.head())

# Step 3: Select input and output variables


# For example, assume "cgpa" as independent (X) and "package" as dependent (y)

X = df[['placement_exam_marks']]
y = df['placed']

# Step 4: Split the data into training and testing sets (80% train, 20% test)
X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.2, random_state=42)

# Step 5: Train the Linear Regression model

model = LinearRegression()
model.fit(X_train, y_train)

# Step 6: Make predictions

y_pred = model.predict(X_test)

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

# Step 7: Evaluate the model

print("\nModel Performance:")

print("Mean Squared Error:",mean_squared_error(y_test,


y_pred))
print("R^2 Score:", r2_score(y_test, y_pred))

# Step 8: Visualize the results

plt.scatter(X_test, y_test, color='blue', label='Actual') plt.plot(X_test, y_pred,


color='red', linewidth=1, label='Predicted')
plt.xlabel('Placement Exam Marks')

plt.ylabel('Placed')
plt.title('Simple Linear Regression')
plt.legend()

plt.grid(True)

plt.show()

OUTPUT
First 5 rows of the dataset:
cgpa placement_exam_marks placed
0 7.19 26.0 1
1 7.46 38.0 1
2 7.54 40.0 1
3 6.42 8.0 1
4 7.23 17.0 0

Model Performance:
Mean Squared Error: 0.2501852253844579
R^2 Score: -0.005668678060327448

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

EX.NO:5 IMPLEMENTATION OF DECISION TREE

PROGRAM

import pandas as pd
df1=pd.read_csv("Symptom-severity.csv")
df2=pd.read_csv("dataset.csv")
df3=pd.read_csv("symptom_Description.csv")
df4=pd.read_csv(“symptom_precaution.csv")
df1.head()

df2.head()

OUTPUT

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

df3.head()

df4.head()

res = pd.concat([df1,df2,df3,df4])

res

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

res.info()

res = res.fillna(res.median(numeric_only=True))

for col in res.select_dtypes(include='object'):


mode_val = res[col].mode()
if not mode_val.empty:
res[col] = res[col].fillna(mode_val[0])

res

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

res.describe()

res.info()

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

res['Precaution_Merged']=res[['Precaution_1','Precaution_2','Precaution_3',
'Precaution_4']].astype(str).agg(', '.join, axis=1)
res['Symptom_Merged']=res[['Symptom','Symptom_1','Symptom_2','Symptom_3,'Symptom
_4',
'Symptom_5','Symptom_6','Symptom_7','Symptom_8','Symptom_9',
'Symptom_10','Symptom_11','Symptom_12','Symptom_13','Symptom_14','Symptom_15',
'Symptom_17']].astype(str).agg(', '.join, axis=1)
res.head()

input=res[['weight','Description','Precaution_Merged','Symptom_Merged']]

target=res['Disease']
input.head()

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

target.head()

from sklearn.preprocessing import LabelEncoder


le_Description=LabelEncoder()
le_Precaution_Merged=LabelEncoder()
le_Symptom_Merged=LabelEncoder()
input['Description_n']=le_Description.fit_transform(input['Description'])
input['Precaution_Merged_n']=le_Precaution_Merged.fit_transform(input['Precaution_
Merged'])
input['Symptom_Merged_n']=le_Symptom_Merged.fit_transform(input['Symptom_
Merged'])

input.head()

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

input=input.drop(['Description','Precaution_Merged','Symptom_Merged'],
axis='column')
input.head()

le_Disease=LabelEncoder()
target=le_Disease.fit_transform(target)
target

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test=train_test_split(input,target,test_size=0.33,random_state=42)
from sklearn import tree
model=tree.DecisionTreeClassifier(max_depth=3)
model.fit(x_train,y_train)
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier, plot_tree


import numpy as np
plt.figure(figsize=(20, 10))

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

plot_tree(

model,

feature_names=input.columns,

class_names=[str(c) for c in sorted(np.unique(target))],


filled=True,

rounded=True,

fontsize=10, max_depth=3)

plt.title('Decision Tree')
plt.show()
OUTPUT

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

EX.NO:6 IMPLEMENTATION OF K MEANS ALGORITHM

PROGRAM

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('Instagram visits clustering.csv')
df

plt.scatter(df[['Instagram visit score']],df['Spending_rank(0 to 100)'])

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

OUTPUT

#Building a K-means Clustering model


from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):

km = KMeans(n_clusters=i)

km.fit_predict(df)

wcss.append(km.inertia_)

wcss

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

plt.plot(range(1,11),wcss)

X = df.iloc[:,:].values #numpy array


km = KMeans(n_clusters=4)
y_means = km.fit_predict(X) y_means

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

X[y_means==0]

X[y_means==1]

X[y_means==2]

X[y_means==3]

X[y_means == 3,1]

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

array([ 70., 79., 28., 49., 92., 90., 80., 84., 76., 90., 36.,
84., 84., 49., 96., 43., 71., 10., 102., 78., 86., 50.,
71., 82., 18., 88., 74., 40., 92., 22., 118., 84., 27.,
78., 84., 86., 82., 72., 59., 66., 54., 79., 103., 51.,
45., 67., 93., 47., 74., 46., 90., 40., 11., 54., 84.,
99., 43., 20., 92., 22., 87., 53., 25., 69., 92., 86.,
32., 67., 56., 47., 88., 89., 20., 54., 86., 78., 9.,
36., 80., 67., 87., 28., 50., 88., 36., 71., 90., 67.,
80., 85., 95., 70., 80., 77., 34., 89., 81., 81., 96.,

71., 91., 46., 22., 86., 73., 24., 75., 15., 71., 55.,
38., 85., 62., 91., 23., 58., 32., 82., 23., 81., 86.,
97., 70., 100., 78., 64., 30., 21., 86., 86., 94., 56.,
39., 24., 84., 77., 90., 52., 79., 86., 24., 33., 77.,
87., 61., 90., 67., 26., 48., 27., 61., 89., 14., 53.,
97., 36., 83., 92., 83., 84., 93., 24., 28., 14., 90.,

78., 62., 73., 75., 72., 78., 80., 33., 91., 32., 91.,
90., 21., 70., 80., 66., 64., 89., 85., 25., 88., 14.,
64., 87., 26., 29., 70., 42., 77., 53., 28., 83., 101.,
92., 100., 10., 58., 88., 36., 38., 95., 36., 27., 97.,
100., 40., 88., 87., 85., 91., 41., 28., 96., 79., 89.,
78., 17., 70., 28., 91., 54., 66., 102., 80., 76., 16.,
28., 24., 74., 41., 41., 49., 44., 18., 74., 88., 77.,
82., 82., 43., 25., 47., 64., 81., 78., 22., 81., 38.,
61., 89., 77., 30., 45., 30., 22., 51., 13., 65., 85.,

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

84., 59., 84., 92., 21., 101., 18., 73., 35., 26., 24.,
37., 61., 90., 33., 32., 89., 98., 87., 83., 67., 56.,
51., 17., 40., 86., 86., 75., 78., 91., 69., 91., 33.,
28., 99., 26., 59., 22., 59., 52., 36., 89., 92., 64.,
33., 27., 82., 79., 47., 42., 82., 24., 90., 24., 28.,
97., 28., 22., 34., 18., 29., 78., 63., 70., 51., 66.,
87., 69., 80., 42., 70., 78., 68., 85., 54., 80., 48.,
39., 85., 31., 76., 30., 56., 93., 90., 104., 58., 74.,
32., 81., 83., 93., 15., 43., 91., 82., 22., 96., 64.,

18., 9., 37., 79., 86., 81., 95., 93., 59., 66., 40.,
58., 24., 41., 97., 79., 102., 73., 45., 78., 31., 61.,
61., 86., 78., 72., 82., 31., 87., 92., 56., 63., 38.,
88., 79., 48., 45., 92., 77., 48., 12., 86., 38., 76.,
22., 44., 94., 87., 85., 77., 90., 32., 52., 89., 65.,
31., 21., 92., 69., 98., 68., 90., 60., 86., 41., 89.,
83., 83., 43., 19., 13., 28., 80., 82., 7., 83., 41.,
81., 91., 94., 39., 53., 88., 102., 19., 79., 85., 94.,
34., 61., 80., 20., 60., 90., 60., 91., 91., 91., 73.,
92., 89., 89., 101., 87., 62., 69., 30., 114., 72., 59.,
45., 73., 38., 92., 95., 85., 80., 48., 73., 88., 28.,
26., 72., 104., 92., 86., 23., 88., 40., 88., 75., 97.,
88., 59., 90., 92., 35., 88., 82., 40., 78., 77., 87.,
77., 28., 89., 25., 99., 85., 24., 27., 42., 88., 85.,

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

81., 81., 72., 20., 83., 89., 44., 18., 88., 48., 65.,
90., 64., 76., 45., 69., 37., 83., 80., 87., 77., 15.,
31., 35., 37., 72., 46., 83., 90., 48., 86., 33., 90.,
26., 52., 86., 83., 75., 85., 92., 93., 81., 83., 90.,
27., 26., 82., 94., 30., 62., 20., 57., 75., 80., 33.,
104., 40., 88., 80., 5., 27., 27., 74., 61., 42., 64.,
27., 56., 82., 79., 96., 91., 72., 60., 40., 93., 85.,
98., 32., 90., 83., 91., 41., 19., 20., 43., 27., 79.,
86., 83., 31., 47., 84., 94., 84., 94., 82., 84., 37.,
66., 41., 86., 99., 40., 76., 102., 76., 90., 87., 77.,
37., 88., 76., 36., 28., 99., 96., 34., 91., 82., 84.,
27., 61., 95., 29., 38., 96., 70., 95.])

plt.scatter(X[y_means == 0,0],X[y_means == 0,1],color='blue')


plt.scatter(X[y_means == 1,0],X[y_means == 1,1],color='red')
plt.scatter(X[y_means == 2,0],X[y_means == 2,1],color='green')
plt.scatter(X[y_means == 3,0],X[y_means == 3,1],color='yellow')

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

from sklearn.datasets import make_blobs # k_means on 3D


centroids = [(-5,-5,5),(5,5,-5),(3.5,-2.5,4),(-2.5,2.5,-4)]
cluster_std = [1,1,1,1]
X,y=make_blobs(n_samples=200,cluster_std=cluster_std,centers=centroids,

n_features= 3,random_state=1)
X
array([[ 4.33424548, 3.32580419, 4.17497018],
[-3.32246719, 3.22171129, -4.625342 ],
[-6.07296862, -4.13459237, 2.6984613 ],
[ 6.90465871, 6.1110567 , -4.3409502 ],
[-2.60839207, 2.95015551, -2.2346649 ],
[ 5.88490881, 4.12271848, -5.86778722],
[-4.68484061, -4.15383935, 4.14048406],
[-1.82542929, 3.96089238, -3.4075272 ],

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

[-5.34385368, -4.95640314, 4.37999916],


[ 4.91549197, 4.70263812, -4.582698 ],
[-3.80108212, -4.81484358, 4.62471505],
[ 4.6735005 , 3.65732421, -3.88561702],
[-6.23005814, -4.4494625 , 5.79280687],
[-3.90232915, 2.95112294, -4.6949209 ],
[ 3.72744124, 5.31354772, -4.49681519],
[-3.3088472 , 3.05743945, -3.81896126],
[ 2.70273021, -2.21732429, 3.17390257],
[ 4.06438286, -0.36217193, 3.214466 ],
[ 4.69268607, -2.73794194, 5.15528789],
[ 4.1210827 , -1.5438783 , 3.29415949],
[-6.61577235, -3.87858229, 5.40890054],
[ 3.05777072, -2.17647265, 3.89000851],
[-1.48617753, 0.27288737, -5.6993336 ],
[-5.3224172 , -5.38405435, 6.13376944],
[-5.26621851, -4.96738545, 3.62688268],
[ 5.20183018, 5.66102029, -3.20784179],
[-2.9189379 , 2.02081508, -5.95210529],
[ 3.30977897, -2.94873803, 3.32755196],
[ 5.12910158, 6.6169496 , -4.49725912],
[-2.46505641, 3.95391758, -3.33831892],
[ 1.46279877, -4.44258918, 1.49355935],
[ 3.87798127, 4.48290554, -5.99702683],
[ 4.10944442, 3.8808846 , -3.0439211 ],

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

[-6.09989127, -5.17242821, 4.12214158],


[-3.03223402, 3.6181334 , -3.3256039 ],
[ 7.44936865, 4.45422583, -5.19883786],
[-4.47053468, -4.86229879, 5.07782113],
[-1.46701622, 2.27758597, -2.52983966],
[ 3.0208429 , -2.14983284, 4.01716473],
[ 3.82427424, -2.47813716, 3.53132618],
[-5.74715829, -3.3075454 , 5.05080775],
[-1.51364782, 2.03384514, -2.61500866],
[-4.80170028, -4.88099135, 4.32933771],
[ 6.55880554, 5.1094027 , -6.2197444 ],
[-1.48879294, 1.02343734, -4.14319575],
[ 4.30884436, -0.71024532, 4.45128402],
[ 3.58646441, -4.64246673, 3.16983114],
[ 3.37256166, 5.60231928, -4.5797178 ],
[-1.39282455, 3.94287693, -4.53968156],
[-4.64945402, -6.31228341, 4.96130449],
[ 3.88352998, 5.0809271 , -5.18657899],
[ 3.32454103, -3.43391466, 3.46697967],
[ 3.45029742, -2.03335673, 5.03368687],
[-2.95994283, 3.14435367, -3.62832971],
[-3.03289825, -6.85798186, 6.23616403],
[-4.13665468, -5.1809203 , 4.39607937],
[-3.6134361 , 2.43258998, -2.83856002],
[ 2.07344458, -0.73204005, 3.52462712],

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

[ 4.11798553, -2.68417633, 3.88401481],


[ 3.60337958, 4.13868364, -4.32528847],
[-5.84520564, -5.67124613, 4.9873354 ],
[-2.41031359, 1.8988432 , -3.44392649],
[-2.75898285, 2.6892932 , -4.56378873],
[-2.442879 , 1.70045251, -4.2915946 ],
[ 3.9611641 , -3.67598267, 5.01012718],
[-7.02220122, -5.30620401, 5.82797464],
[ 2.90019547, -1.37658784, 4.30526704],
[ 5.81095167, 6.04444209, -5.40087819],
[-5.75439794, -3.74713184, 5.51292982],
[-2.77584606, 3.72895559, -2.69029409],
[ 3.07085772, -1.29154367, 5.1157018 ],
[ 2.206915 , 6.93752881, -4.63366799],
[ 4.2996015 , 4.79660555, -4.75733056],
[ 4.86355526, 4.88094581, -4.98259059],
[-4.38161974, -4.76750544, 5.68255141],
[ 5.42952614, 4.3930016 , -4.89377728],
[ 3.69427308, 4.65501279, -5.23083974],
[ 5.90148689, 7.52832571, -5.24863478],
[-4.87984105, -4.38279689, 5.30017032],
[ 3.93816635, -1.37767168, 3.0029802 ],
[-3.32862798, 3.02887975, -6.23708651],
[-4.76990526, -4.23798882, 4.77767186],
[-2.12754315, 2.3515102 , -4.1834002 ],

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

[-0.64699051, 2.64225137, -3.48649452],


[-5.63699565, -4.80908452, 7.10025514],
[-1.86341659, 3.90925339, -2.37908771],
[ 4.82529684, 5.98633519, -4.7864661 ],
[-5.24937038, -3.53789206, 2.93985929],
[-4.59650836, -4.40642148, 3.90508815],
[-3.66400797, 3.19336623, -4.75806733],
[ 6.29322588, 4.88955297, -5.61736206],
[-2.85340998, 0.71208711, -3.63815268],
[-2.35835946, -0.01630386, -4.59566788],
[ 5.61060505, -3.80653407, 4.07638048],
[-1.78695095, 3.80620607, -4.60460297],
[-6.11731035, -4.7655843 , 6.65980218],
[-5.63873041, -4.57650565, 5.07734007],
[ 5.62336218, 4.56504332, -3.59246 ],
[-3.37234925, -4.6619883 , 3.80073197],
[-5.69166075, -5.39675353, 4.3128273 ],
[ 7.19069973, 3.10363908, -5.64691669],
[-3.86837061, -3.48018318, 7.18557541],
[-4.62243621, -4.87817873, 6.12948391],
[ 5.21112476, 5.01652757, -4.82281228],
[-2.61877117, 2.30100182, -2.13352862],
[-2.92449279, 1.76846902, -5.56573815],
[-2.80912132, 3.01093777, -2.28933816],
[ 4.35328122, -2.91302931, 5.83471763],

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

[ 2.79865557, -3.03722302, 4.15626385],


[-3.65498263, 2.3223678 , -5.51045638],
[ 4.8887794 , -3.16134424, 7.03085711],
[ 4.94317552, 5.49233656, -5.68067814],
[ 3.97761018, -3.52188594, 4.79452824],
[-3.41844004, 2.39465529, -3.36980433],
[ 3.50854895, -2.66819884, 3.82581966],
[-2.63971173, 3.88631426, -3.45187042],
[-3.37565464, -5.61175641, 4.47182825],
[-2.37162301, 4.26041518, -3.03346075],
[ 1.81594001, -3.6601701 , 5.35010682],
[ 5.04366899, 4.77368576, -3.66854289],
[-4.19813897, -4.9534327 , 4.81343023],
[ 5.1340482 , 6.20205486, -4.71525189],
[ 3.39320601, -1.04857074, 3.38196315],
[ 4.34086156, -2.60288722, 5.14690038],
[-0.80619089, 2.69686978, -3.83013074],
[-5.62353073, -4.47942366, 3.85565861],
[ 5.56578332, -3.97115693, 3.1698281 ],
[ 4.41347606, 3.76314662, -4.12416107],
[ 4.01507361, -5.28253447, 4.58464661],
[-5.02461696, -5.77516162, 6.27375593],
[ 5.55635552, -0.73975077, 3.93934751],
[-5.20075807, -4.81343861, 5.41005165],
[-2.52752939, 4.24643509, -4.77507029],

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

[-3.85527629, -4.09840928, 5.50249434],


[ 5.78477065, 4.04457474, -4.41408957],
[ 1.74407436, -1.7852104 , 4.85270406],
[ 3.27123417, -0.88663863, 3.62519531],
[ 7.18697965, 5.44136444, -5.10015523],
[-2.78899734, 2.10818376, -3.31599867],
[-3.37000822, 2.86919047, -3.14671781],
[-4.30196797, -5.44712856, 6.2245077 ],
[ 3.95541062, 7.05117344, -4.414338 ],
[ 3.55912398, 6.23225307, -5.25417987],
[-3.09384307, 2.15609929, -5.00016919],
[-5.93576943, -5.26788808, 5.53035547],
[ 5.83600472, 6.54335911, -4.24119434],
[ 4.68988323, 2.56516224, -3.9611754 ],
[-5.29809284, -4.51148185, 4.92442829],
[-1.30216916, 4.20459417, -2.95991085],
[ 4.9268873 , 6.16033857, -4.63050728],
[-3.30618482, 2.24832579, -3.61728483],
[ 4.50178644, 4.68901502, -5.00189148],
[ 3.86723181, -1.26710081, 3.57714304],
[ 4.32458463, -1.84541985, 3.94881155],
[ 4.87953543, 3.76687926, -6.18231813],
[ 3.51335268, -3.1946936 , 4.6218035 ],
[-4.83061757, -4.25944355, 4.0462994 ],
[-1.6290302 , 1.99154287, -3.22258079],

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

[ 1.62683902, -1.57938488, 3.96463208],


[ 6.39984394, 4.21808832, -5.43750898],
[ 5.82400562, 4.43769457, -3.04512192],
[-3.25518824, -5.7612069 , 5.3190391 ],
[-4.95778625, -4.41718479, 3.89938082],
[ 2.75003038, -0.4453759 , 4.05340954],
[ 3.85249436, -2.73643695, 4.7278135 ],
[-5.10174587, -4.13111384, 5.75041164],
[-4.83996293, -4.12383108, 5.31563495],
[ 1.086497 , -4.27756638, 3.22214117],
[ 4.61584111, -2.18972771, 1.90575218],
[-4.25795584, -5.19183555, 4.11237104],
[ 5.09542509, 5.92145007, -4.9392498 ],
[-6.39649634, -6.44411381, 4.49553414],
[ 5.26246745, 5.2764993 , -5.7332716 ],
[ 3.5353601 , -4.03879325, 3.55210482],
[ 5.24879916, 4.70335885, -4.50478868],
[ 5.61853913, 4.55682807, -3.18946509],
[-2.39265671, 1.10118718, -3.91823218],
[ 3.16871683, -2.11346085, 3.14854434],
[ 3.95161595, -1.39582567, 3.71826373],
[-4.09914405, -5.68372786, 4.87710977],
[-1.9845862 , 1.38512895, -4.76730983],
[-1.45500559, 3.1085147 , -4.0693287 ],
[ 2.94250528, -1.56083126, 2.05667659],

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

[ 2.77440288, -3.36776868, 3.86402267],


[ 4.50088142, -2.88483225, 5.45810824],
[-5.35224985, -6.1425182 , 4.65065728],
[-2.9148469 , 2.95194604, -5.57915629],
[-4.06889792, -4.71441267, 5.88514116],
[ 3.47431968, 5.79502609, -5.37443832],
[ 3.66804833, 3.23931144, -6.65072127],
[-3.22239191, 3.59899633, -4.90163449],
[-3.6077125 , 2.48228168, -5.71939447],
[ 5.5627611 , 5.24073709, -4.71933492],
[ 1.38583608, -2.91163916, 5.27852808],
[ 4.42001793, -2.69505734, 4.80539342],
[ 4.71269214, 5.68006984, -5.3198016 ],
[-4.13744959, 6.4586027 , -3.35135636],
[-5.20889423, -4.41337681, 5.83898341],
[ 2.6194224 , -2.77909772, 5.62284909],
[-1.3989998 , 3.28002714, -4.6294416 ]])
wcss = [] #Elbow Method

for i in range(1,21):
km = KMeans(n_clusters=i)
km.fit_predict(X)
wcss.append(km.inertia_)
plt.plot(range(1,21),wcss)

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

Km=KMeans(n_clusters=4)

y_pred = km.fit_predict(X)

df = pd.DataFrame()

df['col1'] = X[:,0]
df['col2'] = X[:,1]
df['col3'] = X[:,2]
df['label'] = y_pred
df

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

y_pred

OUTPUT

array([0, 2, 1, 0, 2, 0, 1, 2, 1, 0, 1, 0, 1, 2, 0, 2, 3, 3, 3, 3, 1, 3,
2, 1, 1, 0, 2, 3, 0, 2, 3, 0, 0, 1, 2, 0, 1, 2, 3, 3, 1, 2, 1, 0,
2, 3, 3, 0, 2, 1, 0, 3, 3, 2, 1, 1, 2, 3, 3, 0, 1, 2, 2, 2, 3, 1,
3, 0, 1, 2, 3, 0, 0, 0, 1, 0, 0, 0, 1, 3, 2, 1, 2, 2, 1, 2, 0, 1,
1, 2, 0, 2, 2, 3, 2, 1, 1, 0, 1, 1, 0, 1, 1, 0, 2, 2, 2, 3, 3, 2,
3, 0, 3, 2, 3, 2, 1, 2, 3, 0, 1, 0, 3, 3, 2, 1, 3, 0, 3, 1, 3, 1,
2, 1, 0, 3, 3, 0, 2, 2, 1, 0, 0, 2, 1, 0, 0, 1, 2, 0, 2, 0, 3, 3,
0, 3, 1, 2, 3, 0, 0, 1, 1, 3, 3, 1, 1, 3, 3, 1, 0, 1, 0, 3, 0, 0,
2, 3, 3, 1, 2, 2, 3, 3, 3, 1, 2, 1, 0, 0, 2, 2, 0, 3, 3, 0, 2, 1,
3, 2], dtype=int32)

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

EX.NO 7 CREATION OF MATRIX USING NUMPY

PROGRAM

import numpy as np
# Step 1: Create a 3x4 matrix with values from 10 to 21
matrix = np.arange(10, 22).reshape(3, 4)

print("Original Matrix:\n", matrix)

# Step 2: Replace value 21 with a square number (e.g., 25)

matrix[matrix == 21] = 25
print("\nMatrix after replacing 21 with 25:\n", matrix)

OUTPUT

Original Matrix:
[[10 11 12 13]
[14 15 16 17]
[18 19 20 21]]

Matrix after replacing 21 with 25:


[[10 11 12 13]
[14 15 16 17]
[18 19 20 25]]

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

EX.NO:8 IMPLEMENTATION OF BASIC EDA OPERATIONS WITH DATASET

PROGRAM

import pandas as pd
df=pd.read_csv(“Symptom-severity.csv”)
df.head()
df.isnull().sum()
df.dtypes
df.describe()
df['weight'].value_counts()
df['Symptom'].value_counts()

OUTPUT

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

EX.NO:9 CREATION OF DATAFRAME FOR SORTING AND RANKING SALES DATA

PROGRAM

import pandas as pd
# Step 1: Create sample product sales data
data = {
'Product': ['Laptop', 'Smartphone', 'Tablet', 'Monitor', 'Keyboard', 'Mouse', 'Printer'],
'Units_Sold': [150, 300, 120, 180, 400, 500, 90],
'Unit_Price': [700, 500, 300, 200, 50, 25, 150]
}

# Create DataFrame
df = pd.DataFrame(data)
# Step 2: Calculate Sales Amount
df['Sales_Amount'] = df['Units_Sold'] * df['Unit_Price']

# Step 3: Sort by Sales Amount in descending order

df_sorted = df.sort_values(by='Sales_Amount', ascending=False)

# Step 4: Rank products based on Sales Amount


df_sorted['Sales_Rank'] = df_sorted['Sales_Amount'].rank(ascending=False,
method='dense').astype(int)
# Display the final DataFrame
print("Product Sales Data Ranked:\n", df_sorted)

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

OUTPUT

Product Sales Data Ranked:

Product Units_Sold Unit_Price Sales_Amount Sales_Rank


1 Smartphone 300 500 150000 1
0 Laptop 150 700 105000 2
5 Mouse 500 25 12500 3
4 Keyboard 400 50 20000 4
3 Monitor 180 200 36000 5
2 Tablet 120 300 36000 5
6 Printer 90 150 13500 6

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

EX.NO:10 FINDING OF HIGHEST RAINFALL AREA IN TAMIL NADU

PROGRAM

import pandas as pd

import matplotlib.pyplot as plt

# --- 1. Load the data ---

data = {

'District': ['Chennai', 'Coimbatore', 'Madurai', 'Tiruchirappalli', 'Kanyakumari', 'Vellore'],

'November_Rainfall_mm': [450, 120, 180, 150, 500, 100]

df = pd.DataFrame(data)

df['District'] = df['District'].str.title() # Normalize casing

# --- 2. Find the highest rainfall area in November (handle ties) ---

max_rainfall = df['November_Rainfall_mm'].max()

highest_rainfall_areas = df[df['November_Rainfall_mm'] == max_rainfall]

print("District(s) with highest rainfall in November:")

for _, row in highest_rainfall_areas.iterrows():

print(f"- {row['District']} ({row['November_Rainfall_mm']} mm)")

print()

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

# --- 3. Create a bar chart ---

plt.figure(figsize=(10, 6))

bars = plt.bar(df['District'], df['November_Rainfall_mm'], color='skyblue')

plt.xlabel('District')

plt.ylabel('Rainfall (mm)')

plt.title('November Rainfall in Tamil Nadu Districts')

plt.xticks(rotation=45, ha='right')

# Add value labels to each bar

for bar in bars:

height = bar.get_height()

plt.text(bar.get_x() + bar.get_width()/2, height + 10, f"{height} mm",ha='center',

va='bottom', fontsize=9)

plt.tight_layout()

plt.show()

# --- 4. Calculate rainfall percentage for a specific district ---

specific_district = "chennai" # Can be lowercase or mixed case

specific_district = specific_district.title()

if specific_district in df['District'].values:

total_rainfall = df['November_Rainfall_mm'].sum()

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

district_rainfall=df[df['District'] == specific_district]

['November_Rainfall_mm'].iloc[0]

rainfall_percentage = (district_rainfall / total_rainfall) * 100

print(f"Rainfall percentage for {specific_district} in

November: {rainfall_percentage:.2f}%")

else:

print(f"District '{specific_district}' not found in the dataset.")

OUTPUT

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

EX.NO:11 DIFFERENTIATION OF FAST AND LEAST MOVING ITEMS IN THE SHOP

PROGRAM

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

# Load the Supermarket dataset

df = pd.read_csv('/content/SuperMarket Analysis.csv')

# Show available columns

print("Columns in dataset:\n", df.columns)

# Group by item/product name and sum quantities sold

item_sales = df.groupby('Product line')['Quantity'].sum().sort_values(ascending=False)

# Identify fast-moving and least-moving items

fast_moving = item_sales.head(10)

least_moving = item_sales.tail(10)

# Plot fast and least moving items side-by-side

plt.figure(figsize=(14, 6))

# Fast-moving items

plt.subplot(1, 2, 1)

sns.barplot(x=fast_moving.values, y=fast_moving.index, palette='Greens_r')

plt.title('Top 10 Fast-Moving Items')

plt.xlabel('Total Quantity Sold')

plt.ylabel('Item')

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

# Least-moving items

plt.subplot(1, 2, 2)

sns.barplot(x=least_moving.values, y=least_moving.index, palette='Reds_r')

plt.title('Bottom 10 Least-Moving Items')

plt.xlabel('Total Quantity Sold')

plt.ylabel('Item')

plt.tight_layout()

plt.show()

# ----------------------------

# Stock availability subplot

# ----------------------------

# Assuming 'Stock' column is available for current stock per item

if 'Stock' in df.columns:

item_stock = df.groupby('Item')['Stock'].sum().sort_values(ascending=False)

plt.figure(figsize=(14, 6))

sns.barplot(x=item_stock.index, y=item_stock.values, palette='Blues_r')

plt.xticks(rotation=90)

plt.title('Stock Available per Item in Store')

plt.xlabel('Item')

plt.ylabel('Available Stock')

plt.tight_layout()

plt.show()

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

else:

print("\n Column 'Stock' not found in the dataset. Please ensure a 'Stock' column

exists for stock-level visualization.")

OUTPUT

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

EX.NO:12 CREATION OF PIE CHART TO DISPLAY FOOD ITEMS

PROGRAM

import matplotlib.pyplot as plt

# Sample data

food_items = ['Apples', 'Bread', 'Milk', 'Rice', 'Eggs']

prices = [120, 40, 60, 80, 50]

# Pie chart

plt.figure(figsize=(6,6))

plt.pie(prices, labels=food_items, autopct='%1.1f%%', startangle=140)

plt.title('Price Distribution of Food Items in Market')

plt.axis('equal') # Ensures the pie is a circle

plt.show()

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

OUTPUT

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

EX.NO:13 CREATION OF LINEAR REGRESSION FOR STUDENT’S INTERNAL MARKS

PROGRAM

# Import libraries

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

# Load dataset

df = pd.read_csv("StudentAcademicData.csv - Sheet1.csv")

# Extract relevant columns

df = df[['Internal marks (out of 20)', 'External marks (out of 80)']].dropna()

df.columns = ['Internal_Marks', 'External_Marks']

# Split data into features and target

X = df[['Internal_Marks']]

y = df['External_Marks']

# Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,

random_state=42)

# Create and train model

model = LinearRegression()

model.fit(X_train, y_train)

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

# Make predictions

y_pred = model.predict(X_test)

# Evaluation

print("Mean Squared Error:", mean_squared_error(y_test, y_pred))

print("R-squared Score:", r2_score(y_test, y_pred))

# Predict new result

new_mark = [[15]] # example internal mark

predicted_result = model.predict(new_mark)

print("Predicted Semester Result for Internal Mark 15:", predicted_result[0])

# Plotting

plt.scatter(X, y, color='blue', label='Actual Data')

plt.plot(X, model.predict(X), color='red', label='Regression Line')

plt.xlabel("Internal Marks (out of 20)")

plt.ylabel("Semester Result (External Marks out of 80)")

plt.title("Linear Regression: Internal vs External Marks")

plt.legend()

plt.grid(True)

plt.show()

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

Output

Mean Squared Error: 284.50287708929307

R-squared Score: -0.007268107945806568

Predicted Semester Result for Internal Mark 15: 53.68967909800521

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

EX.NO:14 CREATION OF CLUSTER USING MACHINE LEARNING ALGORITHM

PROGRAM

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_score

import matplotlib.pyplot as plt

# Step 1: Load the dataset

df = pd.read_csv("salaries.csv") # Update with your filename

# Step 2: Clean and combine text fields

# Check for the existence of 'job_title' column before using it

if 'job_title' in df.columns:

df.dropna(subset=["job_title"], how='all', inplace=True)

df["text"] = df["job_title"].fillna('')

else:

print("Error: 'job_title' column not found in the DataFrame.")

# Handle the error appropriately, e.g., exit or use a different column

# Step 3: TF-IDF Vectorization

vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)

X = vectorizer.fit_transform(df["text"])

# Step 4: Apply KMeans Clustering

n_clusters = 4

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init='auto')

df["cluster"] = kmeans.fit_predict(X)

# Step 5: Print Top Terms per Cluster

def print_top_keywords_per_cluster(model, vectorizer, n_terms=10):

terms = vectorizer.get_feature_names_out()

for i, center in enumerate(model.cluster_centers_):

top_terms = center.argsort()[-n_terms:][::-1]

print(f"\n Cluster {i} Top Keywords:")

print(", ".join(terms[top_terms]))

print_top_keywords_per_cluster(kmeans, vectorizer)

# Step 6: Display Sample Job Titles per Cluster

for i in range(n_clusters):

print(f"\n Cluster {i} - Sample Job Offers:")

# Check for the existence of 'company' and 'job_title' columns before using them

if 'company' in df.columns and 'job_title' in df.columns:

print(df[df["cluster"] == i][["company", "job_title"]].head(3))

elif 'job_title' in df.columns:

print(df[df["cluster"] == i][["job_title"]].head(3))

else:

print("Error: 'company' or 'job_title' column not found in the DataFrame.")

# Step 7: Optional Visualization - Cluster Size

plt.figure(figsize=(6, 4))

df["cluster"].value_counts().sort_index().plot(kind='bar', color='skyblue')

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

plt.title(" Number of Job Offers per Cluster")

plt.xlabel("Cluster")

plt.ylabel("Count")

plt.xticks(rotation=0)

plt.grid(True)

plt.tight_layout()

plt.show()

# Step 8: Optional - Evaluate Clustering Quality

sil_score = silhouette_score(X, df["cluster"])

print(f"\n Silhouette Score: {sil_score:.3f}")

# Step 9: Optional - Save the results

df.to_csv("clustered_ml_jobs.csv", index=False)

OUTPUT

Cluster 0 Top Keywords:

executive, account, enterprise, visualization, writer, web, trainee, trader, technology,


technologist

Cluster 1 Top Keywords:

engineer, software, data, learning, machine, analytics, ai, research, systems, intelligence

Cluster 2 Top Keywords:

manager, product, engineering, data, analytics, governance, operations, intelligence,


business, ai

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

Cluster 3 Top Keywords:

data, scientist, analyst, architect, research, associate, developer, applied, specialist,


consultant

Cluster 0 - Sample Job Offers:

job_title

1437 Executive

1438 Executive

1982 Account Executive

Cluster 1 - Sample Job Offers:

job_title

8 Software Engineer

9 Software Engineer

10 Machine Learning Engineer

Cluster 2 - Sample Job Offers:

job_title

24 Manager

25 Manager

72 Manager

Cluster 3 - Sample Job Offers:

job_title

0 Analyst

1 Analyst

2 Data Quality Lead

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

EX.NO:15 SALARY PREDICTION USING MACHINE LEARNING ALGORITHM

PROGRAM

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error

# Simulated dataset data = pd.DataFrame({

'experience': [1, 3, 5, 7, 10, 12],

'education_level': ['Bachelors', 'Masters', 'Masters', 'PhD', 'PhD', 'Masters'],

'location': ['Mumbai', 'Delhi', 'Bangalore', 'Chennai', 'Hyderabad', 'Pune'],

'skills_score': [65, 70, 80, 85, 90, 95],

'salary': [600000, 800000, 1200000, 1600000, 2000000, 2200000]

})

# Features and target

X = data.drop('salary', axis=1) y = data['salary']

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

# Preprocessing

preprocessor = ColumnTransformer( transformers=[

('num', StandardScaler(), ['experience', 'skills_score']),

('cat', OneHotEncoder(handle_unknown='ignore'), ['education_level', 'location'])

# Pipeline

model = Pipeline(steps=[ ('preprocess', preprocessor),

('regressor', RandomForestRegressor(n_estimators=100, random_state=42))

])

# Train-test split for visualization

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

model.fit(X_train, y_train) y_pred = model.predict(X_test)

# 1. Actual vs Predicted

plt.figure(figsize=(8,5))

plt.scatter(y_test, y_pred, color='blue')

plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--')

plt.xlabel('Actual Salary')

plt.ylabel('Predicted Salary')

plt.title('Actual vs Predicted Salary')

plt.grid(True)

plt.tight_layout()

plt.show()

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

# 2. Residuals (Error)

errors = y_test - y_pred plt.figure(figsize=(8,5))

plt.bar(range(len(errors)), errors, color='orange')

plt.xlabel('Test Instance')

plt.ylabel('Prediction Error (₹)')

plt.title('Prediction Errors per Test Case')

plt.grid(True)

plt.tight_layout()

plt.show()

# 3. Feature Importance

rf_model = model.named_steps['regressor']

feature_names = model.named_steps['preprocess'].transform(X_train).shape[1]

importance = rf_model.feature_importances_

plt.figure(figsize=(10,5))

plt.bar(range(len(importance)), importance)

plt.title('Feature Importance Scores')

plt.xlabel('Feature Index')

plt.ylabel('Importance')

plt.grid(True)

plt.tight_layout()

plt.show()

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

OUTPUT

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

Ex.No:16 CONTENT BEYOND SYLLABUS


SENTIMENTAL ANALYSIS USING LEXICON CLASSIFICATION ALGORITHM

PROGRAM

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import re

%matplotlib inline import warnings

warnings.filterwarnings(“ignore”)

import os

print(os.listdir(“../input”))

pd.set_option(‘display.max_columns’,None)

US_comments=pd.read_csv(‘../input/youtube/UScomments.csv’,error_bad_lines=False)

US_videos=pd.read_csv(‘../input/youtube/USvideos.csv’,error_bad_lines=False)

US_videos.head()

US_videos.shape US_videos.nunique()

US_videos.info()

US_videos.head()

US_comments.head()

US_comments.shape US_comments.isnull().sum()

US_comments.dropna(inplace=True)

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

US_comments isnull().sum()

US_comments.shape

US_comments.nunique()

US_comments.info()

US_comments.drop(41587,inplace=True)

US_comments=US_comments.reset_index().drop(‘index’,axis=1)

US_comments.likes=US_comments.likes.astype(int)

US_comments.replies=US_comments.replies.astype(int)

US_comments.head()

US_comments[‘comment_text’]=US_comments[‘comment_text’].str.replace(“[^a-zA-
Z#]”,” “)

US_comments[‘comment_text’]=US_comments[‘comment_text’].apply(lambda x: ‘
‘.join([w for w in x.split() if len(w)>3]))

US_comments[‘comment_text’]=US_comments[‘comment_text’].apply(lambda
x:x.lower())

tokenized_tweet=US_comments[‘comment_text’].apply(lambda x:x.split())
tokenized_tweet.head()

from nltk.stem import WordNetLemmatizer from nltk.corpus import stopwords

wnl = WordNetLemmatizer()

tokenized_tweet.apply(lambda x: [wnl.lemmatize(i) for i in x if i not in


set(stopwords.words('english'))])

tokenized_tweet.head()

US_comments['comment_text'] = tokenized_tweet

import nltk

nltk.download('vader_lexicon')

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

from nltk.sentiment.vader import SentimentIntensityAnalyzer sia =


SentimentIntensityAnalyzer()

US_comments['Sentiment Scores'] = US_comments['comment_text'].apply(lambda


x:sia.polarity_scores(x)['compound'])

US_comments.head()

US_comments['Sentiment'] = US_comments['Sentiment Scores'].apply(lambda s :


'Positive' if s > 0 else ('Neutral' if s == 0 else 'Negative'))

US_comments.head()

US_comments.Sentiment.value_counts()

videos = []

for i in range(0,US_comments.video_id.nunique()):

a = US_comments[(US_comments.video_id == US_comments.video_id.unique()[i])

& (US_comments.Sentiment == 'Positive')].count()[0]

b=US_comments[US_comments.video_id==US_comments.video_id.unique()[i]]

['Sen timent'].value_counts().sum()

Percentage = (a/b)*100

videos.append(round(Percentage,2))

Positivity = pd.DataFrame(videos,US_comments.video_id.unique()).reset_index()
Positivity.columns = ['video_id','Positive Percentage']

Positivity.head() channels = []

for i in range(0,Positivity.video_id.nunique()):

channels.append(US_videos[US_videos.video_id ==
Positivity.video_id.unique()[i]]['channel_title'].unique()[0])

Positivity['Channel'] = channels Positivity.head()

Positivity[Positivity['Positive Percentage'] == Positivity['Positive Percentage'].max()]

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

Positivity[Positivity['Positive Percentage'] == Positivity['Positive Percentage'].min()]


all_words = ' '.join([text for text in US_comments['comment_text']])

from wordcloud import WordCloud

wordcloud = WordCloud(width=800, height=500, random_state=21,

max_font_size=110).generate(all_words)

plt.figure(figsize=(10, 7))

plt.imshow(wordcloud, interpolation="bilinear")

plt.axis('off')

plt.show()

all_words_posi = ' '.join([text for text in

US_comments['comment_text'][US_comments.Sentiment == 'Positive']])

wordcloud_posi = WordCloud(width=800, height=500, random_state=21,

max_font_size=110).generate(all_words_posi)

plt.figure(figsize=(10, 7)) plt.imshow(wordcloud_posi, interpolation="bilinear")

plt.axis('off')

plt.show()

all_words_nega = ' '.join([text for text in

US_comments['comment_text'][US_comments.Sentiment == 'Negative']])

wordcloud_nega = WordCloud(width=800, height=500, random_state=21,

max_font_size=110).generate(all_words_nega)

plt.figure(figsize=(10, 7)) plt.imshow(wordcloud_nega, interpolation="bilinear")

plt.axis('off')

plt.show()

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

all_words_neu = ' '.join([text for text in


US_comments['comment_text'][US_comments.Sentiment == 'Neutral']])

wordcloud_neu = WordCloud(width=800, height=500, random_state=21,


max_font_size=110).generate(all_words_neu)

plt.figure(figsize=(10, 7))

plt.imshow(wordcloud_neu, interpolation="bilinear")

plt.axis('off')

plt.show()

OUTPUT

['youtube']

(7992, 11)

video_id 2364

title 2398

channel_title 1230

category_id 16

tags 2204

views 7939

likes 6624

dislikes 2531

comment_total 4152

thumbnail_link 2364

date 40

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

dtype: int64

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 7992 entries, 0 to 7991

Data columns (total 11 columns):

video_id 7992 non-null object

title 7992 non-null object

channel_title 7992 non-null object

category_id 7992 non-null int64

tags 7992 non-null object

views 7992 non-null int64

likes 7992 non-null int64

dislikes 7992 non-null int64

comment_total 7992 non-null int64

thumbnail_link 7992 non-null object

date 7992 non-null float64

dtypes: float64(1), int64(5), object(5) memory usage: 686.9+ KB

(691400, 4)

video_id 0

comment_text 25

likes 0

replies 0 dtype: int64 video_id 0

comment_text 0

likes 0

replies 0 dtype: int64 (691375, 4)

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

video_id 2266

comment_text 434076

likes 1284

replies 479

dtype: int64

<class 'pandas.core.frame.DataFrame'>

Int64Index: 691375 entries, 0 to 691399

Data columns (total 4 columns):

video_id 691375 non-null object

comment_text 691375 non-null object

likes 691375 non-null object

replies 691375 non-null object

dtypes: object(4)

memory usage: 26.4+ MB

0 [logan, paul]

1 [been, following, from, start, your, vine, cha...

2 [kong, maverick]

3 [attendance]

4 [trending]

Name: comment_text, dtype: object

0 [logan, paul]

1 [been, following, from, start, your, vine, cha...

2 [kong, maverick]

3 [attendance]

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

4 [trending]

Name: comment_text,dtype:object

Positive 305358
Neutral 260986
Negative 125030
Name: Sentiment, dtype: int64

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

POSITIVE COMMENTS

NEGATIVE COMMENTS

Department of Artificial Intelligence and Data Science Page


K. Ramakrishnan College of Engineering (Autonomous), Trichy

NEUTRAL COMMENTS

Department of Artificial Intelligence and Data Science Page

You might also like