K.
Ramakrishnan College of Engineering (Autonomous), Trichy
EX.NO. 1 CALCULATION OF AVERAGE SALARY USING DATAFRAME
PROGRAM
import pandas as pd
import numpy as np
# Create the DataFrame with random integers
df = pd.DataFrame({
'Age': np.random.randint(18, 61, size=10), # 61 is exclusive
'Salary': np.random.randint(30000, 100001, size=10)})
# Display the DataFrame
print("Generated DataFrame:")
print(df)
# Calculate and display the average salary
average_salary = df['Salary'].mean()
print(f"\nAverage Salary: ₹{average_salary:,.2f}")
OUTPUT
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
EX.NO. 2 IMPLEMENTATION OF DATA PROCESSING METHODS
PROGRAM
import pandas as pd
import numpy as np
# Sample data with some null values
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', None],
'Age': [25, 30, None, 22, 29],
'Salary': [50000, None, 62000, 58000, 60000]
# Create DataFrame
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)
# 1. Dropping rows with any null value
df_dropped = df.dropna()
print("\nDataFrame after dropping rows with null values:\n", df_dropped)
# 2. Filling null values
df_filled = df.fillna({
'Name': 'Unknown',
'Age': df['Age'].mean(), # fill Age with average
'Salary': df['Salary'].median() # fill Salary with median
})
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
print("\nDataFrame after filling null values:\n", df_filled)
# 3. Selecting specific data
selected_rows = df_filled[df_filled['Salary'] > 55000]
print("\nSelected rows where Salary > 55000:\n", selected_rows)
# 4. Convert 'Name' column into a list
name_list = df_filled['Name'].tolist()
print("\n'Name' column as list:\n", name_list)
OUTPUT
Original DataFrame:
Name Age Salary
0 Alice 25.0 50000.0
1 Bob 30.0 NaN
2 Charlie NaN 62000.0
3 David 22.0 58000.0
4 None 29.0 60000.0
Data Frame after dropping rows with null values:
Name Age Salary
0 Alice 25.0 50000.0
3 David 22.0 58000.0
Data Frame after filling null values:
Name Age Salary
0 Alice 25.000000 50000.0
1 Bob 30.000000 58000.0
2 Charlie 26.5 62000.0
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
3 David 22.000000 58000.0
4 Unknown 29.000000 60000.0
Selected rows where Salary > 55000:
Name Age Salary
1 Bob 30.0 58000.0
2 Charlie 26.5 62000.0
3 David 22.0 58000.0
4 Unknown 29.0 60000.0
'Name' column as list:
['Alice', 'Bob', 'Charlie', 'David', 'Unknown']
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
EX.NO:3 IMPLEMENTATION OF EDA
PROGRAM
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set seed for reproducibility
np.random.seed(42)
# Create synthetic dataset
df = pd.DataFrame({
'Age': np.random.randint(22, 60, size=100),
'Salary': np.random.randint(30000, 120000, size=100),
'Experience_Years': np.random.randint(0, 35, size=100),
'Performance_Score': np.random.normal(loc=6, scale=1.5, size=100).round(1),
'Training_Hours': np.random.randint(0, 100, size=100)
})
# Inject some missing values
df.loc[np.random.choice(df.index, size=5), 'Performance_Score'] = np.nan
# 🖥 Display first few rows
print(" Sample of the dataset:")
print(df.head())
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
# Summary statistics print("\n
Summary Statistics:")
print(df.describe())
# Check for missing data
print("\n Missing Values:")
print(df.isnull().sum())
# Histograms of numeric features
df.hist(figsize=(12, 8), edgecolor='black')
plt.suptitle('Histogram of Numeric Features', fontsize=16)
plt.tight_layout()
plt.show()
Boxplots for all numeric features
plt.figure(figsize=(12, 6))
sns.boxplot(data=df,palette='Set3')
plt.title('Boxplot of Numeric Features', fontsize=16)
plt.xticks(rotation=45)
plt.show()
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
OUTPUT
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
EX. NO:4 IMPLEMENTATION OF SINGLE LINEAR REGRESSION
PROGRAM
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Step 1: Load dataset
df = pd.read_csv('placement.csv')
# Step 2: Display first few rows to understand structure print("First 5 rows of the
dataset:")
print(df.head())
# Step 3: Select input and output variables
# For example, assume "cgpa" as independent (X) and "package" as dependent (y)
X = df[['placement_exam_marks']]
y = df['placed']
# Step 4: Split the data into training and testing sets (80% train, 20% test)
X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.2, random_state=42)
# Step 5: Train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Step 6: Make predictions
y_pred = model.predict(X_test)
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
# Step 7: Evaluate the model
print("\nModel Performance:")
print("Mean Squared Error:",mean_squared_error(y_test,
y_pred))
print("R^2 Score:", r2_score(y_test, y_pred))
# Step 8: Visualize the results
plt.scatter(X_test, y_test, color='blue', label='Actual') plt.plot(X_test, y_pred,
color='red', linewidth=1, label='Predicted')
plt.xlabel('Placement Exam Marks')
plt.ylabel('Placed')
plt.title('Simple Linear Regression')
plt.legend()
plt.grid(True)
plt.show()
OUTPUT
First 5 rows of the dataset:
cgpa placement_exam_marks placed
0 7.19 26.0 1
1 7.46 38.0 1
2 7.54 40.0 1
3 6.42 8.0 1
4 7.23 17.0 0
Model Performance:
Mean Squared Error: 0.2501852253844579
R^2 Score: -0.005668678060327448
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
EX.NO:5 IMPLEMENTATION OF DECISION TREE
PROGRAM
import pandas as pd
df1=pd.read_csv("Symptom-severity.csv")
df2=pd.read_csv("dataset.csv")
df3=pd.read_csv("symptom_Description.csv")
df4=pd.read_csv(“symptom_precaution.csv")
df1.head()
df2.head()
OUTPUT
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
df3.head()
df4.head()
res = pd.concat([df1,df2,df3,df4])
res
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
res.info()
res = res.fillna(res.median(numeric_only=True))
for col in res.select_dtypes(include='object'):
mode_val = res[col].mode()
if not mode_val.empty:
res[col] = res[col].fillna(mode_val[0])
res
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
res.describe()
res.info()
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
res['Precaution_Merged']=res[['Precaution_1','Precaution_2','Precaution_3',
'Precaution_4']].astype(str).agg(', '.join, axis=1)
res['Symptom_Merged']=res[['Symptom','Symptom_1','Symptom_2','Symptom_3,'Symptom
_4',
'Symptom_5','Symptom_6','Symptom_7','Symptom_8','Symptom_9',
'Symptom_10','Symptom_11','Symptom_12','Symptom_13','Symptom_14','Symptom_15',
'Symptom_17']].astype(str).agg(', '.join, axis=1)
res.head()
input=res[['weight','Description','Precaution_Merged','Symptom_Merged']]
target=res['Disease']
input.head()
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
target.head()
from sklearn.preprocessing import LabelEncoder
le_Description=LabelEncoder()
le_Precaution_Merged=LabelEncoder()
le_Symptom_Merged=LabelEncoder()
input['Description_n']=le_Description.fit_transform(input['Description'])
input['Precaution_Merged_n']=le_Precaution_Merged.fit_transform(input['Precaution_
Merged'])
input['Symptom_Merged_n']=le_Symptom_Merged.fit_transform(input['Symptom_
Merged'])
input.head()
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
input=input.drop(['Description','Precaution_Merged','Symptom_Merged'],
axis='column')
input.head()
le_Disease=LabelEncoder()
target=le_Disease.fit_transform(target)
target
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(input,target,test_size=0.33,random_state=42)
from sklearn import tree
model=tree.DecisionTreeClassifier(max_depth=3)
model.fit(x_train,y_train)
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree
import numpy as np
plt.figure(figsize=(20, 10))
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
plot_tree(
model,
feature_names=input.columns,
class_names=[str(c) for c in sorted(np.unique(target))],
filled=True,
rounded=True,
fontsize=10, max_depth=3)
plt.title('Decision Tree')
plt.show()
OUTPUT
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
EX.NO:6 IMPLEMENTATION OF K MEANS ALGORITHM
PROGRAM
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('Instagram visits clustering.csv')
df
plt.scatter(df[['Instagram visit score']],df['Spending_rank(0 to 100)'])
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
OUTPUT
#Building a K-means Clustering model
from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
km = KMeans(n_clusters=i)
km.fit_predict(df)
wcss.append(km.inertia_)
wcss
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
plt.plot(range(1,11),wcss)
X = df.iloc[:,:].values #numpy array
km = KMeans(n_clusters=4)
y_means = km.fit_predict(X) y_means
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
X[y_means==0]
X[y_means==1]
X[y_means==2]
X[y_means==3]
X[y_means == 3,1]
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
array([ 70., 79., 28., 49., 92., 90., 80., 84., 76., 90., 36.,
84., 84., 49., 96., 43., 71., 10., 102., 78., 86., 50.,
71., 82., 18., 88., 74., 40., 92., 22., 118., 84., 27.,
78., 84., 86., 82., 72., 59., 66., 54., 79., 103., 51.,
45., 67., 93., 47., 74., 46., 90., 40., 11., 54., 84.,
99., 43., 20., 92., 22., 87., 53., 25., 69., 92., 86.,
32., 67., 56., 47., 88., 89., 20., 54., 86., 78., 9.,
36., 80., 67., 87., 28., 50., 88., 36., 71., 90., 67.,
80., 85., 95., 70., 80., 77., 34., 89., 81., 81., 96.,
71., 91., 46., 22., 86., 73., 24., 75., 15., 71., 55.,
38., 85., 62., 91., 23., 58., 32., 82., 23., 81., 86.,
97., 70., 100., 78., 64., 30., 21., 86., 86., 94., 56.,
39., 24., 84., 77., 90., 52., 79., 86., 24., 33., 77.,
87., 61., 90., 67., 26., 48., 27., 61., 89., 14., 53.,
97., 36., 83., 92., 83., 84., 93., 24., 28., 14., 90.,
78., 62., 73., 75., 72., 78., 80., 33., 91., 32., 91.,
90., 21., 70., 80., 66., 64., 89., 85., 25., 88., 14.,
64., 87., 26., 29., 70., 42., 77., 53., 28., 83., 101.,
92., 100., 10., 58., 88., 36., 38., 95., 36., 27., 97.,
100., 40., 88., 87., 85., 91., 41., 28., 96., 79., 89.,
78., 17., 70., 28., 91., 54., 66., 102., 80., 76., 16.,
28., 24., 74., 41., 41., 49., 44., 18., 74., 88., 77.,
82., 82., 43., 25., 47., 64., 81., 78., 22., 81., 38.,
61., 89., 77., 30., 45., 30., 22., 51., 13., 65., 85.,
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
84., 59., 84., 92., 21., 101., 18., 73., 35., 26., 24.,
37., 61., 90., 33., 32., 89., 98., 87., 83., 67., 56.,
51., 17., 40., 86., 86., 75., 78., 91., 69., 91., 33.,
28., 99., 26., 59., 22., 59., 52., 36., 89., 92., 64.,
33., 27., 82., 79., 47., 42., 82., 24., 90., 24., 28.,
97., 28., 22., 34., 18., 29., 78., 63., 70., 51., 66.,
87., 69., 80., 42., 70., 78., 68., 85., 54., 80., 48.,
39., 85., 31., 76., 30., 56., 93., 90., 104., 58., 74.,
32., 81., 83., 93., 15., 43., 91., 82., 22., 96., 64.,
18., 9., 37., 79., 86., 81., 95., 93., 59., 66., 40.,
58., 24., 41., 97., 79., 102., 73., 45., 78., 31., 61.,
61., 86., 78., 72., 82., 31., 87., 92., 56., 63., 38.,
88., 79., 48., 45., 92., 77., 48., 12., 86., 38., 76.,
22., 44., 94., 87., 85., 77., 90., 32., 52., 89., 65.,
31., 21., 92., 69., 98., 68., 90., 60., 86., 41., 89.,
83., 83., 43., 19., 13., 28., 80., 82., 7., 83., 41.,
81., 91., 94., 39., 53., 88., 102., 19., 79., 85., 94.,
34., 61., 80., 20., 60., 90., 60., 91., 91., 91., 73.,
92., 89., 89., 101., 87., 62., 69., 30., 114., 72., 59.,
45., 73., 38., 92., 95., 85., 80., 48., 73., 88., 28.,
26., 72., 104., 92., 86., 23., 88., 40., 88., 75., 97.,
88., 59., 90., 92., 35., 88., 82., 40., 78., 77., 87.,
77., 28., 89., 25., 99., 85., 24., 27., 42., 88., 85.,
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
81., 81., 72., 20., 83., 89., 44., 18., 88., 48., 65.,
90., 64., 76., 45., 69., 37., 83., 80., 87., 77., 15.,
31., 35., 37., 72., 46., 83., 90., 48., 86., 33., 90.,
26., 52., 86., 83., 75., 85., 92., 93., 81., 83., 90.,
27., 26., 82., 94., 30., 62., 20., 57., 75., 80., 33.,
104., 40., 88., 80., 5., 27., 27., 74., 61., 42., 64.,
27., 56., 82., 79., 96., 91., 72., 60., 40., 93., 85.,
98., 32., 90., 83., 91., 41., 19., 20., 43., 27., 79.,
86., 83., 31., 47., 84., 94., 84., 94., 82., 84., 37.,
66., 41., 86., 99., 40., 76., 102., 76., 90., 87., 77.,
37., 88., 76., 36., 28., 99., 96., 34., 91., 82., 84.,
27., 61., 95., 29., 38., 96., 70., 95.])
plt.scatter(X[y_means == 0,0],X[y_means == 0,1],color='blue')
plt.scatter(X[y_means == 1,0],X[y_means == 1,1],color='red')
plt.scatter(X[y_means == 2,0],X[y_means == 2,1],color='green')
plt.scatter(X[y_means == 3,0],X[y_means == 3,1],color='yellow')
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
from sklearn.datasets import make_blobs # k_means on 3D
centroids = [(-5,-5,5),(5,5,-5),(3.5,-2.5,4),(-2.5,2.5,-4)]
cluster_std = [1,1,1,1]
X,y=make_blobs(n_samples=200,cluster_std=cluster_std,centers=centroids,
n_features= 3,random_state=1)
X
array([[ 4.33424548, 3.32580419, 4.17497018],
[-3.32246719, 3.22171129, -4.625342 ],
[-6.07296862, -4.13459237, 2.6984613 ],
[ 6.90465871, 6.1110567 , -4.3409502 ],
[-2.60839207, 2.95015551, -2.2346649 ],
[ 5.88490881, 4.12271848, -5.86778722],
[-4.68484061, -4.15383935, 4.14048406],
[-1.82542929, 3.96089238, -3.4075272 ],
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
[-5.34385368, -4.95640314, 4.37999916],
[ 4.91549197, 4.70263812, -4.582698 ],
[-3.80108212, -4.81484358, 4.62471505],
[ 4.6735005 , 3.65732421, -3.88561702],
[-6.23005814, -4.4494625 , 5.79280687],
[-3.90232915, 2.95112294, -4.6949209 ],
[ 3.72744124, 5.31354772, -4.49681519],
[-3.3088472 , 3.05743945, -3.81896126],
[ 2.70273021, -2.21732429, 3.17390257],
[ 4.06438286, -0.36217193, 3.214466 ],
[ 4.69268607, -2.73794194, 5.15528789],
[ 4.1210827 , -1.5438783 , 3.29415949],
[-6.61577235, -3.87858229, 5.40890054],
[ 3.05777072, -2.17647265, 3.89000851],
[-1.48617753, 0.27288737, -5.6993336 ],
[-5.3224172 , -5.38405435, 6.13376944],
[-5.26621851, -4.96738545, 3.62688268],
[ 5.20183018, 5.66102029, -3.20784179],
[-2.9189379 , 2.02081508, -5.95210529],
[ 3.30977897, -2.94873803, 3.32755196],
[ 5.12910158, 6.6169496 , -4.49725912],
[-2.46505641, 3.95391758, -3.33831892],
[ 1.46279877, -4.44258918, 1.49355935],
[ 3.87798127, 4.48290554, -5.99702683],
[ 4.10944442, 3.8808846 , -3.0439211 ],
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
[-6.09989127, -5.17242821, 4.12214158],
[-3.03223402, 3.6181334 , -3.3256039 ],
[ 7.44936865, 4.45422583, -5.19883786],
[-4.47053468, -4.86229879, 5.07782113],
[-1.46701622, 2.27758597, -2.52983966],
[ 3.0208429 , -2.14983284, 4.01716473],
[ 3.82427424, -2.47813716, 3.53132618],
[-5.74715829, -3.3075454 , 5.05080775],
[-1.51364782, 2.03384514, -2.61500866],
[-4.80170028, -4.88099135, 4.32933771],
[ 6.55880554, 5.1094027 , -6.2197444 ],
[-1.48879294, 1.02343734, -4.14319575],
[ 4.30884436, -0.71024532, 4.45128402],
[ 3.58646441, -4.64246673, 3.16983114],
[ 3.37256166, 5.60231928, -4.5797178 ],
[-1.39282455, 3.94287693, -4.53968156],
[-4.64945402, -6.31228341, 4.96130449],
[ 3.88352998, 5.0809271 , -5.18657899],
[ 3.32454103, -3.43391466, 3.46697967],
[ 3.45029742, -2.03335673, 5.03368687],
[-2.95994283, 3.14435367, -3.62832971],
[-3.03289825, -6.85798186, 6.23616403],
[-4.13665468, -5.1809203 , 4.39607937],
[-3.6134361 , 2.43258998, -2.83856002],
[ 2.07344458, -0.73204005, 3.52462712],
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
[ 4.11798553, -2.68417633, 3.88401481],
[ 3.60337958, 4.13868364, -4.32528847],
[-5.84520564, -5.67124613, 4.9873354 ],
[-2.41031359, 1.8988432 , -3.44392649],
[-2.75898285, 2.6892932 , -4.56378873],
[-2.442879 , 1.70045251, -4.2915946 ],
[ 3.9611641 , -3.67598267, 5.01012718],
[-7.02220122, -5.30620401, 5.82797464],
[ 2.90019547, -1.37658784, 4.30526704],
[ 5.81095167, 6.04444209, -5.40087819],
[-5.75439794, -3.74713184, 5.51292982],
[-2.77584606, 3.72895559, -2.69029409],
[ 3.07085772, -1.29154367, 5.1157018 ],
[ 2.206915 , 6.93752881, -4.63366799],
[ 4.2996015 , 4.79660555, -4.75733056],
[ 4.86355526, 4.88094581, -4.98259059],
[-4.38161974, -4.76750544, 5.68255141],
[ 5.42952614, 4.3930016 , -4.89377728],
[ 3.69427308, 4.65501279, -5.23083974],
[ 5.90148689, 7.52832571, -5.24863478],
[-4.87984105, -4.38279689, 5.30017032],
[ 3.93816635, -1.37767168, 3.0029802 ],
[-3.32862798, 3.02887975, -6.23708651],
[-4.76990526, -4.23798882, 4.77767186],
[-2.12754315, 2.3515102 , -4.1834002 ],
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
[-0.64699051, 2.64225137, -3.48649452],
[-5.63699565, -4.80908452, 7.10025514],
[-1.86341659, 3.90925339, -2.37908771],
[ 4.82529684, 5.98633519, -4.7864661 ],
[-5.24937038, -3.53789206, 2.93985929],
[-4.59650836, -4.40642148, 3.90508815],
[-3.66400797, 3.19336623, -4.75806733],
[ 6.29322588, 4.88955297, -5.61736206],
[-2.85340998, 0.71208711, -3.63815268],
[-2.35835946, -0.01630386, -4.59566788],
[ 5.61060505, -3.80653407, 4.07638048],
[-1.78695095, 3.80620607, -4.60460297],
[-6.11731035, -4.7655843 , 6.65980218],
[-5.63873041, -4.57650565, 5.07734007],
[ 5.62336218, 4.56504332, -3.59246 ],
[-3.37234925, -4.6619883 , 3.80073197],
[-5.69166075, -5.39675353, 4.3128273 ],
[ 7.19069973, 3.10363908, -5.64691669],
[-3.86837061, -3.48018318, 7.18557541],
[-4.62243621, -4.87817873, 6.12948391],
[ 5.21112476, 5.01652757, -4.82281228],
[-2.61877117, 2.30100182, -2.13352862],
[-2.92449279, 1.76846902, -5.56573815],
[-2.80912132, 3.01093777, -2.28933816],
[ 4.35328122, -2.91302931, 5.83471763],
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
[ 2.79865557, -3.03722302, 4.15626385],
[-3.65498263, 2.3223678 , -5.51045638],
[ 4.8887794 , -3.16134424, 7.03085711],
[ 4.94317552, 5.49233656, -5.68067814],
[ 3.97761018, -3.52188594, 4.79452824],
[-3.41844004, 2.39465529, -3.36980433],
[ 3.50854895, -2.66819884, 3.82581966],
[-2.63971173, 3.88631426, -3.45187042],
[-3.37565464, -5.61175641, 4.47182825],
[-2.37162301, 4.26041518, -3.03346075],
[ 1.81594001, -3.6601701 , 5.35010682],
[ 5.04366899, 4.77368576, -3.66854289],
[-4.19813897, -4.9534327 , 4.81343023],
[ 5.1340482 , 6.20205486, -4.71525189],
[ 3.39320601, -1.04857074, 3.38196315],
[ 4.34086156, -2.60288722, 5.14690038],
[-0.80619089, 2.69686978, -3.83013074],
[-5.62353073, -4.47942366, 3.85565861],
[ 5.56578332, -3.97115693, 3.1698281 ],
[ 4.41347606, 3.76314662, -4.12416107],
[ 4.01507361, -5.28253447, 4.58464661],
[-5.02461696, -5.77516162, 6.27375593],
[ 5.55635552, -0.73975077, 3.93934751],
[-5.20075807, -4.81343861, 5.41005165],
[-2.52752939, 4.24643509, -4.77507029],
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
[-3.85527629, -4.09840928, 5.50249434],
[ 5.78477065, 4.04457474, -4.41408957],
[ 1.74407436, -1.7852104 , 4.85270406],
[ 3.27123417, -0.88663863, 3.62519531],
[ 7.18697965, 5.44136444, -5.10015523],
[-2.78899734, 2.10818376, -3.31599867],
[-3.37000822, 2.86919047, -3.14671781],
[-4.30196797, -5.44712856, 6.2245077 ],
[ 3.95541062, 7.05117344, -4.414338 ],
[ 3.55912398, 6.23225307, -5.25417987],
[-3.09384307, 2.15609929, -5.00016919],
[-5.93576943, -5.26788808, 5.53035547],
[ 5.83600472, 6.54335911, -4.24119434],
[ 4.68988323, 2.56516224, -3.9611754 ],
[-5.29809284, -4.51148185, 4.92442829],
[-1.30216916, 4.20459417, -2.95991085],
[ 4.9268873 , 6.16033857, -4.63050728],
[-3.30618482, 2.24832579, -3.61728483],
[ 4.50178644, 4.68901502, -5.00189148],
[ 3.86723181, -1.26710081, 3.57714304],
[ 4.32458463, -1.84541985, 3.94881155],
[ 4.87953543, 3.76687926, -6.18231813],
[ 3.51335268, -3.1946936 , 4.6218035 ],
[-4.83061757, -4.25944355, 4.0462994 ],
[-1.6290302 , 1.99154287, -3.22258079],
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
[ 1.62683902, -1.57938488, 3.96463208],
[ 6.39984394, 4.21808832, -5.43750898],
[ 5.82400562, 4.43769457, -3.04512192],
[-3.25518824, -5.7612069 , 5.3190391 ],
[-4.95778625, -4.41718479, 3.89938082],
[ 2.75003038, -0.4453759 , 4.05340954],
[ 3.85249436, -2.73643695, 4.7278135 ],
[-5.10174587, -4.13111384, 5.75041164],
[-4.83996293, -4.12383108, 5.31563495],
[ 1.086497 , -4.27756638, 3.22214117],
[ 4.61584111, -2.18972771, 1.90575218],
[-4.25795584, -5.19183555, 4.11237104],
[ 5.09542509, 5.92145007, -4.9392498 ],
[-6.39649634, -6.44411381, 4.49553414],
[ 5.26246745, 5.2764993 , -5.7332716 ],
[ 3.5353601 , -4.03879325, 3.55210482],
[ 5.24879916, 4.70335885, -4.50478868],
[ 5.61853913, 4.55682807, -3.18946509],
[-2.39265671, 1.10118718, -3.91823218],
[ 3.16871683, -2.11346085, 3.14854434],
[ 3.95161595, -1.39582567, 3.71826373],
[-4.09914405, -5.68372786, 4.87710977],
[-1.9845862 , 1.38512895, -4.76730983],
[-1.45500559, 3.1085147 , -4.0693287 ],
[ 2.94250528, -1.56083126, 2.05667659],
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
[ 2.77440288, -3.36776868, 3.86402267],
[ 4.50088142, -2.88483225, 5.45810824],
[-5.35224985, -6.1425182 , 4.65065728],
[-2.9148469 , 2.95194604, -5.57915629],
[-4.06889792, -4.71441267, 5.88514116],
[ 3.47431968, 5.79502609, -5.37443832],
[ 3.66804833, 3.23931144, -6.65072127],
[-3.22239191, 3.59899633, -4.90163449],
[-3.6077125 , 2.48228168, -5.71939447],
[ 5.5627611 , 5.24073709, -4.71933492],
[ 1.38583608, -2.91163916, 5.27852808],
[ 4.42001793, -2.69505734, 4.80539342],
[ 4.71269214, 5.68006984, -5.3198016 ],
[-4.13744959, 6.4586027 , -3.35135636],
[-5.20889423, -4.41337681, 5.83898341],
[ 2.6194224 , -2.77909772, 5.62284909],
[-1.3989998 , 3.28002714, -4.6294416 ]])
wcss = [] #Elbow Method
for i in range(1,21):
km = KMeans(n_clusters=i)
km.fit_predict(X)
wcss.append(km.inertia_)
plt.plot(range(1,21),wcss)
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
Km=KMeans(n_clusters=4)
y_pred = km.fit_predict(X)
df = pd.DataFrame()
df['col1'] = X[:,0]
df['col2'] = X[:,1]
df['col3'] = X[:,2]
df['label'] = y_pred
df
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
y_pred
OUTPUT
array([0, 2, 1, 0, 2, 0, 1, 2, 1, 0, 1, 0, 1, 2, 0, 2, 3, 3, 3, 3, 1, 3,
2, 1, 1, 0, 2, 3, 0, 2, 3, 0, 0, 1, 2, 0, 1, 2, 3, 3, 1, 2, 1, 0,
2, 3, 3, 0, 2, 1, 0, 3, 3, 2, 1, 1, 2, 3, 3, 0, 1, 2, 2, 2, 3, 1,
3, 0, 1, 2, 3, 0, 0, 0, 1, 0, 0, 0, 1, 3, 2, 1, 2, 2, 1, 2, 0, 1,
1, 2, 0, 2, 2, 3, 2, 1, 1, 0, 1, 1, 0, 1, 1, 0, 2, 2, 2, 3, 3, 2,
3, 0, 3, 2, 3, 2, 1, 2, 3, 0, 1, 0, 3, 3, 2, 1, 3, 0, 3, 1, 3, 1,
2, 1, 0, 3, 3, 0, 2, 2, 1, 0, 0, 2, 1, 0, 0, 1, 2, 0, 2, 0, 3, 3,
0, 3, 1, 2, 3, 0, 0, 1, 1, 3, 3, 1, 1, 3, 3, 1, 0, 1, 0, 3, 0, 0,
2, 3, 3, 1, 2, 2, 3, 3, 3, 1, 2, 1, 0, 0, 2, 2, 0, 3, 3, 0, 2, 1,
3, 2], dtype=int32)
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
EX.NO 7 CREATION OF MATRIX USING NUMPY
PROGRAM
import numpy as np
# Step 1: Create a 3x4 matrix with values from 10 to 21
matrix = np.arange(10, 22).reshape(3, 4)
print("Original Matrix:\n", matrix)
# Step 2: Replace value 21 with a square number (e.g., 25)
matrix[matrix == 21] = 25
print("\nMatrix after replacing 21 with 25:\n", matrix)
OUTPUT
Original Matrix:
[[10 11 12 13]
[14 15 16 17]
[18 19 20 21]]
Matrix after replacing 21 with 25:
[[10 11 12 13]
[14 15 16 17]
[18 19 20 25]]
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
EX.NO:8 IMPLEMENTATION OF BASIC EDA OPERATIONS WITH DATASET
PROGRAM
import pandas as pd
df=pd.read_csv(“Symptom-severity.csv”)
df.head()
df.isnull().sum()
df.dtypes
df.describe()
df['weight'].value_counts()
df['Symptom'].value_counts()
OUTPUT
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
EX.NO:9 CREATION OF DATAFRAME FOR SORTING AND RANKING SALES DATA
PROGRAM
import pandas as pd
# Step 1: Create sample product sales data
data = {
'Product': ['Laptop', 'Smartphone', 'Tablet', 'Monitor', 'Keyboard', 'Mouse', 'Printer'],
'Units_Sold': [150, 300, 120, 180, 400, 500, 90],
'Unit_Price': [700, 500, 300, 200, 50, 25, 150]
}
# Create DataFrame
df = pd.DataFrame(data)
# Step 2: Calculate Sales Amount
df['Sales_Amount'] = df['Units_Sold'] * df['Unit_Price']
# Step 3: Sort by Sales Amount in descending order
df_sorted = df.sort_values(by='Sales_Amount', ascending=False)
# Step 4: Rank products based on Sales Amount
df_sorted['Sales_Rank'] = df_sorted['Sales_Amount'].rank(ascending=False,
method='dense').astype(int)
# Display the final DataFrame
print("Product Sales Data Ranked:\n", df_sorted)
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
OUTPUT
Product Sales Data Ranked:
Product Units_Sold Unit_Price Sales_Amount Sales_Rank
1 Smartphone 300 500 150000 1
0 Laptop 150 700 105000 2
5 Mouse 500 25 12500 3
4 Keyboard 400 50 20000 4
3 Monitor 180 200 36000 5
2 Tablet 120 300 36000 5
6 Printer 90 150 13500 6
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
EX.NO:10 FINDING OF HIGHEST RAINFALL AREA IN TAMIL NADU
PROGRAM
import pandas as pd
import matplotlib.pyplot as plt
# --- 1. Load the data ---
data = {
'District': ['Chennai', 'Coimbatore', 'Madurai', 'Tiruchirappalli', 'Kanyakumari', 'Vellore'],
'November_Rainfall_mm': [450, 120, 180, 150, 500, 100]
df = pd.DataFrame(data)
df['District'] = df['District'].str.title() # Normalize casing
# --- 2. Find the highest rainfall area in November (handle ties) ---
max_rainfall = df['November_Rainfall_mm'].max()
highest_rainfall_areas = df[df['November_Rainfall_mm'] == max_rainfall]
print("District(s) with highest rainfall in November:")
for _, row in highest_rainfall_areas.iterrows():
print(f"- {row['District']} ({row['November_Rainfall_mm']} mm)")
print()
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
# --- 3. Create a bar chart ---
plt.figure(figsize=(10, 6))
bars = plt.bar(df['District'], df['November_Rainfall_mm'], color='skyblue')
plt.xlabel('District')
plt.ylabel('Rainfall (mm)')
plt.title('November Rainfall in Tamil Nadu Districts')
plt.xticks(rotation=45, ha='right')
# Add value labels to each bar
for bar in bars:
height = bar.get_height()
plt.text(bar.get_x() + bar.get_width()/2, height + 10, f"{height} mm",ha='center',
va='bottom', fontsize=9)
plt.tight_layout()
plt.show()
# --- 4. Calculate rainfall percentage for a specific district ---
specific_district = "chennai" # Can be lowercase or mixed case
specific_district = specific_district.title()
if specific_district in df['District'].values:
total_rainfall = df['November_Rainfall_mm'].sum()
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
district_rainfall=df[df['District'] == specific_district]
['November_Rainfall_mm'].iloc[0]
rainfall_percentage = (district_rainfall / total_rainfall) * 100
print(f"Rainfall percentage for {specific_district} in
November: {rainfall_percentage:.2f}%")
else:
print(f"District '{specific_district}' not found in the dataset.")
OUTPUT
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
EX.NO:11 DIFFERENTIATION OF FAST AND LEAST MOVING ITEMS IN THE SHOP
PROGRAM
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the Supermarket dataset
df = pd.read_csv('/content/SuperMarket Analysis.csv')
# Show available columns
print("Columns in dataset:\n", df.columns)
# Group by item/product name and sum quantities sold
item_sales = df.groupby('Product line')['Quantity'].sum().sort_values(ascending=False)
# Identify fast-moving and least-moving items
fast_moving = item_sales.head(10)
least_moving = item_sales.tail(10)
# Plot fast and least moving items side-by-side
plt.figure(figsize=(14, 6))
# Fast-moving items
plt.subplot(1, 2, 1)
sns.barplot(x=fast_moving.values, y=fast_moving.index, palette='Greens_r')
plt.title('Top 10 Fast-Moving Items')
plt.xlabel('Total Quantity Sold')
plt.ylabel('Item')
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
# Least-moving items
plt.subplot(1, 2, 2)
sns.barplot(x=least_moving.values, y=least_moving.index, palette='Reds_r')
plt.title('Bottom 10 Least-Moving Items')
plt.xlabel('Total Quantity Sold')
plt.ylabel('Item')
plt.tight_layout()
plt.show()
# ----------------------------
# Stock availability subplot
# ----------------------------
# Assuming 'Stock' column is available for current stock per item
if 'Stock' in df.columns:
item_stock = df.groupby('Item')['Stock'].sum().sort_values(ascending=False)
plt.figure(figsize=(14, 6))
sns.barplot(x=item_stock.index, y=item_stock.values, palette='Blues_r')
plt.xticks(rotation=90)
plt.title('Stock Available per Item in Store')
plt.xlabel('Item')
plt.ylabel('Available Stock')
plt.tight_layout()
plt.show()
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
else:
print("\n Column 'Stock' not found in the dataset. Please ensure a 'Stock' column
exists for stock-level visualization.")
OUTPUT
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
EX.NO:12 CREATION OF PIE CHART TO DISPLAY FOOD ITEMS
PROGRAM
import matplotlib.pyplot as plt
# Sample data
food_items = ['Apples', 'Bread', 'Milk', 'Rice', 'Eggs']
prices = [120, 40, 60, 80, 50]
# Pie chart
plt.figure(figsize=(6,6))
plt.pie(prices, labels=food_items, autopct='%1.1f%%', startangle=140)
plt.title('Price Distribution of Food Items in Market')
plt.axis('equal') # Ensures the pie is a circle
plt.show()
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
OUTPUT
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
EX.NO:13 CREATION OF LINEAR REGRESSION FOR STUDENT’S INTERNAL MARKS
PROGRAM
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load dataset
df = pd.read_csv("StudentAcademicData.csv - Sheet1.csv")
# Extract relevant columns
df = df[['Internal marks (out of 20)', 'External marks (out of 80)']].dropna()
df.columns = ['Internal_Marks', 'External_Marks']
# Split data into features and target
X = df[['Internal_Marks']]
y = df['External_Marks']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
# Make predictions
y_pred = model.predict(X_test)
# Evaluation
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R-squared Score:", r2_score(y_test, y_pred))
# Predict new result
new_mark = [[15]] # example internal mark
predicted_result = model.predict(new_mark)
print("Predicted Semester Result for Internal Mark 15:", predicted_result[0])
# Plotting
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, model.predict(X), color='red', label='Regression Line')
plt.xlabel("Internal Marks (out of 20)")
plt.ylabel("Semester Result (External Marks out of 80)")
plt.title("Linear Regression: Internal vs External Marks")
plt.legend()
plt.grid(True)
plt.show()
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
Output
Mean Squared Error: 284.50287708929307
R-squared Score: -0.007268107945806568
Predicted Semester Result for Internal Mark 15: 53.68967909800521
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
EX.NO:14 CREATION OF CLUSTER USING MACHINE LEARNING ALGORITHM
PROGRAM
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
# Step 1: Load the dataset
df = pd.read_csv("salaries.csv") # Update with your filename
# Step 2: Clean and combine text fields
# Check for the existence of 'job_title' column before using it
if 'job_title' in df.columns:
df.dropna(subset=["job_title"], how='all', inplace=True)
df["text"] = df["job_title"].fillna('')
else:
print("Error: 'job_title' column not found in the DataFrame.")
# Handle the error appropriately, e.g., exit or use a different column
# Step 3: TF-IDF Vectorization
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X = vectorizer.fit_transform(df["text"])
# Step 4: Apply KMeans Clustering
n_clusters = 4
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init='auto')
df["cluster"] = kmeans.fit_predict(X)
# Step 5: Print Top Terms per Cluster
def print_top_keywords_per_cluster(model, vectorizer, n_terms=10):
terms = vectorizer.get_feature_names_out()
for i, center in enumerate(model.cluster_centers_):
top_terms = center.argsort()[-n_terms:][::-1]
print(f"\n Cluster {i} Top Keywords:")
print(", ".join(terms[top_terms]))
print_top_keywords_per_cluster(kmeans, vectorizer)
# Step 6: Display Sample Job Titles per Cluster
for i in range(n_clusters):
print(f"\n Cluster {i} - Sample Job Offers:")
# Check for the existence of 'company' and 'job_title' columns before using them
if 'company' in df.columns and 'job_title' in df.columns:
print(df[df["cluster"] == i][["company", "job_title"]].head(3))
elif 'job_title' in df.columns:
print(df[df["cluster"] == i][["job_title"]].head(3))
else:
print("Error: 'company' or 'job_title' column not found in the DataFrame.")
# Step 7: Optional Visualization - Cluster Size
plt.figure(figsize=(6, 4))
df["cluster"].value_counts().sort_index().plot(kind='bar', color='skyblue')
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
plt.title(" Number of Job Offers per Cluster")
plt.xlabel("Cluster")
plt.ylabel("Count")
plt.xticks(rotation=0)
plt.grid(True)
plt.tight_layout()
plt.show()
# Step 8: Optional - Evaluate Clustering Quality
sil_score = silhouette_score(X, df["cluster"])
print(f"\n Silhouette Score: {sil_score:.3f}")
# Step 9: Optional - Save the results
df.to_csv("clustered_ml_jobs.csv", index=False)
OUTPUT
Cluster 0 Top Keywords:
executive, account, enterprise, visualization, writer, web, trainee, trader, technology,
technologist
Cluster 1 Top Keywords:
engineer, software, data, learning, machine, analytics, ai, research, systems, intelligence
Cluster 2 Top Keywords:
manager, product, engineering, data, analytics, governance, operations, intelligence,
business, ai
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
Cluster 3 Top Keywords:
data, scientist, analyst, architect, research, associate, developer, applied, specialist,
consultant
Cluster 0 - Sample Job Offers:
job_title
1437 Executive
1438 Executive
1982 Account Executive
Cluster 1 - Sample Job Offers:
job_title
8 Software Engineer
9 Software Engineer
10 Machine Learning Engineer
Cluster 2 - Sample Job Offers:
job_title
24 Manager
25 Manager
72 Manager
Cluster 3 - Sample Job Offers:
job_title
0 Analyst
1 Analyst
2 Data Quality Lead
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
EX.NO:15 SALARY PREDICTION USING MACHINE LEARNING ALGORITHM
PROGRAM
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
# Simulated dataset data = pd.DataFrame({
'experience': [1, 3, 5, 7, 10, 12],
'education_level': ['Bachelors', 'Masters', 'Masters', 'PhD', 'PhD', 'Masters'],
'location': ['Mumbai', 'Delhi', 'Bangalore', 'Chennai', 'Hyderabad', 'Pune'],
'skills_score': [65, 70, 80, 85, 90, 95],
'salary': [600000, 800000, 1200000, 1600000, 2000000, 2200000]
})
# Features and target
X = data.drop('salary', axis=1) y = data['salary']
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
# Preprocessing
preprocessor = ColumnTransformer( transformers=[
('num', StandardScaler(), ['experience', 'skills_score']),
('cat', OneHotEncoder(handle_unknown='ignore'), ['education_level', 'location'])
# Pipeline
model = Pipeline(steps=[ ('preprocess', preprocessor),
('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])
# Train-test split for visualization
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model.fit(X_train, y_train) y_pred = model.predict(X_test)
# 1. Actual vs Predicted
plt.figure(figsize=(8,5))
plt.scatter(y_test, y_pred, color='blue')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--')
plt.xlabel('Actual Salary')
plt.ylabel('Predicted Salary')
plt.title('Actual vs Predicted Salary')
plt.grid(True)
plt.tight_layout()
plt.show()
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
# 2. Residuals (Error)
errors = y_test - y_pred plt.figure(figsize=(8,5))
plt.bar(range(len(errors)), errors, color='orange')
plt.xlabel('Test Instance')
plt.ylabel('Prediction Error (₹)')
plt.title('Prediction Errors per Test Case')
plt.grid(True)
plt.tight_layout()
plt.show()
# 3. Feature Importance
rf_model = model.named_steps['regressor']
feature_names = model.named_steps['preprocess'].transform(X_train).shape[1]
importance = rf_model.feature_importances_
plt.figure(figsize=(10,5))
plt.bar(range(len(importance)), importance)
plt.title('Feature Importance Scores')
plt.xlabel('Feature Index')
plt.ylabel('Importance')
plt.grid(True)
plt.tight_layout()
plt.show()
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
OUTPUT
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
Ex.No:16 CONTENT BEYOND SYLLABUS
SENTIMENTAL ANALYSIS USING LEXICON CLASSIFICATION ALGORITHM
PROGRAM
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
%matplotlib inline import warnings
warnings.filterwarnings(“ignore”)
import os
print(os.listdir(“../input”))
pd.set_option(‘display.max_columns’,None)
US_comments=pd.read_csv(‘../input/youtube/UScomments.csv’,error_bad_lines=False)
US_videos=pd.read_csv(‘../input/youtube/USvideos.csv’,error_bad_lines=False)
US_videos.head()
US_videos.shape US_videos.nunique()
US_videos.info()
US_videos.head()
US_comments.head()
US_comments.shape US_comments.isnull().sum()
US_comments.dropna(inplace=True)
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
US_comments isnull().sum()
US_comments.shape
US_comments.nunique()
US_comments.info()
US_comments.drop(41587,inplace=True)
US_comments=US_comments.reset_index().drop(‘index’,axis=1)
US_comments.likes=US_comments.likes.astype(int)
US_comments.replies=US_comments.replies.astype(int)
US_comments.head()
US_comments[‘comment_text’]=US_comments[‘comment_text’].str.replace(“[^a-zA-
Z#]”,” “)
US_comments[‘comment_text’]=US_comments[‘comment_text’].apply(lambda x: ‘
‘.join([w for w in x.split() if len(w)>3]))
US_comments[‘comment_text’]=US_comments[‘comment_text’].apply(lambda
x:x.lower())
tokenized_tweet=US_comments[‘comment_text’].apply(lambda x:x.split())
tokenized_tweet.head()
from nltk.stem import WordNetLemmatizer from nltk.corpus import stopwords
wnl = WordNetLemmatizer()
tokenized_tweet.apply(lambda x: [wnl.lemmatize(i) for i in x if i not in
set(stopwords.words('english'))])
tokenized_tweet.head()
US_comments['comment_text'] = tokenized_tweet
import nltk
nltk.download('vader_lexicon')
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
from nltk.sentiment.vader import SentimentIntensityAnalyzer sia =
SentimentIntensityAnalyzer()
US_comments['Sentiment Scores'] = US_comments['comment_text'].apply(lambda
x:sia.polarity_scores(x)['compound'])
US_comments.head()
US_comments['Sentiment'] = US_comments['Sentiment Scores'].apply(lambda s :
'Positive' if s > 0 else ('Neutral' if s == 0 else 'Negative'))
US_comments.head()
US_comments.Sentiment.value_counts()
videos = []
for i in range(0,US_comments.video_id.nunique()):
a = US_comments[(US_comments.video_id == US_comments.video_id.unique()[i])
& (US_comments.Sentiment == 'Positive')].count()[0]
b=US_comments[US_comments.video_id==US_comments.video_id.unique()[i]]
['Sen timent'].value_counts().sum()
Percentage = (a/b)*100
videos.append(round(Percentage,2))
Positivity = pd.DataFrame(videos,US_comments.video_id.unique()).reset_index()
Positivity.columns = ['video_id','Positive Percentage']
Positivity.head() channels = []
for i in range(0,Positivity.video_id.nunique()):
channels.append(US_videos[US_videos.video_id ==
Positivity.video_id.unique()[i]]['channel_title'].unique()[0])
Positivity['Channel'] = channels Positivity.head()
Positivity[Positivity['Positive Percentage'] == Positivity['Positive Percentage'].max()]
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
Positivity[Positivity['Positive Percentage'] == Positivity['Positive Percentage'].min()]
all_words = ' '.join([text for text in US_comments['comment_text']])
from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=500, random_state=21,
max_font_size=110).generate(all_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()
all_words_posi = ' '.join([text for text in
US_comments['comment_text'][US_comments.Sentiment == 'Positive']])
wordcloud_posi = WordCloud(width=800, height=500, random_state=21,
max_font_size=110).generate(all_words_posi)
plt.figure(figsize=(10, 7)) plt.imshow(wordcloud_posi, interpolation="bilinear")
plt.axis('off')
plt.show()
all_words_nega = ' '.join([text for text in
US_comments['comment_text'][US_comments.Sentiment == 'Negative']])
wordcloud_nega = WordCloud(width=800, height=500, random_state=21,
max_font_size=110).generate(all_words_nega)
plt.figure(figsize=(10, 7)) plt.imshow(wordcloud_nega, interpolation="bilinear")
plt.axis('off')
plt.show()
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
all_words_neu = ' '.join([text for text in
US_comments['comment_text'][US_comments.Sentiment == 'Neutral']])
wordcloud_neu = WordCloud(width=800, height=500, random_state=21,
max_font_size=110).generate(all_words_neu)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud_neu, interpolation="bilinear")
plt.axis('off')
plt.show()
OUTPUT
['youtube']
(7992, 11)
video_id 2364
title 2398
channel_title 1230
category_id 16
tags 2204
views 7939
likes 6624
dislikes 2531
comment_total 4152
thumbnail_link 2364
date 40
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7992 entries, 0 to 7991
Data columns (total 11 columns):
video_id 7992 non-null object
title 7992 non-null object
channel_title 7992 non-null object
category_id 7992 non-null int64
tags 7992 non-null object
views 7992 non-null int64
likes 7992 non-null int64
dislikes 7992 non-null int64
comment_total 7992 non-null int64
thumbnail_link 7992 non-null object
date 7992 non-null float64
dtypes: float64(1), int64(5), object(5) memory usage: 686.9+ KB
(691400, 4)
video_id 0
comment_text 25
likes 0
replies 0 dtype: int64 video_id 0
comment_text 0
likes 0
replies 0 dtype: int64 (691375, 4)
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
video_id 2266
comment_text 434076
likes 1284
replies 479
dtype: int64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 691375 entries, 0 to 691399
Data columns (total 4 columns):
video_id 691375 non-null object
comment_text 691375 non-null object
likes 691375 non-null object
replies 691375 non-null object
dtypes: object(4)
memory usage: 26.4+ MB
0 [logan, paul]
1 [been, following, from, start, your, vine, cha...
2 [kong, maverick]
3 [attendance]
4 [trending]
Name: comment_text, dtype: object
0 [logan, paul]
1 [been, following, from, start, your, vine, cha...
2 [kong, maverick]
3 [attendance]
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
4 [trending]
Name: comment_text,dtype:object
Positive 305358
Neutral 260986
Negative 125030
Name: Sentiment, dtype: int64
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
POSITIVE COMMENTS
NEGATIVE COMMENTS
Department of Artificial Intelligence and Data Science Page
K. Ramakrishnan College of Engineering (Autonomous), Trichy
NEUTRAL COMMENTS
Department of Artificial Intelligence and Data Science Page