EXPERIMENT NO:1
CREATE A DATAFRAME AND DEMONSTRATE DIFFERENT
WAYS TO TREAT MISSING VALUES
AIM:
Create a DataFrame and Demonstrate Different ways to treat the missing
values.
DESCRIPTION:
Create a Sample DataFrame: First, we will create a sample DataFrame
representing student data. This DataFrame will intentionally have some
missing values.
Demonstrate Different Methods to Treat Missing Values: We will showcase
several methods, such as filling missing values with a default value,
dropping rows or columns with missing values, and imputing missing
values based on statistical measures like mean or median.
PROGRAM:
import pandas as pd
import numpy as np
# Step 1: Create a sample DataFrame with missing values
data={
'Name': ['siri', 'sony', None, 'Dini', 'Bhuvi'],
'Age': [22, 23, 14, None, 23],
'Grade': [81, 92, None, 75, 92]
}
df = pd.DataFrame(data)
print("Original DataFrame with Missing Values:") print(df)
# Step 2: Different methods to treat missing values
• Method 1: Fill missing values with a default value
df_filled = df.fillna('Unknown')
print("\nDataFrame after filling missing values with 'Unknown':")
print(df_filled)
• Method 2: Drop rows with any missing values
df_dropped_rows = df.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped_rows)
• Method 3: Drop columns with any missing values
df_dropped_columns = df.dropna(axis=1)
print("\nDataFrame after dropping columns with missing values:")
print(df_dropped_columns)
• Method 4: Fill missing values with mean (for numerical columns)
df_filled_mean = df.copy() df_filled_mean[‘Age']=
df_filled_mean['Age'].fillna(df_filled_mean['Age'].mean())
df_filled_mean[‘Grade']=
df_filled_mean['Grade'].fillna(df_filled_mean['Grade'].mean()) print("\
nDataFrame after filling missing numerical values with mean:")
print(df_filled_mean)
• Method 5: Forward fill (use previous value to fill missing values)
df_forward_fill = df.fillna(method='ffill')
print("\nDataFrame after forward filling missing values:") print(df_forward_fill)
• Method 6: Backward fill (use next value to fill missing values)
df_backward_fill = df.fillna(method='bfill')
print("\nDataFrame after backward filling missing values:")
print(df_backward_fill)
OUTPUT:
Original DataFrame with Missing Values:
Name Age Grade
0 siri 22.0 81.0
1 sony 23.0 92.0
2 None 14.0 NaN 3
3 Dini NaN 75.0
4 Bhuvi 23.0 92.0
DataFrame after filling missing values with 'Unknown':
Name Age Grade
0 siri 22.0 81.0
1 sony 23.0 92.0
2 Unknown 14.0 Unknown
3 Dini Unknown 75.0
4 Bhuvi 23.0 92.0
DataFrame after dropping rows with missing values:
Name Age Grade
0 siri 22.0 81.0
1 sony 23.0 92.0
4 Bhuvi 23.0 92.0
DataFrame after dropping columns with missing values:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]
DataFrame after filling missing numerical values with mean:
Name Age Grade
0 siri 22.0 81.00
1 sony 23.0 92.00
2 None 14.0 86.75
3 Dini 20.5 75.00
4 Bhuvi 23.0 92.00
DataFrame after forward filling missing values:
Name Age Grade
0 siri 22.0 88.0
1 sony 23.0 92.0
2 sony 14.0 92.0
3 Dini 14.0 75.0
4 Bhuvi 23.0 92.0
DataFrame after backward filling missing values:
Name Age Grade
0 siri 22.0 88.0
1 sony 23.0 92.0
2 Dini 14.0 75.0
3 Dini 23.0 75.0
4 Bhuvi 23.0 92.0
EXPERIMENT NO:2
IMPLEMENT DATA
WRANGLING(CONCATENATE,MERGE,GROUP) AND DATA
AGGREGATION
AIM:
Implement Data Wrangling(Concatenate,Merge,Group) and Data
Aggregation.
DESCRIPTION:
a.
Data Aggregation:
This involves summarizing data, like calculating averages or totals.
a.Concatenation: pd.concat combines df1 and df2. ignore_index=True
resets the index in the resulting DataFrame.
b.Grouping: groupby('Age') groups the data by the 'Age' column.
size() then counts the number of students in each age group.
c.Merging: pd.merge combines concatenated_df and grades_df using
the 'Student_ID' column as the key.
d.Data Aggregation: mean() calculates the average age of the students
in the concatenated DataFrame.
PROGRAM:
# To implement data wrangling(Concatenate) and data
aggregation import pandas as pd sqrs = pd.DataFrame({
"nums":[15,16,17,18,19],
"sqrs":[i**2 for i in range (15,20)]
})
cubs = pd.DataFrame({
"nums":[45,46,47,48,49],
"cubs":[i**3 for i in range(45,50)]
}) print(sqrs.to_string(),"/n")
print(cubs.to_string(),"/n") print("/nAfter
counting:/n”,pd.concat([sqrs,cubs]))
b.Write: 'DataFrame.to_csv()' for writing a DataFrame to a CSV file.
For text(.txt) files:
a. Read: Standard Python functions like 'open()' with 'read()' or 'readlines()'
for reading text files.
b. Write: Using 'open()' with the 'write()' or 'writelines()' methods for writing
to text files.
For .xls (Excel) files:
a.Read: 'pandas.read_excel()' for reading data from an Excel file into a
DataFrame.
b. Write: 'DataFrame.to_excel()' for writing a DataFrame to an Excel file.
PROGRAM:
# Write a Computer Program to Conduct Read and Write Data into
Files(.csv format) import pandas as pd df = pd.DataFrame({
"name":['Dini','jain','mary','rani'],
"pin":['201','204','208','209'],
})
df.to_csv('Student_data.csv',index=False) stdata =
pd.read_csv('Student_data.csv') print("/n The Data
Present in Stut Data is:/n”,stdata)
# Write a Computer Program to Conduct Read and Write Data into Files
(.txt format) import pandas as pd df = pd.DataFrame({
"name":['Dini','jain','mary','rani'],
"pin":['201','204','208','209'],
})
df.to_csv('Student_data.txt',index=False) stdata =
pd.read_fwf('Student_data.txt')
print("/n The Data Present in Student Data is:/n",stdata)
# Write a Computer Program to Conduct Read and Write Data into Files
(.xls format) import pandas as pd df = pd.DataFrame({
"name":['Dini','jain','mary','rani'],
"pin":['201','204','208','209'],
})
df.to_csv('Student_data.xlsx',index=False) stdata =
pd.read_excel('Student_data.xlsx') print("/n The Data Present
in Student Data is:/n”,stdata)
OUTPUT:
# Output for Read and Write Data into Files(.csv format)
# Output for Read and Write Data into Files(.txt format)
/n The Data Present in Student Data is:/n
name,pin 0
Dini,201
1 jain,204
2 mary,208
3 rani,209
# Output for Read and Write Data into Files(.xlsx format)
/n The Data Present in Student Data is:/n
PROGRAM:
#Creating DataFrame using Dictionary
import pandas as pd
di = {
"nums" :[1,2,3,4,5,6],
"sqr" :[i**2 for i in range(1,7)]
}
df1 = pd.DataFrame(di) df1
# Creating DataFrame using Tuple
l = [("a",20),("b",10),("c",30)]
df = pd.DataFrame(l) df
OUTPUT:
#Output for Creating DataFrame using Dictionary
#Output for Creating DataFrame using Tuple
EXPERIMENT NO:5
WRITE A PROGRAM TO CONDUCT INDEXING AND
SLICING
A) HEAD()
B) TAIL()
C) DESCRIBE()
D) SHAPE()
E) ITERROWS()
AIM:
A Program to Conduct Indexing and Slicing For: a)Head()
b)Tail()
c)Describe()
d)Shape()
e)Iterrows()
DESCRIPTION:
a) Head(): Use head() to quickly peek at the first few rows of a DataFrame.
b) Tail(): Utilize tail() to glance at the last few rows of a DataFrame.
c) Describe(): Call describe() to get summary statistics of numerical
columns in a DataFrame.
d) Shape(): Access the shape of a DataFrame using shape attribute,
returning a tuple of (num_rows, num_columns).
e) Iterrows(): Iterate over rows of a DataFrame with iterrows(), returning
index and row data for each iteration.
These operations provide essential insights and information about the
structure and contents of a DataFrame in Python using Pandas.
PROGRAM:
df1
OUTPUT 1:
df1.head()
OUTPUT 2:
df1.head(3)
OUTPUT 3:
df1.tail
OUTPUT 4:
EXPERIMENT NO:6
WRITE A COMPUTER PROGRAM TO COMPUTE LOC AND
ILOCFUNCTIONS IN PANDAS
AIM:
A Computer Program To Complete Loc and Loc Functions in Pandas.
DESCRIPTION:
• Import Pandas:
Start by importing the Pandas library (import pandas as pd).
• Create DataFrame:
Define your dataset as a Pandas DataFrame.
• Using loc:
Use loc to select rows and columns by labels.
Syntax: df.loc[row_labels, column_labels].
Example: df.loc[2] selects the row with label 2.
• Using iloc:
Use iloc to select rows and columns by integer indices.
Syntax: df.iloc[row_indices, column_indices].
Example: df.iloc[2] selects the row with index 2.
Output: Print or use the selected data as needed.
PROGRAM:
import pandas as pd
data = pd.DataFrame({'Brand': ['Maruti', 'Hyundai', 'Tata','Mahindra',
'Maruti', 'Hyundai', 'Renault', 'Tata', 'Maruti'],
'Year': [2012, 2014, 2011, 2015, 2012, 2016, 2014, 2018,
2019],
'Kms Driven': [50000, 30000, 60000, 25000, 10000, 46000,
31000, 15000, 12000],
'City': ['Gurgaon', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai',
'Delhi', 'Mumbai', 'Chennai', 'Ghaziabad'],
'Mileage': [28, 27, 25, 26, 28, 29, 24, 21, 24]
})
print(data)
OUTPUT 1:
Import Libraries:
Import scikit-learn for machine learning and
data handling.
import numpy as np import matplotlib.pyplot as plt from
sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Sample data (Height in cm and Weight in kg)
height = np.array([150, 160, 170, 180, 190]).reshape(-1, 1) # Explanatory
variable
weight = np.array([50, 60, 70, 80, 90]) # Dependent variable
# Create a linear regression model
model = LinearRegression()
# Fit the model to the data (training) model.fit(height,
weight)
# Make predictions
weight_pred = model.predict(height)
# Print the coefficients print(f"Coefficients:
{model.coef_}")
print(f"Intercept: {model.intercept_}")
# The mean squared error and coefficient of determination (R^2)
print(f"Mean squared error: {mean_squared_error(weight, weight_pred)}")
print(f"Coefficient of determination (R^2): {r2_score(weight,
weight_pred)}")
# Plot outputs
plt.scatter(height, weight, color='black') plt.plot(height,
weight_pred, color='blue', linewidth=3) plt.xlabel('Height
(cm)') plt.ylabel('Weight (kg)') plt.show()
OUTPUT:
Coefficients: [1.]
Intercept: -100.0
Mean squared error: 0.0
Coefficient of determination (R^2): 1.0
EXPERIMENT NO:9
IMPLEMENT LINEAR REGRESSION USING PYTHON SCRIPT
AND IDENTIFY EXPLANATORY VARIABLES
AIM:
To Implement Linear Regression using Python Script and Identify Explanatory
Variables.
DESCRIPTION:
In your Python script, you’ll:
• Load data using pandas.
• Split data using train_test_split from scikit-learn.
• Create a LinearRegression model.
• Fit the model to the training data.
• Extract and analyze coefficients.
• Evaluate the model's performance.
• Optionally, visualize the relationship between variables.
PROGRAM:
import numpy as np from sklearn.linear_model import
linearRegression
#sample data
x = np.array([1, 2, 3, 4, 5)].reshape((-1,1)) y =
np.array([2, 4, 6, 8, 10])
#Create a linear regression model model
= linearRegression() #Fit the model to
the data model.fit(x,y)
#Get the coefficients (slope and intercept) slope =
model.coef_[0] intercept = model.intercept_
#print the coeficients print("Slope (Explanatory
Variable(:", slope) print("Intercept:", intercept)
OUTPUT:
Slope (Explanatory Variable): 2.0
Intercept: 0.0
EXPERIMENT:10
IMPLEMENT THE CLUSTERING TECHNIQUE FOR A GIVEN
DATASET IN PYTHON AIM:
To Implement the Clustering Technique for a given Dataset in the Python.
DESCRIPTION:
• Import Libraries: Import pandas for data handling, scikit-learn for clustering
algorithms, and matplotlib/seaborn for visualization.
• Load and Prepare Data: Load your dataset into a DataFrame and preprocess it
if necessary.
• Choose a Clustering Algorithm: Select a clustering algorithm like Kmeans,
DBSCAN, or Hierarchical Clustering based on your dataset characteristics
and requirements.
• Create and Fit the Model: Instantiate the chosen clustering algorithm and fit it
to your data.
• Predict Cluster Labels: If applicable, predict cluster labels for each data point.
• Visualize Clusters (Optional): Visualize the clustering results to understand
the data's structure better. Use scatter plots or other visualization techniques.
• Evaluate Clustering Performance (Optional): If ground truth labels are
available, evaluate the clustering performance using metrics such as silhouette
score or adjusted Rand index.
PROGRAM:
#Import necessary libraries from
sklearn.cluster import KMeans import
numpy as np import pandas as pd
#example dataset: Two-dimensional data
data = np.array([[1, 2],[1, 4], [1 ,0], [10 ,2], [10, 4], [10, 0]])
#convert the dataset to a Dataframe for better visualization df =
pd.DataFrame(data, columns=['x' ,'y']) #Initialize the KMeans
algorithm with 2 clusters kmeans = kMeans(n_clusters=2) #Fit the
model on the dataset kmeans.fit(df)
#Predict the cluster for each instance in the dataset
df['cluster'] = kmeans.predict(df) #ouput the clustering
results
print(df)
OUTPUT:
To Implement the Naive Bayesian Classifier for a sample Training
Dataset stored as .csv File.
DESCRIPTION:
Follow these steps:
• Import Libraries: Import necessary libraries such as pandas for data
manipulation and scikit-learn for machine learning algorithms.
• Load and Prepare Data: Load your dataset into a pandas DataFrame.
Preprocess the data if needed, including handling missing values, encoding
categorical variables, and splitting the data into features (independent
variables) and target variable (dependent variable).
• Choose a Naive Bayes Algorithm: Select a Naive Bayes algorithm based on
your dataset and requirements. Common options include Gaussian Naive
Bayes, Multinomial Naive Bayes, and Bernoulli Naive Bayes.
• Create and Train the Model: Instantiate the chosen Naive Bayes algorithm and
fit it to your training data.
• Make Predictions: Use the trained model to make predictions on new or
unseen data.
• Evaluate Model Performance: Assess the performance of the Naive Bayes
classifier using evaluation metrics such as accuracy, precision, recall, or F1-
score.
PROGRAM:
from sklearn.naive_bayes import GaussianNB from
sklearn.model_selection import train_test_split from
sklearn.metrics import accuracy_score import pandas as
pd df = pd.read_csv('sample_data.csv')
x = df.iloc[:,:-1] y
= df.iloc[:,-1]
x_train, x_test, y_train, y_test =
train_test_split(x,y,test_size=0.2,random_state =42) gnb =
GaussianNB()
gnb.fit(x_train,y_train) y_pred =
gnb.predict(x_test) accuracy =
accuracy_score(y_test,y_pred)
print(f’Accuracy:{accuracy:.2f}')
OUTPUT:
Accuracy = 0.83
EXPERIMENT NO:12
WRITE A PROGRAM TO IMPLEMENT THE NAIVE BAYESIAN
CLASSIFIER FOR A SAMPLE TRAINING DATASET STORED AS
A .CSV FILE. COMPUTE THE ACCURACY OF THE
CLASSIFIER,CONSIDERING FEW TEXT DATASETS.
AIM:
Program to Implement the Naive Bayesian Classifier for a Sample Training
Dataset stored as a .csv File. Compute the Accuracy of the Classifier,
Considering few Datasets.
DESCRIPTION:
• Import Libraries: Import pandas for data manipulation, scikitlearn for
machine learning algorithms, and any other necessary libraries.
• Load and Prepare Data: Load the training dataset from the .csv file into
a pandas DataFrame. Preprocess the data as needed, including handling
missing values and encoding categorical variables.
• Split Data: Split the dataset into features (independent variables) and
•
the target variable (dependent variable). Choose a Naive Bayes
Algorithm: Select a Naive Bayes algorithm suitable for your dataset.
Common options include Gaussian Naive Bayes, Multinomial Naive
Bayes, and Bernoulli Naive Bayes.
• Create and Train the Model: Instantiate the chosen Naive Bayes
algorithm and fit it to the training data.
• Load Test Data: Load the test dataset from .csv files into a pandas
DataFrame.
• Make Predictions: Use the trained model to make predictions on the test
data.
• Evaluate Model Performance: Compare the predicted labels with the
actual labels in the test data to compute the accuracy of the classifier.
PROGRAM:
import pandas as pd from sklearn.model_selection
import train_test_split from sklearn.naive_bayes import
GaussianNB from sklearn.metrics import accuracy_score
df = pd.read_csv('/content/iris.csv')
df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'species'] df.info()
print(df)
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']] y =
df['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42) clf = GaussianNB()
clf.fit(X_train, y_train) y_pred =
clf.predict(X_test) accuracy =
accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
OUTPUT:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149 Data
columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64 4 species 150 non-null
object dtypes: float64(4), object memory usage: 6.0+ KB
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
.. ... ... ... … …
145 6.7 3.0 5.2 2.3 Virginica
146 6.3 2.5 5.0 1.9 Virginica
147 6.5 3.0 5.2 2.0 Virginica
148 6.2 3.4 5.4 2.3 Virginica 149 5.9
3.0 5.1 1.8 Virginica
[150 rows x 5 columns]
Accuracy: 1.0