School of Computer Engineering and
Technology
1
Lab Assignment
Write a python program to perform pre-processing
on suitable dataset and illustrate various
visualization techniques on suitable sample data.
Analyze the same.
2
Index
⚫ Data Preprocessing steps in Python
⚫ Importing the libraries.
⚫ Importing the Dataset.
⚫ Handling of Missing Data.
⚫ Handling of Categorical Data.
⚫ Splitting the dataset into training and testing datasets.
⚫ Feature Scaling.
3
4
Step 1: Import Libraries
⚫ Following are the key libraries that we will need to perform
Assignment.
⚫ NumPy
⚫ SciPy
⚫ Pandas
⚫ SciKit-Learn
⚫ matplotlib
⚫ Seaborn
⚫ Bokeh
⚫ Altair
⚫ Plotly
⚫ ggplot
⚫ Eg: import pandas as pd
5
Step 2: Import the Dataset
⚫ There are different file format commonly used to read data
from
⚫ .csv
⚫ .xls
⚫ .txt
dataset =
pd.read_excel(‘age_salary.xls’)
dataset =
pd.read_table(‘age_salary.txt’)
6
Methods for preprocessing data
⚫ .head()
⚫ .tail()
⚫ .columns()
⚫ .info()
⚫ .describe()
⚫ .dtypes()
⚫ .index()
⚫ fillna()
⚫ dropna()
⚫ isnull()
7
⚫ isna()
• Demo Program
Methods description
A DataFrame is a 2-dimensional data structure that can store data of different types
(including characters, integers, floating point values, factors and more) in columns.
df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labels and column names
ndim number of dimensions
size number of elements
shape return a tuple representing the dimensionality
values numpy representation of the data
8
df.method() description
head( [n] ), first/last n rows
tail( [n] )
describe() generate descriptive statistics (for
numeric columns only)
max(), min() return max/min values for all numeric
columns
mean(), median() return mean/median values for all numeric
columns
std() standard deviation
dropna() drop all the records with missing values
9
Introduction to Visualization
description
distplot histogram
barplot estimate of central tendency for a numeric
variable
jointplot Scatterplot
regplot Regression plot
pairplot Pairplot
10
References
⚫ https://data-flair.training/blogs/python-ml-data-preproc
essing/
⚫ Python for Data Analysis, Research Computing
Services, Katia Oleinik ([email protected])
⚫ https://blog.insightdatascience.com/data-visualization-
in-python-advanced-functionality-in-seaborn-
20d217f1a9a6
11