0% found this document useful (0 votes)
13 views12 pages

Unit III - Notes

Uploaded by

Kannan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views12 pages

Unit III - Notes

Uploaded by

Kannan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

VELTECHHIGHTECH

Dr.RANGARAJANDr.SAKUNTHALAENGINEERINGCOLLEGE
AnAutonomousInstitution
ApprovedbyAICTE-NewDelhi,AffiliatedtoAnnaUniversity,Chennai
AccreditedbyNBA,NewDelhi&AccreditedbyNAACwith“A”Grade&CGPAof3.27
Courseco 21AI35IT Semester III
de
Category PROFESSIONALCORECOURSE(PCC) L T P C
2 0 4 4
CourseTitl DATASCIENCEFORENGINEERS
e

COURSEOBJECTIVES:
 To describe the life cycle of Data Science and computational
environments for data scientists using Python.
 To describe the fundamentals for exploring and managing data
with Python.
 To examine the various data analytics techniques for labeled
/columnar data using Python.
 To demonstrate a flexible range of data visualizations techniques
in Python.
 To describe the various Machine learning algorithms for data
modeling with Python.

COURSEOUTCOMES:
CO. No. Course Outcomes Blooms
level
On successful completion of this Course, students will be able to
C305. 3 Understand the concepts of Pandas. K2

UNIT-III
UNITIII INTRODUCTION TO PANDAS
Installing and Using Pandas, Introducing Pandas Objects,Data IndexingandSelection.Operating
on Data in Pandas, Handling Missing Data.

INTRODUCTION TO PANDAS
Pandas in Python is a package that is written for data analysis and manipulation. Pandas offer
various operations and data structures to perform numerical data manipulations and time
series. Pandas is an open-source library that is built over Numpy libraries. Pandas library is
known for its high productivity and high performance. Pandas are popular because they
make importing and analyzing data much easier. Pandas programs can be written on any
plain text editor like Notepad, notepad++, or anything of that sort and saved with a .py
extension.

To begin with Install Pandas in Python, write Pandas Codes, and perform various intriguing
and useful operations, one must have Python installed on their System. Check if Python is
Already Present To check if your device is pre-installed with Python or not, just go to the
Command line(search for cmd in the Run dialog( + R). Now run the following command:
python –version

If Python is already installed, it will generate a message with the Python version available
else install Python, for installing please visit: How to Install Python on Windows or Linux
and PIP.

Pandas can be installed in multiple ways on Windows, Linux, and MacOS. Various ways are
listed below:

Import Pandas in Python

Now, that we have installed pandas on the system. Let's see how we can import it to make
use of it.

For this, go to a Jupyter Notebook or open a Python file, and write the following code:
import pandas as pd

Here, pd is referred to as an alias to the Pandas, which will help us in optimizing the code.

How to Install or Download Python Pandas


Pandas can be installed in multiple ways on Windows, Linux and MacOS. Various different
ways are listed below:

Install Pandas on Windows

Python Pandas can be installed on Windows in two ways:

Using pip

Using Anaconda

Install Pandas using pip

PIP is a package management system used to install and manage software packages/libraries
written in Python. These files are stored in a large “online repository” termed as Python
Package Index (PyPI).

Step 1 : Launch Command Prompt

To open the Start menu, press the Windows key or click the Start button. To access the
Command Prompt, type "cmd" in the search bar, click the displayed app, or use Windows
key + r, enter "cmd," and press Enter.

Step 2 : Run the Command

Pandas can be installed using PIP by use of the following command in Command Prompt.
pip install pandas
Introduction to Pandas

 Pandas is a Python library used for data analysis and manipulation.


 It builds on NumPy and provides easy-to-use data structures for labeled data.

🧩 2. Core Pandas Data Structures

Pandas provides two primary objects:

🟦 A. Series – 1D Labeled Array

 A Series is a one-dimensional array-like object that can hold any data type.
 It includes an index which labels each element.

🔹 Syntax:
python
CopyEdit
import pandas as pd

s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])


print(s)
🔹 Output:
css
CopyEdit
a 10
b 20
c 30
dtype: int64
🔹 Key Properties:

 s.index – returns the index (['a', 'b', 'c'])


 s.values – returns the data ([10, 20, 30])
🔹 Use Cases:

 Time series data


 Single-column data
 Intermediate results in calculations

🟨 B. DataFrame – 2D Labeled Table

 A DataFrame is a 2-dimensional labeled data structure with rows and columns.


 Think of it like a table or spreadsheet.

🔹 Syntax:
python
CopyEdit
data = {
'Name': ['Alice', 'Bob'],
'Age': [25, 30]
}

df = pd.DataFrame(data)
print(df)
🔹 Output:
markdown
CopyEdit
Name Age
0 Alice 25
1 Bob 30
🔹 Key Properties:

 df.columns – returns column labels (['Name', 'Age'])


 df.index – returns row index ([0, 1])
 df.values – returns 2D array of values

🔹 Accessing Data:
python
CopyEdit
df['Name'] # Access column
df.loc[0] # Access row by label
df.iloc[1] # Access row by position

🧠 3. Differences Between Series and DataFrame


Feature Series DataFrame
Dimension 1D 2D
Data structure Array with index Table with rows and columns
Use Case One column of data Tabular data

📌 4. Creating Pandas Objects from Various Data Types


Source Type Constructor Used Example
List or array pd.Series() pd.Series([1, 2, 3])
Dictionary pd.DataFrame() pd.DataFrame({'a':[1], 'b':[2]})
NumPy array pd.DataFrame() pd.DataFrame(np.random.rand(2,3))
Source Type Constructor Used Example
CSV/Excel/SQL pd.read_csv(), etc. pd.read_csv('file.csv')

✅ 5. Summary

 Series: 1D data with labels


 DataFrame: 2D data with row and column labels
 Pandas simplifies loading, transforming, and analyzing structured data

1. Indexing in Series
A Series in Pandas can be indexed by:

 Position (like lists)


 Label (like dictionaries)

✅ Example:
python
CopyEdit
import pandas as pd
s = pd.Series([100, 200, 300], index=['a', 'b', 'c'])

🔹 Accessing Elements:
python
CopyEdit
s['a'] # 100
s[1] # 200

🔹 Slicing:
python
CopyEdit
s['a':'c'] # Includes both start and end
s[0:2] # Like list slicing

2. Indexing in DataFrame
A DataFrame supports:

 Column selection
 Row selection
 Element access
 Boolean indexing

✅ Example:
python
CopyEdit
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)

🟦 A. Column Selection
python
CopyEdit
df['Name'] # Single column (as Series)
df[['Name', 'Age']] # Multiple columns (as DataFrame)

B. Row Selection
🔹 Using loc[] (Label-based):
python
CopyEdit
df.loc[0] # First row
df.loc[0:1] # Rows 0 to 1 (inclusive)
df.loc[df['Age'] > 25] # Filtered rows
🔹 Using iloc[] (Integer-based):
python
CopyEdit
df.iloc[1] # Second row
df.iloc[0:2] # First two rows

C. Accessing Individual Elements


python
CopyEdit
df.loc[0, 'Name'] # Alice
df.iloc[1, 0] # Bob

3. Boolean Indexing / Conditional Selection


Used to filter rows based on condition(s):

python
CopyEdit
df[df['Age'] > 25]
✅ Example:
python
CopyEdit
# Output rows where Age > 25
print(df[df['Age'] > 25])

4. Using Conditions with Multiple Filters


 Use & for AND, | for OR
 Enclose each condition in parentheses

python
CopyEdit
df[(df['Age'] > 25) & (df['Name'] != 'Bob')]

📌 Summary Table
Method Description Example
df['col'] Access a column df['Name']
Method Description Example
df.loc[] Label-based row/column access df.loc[0, 'Age']
df.iloc[] Integer-location based access df.iloc[1, 0]
Boolean Indexing Conditional row selection df[df['Age'] > 30]
Slicing Range of rows/columns df[0:2], df.loc[1:2]

Operating on Data in Pandas


This topic covers how to perform arithmetic, statistical, and functional operations on
data stored in Pandas Series and DataFrame objects.

1. Element-wise Operations
Pandas supports element-wise arithmetic operations between:

 Series and Series


 DataFrame and DataFrame
 DataFrame and scalar value

Example:

import pandas as pd

df = pd.DataFrame({

'A': [10, 20, 30],

'B': [5, 15, 25]

})

# Add 5 to each element

print(df + 5)

# Multiply column by 2

print(df['A'] * 2)
# Subtract columns

print(df['A'] - df['B'])

2. Statistical and Aggregation Functions


Pandas provides built-in functions to summarize and describe data.

Function Description
sum() Sum of values
mean() Average/mean
median() Median value
std() Standard deviation
min() Minimum value
max() Maximum value
count() Count non-null values
describe() Summary statistics

Example:
python
CopyEdit
df.sum() # Column-wise sum
df.mean() # Column-wise mean
df.describe() # Summary of statistics

3. Function Application
You can apply custom or built-in functions to:

 A column (Series)
 The entire DataFrame

Using apply():
python
CopyEdit
# Square each value in column A
df['A'].apply(lambda x: x ** 2)

Using applymap() (for element-wise DataFrame ops):


python
CopyEdit
df.applymap(lambda x: x * 2)

4. Broadcasting
Broadcasting allows arithmetic between objects of different shapes:

 Series aligns on index


 Scalar applies to every element

Example:
python
CopyEdit
s = pd.Series([1, 2, 3])
df = pd.DataFrame({'A': [10, 20, 30], 'B': [40, 50, 60]})

# Broadcast Series across DataFrame columns


df.sub(s, axis=0)

🎯 5. Sorting and Ranking


🔹 Sorting:
python
CopyEdit
df.sort_values(by='A') # Sort rows by column A
df.sort_index() # Sort by index

🔹 Ranking:
python
CopyEdit
df['A'].rank() # Assign rank to values

🧪 6. Data Type Operations


You can check and convert data types using:

python
CopyEdit
df.dtypes # Check column data types
df.astype(int) # Convert data type

📌 Summary Table
Operation Type Method / Function Example
Arithmetic +, -, *, / df['A'] + 5
Aggregation sum(), mean(), etc. df.mean()
Apply function apply(), applymap() df.apply(lambda x: x*2)
Sorting & Ranking sort_values(), rank() df.sort_values('A')
Type Conversion astype() df.astype('float')

Handling Missing Data in Pandas

Objective:

Learn how to identify, remove, and fill missing values (NaNs) in Series and DataFrame
objects using Pandas.

1. What is Missing Data?


 Missing data is represented as:
o NaN (Not a Number)
o None (Python's null value)
 Common causes:
o Incomplete data entries
o Failed data imports
o Data corruption

🔍 2. Detecting Missing Data


✅ Use isnull() and notnull()
python
CopyEdit
import pandas as pd
import numpy as np

df = pd.DataFrame({
'Name': ['Alice', 'Bob', None],
'Age': [25, np.nan, 30]
})

print(df.isnull()) # Returns True for missing values


print(df.notnull()) # Returns True for non-missing values

🧹 3. Dropping Missing Data


🔹 dropna() — Remove rows or columns with NaN values
✅ Drop rows with any missing value:
python
CopyEdit
df.dropna()
✅ Drop columns with any missing value:
python
CopyEdit
df.dropna(axis=1)
✅ Drop rows where all values are NaN:
python
CopyEdit
df.dropna(how='all')
✅ Drop rows if less than a threshold of non-NaN values:
python
CopyEdit
df.dropna(thresh=2)

🧴 4. Filling Missing Data


🔹 fillna() — Replace NaNs with a value or method
✅ Replace with a constant:
python
CopyEdit
df.fillna(0)
df.fillna('Unknown')
✅ Forward Fill (propagate last valid value forward):
python
CopyEdit
df.fillna(method='ffill')
✅ Backward Fill:
python
CopyEdit
df.fillna(method='bfill')
✅ Fill using column mean/median:
python
CopyEdit
df['Age'].fillna(df['Age'].mean(), inplace=True)

🧪 5. Interpolating Missing Data


Estimates missing values using mathematical interpolation:

python
CopyEdit
df.interpolate()

🧠 6. Checking for Any Missing Data


python
CopyEdit
df.isnull().sum() # Count of missing values per column
df.isnull().any() # Check if any missing value exists

📝 Example: Handling Missing Data


python
CopyEdit
data = {'Name': ['Alice', None, 'Charlie'], 'Age': [25, None, 30]}
df = pd.DataFrame(data)

# Fill missing name with 'Unknown', and Age with average


df['Name'].fillna('Unknown', inplace=True)
df['Age'].fillna(df['Age'].mean(), inplace=True)

📌 Summary Table
Task Method Example
Detect missing data isnull(), notnull() df.isnull()
Drop rows with missing data dropna() df.dropna()
Fill missing data fillna(value) df.fillna(0)
Forward/backward fill method='ffill'/'bfill' df.fillna(method='ffill')
Interpolation interpolate() df.interpolate()
Count missing values isnull().sum() df.isnull().sum()

You might also like