0% found this document useful (0 votes)
14 views34 pages

Python Unit 4

Uploaded by

drashtijasani11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views34 pages

Python Unit 4

Uploaded by

drashtijasani11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

BCA SEM-5 PYTHON Unit-4

How to generating virtual environment in python?

OR

How to managing packages with pip (python package index) in python?

 Python applications will often use packages and modules that don’t come
as part of the standard library.
 Applications will sometimes need a specific version of a library.
 This means it may not be possible for one Python installation to meet the
requirements of every application.
 If application A needs version 1.0 of a particular module but application B
needs version 2.0, then the requirements are in conflict and installing
either version 1.0 or 2.0 will leave one application unable to run.
 The solution for this problem is to create a virtual environment, a self-
contained directory tree that contains a Python installation for a particular
version of Python, plus a number of additional packages.
 Different applications can then use different virtual environments.
 To resolve the earlier example of conflicting requirements, application A
can have its own virtual environment with version 1.0 installed while
application B has another virtual environment with version 2.0.
 If application B requires a library be upgraded to version 3.0, this will not
affect application A’s environment.
 To create a virtual environment, decide upon a directory where you want
to place it, and run the venv module as a script with the directory path:

python3 -m venv tutorial-env

 This will create the tutorial-env directory if it doesn’t exist, and also create
directories inside it containing a copy of the Python interpreter, the
standard library, and various supporting files.

Page 1
Page 2
BCA SEM-5 PYTHON Unit-4

 Once you’ve created a virtual environment, you may activate it.

tutorial-env\Scripts\[Link]

Managing Packages with pip

 You can install, upgrade, and remove packages using a program called
pip. By default pip will install packages from the Python Package Index.
 pip has a number of subcommands: “search”, “install”, “uninstall”,
“freeze”, etc.
 You can install the latest version of a package by specifying a
package’s name:

(tutorial-env) $ pip install novas

 You can also install a specific version of a package by giving the


package name followed by == and the version number:

(tutorial-env) $ pip install requests==2.6.0

 pip list will display all of the packages installed in the virtual environment:

(tutorial-env) $ pip list


novas ([Link])
numpy (1.9.2)
pip (7.0.3)

How to install mysql connector? And verify the connector installation.

 To communicate with MySQL database from python, we need a


program that becomes an interface to MYSQL database server.
Page 3
 MYSQLDB is an interface between the MYSQL server and Python programs.

Before you can access MySQL databases using Python, you must install one (or
more) of the following packages in a python.

Page 4
BCA SEM-5 PYTHON Unit-4

 MySQL-python:

 This package contains the MySQLdb module, which is written in C. It isone


of the most commonly used Python packages for MySQL.

 mysql-connector-python:

 This package contains the [Link] module, which is


written entirely in Python.

 PyMySQL:

 This package contains the pymysql module, which is written entirely


in Python. It is designed to be a drop-in replacement for the MySQL-
python package.

 All three of these packages use Python's portable SQL database API.
 This means that if you switch from one module to another, you can reuse
almost all of your existing code (the code sample below demonstrates
how to do this).

 Type the command for the package you want to install:

 To install the MySQL-python package, type the following command:

pip install MySQL-python

 To install the mysql-connector-python package, type the following command:

pip install mysql-connector-python

 To install the pymysql package, type the following command:

pip install pymysql

Page 5
Page 6
BCA SEM-5 PYTHON Unit-4

 Verifying the MySQLdb Interface Installation


 If the MySQLdb interface for python has been installed successfully, we can
see a new module by the name ‘MySQLdb’ is added to the existing
modules in python library.
 Go to python IDLE shell window and type the following at python prompt
>>> help(‘modules’)
 As a result, it will display all the available modules of python.
 We should be able to locate ‘MySQLdb’ module among them.
 This represent that the MySQLdb interface for python has beensuccessfully
installed.

Using MYSQL from python explain select , insert , delete with


example.

 Python can be used in database applications.


 One of the most popular databases is MySQL.

 You need to know the following detail of the MySQL server to perform the
connection from Python.

 Username – i.e., the username that you use to work with MySQL Server.
The default username for the MySQL database is a root
Page 7
Page 8
BCA SEM-5 PYTHON Unit-4

 Password – Password is given by the user at the time of installing the mysql
database. If you are using root then you won’t need the password.
 Host Name – is the server name or Ip address on which MySQL is running. if
you are running on localhost, then you can use localhost, or it’s IP, i.e.
[Link]
 Database Name – Database name to which you want to connect. Here we are
using Database named ‘Emp‘ because we have already created this for our
example.

Methods which are used in database connectivity with mysql:

1. connect() : This method is used for creating a connection to our database


it have four arguments:
1. Server Name
2. Database User Name
3. Database Provider
4. Database Name

2. Cursor() : This method creates a cursor object that is capable for executing
sql query on database.

3. Execute() : This method is used for executing sql query on database. It takes
a sql query( as string) as an argument.

4. Fetchone() : This method retrieves the next row of a query result set and
returns a single sequence, or None if no more rows are available.

5. Close() : This method close the database connection.

Creating a Table

 To create a table in MySQL, use the "CREATE TABLE" statement.


 Make sure you define the name of the database when you create
the connection

Page 9
Page 10
BCA SEM-5 PYTHON Unit-4

 Inserting Record

 To fill a table in MySQL, use the "INSERT INTO" statement.

Import [Link] as mydb

pydb=[Link](host="localhost",user="root",passwd="",database=db_nam
e)
[Link]("insert into emp(eid,name,sal) values(10,’dixita’,2000)”)
[Link]()
print([Link],"Row Inserted Successfully")
[Link]()

 Delete Record
 You can delete records from an existing table by using the
"DELETEFROM" statement:

Import [Link] as mydb

pydb=[Link](host="localhost",user="root",passwd="",database=db_name)

[Link]("delete from emp where eid=10”)


[Link]()
print([Link],"Row deleted Successfully")
[Link]()

 Select From a Table

 To select from a table in MySQL, use the "SELECT" statement:

Import [Link] as mydb

pydb=[Link](host="localhost",user="root",passwd="",database=db_name)
Page 11
BCA SEM-5 PYTHON Unit-4

pycursor=[Link]()
[Link]("SELECT * FROM emp")
res=[Link]()
for x in res:
print(x)

 Data Analysis With Pandas

The process of installing and configuring Pandas, including setting up a virtual environment,
involves several steps:
1. Install PIP (if not already present):
PIP is typically included with Python installations from version 3.4 onwards. To verify if PIP
is installed and to check its version, open your terminal or command prompt and execute:
pip –version
If PIP is not found, you may need to reinstall Python or follow platform-specific instructions
to install PIP.
2. Create a Virtual Environment:
Navigate to your desired project directory in the terminal or command prompt and create a
virtual environment using the venv module:
python -m venv my_pandas_env

Replace my_pandas_env with your preferred name for the virtual environment.
3. Activate the Virtual Environment:
On Windows.
my_pandas_env\Scripts\activate
On macOS/Linux.
source my_pandas_env/bin/activate

Once activated, the name of your virtual environment will typically appear in your terminal
prompt, indicating that you are working within it.
4. Install Pandas 2.0:
With the virtual environment active, use PIP to install Pandas. To specifically install version
2.0, execute:
pip install pandas==2.0.0
Page 12
(Note: Replace 2.0.0 with the specific patch version if required, e.g., 2.0.3.)
5. Verify Installation:
To confirm that Pandas has been installed correctly within your virtual environment, open a
Python interpreter or create a Python script and execute the following:

import pandas as pd
print(pd.__version__)

This will print the installed version of Pandas, confirming a successful installation.

 Data Structures in Pandas

Pandas is an open-source Python library used for working with relational or labeled data
in an easy and intuitive way. It provides powerful data structures and a wide range of
operations for manipulating numerical data and time series. Pandas also offers tools for
cleaning, processing and analyzing data efficiently. It is one of the most popular libraries
for data analysis in Python and primarily supports two core data structures:
 Series
 DataFrame
Series
A Series is a one-dimensional array-like object that can store any data type such as
integers, strings, floats, or even Python objects. It comes with labels (called an index).
Syntax
[Link](data=None, index=None, dtype=None, name=None, copy=False)
Parameters:
 data: Array-like, dict or scalar – Input data.
 index (Optional): Labels for the axis.
 dtype (Optional): Data type of the Series.
 name (Optional): Name of the Series.
 copy (Bool): Copy data if True.
Returns: A [Link] object containing the provided data with an associated index.
Example 1: Series holding the char data type.

import pandas as pd
a = ['g', 'e', 'e', 'k', 's']

Page 13
res = [Link](a)
print(res)

Output

Explanation: We pass the list a into [Link](a), which converts it into a Series (a
column-like structure) where each item gets a default index starting from 0, automatically
assigned by Pandas.
Example 2: Series holding the Int data type.

import pandas as pd
a = [1,2,3,4,5]

res = [Link](a)
print(res)

Output

Page 14
Explanation: We pass the list a into [Link] a, which converts it into a Series (a column-
like structure) where each number gets a default index starting from 0, automatically
assigned by Pandas.
Example 3: Series holding the dictionary.

import pandas as pd
a = { 'Id': 1013, 'Name': 'MOhe', 'State': 'Maniput','Age': 24}

res = [Link](a)
print(res)

Output:

 Dataframe

A DataFrame is a two-dimensional, size-mutable and heterogeneous tabular data


structure with labeled rows and columns, similar to a spreadsheet or SQL table. Each
column in a DataFrame is a Pandas Series, allowing you to work with multiple types of
data in one table.
Syntax:
[Link](data=None, index=None, columns=None, dtype=None, copy=False)
Parameters:
 data: Various forms of input data (e.g., lists, dict, ndarray, Series, another DataFrame).
 index( Optional): labels for rows.
 columns(Optional): labels for columns.
 dtype(Optional): Optional data type for all columns.
 copy(Optional): Boolean; whether to copy data or not.
Returns: A [Link] object representing a 2D labeled data structure.
Example 1: Creating a dataFrame from a list

import pandas as pd
Page 15
a = ['Python', 'Pandas', 'Numpy']

df = [Link](a, columns=['Tech'])
print(df)

Output:

Explanantion: We pass the list a into [Link](a, columns=['Tech']), which converts


it into a DataFrame with a single column named 'Tech'. Each item becomes a row and
Pandas automatically assigns a default integer index starting from 0.
Example 2: Creating a dataFrame from a dictionary

a={
'Name': ['Tom', 'Nick', 'Krish', 'Jack'],
'Age': [20, 21, 19, 18]
}
res = [Link](a)
print(res)

Output:

Explanation: We pass the dictionary a into [Link](a), which converts it into a


DataFrame where the dictionary keys become column names and the values (lists)

Page 16
become the column data. Pandas assigns a default integer index starting from 0 for the
rows.
Example 3: Selecting columns and rows in a dataFrame

import pandas as pd

a={
'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd']
}
df = [Link](a)
print(df[['Name', 'Qualification']])

Output:

Explanation: We create a DataFrame df from the dictionary a, then select and print only
the columns 'Name' and 'Qualification' by passing their names in a list to df[]. This returns
a new DataFrame with just those two columns.
Accessing columns and rows in a dataFrame
A DataFrame in Pandas is a 2D tabular structure where you can easily access and
manipulate data by selecting specific columns or rows. You can extract one or more
columns using column names and filter rows using labels or conditions.
Example 1: We can access one or more columns in a DataFrame using square brackets.

import pandas as pd
a={
'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'City': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj']
}
df = [Link](a)

print(df['Name']) # single column


Page 17
print(df[['Name', 'City']]) # multiple columns

Output:

Explanation:
 df['Name'] returns a Series containing values from the 'Name' column.
 df[['Name', 'City']] returns a new DataFrame containing only the specified columns.
Example 2: We can use .loc[] to access rows by index or filter them using conditions.

Output:

import pandas as pd

a={
'Name': ['Mohe', 'Shyni', 'Parul', 'Sam'],
'ID': [12, 43, 54, 32],
'City': ['Delhi', 'Kochi', 'Pune', 'Patna']
}

df = [Link](a)
res = [Link][df['Name'] == 'Mohe']
print(res)

Page 18
Explanation: [Link][df['Name'] == 'Mohe'] filters and returns only the row(s) where the
'Name' column has the value 'Mohe'.

 Creating Data Frame with different sources

Creating a Pandas DataFrame in Python can be achieved from various data sources.
1. From an Excel Spreadsheet:
To create a DataFrame from an Excel file, the read_excel() function from the Pandas library
is used.

import pandas as pd

# Assuming '[Link]' is your Excel file and 'Sheet1' is the sheet name
df_excel = pd.read_excel('[Link]', sheet_name='Sheet1')
print(df_excel)

2. From .csv files:


To create a DataFrame from a CSV file, the read_csv() function from the Pandas library is
used.

import pandas as pd

# Assuming '[Link]' is your CSV file


df_csv = pd.read_csv('[Link]')
print(df_csv)

3. From a Python Dictionary:


A DataFrame can be created from a dictionary where keys represent column names and
values are lists of data for each column.

import pandas as pd

data_dict = {
Page 19
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']
}
df_dict = [Link](data_dict)
print(df_dict)

4. From a List of Lists or Tuples:


A DataFrame can be created from a list of lists or a list of tuples, where each inner list or
tuple represents a row. Column names can be specified separately.

import pandas as pd

data_list = [
['Alice', 25, 'New York'],
['Bob', 30, 'London'],
['Charlie', 35, 'Paris']
]
df_list = [Link](data_list, columns=['Name', 'Age', 'City'])
print(df_list)

data_tuples = [
('David', 40, 'Tokyo'),
('Eve', 28, 'Berlin')
]
df_tuples = [Link](data_tuples, columns=['Name', 'Age', 'City'])
print(df_tuples)

Page 20
 Operations on Data frames

Once we create a data frame, we can do various operations on it. These operations
help us in analyzing the data or manipulating the data. The reader is advised to refer
to the list of all operations available in pandas at the following link:

[Link]

First we will create a data frame from a .csv file using read_csv() function as shown
below. This data frame will be the basis for our operations.

>>> df = pd.read_csv("f:\\python\PANDAS\[Link]")
>>> df
OUTPUT:
empid ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-00
1 1002 Anil Kumar 23000.50 3-20-2002
2 1003 Gaurav Gupta 18000.33 03-03-02
3 1004 Hema Chandra 16500.50 10-09-00
4 1005 Laxmi Prasanna 12000.75 08-10-00
5 1006 Anant Nag 9999.99 09-09-99

 Knowing number of rows and columns

 Using the .shape attribute:

 The shape attribute of a DataFrame returns a tuple representing its


dimensions, where the first element is the number of rows and the second is
the number of columns.
import pandas as pd

# Create a sample DataFrame


data = {'col1': [1, 2, 3], 'col2': [4, 5, 6], 'col3': [7, 8, 9]}
df = [Link](data)

# Get the number of rows and columns


num_rows, num_cols = [Link]

Page 21
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_cols}")

 Retrieving Rows from Data Frame

In Python, when working with Pandas DataFrames, the head() method is used to
retrieve the first n rows of the DataFrame. By default, if no argument is provided, head()
returns the first 5 rows.

import pandas as pd

# Create a sample DataFrame


data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
'Age': [25, 30, 35, 28, 42, 50, 22],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami', 'Boston',
'Seattle']}
df = [Link](data)

# Retrieve the first 5 rows (default)


first_five_rows = [Link]()
print("First 5 rows:\n", first_five_rows)

# Retrieve the first 3 rows


first_three_rows = [Link](3)
print("\nFirst 3 rows:\n", first_three_rows)

 tail() on your DataFrame without any arguments to get the last five rows.

import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
'Age': [25, 30, 35, 28, 42, 50, 22],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami', 'Boston',
'Seattle']}
df = [Link](data)

last_five_rows = [Link]()
print(last_five_rows)

Page 22
last_three_rows = [Link](3)
print(last_three_rows)

 Retrieving a Range of Rows

We can treat the data frame as an object and retrieve the rows from it using
slicing. For example, if we write df[1:4], we can get 1nd row to 3th row (excludes 4th
row).
import pandas as pd

data = {'col1': [1, 2, 3, 4, 5, 6],


'col2': ['A', 'B', 'C', 'D', 'E', 'F']}
df = [Link](data)

# Select rows from index 1 (inclusive) to 4 (exclusive)


selected_rows = df[1:4]
print(selected_rows)

 To Retrieve Column Data


To get the column data, we can mention the column name as subscript. For
example, [Link] will display all employee id numbers. This can also be done
using df[‘age’]

import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
'Age': [25, 30, 35, 28, 42, 50, 22],
'City': ['New York', 'Los Angeles', 'Chicago',
'Houston', 'Miami', 'Boston', 'Seattle']}
df = [Link](data)
print(df)
print(df['Age'])

 Retrieving Data from Multiple Columns

To retrieve multiple columns data, we can provide the list of column


names as subscript to data frame object as df[ [list of column names] ]. For example, to
display the employee ids and their names, we can write:
>>> df[['empid', 'ename']]

Page 23
empid ename
0 1001 Ganesh Rao
1 1002 Anil Kumar
2 1003 Gaurav Gupta
3 1004 Hema Chandra
4 1005 Laxmi Prasanna
5 1006 Anant Nag

 Finding Maximum and Minimum Values


It is possible to find the highest value using max() method and the least value using min()
method. These methods are applied to columns containing numerical data. For example,
to know the highest salary and the least salary, we can use as:
>>> df['sal'].max()
23000.5
>>> df['sal'].min()
9999.9899999999998

 Displaying Statistical Information


We have describe() method that displays very important information like number of
values, average, standard deviation, minimum, maximum, 25%, 50% and 75% of the
total value. This information is highly useful for statistical analysis.

>>> [Link]()
empid sal
count 6.000000 6.000000
mean 1003.500000 14917.011667
std 1.870829 5181.037711
min 1001.000000 9999.990000
25% 1002.250000 10500.187500
50% 1003.500000 14250.625000
75% 1004.750000 17625.372500
max 1006.000000 23000.500000

Performing Queries on Data


We can retrieve rows based on a query. The query should be given as subscript in the
data frame object. For example, to retrieve all the rows where salary is greater than Rs.
10000, we can write:
>>> df[[Link]>10000]
Page 24
OUTPUT:
empid ename sal doj
1 1002 Anil Kumar 23000.50 3-20-2002
2 1003 Gaurav Gupta 18000.33 03-03-02
3 1004 Hema Chandra 16500.50 10-09-00
4 1005 Laxmi Prasanna 12000.75 08-10-00

To retrieve the row where salary is maximum, we can write:


>>> df[[Link] == [Link]()]

OUTPUT:

empid ename sal doj


1 1002 Anil Kumar 23000.5 3-20-2002

Suppose, we want to show data from some columns based on a query, we can mention
the list of columns and then the query as: df[[column names]][query].
For example, to
display only id numbers and names where the salary is greater than Rs. 10000, we can
write:
>>> df[['empid', 'ename']][[Link]>10000]

OUTPUT:
empid ename
1 1002 Anil Kumar
2 1003 Gaurav Gupta
3 1004 Hema Chandra
4 1005 Laxmi Prasanna

 Knowing the index range


The first column is called index column and it is generated in the data frame
automatically. We can retrieve the index information using index attribute as:
>>> [Link]
RangeIndex(start=0, stop=6, step=1)

 Setting a column as index

We know the index column is automatically generated. If we do not want this column and
we want to set a column from our data as index column, that is possible using set_index()
method. The column with unique values can be set as index column. For example, to
make ‘empid’ column as index column for our data frame, we can write:
Page 25
>>> df1 = df.set_index('empid')
The above statement creates another data frame ‘df1’ that uses ‘empid’ as index column.
We can verify this by displaying df1 as:
>>> df1
OUTPUT:

Empid ename sal doj


1001 Ganesh Rao 10000.00 10-10-2000
1002 Anil Kumar 23000.50 3-3-2002
1003 Gaurav Gupta 18000.33 3-3-2002
1004 Hema Chandra 16500.50 3-3-2002
1005 Laxmi Prasanna 12000.75 10-8-2000
1006 Anant Nag 9999.99 9-9-1999
We can find the empid being used as index column in the new data frame ‘df1’ as above.
However, the original data frame ‘df’ in this case is not modified and it still uses
automatically generated index column. If we want to modify the original ‘df’ and set
empid as index column, we should add ‘inplace=True’ as shown below:
>>> df.set_index('empid', inplace=True)
>>> df
OUTPUT:

Empid ename sal doj


1001 Ganesh Rao 10000.00 10-10-00
1002 Anil Kumar 23000.50 3-20-2002
1003 Gaurav Gupta 18000.33 03-03-02
1004 Hema Chandra 16500.50 10-09-00
1005 Laxmi Prasanna 12000.75 08-10-00
1006 Anant Nag 9999.99 09-09-99
Once we set ‘empid’ as index, it is possible to locate the data of any employee by passing
employee id number to loc attribute as:
>>> [Link][1004]
OUTPUT:
ename Hema Chandra
sal 16500.5
doj 3-3-2002
Name: 1004, dtype: object

Resetting the Index


To reset the index value from ‘empid’ back to auto-generated index number, we can use
reset_index() method with inplace=True option as:
Page 26
>>> df.reset_index(inplace=True)
>>> df
OUTPUT:
empid ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-2000
1 1002 Anil Kumar 23000.50 3-3-2002
2 1003 Gaurav Gupta 18000.33 3-3-2002
3 1004 Hema Chandra 16500.50 3-3-2002
4 1005 Laxmi Prasanna 12000.75 10-8-2000
5 1006 Anant Nag 9999.99 9-9-1999

 Sorting the data

To sort the data coming from a .csv file, first let us read the data from the file into a data
frame using read_csv() function as:
>>> df = pd.read_csv("f:\\python\PANDAS\[Link]",
parse_dates=['doj'])
Here, we are loading the data from [Link] file and also informing to take ‘doj’ as
date type field using parse_dates option. Now let us display the data frame as:
>>> print(df)
OUTPUT:
empid ename sal doj
0 1001 Ganesh Rao 10000.00 2000-10-10
1 1002 Anil Kumar 23000.50 2002-03-03
2 1003 Gaurav Gupta 18000.33 2002-03-03
3 1004 Hema Chandra 16500.50 2002-03-03
4 1005 Laxmi Prasanna 12000.75 2000-08-10
5 1006 Anant Nag 9999.99 1999-09-09
To sort the rows on ‘doj’ column into ascending order, we can use sort_values() method
as:
>>> df1 = df.sort_values('doj')
>>> df1
OUTPUT:

empid ename sal doj


5 1006 Anant Nag 9999.99 1999-09-09
4 1005 Laxmi Prasanna 12000.75 2000-08-10
0 1001 Ganesh Rao 10000.00 2000-10-10
1 1002 Anil Kumar 23000.50 2002-03-03

Page 27
2 1003 Gaurav Gupta 18000.33 2002-03-03
3 1004 Hema Chandra 16500.50 2002-03-03

To sort in descending order, we should use an additional option ‘ascending = False’ as:
>>> df1 = df.sort_values('doj', ascending=False)
Sorting on multiple columns is also possible. This can be done using an option ‘by’ in the
sort_values() method. For example, we want to sort on ‘doj’ in descending order and in
that ‘sal’ in ascending order. That means, when two employees have same ‘doj’, then
their salaries will be sorted into ascending order.
>>> df1 = df.sort_values(by=['doj', 'sal'], ascending=[False, True])
>>> df1
OUTPUT:
empid ename sal doj
3 1004 Hema Chandra 16500.50 2002-03-03
2 1003 Gaurav Gupta 18000.33 2002-03-03
1 1002 Anil Kumar 23000.50 2002-03-03
0 1001 Ganesh Rao 10000.00 2000-10-10
4 1005 Laxmi Prasanna 12000.75 2000-08-10
5 1006 Anant Nag 9999.99 1999-09-09

 Handling missing data

In many cases, the data that we receive from various sources may not be perfect. That
means there may be some missing data. For example, ‘[Link]’ file contains the
following data where employee name is missing in one row and salary and date of
joining are missing in another row. Please see beloved Figure :

Page 28
When we convert the data into a data frame, the missing data is represented by NaN (Not a
Number). NaN is a default marker for the missing value. Please observe the following
data frame:
>>> import pandas as pd
>>> df = pd.read_csv("f:\\python\PANDAS\[Link]")
>>> df

OUTPUT:
0 1001 Ganesh Rao 10000.00 10-10-00
1 1002 Anil Kumar 23000.50 03-03-02
2 1003 NaN 18000.33 03-03-02
3 1004 Hema Chandra NaN NaN
4 1005 Laxmi Prasanna 12000.75 10-08-00
5 1006 Anant Nag 9999.99 09-09-99

We can use fillna() method to replace the Na or NaN values by a specified value. For
example, to fill the NaN values by 0, we can use:
>>> df1 = [Link](0)
>>> df1

OUTPUT:
empid ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-00
1 1002 Anil Kumar 23000.50 03-03-02
2 1003 0 18000.33 03-03-02
3 1004 Hema Chandra 0.00 0
4 1005 Laxmi Prasanna 12000.75 10-08-00
5 1006 Anant Nag 9999.99 09-09-99
But this is not so useful as it is filling any type of column with zero. We can fill each
column with a different value by passing the column names and the value to be used to
fill in the column. For example, to fill ‘ename’ column with ‘Name missing’, ‘sal’ with 0.0
and ‘doj’ with ’00-00-00’, we should supply these values as a dictionary to fillna() method
as shown below:
>>> df1 = [Link]({'ename': 'Name missing', 'sal': 0.0, 'doj':'00-00-
00'})
>>> df1
OUTPUT:
empid ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-00
Page 29
1 1002 Anil Kumar 23000.50 03-03-02
2 1003 Name missing 18000.33 03-03-02
3 1004 Hema Chandra 0.00 00-00-00
4 1005 Laxmi Prasanna 12000.75 10-08-00
5 1006 Anant Nag 9999.99 09-09-99
If we do not want the missing data and want to remove those rows having Na or NaN
values, then we can use dropna() method as:
>>> df1 = [Link]()
>>> df1
OUTPUT:
empid ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-00
1 1002 Anil Kumar 23000.50 03-03-02
4 1005 Laxmi Prasanna 12000.75 10-08-00
5 1006 Anant Nag 9999.99 09-09-99
In this way, filling the necessary data or eliminating the missing data is called ‘data
cleansing’.

 Merging and Joining DataFrames:

 Merging DataFrames using merge() Function


The merge() function in Python, part of the Pandas library, is used to combine two
DataFrames based on common columns or indices, similar to SQL JOIN operations.
Syntax:
[Link](left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)

Key Parameters:
left: The left DataFrame to merge.
right: The right DataFrame to merge.

how: Specifies the type of merge (join):


'inner' (default): Returns only the rows where the merge key(s) are present in both
DataFrames.
'left': Returns all rows from the left DataFrame, and matching rows from the right
DataFrame. If no match, NaN is filled.
'right': Returns all rows from the right DataFrame, and matching rows from the left
DataFrame. If no match, NaN is filled.
Page 30
'outer': Returns all rows when there is a match in either the left or right DataFrame. If
no match, NaN is filled.
on: Column or list of column names to join on. These must be present in both
DataFrames.
left_on: Column or list of column names from the left DataFrame to use as join keys.
right_on: Column or list of column names from the right DataFrame to use as join keys.
left_index: If True, use the index of the left DataFrame as its join key(s).
right_index: If True, use the index of the right DataFrame as its join key(s).
suffixes: A tuple of string suffixes to apply to overlapping column names (not join keys)
from the left and right DataFrames, respectively. Default is ('_x', '_y').

import pandas as pd

# Create two sample DataFrames


df1 = [Link]({'key': ['A', 'B', 'C', 'D'], 'value1': [1, 2, 3, 4]})
df2 = [Link]({'key': ['B', 'D', 'E', 'F'], 'value2': [5, 6, 7, 8]})

# Perform an inner merge on the 'key' column


merged_df_inner = [Link](df1, df2, on='key', how='inner')
print("Inner Merge:\n", merged_df_inner)

# Perform a left merge


merged_df_left = [Link](df1, df2, on='key', how='left')
print("\nLeft Merge:\n", merged_df_left)

 Joining DataFrames using join() Method


The join() function in Pandas is used to combine columns of two or more DataFrames
based on their indices or a specified column. It is particularly useful when you need
to merge DataFrames where one or both have meaningful indices that serve as the
key for joining.

import pandas as pd

# Create sample DataFrames


df1 = [Link]({'value1': [10, 20, 30]}, index=['A', 'B', 'C'])
df2 = [Link]({'value2': [100, 200]}, index=['B', 'C'])

# Perform a left join (default behavior)


joined_df = [Link](df2)
print("Left Join (default):")
Page 31
print(joined_df)

# Perform an inner join


inner_joined_df = [Link](df2, how='inner')
print("\nInner Join:")
print(inner_joined_df)

# Perform an outer join


outer_joined_df = [Link](df2, how='outer')
print("\nOuter Join:")
print(outer_joined_df)

# Perform a right join


right_joined_df = [Link](df2, how='right')
print("\nRight Join:")
print(right_joined_df)

# Joining on a column instead of index (requires setting index on one DataFrame)


df3 = [Link]({'key': ['X', 'Y', 'Z'], 'data1': [1, 2, 3]})
df4 = [Link]({'key': ['Y', 'Z', 'W'], 'data2': [10, 20, 30]})

# Set 'key' as index for df4 to join on it


df4_indexed = df4.set_index('key')

# Join df3 with df4_indexed on the 'key' column of df3 and the index of df4_indexed
joined_on_column = [Link](df4_indexed, on='key')
print("\nJoin on a specific column:")
print(joined_on_column)

 Grouping and Aggregating Data

Grouping and aggregating data in Pandas is a fundamental process for data analysis
and involves the "split-apply-combine" strategy. This strategy allows for the summarization
and transformation of data based on specific criteria.
1. Grouping Data:
The primary method for grouping data in Pandas is the groupby() function. This
function splits a DataFrame into groups based on the unique values within one or more
specified columns.
import pandas as pd
Page 32
data = {'Category': ['A', 'B', 'A', 'C', 'B'],
'Value': [10, 15, 20, 25, 30]}
df = [Link](data)

# Grouping by a single column


grouped_by_category = [Link]('Category')

# Grouping by multiple columns (e.g., if you had another column 'Subcategory')


# grouped_by_multiple = [Link](['Category', 'Subcategory'])

The result of groupby() is a DataFrameGroupBy object, which is an intermediate


object that holds the grouped data but does not yet display the aggregated results.

2. Aggregating Data:
After grouping, you can apply aggregation functions to each group to summarize the
data. Common aggregation functions include:
.sum(): Calculates the sum of values within each group.
.mean(): Calculates the average of values within each group.
.count(): Counts the number of non-null values within each group.
.min(): Finds the minimum value within each group.
.max(): Finds the maximum value within each group.
.median(): Calculates the median value within each group.
.std(): Calculates the standard deviation within each group.
.agg(): Allows applying multiple aggregation functions simultaneously or custom
functions.

import pandas as pd

data = {'Category': ['A', 'B', 'A', 'C', 'B'],


'Value': [10, 15, 20, 25, 30]}
df = [Link](data)
# Grouping by a single column
grouped_by_category = [Link]('Category')

# Grouping by multiple columns (e.g., if you had another column 'Subcategory')


# grouped_by_multiple = [Link](['Category', 'Subcategory'])
# Applying an aggregation function to a grouped object
sum_by_category = grouped_by_category['Value'].sum()
print("Sum by Category:\n", sum_by_category)
Page 33
# Applying multiple aggregations using .agg()
multi_agg_by_category = grouped_by_category['Value'].agg(['sum', 'mean', 'count'])
print("\nMultiple Aggregations by Category:\n", multi_agg_by_category)

# Applying different aggregations to different columns (if applicable)


# [Link]('Category').agg({'Value1': 'sum', 'Value2': 'mean'})

Page 34

You might also like