Python Unit 4
Python Unit 4
OR
Python applications will often use packages and modules that don’t come
as part of the standard library.
Applications will sometimes need a specific version of a library.
This means it may not be possible for one Python installation to meet the
requirements of every application.
If application A needs version 1.0 of a particular module but application B
needs version 2.0, then the requirements are in conflict and installing
either version 1.0 or 2.0 will leave one application unable to run.
The solution for this problem is to create a virtual environment, a self-
contained directory tree that contains a Python installation for a particular
version of Python, plus a number of additional packages.
Different applications can then use different virtual environments.
To resolve the earlier example of conflicting requirements, application A
can have its own virtual environment with version 1.0 installed while
application B has another virtual environment with version 2.0.
If application B requires a library be upgraded to version 3.0, this will not
affect application A’s environment.
To create a virtual environment, decide upon a directory where you want
to place it, and run the venv module as a script with the directory path:
This will create the tutorial-env directory if it doesn’t exist, and also create
directories inside it containing a copy of the Python interpreter, the
standard library, and various supporting files.
Page 1
Page 2
BCA SEM-5 PYTHON Unit-4
tutorial-env\Scripts\[Link]
You can install, upgrade, and remove packages using a program called
pip. By default pip will install packages from the Python Package Index.
pip has a number of subcommands: “search”, “install”, “uninstall”,
“freeze”, etc.
You can install the latest version of a package by specifying a
package’s name:
pip list will display all of the packages installed in the virtual environment:
Before you can access MySQL databases using Python, you must install one (or
more) of the following packages in a python.
Page 4
BCA SEM-5 PYTHON Unit-4
MySQL-python:
mysql-connector-python:
PyMySQL:
All three of these packages use Python's portable SQL database API.
This means that if you switch from one module to another, you can reuse
almost all of your existing code (the code sample below demonstrates
how to do this).
Page 5
Page 6
BCA SEM-5 PYTHON Unit-4
You need to know the following detail of the MySQL server to perform the
connection from Python.
Username – i.e., the username that you use to work with MySQL Server.
The default username for the MySQL database is a root
Page 7
Page 8
BCA SEM-5 PYTHON Unit-4
Password – Password is given by the user at the time of installing the mysql
database. If you are using root then you won’t need the password.
Host Name – is the server name or Ip address on which MySQL is running. if
you are running on localhost, then you can use localhost, or it’s IP, i.e.
[Link]
Database Name – Database name to which you want to connect. Here we are
using Database named ‘Emp‘ because we have already created this for our
example.
2. Cursor() : This method creates a cursor object that is capable for executing
sql query on database.
3. Execute() : This method is used for executing sql query on database. It takes
a sql query( as string) as an argument.
4. Fetchone() : This method retrieves the next row of a query result set and
returns a single sequence, or None if no more rows are available.
Creating a Table
Page 9
Page 10
BCA SEM-5 PYTHON Unit-4
Inserting Record
pydb=[Link](host="localhost",user="root",passwd="",database=db_nam
e)
[Link]("insert into emp(eid,name,sal) values(10,’dixita’,2000)”)
[Link]()
print([Link],"Row Inserted Successfully")
[Link]()
Delete Record
You can delete records from an existing table by using the
"DELETEFROM" statement:
pydb=[Link](host="localhost",user="root",passwd="",database=db_name)
pydb=[Link](host="localhost",user="root",passwd="",database=db_name)
Page 11
BCA SEM-5 PYTHON Unit-4
pycursor=[Link]()
[Link]("SELECT * FROM emp")
res=[Link]()
for x in res:
print(x)
The process of installing and configuring Pandas, including setting up a virtual environment,
involves several steps:
1. Install PIP (if not already present):
PIP is typically included with Python installations from version 3.4 onwards. To verify if PIP
is installed and to check its version, open your terminal or command prompt and execute:
pip –version
If PIP is not found, you may need to reinstall Python or follow platform-specific instructions
to install PIP.
2. Create a Virtual Environment:
Navigate to your desired project directory in the terminal or command prompt and create a
virtual environment using the venv module:
python -m venv my_pandas_env
Replace my_pandas_env with your preferred name for the virtual environment.
3. Activate the Virtual Environment:
On Windows.
my_pandas_env\Scripts\activate
On macOS/Linux.
source my_pandas_env/bin/activate
Once activated, the name of your virtual environment will typically appear in your terminal
prompt, indicating that you are working within it.
4. Install Pandas 2.0:
With the virtual environment active, use PIP to install Pandas. To specifically install version
2.0, execute:
pip install pandas==2.0.0
Page 12
(Note: Replace 2.0.0 with the specific patch version if required, e.g., 2.0.3.)
5. Verify Installation:
To confirm that Pandas has been installed correctly within your virtual environment, open a
Python interpreter or create a Python script and execute the following:
import pandas as pd
print(pd.__version__)
This will print the installed version of Pandas, confirming a successful installation.
Pandas is an open-source Python library used for working with relational or labeled data
in an easy and intuitive way. It provides powerful data structures and a wide range of
operations for manipulating numerical data and time series. Pandas also offers tools for
cleaning, processing and analyzing data efficiently. It is one of the most popular libraries
for data analysis in Python and primarily supports two core data structures:
Series
DataFrame
Series
A Series is a one-dimensional array-like object that can store any data type such as
integers, strings, floats, or even Python objects. It comes with labels (called an index).
Syntax
[Link](data=None, index=None, dtype=None, name=None, copy=False)
Parameters:
data: Array-like, dict or scalar – Input data.
index (Optional): Labels for the axis.
dtype (Optional): Data type of the Series.
name (Optional): Name of the Series.
copy (Bool): Copy data if True.
Returns: A [Link] object containing the provided data with an associated index.
Example 1: Series holding the char data type.
import pandas as pd
a = ['g', 'e', 'e', 'k', 's']
Page 13
res = [Link](a)
print(res)
Output
Explanation: We pass the list a into [Link](a), which converts it into a Series (a
column-like structure) where each item gets a default index starting from 0, automatically
assigned by Pandas.
Example 2: Series holding the Int data type.
import pandas as pd
a = [1,2,3,4,5]
res = [Link](a)
print(res)
Output
Page 14
Explanation: We pass the list a into [Link] a, which converts it into a Series (a column-
like structure) where each number gets a default index starting from 0, automatically
assigned by Pandas.
Example 3: Series holding the dictionary.
import pandas as pd
a = { 'Id': 1013, 'Name': 'MOhe', 'State': 'Maniput','Age': 24}
res = [Link](a)
print(res)
Output:
Dataframe
import pandas as pd
Page 15
a = ['Python', 'Pandas', 'Numpy']
df = [Link](a, columns=['Tech'])
print(df)
Output:
a={
'Name': ['Tom', 'Nick', 'Krish', 'Jack'],
'Age': [20, 21, 19, 18]
}
res = [Link](a)
print(res)
Output:
Page 16
become the column data. Pandas assigns a default integer index starting from 0 for the
rows.
Example 3: Selecting columns and rows in a dataFrame
import pandas as pd
a={
'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd']
}
df = [Link](a)
print(df[['Name', 'Qualification']])
Output:
Explanation: We create a DataFrame df from the dictionary a, then select and print only
the columns 'Name' and 'Qualification' by passing their names in a list to df[]. This returns
a new DataFrame with just those two columns.
Accessing columns and rows in a dataFrame
A DataFrame in Pandas is a 2D tabular structure where you can easily access and
manipulate data by selecting specific columns or rows. You can extract one or more
columns using column names and filter rows using labels or conditions.
Example 1: We can access one or more columns in a DataFrame using square brackets.
import pandas as pd
a={
'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'City': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj']
}
df = [Link](a)
Output:
Explanation:
df['Name'] returns a Series containing values from the 'Name' column.
df[['Name', 'City']] returns a new DataFrame containing only the specified columns.
Example 2: We can use .loc[] to access rows by index or filter them using conditions.
Output:
import pandas as pd
a={
'Name': ['Mohe', 'Shyni', 'Parul', 'Sam'],
'ID': [12, 43, 54, 32],
'City': ['Delhi', 'Kochi', 'Pune', 'Patna']
}
df = [Link](a)
res = [Link][df['Name'] == 'Mohe']
print(res)
Page 18
Explanation: [Link][df['Name'] == 'Mohe'] filters and returns only the row(s) where the
'Name' column has the value 'Mohe'.
Creating a Pandas DataFrame in Python can be achieved from various data sources.
1. From an Excel Spreadsheet:
To create a DataFrame from an Excel file, the read_excel() function from the Pandas library
is used.
import pandas as pd
# Assuming '[Link]' is your Excel file and 'Sheet1' is the sheet name
df_excel = pd.read_excel('[Link]', sheet_name='Sheet1')
print(df_excel)
import pandas as pd
import pandas as pd
data_dict = {
Page 19
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']
}
df_dict = [Link](data_dict)
print(df_dict)
import pandas as pd
data_list = [
['Alice', 25, 'New York'],
['Bob', 30, 'London'],
['Charlie', 35, 'Paris']
]
df_list = [Link](data_list, columns=['Name', 'Age', 'City'])
print(df_list)
data_tuples = [
('David', 40, 'Tokyo'),
('Eve', 28, 'Berlin')
]
df_tuples = [Link](data_tuples, columns=['Name', 'Age', 'City'])
print(df_tuples)
Page 20
Operations on Data frames
Once we create a data frame, we can do various operations on it. These operations
help us in analyzing the data or manipulating the data. The reader is advised to refer
to the list of all operations available in pandas at the following link:
[Link]
First we will create a data frame from a .csv file using read_csv() function as shown
below. This data frame will be the basis for our operations.
>>> df = pd.read_csv("f:\\python\PANDAS\[Link]")
>>> df
OUTPUT:
empid ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-00
1 1002 Anil Kumar 23000.50 3-20-2002
2 1003 Gaurav Gupta 18000.33 03-03-02
3 1004 Hema Chandra 16500.50 10-09-00
4 1005 Laxmi Prasanna 12000.75 08-10-00
5 1006 Anant Nag 9999.99 09-09-99
Page 21
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_cols}")
In Python, when working with Pandas DataFrames, the head() method is used to
retrieve the first n rows of the DataFrame. By default, if no argument is provided, head()
returns the first 5 rows.
import pandas as pd
tail() on your DataFrame without any arguments to get the last five rows.
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
'Age': [25, 30, 35, 28, 42, 50, 22],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami', 'Boston',
'Seattle']}
df = [Link](data)
last_five_rows = [Link]()
print(last_five_rows)
Page 22
last_three_rows = [Link](3)
print(last_three_rows)
We can treat the data frame as an object and retrieve the rows from it using
slicing. For example, if we write df[1:4], we can get 1nd row to 3th row (excludes 4th
row).
import pandas as pd
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
'Age': [25, 30, 35, 28, 42, 50, 22],
'City': ['New York', 'Los Angeles', 'Chicago',
'Houston', 'Miami', 'Boston', 'Seattle']}
df = [Link](data)
print(df)
print(df['Age'])
Page 23
empid ename
0 1001 Ganesh Rao
1 1002 Anil Kumar
2 1003 Gaurav Gupta
3 1004 Hema Chandra
4 1005 Laxmi Prasanna
5 1006 Anant Nag
>>> [Link]()
empid sal
count 6.000000 6.000000
mean 1003.500000 14917.011667
std 1.870829 5181.037711
min 1001.000000 9999.990000
25% 1002.250000 10500.187500
50% 1003.500000 14250.625000
75% 1004.750000 17625.372500
max 1006.000000 23000.500000
OUTPUT:
Suppose, we want to show data from some columns based on a query, we can mention
the list of columns and then the query as: df[[column names]][query].
For example, to
display only id numbers and names where the salary is greater than Rs. 10000, we can
write:
>>> df[['empid', 'ename']][[Link]>10000]
OUTPUT:
empid ename
1 1002 Anil Kumar
2 1003 Gaurav Gupta
3 1004 Hema Chandra
4 1005 Laxmi Prasanna
We know the index column is automatically generated. If we do not want this column and
we want to set a column from our data as index column, that is possible using set_index()
method. The column with unique values can be set as index column. For example, to
make ‘empid’ column as index column for our data frame, we can write:
Page 25
>>> df1 = df.set_index('empid')
The above statement creates another data frame ‘df1’ that uses ‘empid’ as index column.
We can verify this by displaying df1 as:
>>> df1
OUTPUT:
To sort the data coming from a .csv file, first let us read the data from the file into a data
frame using read_csv() function as:
>>> df = pd.read_csv("f:\\python\PANDAS\[Link]",
parse_dates=['doj'])
Here, we are loading the data from [Link] file and also informing to take ‘doj’ as
date type field using parse_dates option. Now let us display the data frame as:
>>> print(df)
OUTPUT:
empid ename sal doj
0 1001 Ganesh Rao 10000.00 2000-10-10
1 1002 Anil Kumar 23000.50 2002-03-03
2 1003 Gaurav Gupta 18000.33 2002-03-03
3 1004 Hema Chandra 16500.50 2002-03-03
4 1005 Laxmi Prasanna 12000.75 2000-08-10
5 1006 Anant Nag 9999.99 1999-09-09
To sort the rows on ‘doj’ column into ascending order, we can use sort_values() method
as:
>>> df1 = df.sort_values('doj')
>>> df1
OUTPUT:
Page 27
2 1003 Gaurav Gupta 18000.33 2002-03-03
3 1004 Hema Chandra 16500.50 2002-03-03
To sort in descending order, we should use an additional option ‘ascending = False’ as:
>>> df1 = df.sort_values('doj', ascending=False)
Sorting on multiple columns is also possible. This can be done using an option ‘by’ in the
sort_values() method. For example, we want to sort on ‘doj’ in descending order and in
that ‘sal’ in ascending order. That means, when two employees have same ‘doj’, then
their salaries will be sorted into ascending order.
>>> df1 = df.sort_values(by=['doj', 'sal'], ascending=[False, True])
>>> df1
OUTPUT:
empid ename sal doj
3 1004 Hema Chandra 16500.50 2002-03-03
2 1003 Gaurav Gupta 18000.33 2002-03-03
1 1002 Anil Kumar 23000.50 2002-03-03
0 1001 Ganesh Rao 10000.00 2000-10-10
4 1005 Laxmi Prasanna 12000.75 2000-08-10
5 1006 Anant Nag 9999.99 1999-09-09
In many cases, the data that we receive from various sources may not be perfect. That
means there may be some missing data. For example, ‘[Link]’ file contains the
following data where employee name is missing in one row and salary and date of
joining are missing in another row. Please see beloved Figure :
Page 28
When we convert the data into a data frame, the missing data is represented by NaN (Not a
Number). NaN is a default marker for the missing value. Please observe the following
data frame:
>>> import pandas as pd
>>> df = pd.read_csv("f:\\python\PANDAS\[Link]")
>>> df
OUTPUT:
0 1001 Ganesh Rao 10000.00 10-10-00
1 1002 Anil Kumar 23000.50 03-03-02
2 1003 NaN 18000.33 03-03-02
3 1004 Hema Chandra NaN NaN
4 1005 Laxmi Prasanna 12000.75 10-08-00
5 1006 Anant Nag 9999.99 09-09-99
We can use fillna() method to replace the Na or NaN values by a specified value. For
example, to fill the NaN values by 0, we can use:
>>> df1 = [Link](0)
>>> df1
OUTPUT:
empid ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-00
1 1002 Anil Kumar 23000.50 03-03-02
2 1003 0 18000.33 03-03-02
3 1004 Hema Chandra 0.00 0
4 1005 Laxmi Prasanna 12000.75 10-08-00
5 1006 Anant Nag 9999.99 09-09-99
But this is not so useful as it is filling any type of column with zero. We can fill each
column with a different value by passing the column names and the value to be used to
fill in the column. For example, to fill ‘ename’ column with ‘Name missing’, ‘sal’ with 0.0
and ‘doj’ with ’00-00-00’, we should supply these values as a dictionary to fillna() method
as shown below:
>>> df1 = [Link]({'ename': 'Name missing', 'sal': 0.0, 'doj':'00-00-
00'})
>>> df1
OUTPUT:
empid ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-00
Page 29
1 1002 Anil Kumar 23000.50 03-03-02
2 1003 Name missing 18000.33 03-03-02
3 1004 Hema Chandra 0.00 00-00-00
4 1005 Laxmi Prasanna 12000.75 10-08-00
5 1006 Anant Nag 9999.99 09-09-99
If we do not want the missing data and want to remove those rows having Na or NaN
values, then we can use dropna() method as:
>>> df1 = [Link]()
>>> df1
OUTPUT:
empid ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-00
1 1002 Anil Kumar 23000.50 03-03-02
4 1005 Laxmi Prasanna 12000.75 10-08-00
5 1006 Anant Nag 9999.99 09-09-99
In this way, filling the necessary data or eliminating the missing data is called ‘data
cleansing’.
Key Parameters:
left: The left DataFrame to merge.
right: The right DataFrame to merge.
import pandas as pd
import pandas as pd
# Join df3 with df4_indexed on the 'key' column of df3 and the index of df4_indexed
joined_on_column = [Link](df4_indexed, on='key')
print("\nJoin on a specific column:")
print(joined_on_column)
Grouping and aggregating data in Pandas is a fundamental process for data analysis
and involves the "split-apply-combine" strategy. This strategy allows for the summarization
and transformation of data based on specific criteria.
1. Grouping Data:
The primary method for grouping data in Pandas is the groupby() function. This
function splits a DataFrame into groups based on the unique values within one or more
specified columns.
import pandas as pd
Page 32
data = {'Category': ['A', 'B', 'A', 'C', 'B'],
'Value': [10, 15, 20, 25, 30]}
df = [Link](data)
2. Aggregating Data:
After grouping, you can apply aggregation functions to each group to summarize the
data. Common aggregation functions include:
.sum(): Calculates the sum of values within each group.
.mean(): Calculates the average of values within each group.
.count(): Counts the number of non-null values within each group.
.min(): Finds the minimum value within each group.
.max(): Finds the maximum value within each group.
.median(): Calculates the median value within each group.
.std(): Calculates the standard deviation within each group.
.agg(): Allows applying multiple aggregation functions simultaneously or custom
functions.
import pandas as pd
Page 34