Pandas Cheat Sheet
by Justin1209 (Justin1209) via [Link]/101982/cs/21202/
Import the Pandas Module Loading and Saving CSVs (cont) Converting Datatypes
import pandas as pd > # Get the first DataFrame chunk: # Convert argument to numeric type
df_urb_pop [Link]_numeric(arg, errors="ra‐
Create a DataFrame df_urb_pop = next(urb_pop_reader) ise")
# Method 1 errors:
Inspect a DataFrame "raise" -> raise an exception
df1 = [Link]aFrame({
[Link](5) First 5 rows "coerce" -> invalid parsing will be set as NaN
'name':
['John Smith',
'Jane Doe'], [Link]() Statistics of columns (row
DataFrame for Select Columns / Rows
'address':
['13 Main St.', count, null values, datatype)
'46 Maple Ave.'], df = [Link]([
'age':
[34, 28] Reshape (for Scikit) ['January', 100, 100, 23,
}) 100],
nums = [Link](range(1, 11))
# Method 2 ['February', 51, 45, 145,
-> [ 1 2 3 4 5 6 7 8 9 10]
df2 = [Link]aFrame([ 45],
nums = nums.reshape(-1, 1)
['John Smith', '123 Main ['March', 81, 96, 65, 96],
-> [ [1],
St.', 34], ['April', 80, 80, 54, 180],
[2],
['Jane Doe', '456 Maple ['May', 51, 54, 54, 154],
[3],
Ave.', 28], ['June', 112, 109, 79,
[4],
['Joe Schmo', '9 129]],
[5],
Broadway', 51] columns=['month',
[6],
], 'east', 'north', 'south',
[7],
columns=['name',
'address', 'west']
[8],
'age']) )
[9],
[10]]
Loading and Saving CSVs Select Columns
You can think of reshape() as rotating this
# Load a CSV File in to a # Select one Column
array. Rather than one big row of numbers,
clinic_north = [Link]
DataFrame nums is now a big column of numbers -
df = [Link]d_csv('my-csv- there’s one number in each row. --> Reshape values for Scikit
f[Link]') learn: clinic_north.[Link]‐
# Saving DataFrame to a CSV File shape(-1, 1)
df.to_csv('new-csv-fi‐ # Select multiple Columns
le.csv') clinic_north_south
= df[['n‐
# Load DataFrame in Chunks (For orth', 'south']]
large Datasets) Make sure that you have a double set of
# Initialize reader object: brackets [[ ]], or this command won’t work!
urb_pop_reader
urb_pop_reader = [Link]d_c‐
sv('ind_pop_dat[Link]',
chunksize=1000)
By Justin1209 (Justin1209) Published 23rd November, 2019. Sponsored by [Link]
[Link]/justin1209/ Last updated 31st January, 2020. Measure your website readability!
Page 1 of 4. [Link]
Pandas Cheat Sheet
by Justin1209 (Justin1209) via [Link]/101982/cs/21202/
Select Rows Adding a Column Performing Column Operation (cont)
# Select one Row df = [Link]([ > -> lower, upper
march = [Link][2] [1, '3 inch screw', 0.5, # Perform a lambda Operation on a Column
# Select multiple Rows 0.75], get_last_name = lambda x: [Link]t(" ")[-1]
jan_feb_march = [Link]c[:3] [2, '2 inch nail', 0.10, df['last_name'] = [Link](get_last_‐
feb_march_april = [Link]‐ 0.25], name)
c[1:4] [3, 'hammer', 3.00, 5.50],
Performing a Operation on Multiple
may_june = [Link]c[-2:] [4, 'screwdriver', 2.50,
Columns
# Select Rows with Logic 3.00]
january = df[df.month == ], df = [Link]([
'January'] columns=['Product ID', ["Apple", 1.00, "No"],
-> <, >, <=, >=, !=, == 'Description', 'Cost to ["Milk", 4.20, "No"],
march_april = df[([Link] == Manufacture', 'Price'] ["Paper Towels", 5.00, "‐
'March') | ([Link] == ) Yes"],
'April')] # Add a Column with specified ["Light Bulbs", 3.75, "‐
-> &, | row-values Yes"],
january_february_march = df['Sold in Bulk?'] = ['Yes', ],
df[[Link](['January', 'Yes', 'No', 'No'] columns=["Item", "‐
'February', 'March'])] # Add a Column with same value Price", "Is taxed?"])
-> column_name.isin([" ", " in every row # Lambda Function
"]) df['Is taxed?'] = 'Yes' df['Price with Tax'] = [Link]‐
# Add a Column with calculation ly(lambda row:
Selecting a Subset of a Dataframe often
df['Revenue'] = df['Price'] - row['Price'] * 1.075
results in non-consecutive indices.
df['Cost to Manufacture'] if row['Is taxed?'] ==
Using .reset_index() will create a new 'Yes'
DataFrame move the old indices into a new Performing Column Operation else row['Price'],
colum called index. df = [Link]([
axis=1
)
['JOHN SMITH', 'john.smi‐
Use .reset_index(drop=True) if you dont th@gmail.com'], We apply a lambda to rows, as opposed to
need the index column. columns, when we want to perform functi‐
['Jane Doe', 'jdoe@yah‐
Use .reset_index(inplace=True) to prevent a onality that needs to access more than one
oo.com'],
new DataFrame from brein created. column at a time.
['joe schmo', 'joeschmo‐
@hotmail.com']
],
columns=['Name', 'Email'])
# Changing a column with an
Operation
df['Name'] = [Link]. apply(‐
lower)
By Justin1209 (Justin1209) Published 23rd November, 2019. Sponsored by [Link]
[Link]/justin1209/ Last updated 31st January, 2020. Measure your website readability!
Page 2 of 4. [Link]
Pandas Cheat Sheet
by Justin1209 (Justin1209) via [Link]/101982/cs/21202/
Rename Columns Column Statistics Pivot Tables
# Method 1 Mean = Average [Link]() orders =
[Link] = ['NewName_1', Median [Link]() pd.read_csv('[Link]')
'NewName_2, 'NewName_3', shoe_counts = orders.
Minimal Value [Link]()
'...'] groupby(['shoe_type',
Maximum Value [Link]()
# Method 2 'shoe_color']).
Number of Values [Link]()
[Link](columns={ [Link]nt().reset_index()
'OldName_1': 'NewNa‐ Unique Values [Link]() shoe_counts_pivot = shoe_c‐
me_1', Standard Deviation [Link]() ounts.pivot(
'OldName_2': 'NewNa‐ List of Unique Values [Link]() index = 'shoe_type',
me_2' columns = 'shoe_color',
Dont't forget reset_index() at the end of a
}, inplace=True
) values = 'id').reset_index()
groupby operation
Using inplace=True lets us edit the original We have to build a temporary table where
DataFrame. Calculating Aggregate Functions we group by the columns we want to
# Group By include in the pivot table
Series vs. Dataframes
grouped = df. groupby(['col1',
Merge (Same Column Name)
# Dataframe and Series 'col2']).col3
print(type(clinic_north)): .measurement()
. reset_index() sales = pd.read_csv('[Link]')
# <class 'pandas.c[Link]ries.Series'> # -> group by column1 and targets = [Link]d_csv('ta‐
print(type(df)): column2, calculate values of rgets.csv')
# <class 'pandas.c[Link][Link]taFrame'> column3 men_women = [Link]d_csv('me‐
print(type(clinic_north_south)) n_women_sale[Link]')
# Percentile
# <class 'pandas.c[Link][Link]taFrame'> # Method 1
high_earners = [Link]upby('‐
In Pandas category').wage sales_targets = [Link](sales,
- a series is a one-dimensional object that apply(lambda
. x: [Link]‐ targets, how=" ")
contains any type of data. centile(x, 75)) # how: "inner"(default), "‐
reset_index()
. outer", "left", "right"
- a dataframe is a two-dimensional object #Method 2 (Method Chaining)
# [Link]centile can calculate
that can hold multiple columns of different all_data =
any percentile over an array of
types of data.
values [Link](targets).merge(men_w
omen)
A single column of a dataframe is a series, Don't forget reset.index()
and a dataframe is a container of two or
more series objects.
By Justin1209 (Justin1209) Published 23rd November, 2019. Sponsored by [Link]
[Link]/justin1209/ Last updated 31st January, 2020. Measure your website readability!
Page 3 of 4. [Link]
Pandas Cheat Sheet
by Justin1209 (Justin1209) via [Link]/101982/cs/21202/
Inner Merge (Different Column Name) Melt
orders = [Link]lt(DataFrame, id_vars, value_vars, var_name, value_name='‐
pd.read_csv('[Link]') value')
products = [Link]d_csv('pr‐ id_vars: Column(s) to use as identifier variables.
odu[Link]') value_vars: Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
# Method 1: Rename Columns var_name: Name to use for the ‘variable’ column.
orders_products = value_name: Name to use for the ‘value’ column.
[Link](orders, Unpivot a DataFrame from wide to long
[Link](columns={'i‐ format, optionally leaving identifiers set.
d':'product_id'}), how=" ")
.reset_index() Assert Statements
# how: "inner"(default), "‐ # Test if country is of type
outer", "left", "right" object
# Method 2: assert gapmin[Link]untry.d‐
orders_products = types == [Link]
[Link](orders, products, # Test if year is of type int64
left_on="pr‐
assert gapmin[Link]ar.dtypes
oduct_id", == np.int64
right_on="id",
# Test if life_expectancy is
suffixes=["_‐
of type float64
orders","_products"]) assert gapmin[Link]fe_exp‐
Method 2: ectancy.dtypes == np.float64
If we use this syntax, we’ll end up with two # Assert that country does not
columns called id. contain any missing values
Pandas won’t let you have two columns assert [Link]null(gapmind‐
with the same name, so it will change them er.country).all()
to id_x and id_y. # Assert that year does not
We can help make them more useful by contain any missing values
using the keyword suffixes. assert [Link]null(gapmind‐
er.year).all()
Concatenate
bakery =
pd.read_csv('[Link]')
ice_cream = [Link]d_csv('ic‐
e_crea[Link]')
menu = [Link]([bakery,
ice_cream])
By Justin1209 (Justin1209) Published 23rd November, 2019. Sponsored by [Link]
[Link]/justin1209/ Last updated 31st January, 2020. Measure your website readability!
Page 4 of 4. [Link]