Pandas Cheat Sheet
Pandas Cheat Sheet
Data Wrangling
& *
Tidy data complements pandas’s vectorized
with pandas Cheat Sheet In a tidy operations. pandas will automatically preserve
[Link] data set: observations as you manipulate variables. No
other format works as intuitively with pandas.
M A
Pandas API Reference Pandas User Guide Each variable is saved
in its own column
Each observation is
saved in its own row *
Creating DataFrames Reshaping Data – Change layout, sorting, reindexing, renaming
a b c df.sort_values('mpg')
1 4 7 10 Order rows by values of a column (low to high).
2 5 8 11
df.sort_values('mpg', ascending=False)
3 6 9 12
Order rows by values of a column (high to low).
df = [Link](
{"a" : [4, 5, 6], [Link](df) [Link](columns = {'y':'year'})
[Link](columns='var', values='val')
Gather columns into rows. Rename the columns of a DataFrame
"b" : [7, 8, 9], Spread rows into columns.
"c" : [10, 11, 12]}, df.sort_index()
index = [1, 2, 3]) Sort the index of a DataFrame
Specify values for each column.
df.reset_index()
df = [Link]( Reset index of DataFrame to row numbers, moving
[[4, 7, 10], index to columns.
[5, 8, 11], [Link]([df1,df2], axis=1)
[Link]([df1,df2]) [Link](columns=['Length', 'Height'])
[6, 9, 12]], Append columns of DataFrames
Append rows of DataFrames Drop columns from DataFrame
index=[1, 2, 3],
columns=['a', 'b', 'c'])
Specify values for each row. Subset Observations - rows Subset Variables - columns Subsets - rows and columns
a b c Use [Link][] and [Link][] to select only
N v rows, only columns or both.
1 4 7 10 Use [Link][] and [Link][] to access a single
D df[[Link] > 7]
2 5 8 11 df[['width', 'length', value by row and column.
e 2 6 9 12
Extract rows that meet logical criteria. 'species']] First index selects rows, second index columns.
df.drop_duplicates() Select multiple columns with specific names.
[Link][10:20]
df = [Link]( Remove duplicate rows (only considers columns). df['width'] or [Link]
Select rows 10-20.
{"a" : [4 ,5, 6], [Link](frac=0.5) Select single column with specific name.
[Link][:, [1, 2, 5]]
"b" : [7, 8, 9], Randomly select fraction of rows. [Link](regex='regex')
Select columns in positions 1, 2 and 5 (first
"c" : [10, 11, 12]}, [Link](n=10) Randomly select n rows. Select columns whose name matches
column is 0).
index = [Link].from_tuples( [Link](n, 'value') regular expression regex.
[('d', 1), ('d', 2), Select and order top n entries. Using query [Link][:, 'x2':'x4']
Select all columns between x2 and x4 (inclusive).
('e', 2)], names=['n', [Link](n, 'value')
query() allows Boolean expressions for filtering [Link][df['a'] > 10, ['a', 'c']]
'v'])) Select and order bottom n entries.
rows. Select rows meeting logical condition, and only
Create DataFrame with a MultiIndex [Link](n)
[Link]('Length > 7') the specific columns .
Select first n rows.
Method Chaining [Link](n)
Select last n rows.
[Link]('Length > 7 and Width <
8')
[Link][1, 2] Access single value by index
[Link][4, 'A'] Access single value by label
Most pandas methods return a DataFrame so that [Link]
another pandas method can be applied to the Logic in Python (and pandas) ('[Link]("abc")', regex (Regular Expressions) Examples
result. Less than Not equal to
engine="python") Matches strings containing a period '.'
< != '\.'
This improves readability of code.
> Greater than [Link](values) Group membership 'Length$' Matches strings ending with word 'Length'
df = ([Link](df)
.rename(columns={ == Equals [Link](obj) Is NaN '^Sepal' Matches strings beginning with the word 'Sepal'
'variable':'var', <= Less than or equals [Link](obj) Is not NaN '^x[1-5]$' Matches strings beginning with 'x' and ending with 1,2,3,4,5
'value':'val'}) >= Greater than or equals &,|,~,^,[Link](),[Link]( Logical and, or, not, xor, any, all '^(?!Species$).*' Matches strings except the string 'Species'
.query('val >= 200') )
Cheatsheet for pandas ([Link] originally written by Irv Lustig, Princeton Consultants, inspired by Rstudio Data Wrangling Cheatsheet
Group Data Combine Data Sets
[Link](by="col") The examples below can also be applied to groups. In this case, the
function is applied on a per-group basis, and the returned vectors adf bdf
Return a GroupBy object, grouped x1 x2 x1 x3
by values in column named "col". are of the length of the original DataFrame.
A 1 A T
shift(1) shift(-1) B 2 B F
[Link](level="ind") Copy with values shifted by 1. Copy with values lagged by 1. C 3 D T
Return a GroupBy object, grouped rank(method='dense') cumsum()
by values in index level named Ranks with no gaps. Standard Joins
Cumulative sum.
"ind". rank(method='min') cummax() x1 x2 x3 [Link](adf, bdf,
Ranks. Ties get min rank. Cumulative max. A 1 T how='left', on='x1')
All of the summary functions listed above can be applied to a group. rank(pct=True) cummin() B 2 F Join matching rows from bdf to adf.
Additional GroupBy functions: Ranks rescaled to interval [0, 1]. Cumulative min. C 3 NaN
size() agg(function) rank(method='first') cumprod()
Size of each group. Aggregate group using function. Ranks. Ties go to first value. Cumulative product. x1 x2 x3 [Link](adf, bdf,
A 1.0 T how='right', on='x1')
Summarize Data Handling Missing Data
B 2.0
D NaN
F
T
Join matching rows from adf to bdf.
df['w'].value_counts() [Link]() x1 x2 x3
Count number of rows with each unique value of variable Drop rows with any column having NA/null data. [Link](adf, bdf,
A 1 T
len(df) [Link](value) how='inner', on='x1')
B 2 F
# of rows in DataFrame. Replace all NA/null data with value. Join data. Retain only rows in both sets.
[Link]
x1 x2 x3 [Link](adf, bdf,
Tuple of # of rows, # of columns in DataFrame.
df['w'].nunique() Make New Columns A 1
B 2
T
F
how='outer', on='x1')
# of distinct values in a column. Join data. Retain all values, all rows.
[Link]() C 3 NaN
Basic descriptive and statistics for each column (or GroupBy). D NaN T
[Link]() Filtering Joins
Prints a concise summary of the DataFrame. [Link](Area=lambda df: x1 x2 adf[[Link](bdf.x1)]
df.memory_usage() [Link]*[Link]) A 1 All rows in adf that have a match in bdf.
Prints the memory usage of each column in the DataFrame. Compute and append one or more new columns. B 2
[Link]() df['Volume'] = [Link]*[Link]*[Link]
Prints a Series with the dtype of each column in the DataFrame. Add single column. x1 x2 adf[~[Link](bdf.x1)]
[Link]([Link], n, labels=False) C 3 All rows in adf that do not have a match in bdf.
Bin column into n buckets.
Vector Vector ydf zdf
function function x1 x2 x1 x2
A 1 B 2
pandas provides a large set of summary functions that operate on
B 2 C 3
different kinds of pandas objects (DataFrame columns, Series, pandas provides a large set of vector functions that operate on all
C 3 D 4
GroupBy, Expanding and Rolling (see below)) and produce single columns of a DataFrame or a single selected column (a pandas
values for each of the groups. When applied to a DataFrame, the Series). These functions produce vectors of values for each of the Set-like Operations
result is returned as a pandas Series for each column. Examples: columns, or a single Series for the individual Series. Examples: x1 x2 [Link](ydf, zdf)
sum() min() max(axis=1) min(axis=1) B 2 Rows that appear in both ydf and zdf
Sum values of each object. Minimum value in each object. Element-wise max. Element-wise min. C 3 (Intersection).
count() max() clip(lower=- abs()
Count non-NA/null values of Maximum value in each object. 10,upper=10) Absolute value. x1 x2
[Link](ydf, zdf, how='outer')
each object. mean() Trim values at input thresholds A 1
Rows that appear in either or both ydf and zdf
median()
Median value of each object.
Mean value of each object.
var() Windows B
C
2
3
(Union).
quantile([0.25,0.75]) Variance of each object. [Link]() D 4
[Link](ydf, zdf, how='outer',
Quantiles of each object. std() Return an Expanding object allowing summary functions to be
x1 x2 indicator=True)
apply(function) Standard deviation of each applied cumulatively.
A 1 .query('_merge == "left_only"')
Apply function to each object. object. [Link](n)
.drop(columns=['_merge'])
Return a Rolling object allowing summary functions to be
Rows that appear in ydf but not zdf (Setdiff).
applied to windows of length n.
Cheatsheet for pandas ([Link] originally written by Irv Lustig, Princeton Consultants, inspired by Rstudio Data Wrangling Cheatsheet
Plotting
[Link]() [Link](x='w', y='h') [Link]() [Link]()
Frequently Used
Plot a line graph for the DataFrame. Plot a scatter graph of the DataFrame. Plot a histogram of the DataFrame. Plot a pie chart of the DataFrame. Options
Pandas offers some ‘options’ to globally
control how Pandas behaves, display etc.
Options can be queried and set via:
[Link]() [Link]() [Link]() [Link]() [Link].option_name (where
Plot a line graph for the DataFrame. Plot a scatter graph of the DataFrame. Plot an area graph of the DataFrame. Plot a hexbin graph of the DataFrame. option_name is the name of an option). For
example:
[Link].max_rows =
20
Set the display.max_rows option to
[Link](subplots=True) [Link](cumulative=True) [Link](stacked=True) Functions
20.
Separate into different graphs for each column in Creates a cumulative plot Stacks the data for the columns on top of each get_option(option)
the DataFrame. [Link](bins=30) other. (bar, barh and area only) Fetch the value of the given option.
[Link](title=“Graph of A against Set the number of bins into which data is grouped [Link](alpha=0.5) set_option(option)
B”) (histograms) Sets the transparency of the plot to 50%. Set the value of the given option.
Sets the title of the graph. reset_option(options)
[Link](subplots=True, title=['col1', 'col2', 'col3'])
Reset the values of all given options to
Arguments can be combined for more flexibility when graphing, this would plot a separate line graph for of column of a 3-columned DataFrame. The first string in the
default settings.
list of titles applies to the graph of the left-most column.
describe_option(options)
Print descriptions of given options.
Changing Type Series String Operations option_context(options)
Execute code with temporary option
Similar to python string operations, except these are vectorized to apply to the settings that revert to prior settings after
pd.to_numeric(data) [Link](type)
entire Series efficiently. execution.
Convert non-numeric types to Convert data to (almost) any given [Link](pattern) [Link]()
numeric. type including categorical Returns a series with the integer Concatenate elements into a single Display options
pd.to_datetime(data) df.infer_objects() counts in each element. string display.max_rows
Convert non-datetime types to Attempts to infer a better type for [Link](index) [Link](sep) The maximum number of rows displayed
datetime type object type data. Returns a series with the data at the Splits the string on the first instance in pretty-print.
pd.to_timedelta(data) df.convert_dtypes() given index for each element. of the separator display.max_columns
Convert non- timedelta types to Convert columns to best possible [Link](sep) [Link](start, stop, The maximum number of columns
timedelta dtypes Returns a series where each element step) displayed in pretty-print.
has been concatenated. Slices each string display.expand_frame_repr
Datetime [Link]
Extract the day (int) from the date.
[Link]()
Converts the first character of each
[Link](pat, rep)
Use regex to replace patterns in each
Controls whether the DataFrame
representation stretches across pages.
With a Series containing data of word to be a capital. string. display.large_repr
[Link] Controls whether a DataFrame that
type datetime, the dt accessor [Link]() [Link]()
Find which quarter the date lies in. exceeds maximum rows/columns is
is used to get various Returns a series with the lengths of Checks whether each element is
components of the datetime [Link] truncated or summarized
each element. alpha-numeric
values: Extract the hour. [Link]
[Link]
Input/Output
[Link] The output display precision in decimal
Extract the year Extract the minute.
[Link] places.
[Link] display.max_colwidth
Extract the month as an integer. Common file types for data input include CSV, JSON, HTML which are human-
Extract the second. The maximum width of columns, longer
readable, while the common output types are usually more optimized for
performance and scalability such as feather, parquet and HDF. cells will be truncated.
Mapping df = pd.read_csv(filepath)
Read data from csv file
df.to_parquet(filepath)
Write data to parquet file
display.max_info_columns
The maximum number of columns
Apply a mapping to every element in a DataFrame or Series, useful for df = pd.read_html(filepath) df.to_feather(filepath) displayed after calling info().
recategorizing or transforming data. Read data from html file Write data to feather file display.chop_threshold
df = pd.read_excel df.to_hdf(filepath) Sets the rounding threshold to zero when
[Link](lambda x: 2*x) displaying a Series/DataFrame.
(filepath) Write data to HDF file
Returns a copy of the series where every entry is doubled display.colheader_justify
Read data from xls (and related) files df.to_clipboard()
[Link](lambda s: [Link]() - [Link](), axis=1) Controls how column headers are justified.
df = pd.read_sql(filepath) Copy object to the system clipboard
Returns a Series with the difference of the maximum and minimum values of
Read data from sql file
each row of the DataFrame
pd.read_clipboard()