Cheat Sheet: The pandas DataFrame Object
Preliminaries Get your data into a DataFrame
Always start by importing these Python modules Instantiate an empty DataFrame
import numpy as np df = DataFrame()
import matplotlib.pyplot as plt
import pandas as pd Load a DataFrame from a CSV file
from pandas import DataFrame, Series df = pd.read_csv('file.csv') # often works
Note: these are the recommended import aliases df = pd.read_csv('file.csv', header=0,
Note: you can put these into a PYTHONSTARTUP file index_col=0, quotechar='"', sep=':',
na_values = ['na', '-', '.', ''])
Note: refer to pandas docs for all arguments
Cheat sheet conventions Get data from inline CSV text to a DataFrame
from io import StringIO
Code examples data = """, Animal, Cuteness, Desirable
# Code examples are found in yellow boxes row-1, dog, 8.7, True
row-2, cat, 9.5, True
In the code examples, typically I use: row-3, bat, 2.6, False"""
s to represent a pandas Series object; df = pd.read_csv(StringIO(data), header=0,
index_col=0, skipinitialspace=True)
df to represent a pandas DataFrame object;
idx to represent a pandas Index object. Note: skipinitialspace=True allows for a pretty layout
Also: t – tuple, l – list, b – Boolean, i – integer,
a – numpy array, st – string, d – dictionary, etc. Load DataFrames from a Microsoft Excel file
# Each Excel sheet in a Python dictionary
workbook = pd.ExcelFile('file.xlsx')
d = {} # start with an empty dictionary
The conceptual model for sheet_name in workbook.sheet_names:
df = workbook.parse(sheet_name)
d[sheet_name] = df
DataFrame object: is a two-dimensional table of data
with column and row indexes (something like a spread Note: the parse() method takes many arguments like
sheet). The columns are made up of Series objects. read_csv() above. Refer to the pandas documentation.
Column index (df.columns) Load a DataFrame from a MySQL database
import pymysql
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://'
Series of data
Series of data
Series of data
Series of data
Series of data
Series of data
Series of data
(df.index)
+'USER:PASSWORD@HOST/DATABASE')
Row index
df = pd.read_sql_table('table', engine)
Data in Series then combine into a DataFrame
# Example 1 ...
s1 = Series(range(6))
s2 = s1 * s1
A DataFrame has two Indexes: s2.index = s2.index + 2 # misalign indexes
Typically, the column index (df.columns) is a list of df = pd.concat([s1, s2], axis=1)
strings (variable names) or (less commonly) integers
Typically, the row index (df.index) might be: # Example 2 ...
o Integers - for case or row numbers; s3 = Series({'Tom':1, 'Dick':4, 'Har':9})
o Strings – for case names; or s4 = Series({'Tom':3, 'Dick':2, 'Mar':5})
o DatetimeIndex or PeriodIndex – for time series df = pd.concat({'A':s3, 'B':s4 }, axis=1)
Note: 1st method has in integer column labels
Series object: an ordered, one-dimensional array of Note: 2nd method does not guarantee col order
data with an index. All the data in a Series is of the
same data type. Series arithmetic is vectorised after first Get a DataFrame from a Python dictionary
aligning the Series index for each of the operands. # default --- assume data is in columns
df = DataFrame({
s1 = Series(range(0,4)) # -> 0, 1, 2, 3 'col0' : [1.0, 2.0, 3.0, 4.0],
s2 = Series(range(1,5)) # -> 1, 2, 3, 4 'col1' : [100, 200, 300, 400]
s3 = s1 + s2 # -> 1, 3, 5, 7 })
Version 30 April 2017 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
1