Working with missing and non-finite data Working with Categorical Data
Working with missing data Categorical data
Pandas uses the not-a-number construct ([Link] and The pandas Series has an R factors-like data type for
float('nan')) to indicate missing data. The Python None encoding categorical data.
can arise in data as well. It is also treated as missing s = Series(['a','b','a','c','b','d','a'],
data; as is the pandas not-a-time construct dtype='category')
([Link]). df['B'] = df['A'].astype('category')
Note: the key here is to specify the "category" data type.
Missing data in a Series Note: categories will be ordered on creation if they are
s = Series( [8,None,float('nan'),[Link]]) sortable. This can be turned off. See ordering below.
#[8, NaN, NaN, NaN]
[Link]() #[False, True, True, True] Convert back to the original data type
[Link]()#[True, False, False, False] s = Series(['a','b','a','c','b','d','a'],
[Link](0)#[8, 0, 0, 0] dtype='category')
s = [Link]('string')
Missing data in a DataFrame
df = [Link]() # drop all rows with NaN Ordering, reordering and sorting
df = [Link](axis=1) # same for cols s = Series(list('abc'), dtype='category')
df=[Link](how='all') #drop all NaN row print ([Link])
df=[Link](thresh=2) # drop 2+ NaN in r s=[Link].reorder_categories(['b','c','a'])
# only drop row if NaN in a specified col s = [Link]()
df = [Link](df['col'].notnull()) [Link] = False
Trap: category must be ordered for it to be sorted
Recoding missing data
[Link](0, inplace=True) # [Link] 0 Renaming categories
s = df['col'].fillna(0) # [Link] 0 s = Series(list('abc'), dtype='category')
df = [Link](r'\s+', [Link], [Link] = [1, 2, 3] # in place
regex=True) # white space [Link] s = [Link].rename_categories([4,5,6])
# using a comprehension ...
Non-finite numbers [Link] = ['Group ' + str(i)
With floating point numbers, pandas provides for for i in [Link]]
positive and negative infinity. Trap: categories must be uniquely named
s = Series([float('inf'), float('-inf'),
[Link], -[Link]]) Adding new categories
Pandas treats integer comparisons with plus or minus s = [Link].add_categories([4])
infinity as expected.
Removing categories
Testing for finite numbers s = [Link].remove_categories([4])
(using the data from the previous example) [Link].remove_unused_categories() #inplace
b = [Link](s)
Version 30 April 2017 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
11