LECTURE -2
PANDAS
Working With Missing Data
Values considered “missing”
pandas uses different sentinel values to represent a missing (also referred to as NA) depending on the data
type.
numpy.nan for NumPy data types. The disadvantage of using NumPy data types is that the original data
type will be coerced to np.float64 or object.
Inserting Missing Data
You can insert missing values by simply assigning to a Series
or DataFrame. The missing value sentinel used will be chosen
based on the dtype.
Example
Calculations With Missing Data
Missing values propagate through arithmetic operations
between pandas objects.
When summing data, NA values or empty data will be treated
as zero.
When taking the product, NA values or empty data will be
treated as 1.
Cumulative methods like cumsum() and cumprod() ignore NA
values by default preserve them in the result. This behavior
can be changed with skipna
Dropping Missing Data
dropna() drop a rows or columns with missing data.
Filling Missing Data
Filling by value
fillna() replaces NA values with non-NA data.
Replace NA with a scalar value
Fill gaps forward or backward
Boolean Indexing
Select rows where df.A is greater than 0.
Boolean Indexing
Using isin() method for filtering:
Stats
Operations in general exclude missing data.
Calculate the mean value for each column:
User Defined Functions
DataFrame.agg() and DataFrame.transform() applies a user
defined function that reduces or broadcasts its result
respectively.
Summarizing Data: describe
There is a convenient describe() function which computes a
variety of summary statistics about a Series or the columns of
a DataFrame (excluding NAs of course):
Index Of Min/Max Values
The idxmin() and idxmax() functions on Series and DataFrame
compute the index labels with the minimum and maximum
corresponding values:
Value Counts (Histogramming) / Mode
The value_counts() Series method computes a histogram of a 1D
array of values. It can also be used as a function on regular
arrays:
Value Counts (Histogramming) / Mode
The value_counts() method can be used to count combinations
across multiple columns.
String Methods
Series and Index are equipped with a set of string processing
methods that make it easy to operate on each element of the
array. Perhaps most importantly, these methods exclude
missing/NA values automatically. These are accessed via the
str attribute and generally have names matching the
equivalent (scalar) built-in string methods:
String Methods
Merge, Join, Concatenate
• concat() :
The concat() function concatenates an arbitrary amount of
Series or DataFrame objects along an axis while
performing optional set logic (union or intersection) of the
indexes on the other axes. Like numpy.concatenate, concat()
takes a list or dict of homogeneously-typed objects and
concatenates them.
Merge, Join, Concatenate
• Joining logic of the resulting axis
The join keyword specifies how to handle axis values that don’t
exist in the first DataFrame.
join='outer' takes the union of all axis values
Merge, Join, Concatenate
• merge():
merge() performs join operations similar to relational databases like SQL.
merge() implements common SQL style joining operations.
one-to-one: joining two DataFrame objects on their indexes which must
contain unique values.
many-to-one: joining a unique index to one or more columns in a
different DataFrame.
many-to-many : joining columns on columns.
EXAMPLE
DataFrame.join()
DataFrame.join() combines the columns of multiple, potentially differently-
indexed DataFrame into a single result DataFrame.