Pandas: Python Package | Fast, flexible and powerful data manipulation tool
Series: 1 Dimensional labelled array of any data type
>Bros = pd.Series(data = [‘Soum’ , ’Ayush’], index = [1,2], dtype = ’string’) # Using List for creating Series
Note: When Index are not specified Pandas assign a sequence of integers starting from zero. This is a built-in Pandas
object called ‘RangeIndex’
Method & Attributes:
Method – A function bound to the object (Example: List1.equals(List2)
Attribute – A variable bound to the object (Example: List1.dtype, Bros.name = ‘Cool Dudes!’)
Note: Series can have names and those names can be used as column names in DataFrame. Similarly, the Series Index
can also have its own name
Add_prefix() & Add_suffix() Method: (Doesn’t modify the original Series labels)
>Alphabet.add_prefix(‘_prefix_label’)
>Alphabet.add_suffix(‘_suffix_label’)
Boolean mask and the .loc Indexer:
Boolean Mask - Used to index select items at scale | work with [] and .loc | need to be of same length as Series
>Alphabet.loc[[True if i%2==0 else False for i in range(26)]]
.loc – The prototypical way of doing label based extraction
.iloc – Way of doing integer / location based extraction
Example:
>Alphabet = pd.Series(data = [‘A’,’B’,’C’,’D’], index = [1,2,3,4], dtype=’string’)
>Alphabet.loc[1] ---> Output: A
>Alphabet.iloc[1] ---> Output: B
Selecting with .get():
-‘get()’ works for both ‘index based on labels’ as well as ‘location based extraction’
>Alphabet.get(‘label_A’) ---> Output: A
>Alphabet.get(0) ---> Output: A
>Alphabet.get(‘Anything’, default=’Couldn’t find anything!’) ---> Output: Couldn’t find anything!
Note: default of .get() is set to ‘None’
Series Methods and Handling:
pd.read_csv – Read a comma-separated values (csv) file into the DataFrame
>Alcohol = pd.read_csv(‘https://andybek.com/drinks’, usecols = [‘country’,’wine_servings’], index_col = [‘country’])
>type(Alcohol) ---> pandas.core.frame.DataFrame
If the parsed data only contains one column then setting squeeze parameter as True will return a Series
>Alcohol = pd.read_csv(‘https://andybek.com/drinks’, usecols = [‘country’,’wine_servings’], index_col = [‘country’],
squeeze = True)
>type(Alcohol) ---> pandas.core.series.Series
Size & Shape:
.size - number if elements in the Series
.shape - tuple of the dimension for a Series
Unique values and Series Monotonicity:
.is_unique - Simply checks if the Series contains a sequence of Unique values (True/Values)
.nunique() - Gives exact counts of unique values in a Series (excluding NA) (#)
.nunique(dropna = False) - Gives counts of unique values in a Series (including NA)
.unique() - Gives a list of unique values in the Series
.is_monotonic / .is_monotonic_increasing - Checks if Series is increasing / stagnation
.is_monotonic_decreasing - Checks if Series is decreasing
The count() Method:
Series.count() - Return number of non-NA / Null observations in the Series
.isna().sum() - Return uber of Null elements in the Series
Dropping and Filling NAs:
>Alcohol.fillna(100, inplace = False)
>Alcohol.dropna(inplace = False) - Exclude NAs from the Series
Note: Both methods return a copy of the Series unless the ‘inplace’ parameter is set to True
Descriptive Statistics:
>Alcohol.sum() - Excludes NA’s
>Alcohol.mean()
>Alcohol.median()
>Alcohol.quantile(q=0.5)
IQR (Interquantile Range) -> Alcohol.quantile(0.75) – Alcohol.qunatile(0.25)
>Alcohol.min()
>Alcohol.max()
>Alcohol.std()
>Alcohol.var()
Note: Alcohol.std()**2 = Alcohol.var() | Mode - Item with highest frequency
Describe() Method:
Gives an overall statistical description of the dataset
>Alcohol.describe(percentile = [0.79,0.19], include = float, exclude = object)
Value_counts():
A sorted series containing unique values and their counts
>Alcohol.value_counts(sort = True, ascending = False, dropna = True, normalize = False)
Note: Normalize provides relative frequency
idxmax() & idxmin():
idxmax() - Returns the label of the row with minimum value
idxmin() - Returns the label of the row with minimum value
Note: If ‘multiple’ min/max values are present then only the first label is returned
nlargest() & nsmallest():
>Alcohol.nlargest(n = 10) - Shows largest 10 alcohol consuming countries / values
>Alcohol.nsmallest(n = 10) - Shows smallest 10 alcohol consuming countries / values
Sorting with sort_values():
Returns a new Series, sorted by values
>Alcohol.sort_values(ascending = False, na_position = ‘last’, kind = ‘quicksort’, inplace = False)
Note: Have other sorting options such as ‘mergesort’ or ‘heapsort’, default is ‘quicksort’
Sorting wit sort_index():
Returns a new Series, sorted by index labels
>Alcochol.sort_index(ascending = True, na_postion = ‘last’, inplace = False)
Note: NaN won’t show up in the sorting if not present in the index label
Series Arithmetic and fill_value():
Use to add / subtract / divide / multiply two Series
>Alcohol.add(More_Drinks, fill_value = 0)
>Alcohol.subtract(More_Drinks, fill_value = 0)
>Alcohol.divide(More_Drinks, fill_value = 1)
>Alcohol.multiply(More_Drinks, fill_value = 1)
Note: ‘fill_value’ is set to 0 / 1 so that we don’t end up losing data due to Series misalignment
Calculating Variance and Standard Deviation:
Variance - The average of squared differences from the mean
>(Alcohol.subtract(Alcohol.mean())**2).sum()/(Alcohol.count() -1)
Note: Standard Deviation is square of Variance
Cumulative Operations:
>Alcohol.cumsum(skipna =True) - Calculate a progressive / cumulative sum (of the values preceding in the Series)
>Alcohol.cumprod()
>Alcohol.cummin()
>Alcohol.cummax()
Note: ‘NaN’ are skipped (i.e., ‘skipna’ parameter is set to ‘True’ by default) as sum of any number with ‘NaN’ is ‘NaN’)
Pairwise Difference with Diff():
Used to calculate the discrete difference for pair of elements in a Series
>Ser.diff(periods = 1)
Series Iteration:
>for i in Alcohol:
print(i)
-Prints value / alcohol consumption without labels
>for i in Alcohol.index:
print(i)
-Prints labels without values
>for i in Alcohol.index:
print(i, Alcohol[i])
-Prints both label and value
>for i in Alcohol.items(): / for i in Alcohol.iteritems():
print(i)
-Returns a lazy evaluated iterable of tuple | Behind the scene used ‘zip’ method
Filtering: filter(), where() and mask()
>Alcohol.filter(regex = ‘^V’) - Will filter countries starting with letter ‘V’
>Alcohol.filter(like = ‘stan’) - Will filter countries having ‘stan’ in them
Note: Filtering is done along index labels and not on values on using ‘filter()’ function. For filtering for values, use square
brackets (Example: Alcohol[Alcohol > 200])
pandas.Series.where() - Replace values with ‘NaN / set value’ where the condition is False
>Alcohol.where(lambda x: x > 200, np.nan).dropna()
pandas.Series.mask() - Replace values with ‘NaN / set value’ where the condition is True
>Alcohol.mask(lambda x: x > 200, np.nan).dropna()
Note: Default replacing value is ‘NaN’
Transforming with update(), apply() & map():
update() - Update series with newly added values
>Alcohol.update(pd.Series(data = [200,20], index = [‘Albania’, ‘Algeria’]))
apply() - Applies a transformation to each and every element in a series
>Alcohol.apply(lambda x : x**2) | Alcohol.apply(np.square)
map() - Map values of Series according to input correspondence | Used for substituting each value in a Series with
another value, that may be derived from a function, a dictionary Series
>Ser.map({‘old value’ : ‘new value’})
>Alcohol.map(lambda x : x**2)
DataFrame:
A table of data that contains a collection of rows and columns.
Key Aspects -
1. DataFrames have two dimensions: labeled indices and columns
2. Each column in a DataFrame is a Series and each column must be of same size
3. Unlike Series, DataFrames could be heterogenous (i.e., have multiple data types)
>names = [‘Olga’, ‘Andre’, ‘Brian’]
>ages = [29, 21, 45]
>married = [False, True, True]
>df = pd.DataFrame({‘name’ : names, ‘age’ : ages, ‘married’ : married})
Need to specify index position as well as column position / name to fetch specific value from DataFrame
> df.iloc[2,0] --> Output: ‘Brian’
.dtypes: Shows columns name and their datatypes
Note: ‘.dtype’ will throw an error as DataFrame can have multiple data types
.ndim & .shape:
.ndim - Direct way of comparing dimension of the dataset
.shape - Gives shape of the dataset
More ways for DataFrame:
1. Dict of Tuples (Column-wise)
>tuple_name = tuple(names)
>tuple_ages = tuple(ages)
>tuple_married = tuple(married)
>pd.DataFrame({‘name’ : tuple_name, ‘age’ : tuple_ages, ‘married’ : tuple_married})
2. Dict of Dicts (Column-wise)
>series_name = pd.Series(names)
>series_ages = pd.Series(ages)
>series_married = pd.Series(married)
>pd.DataFrame({‘name’ : series_name, ‘age’ : series_ages, ‘married’ : series_married})
3. Dict of Series (Column-wise)
>dict_names = {k:v for k,v in enumerate(names)}
>dict_ages = {k:v for k,v in enumerate(ages)}
>dict_married = {k:v for k,v in enumerate(married)}
>pd.DataFrame({‘name’ : dict_name, ‘age’ : dict_ages, ‘married’ : dict_married})
4. List of Dicts (Row-wise)
>rowwise = [{‘name’ : name, ‘age’ : ages, ‘married’ : married} for name, ages, married in zip(names, ages, married)]
The info() Method:
Gives overview about the dataset. Only works for DataFrames
>df.info(verbose = True, max_cols = 4, memory_usage = ‘deep)
Note: verbose default is set to True. Setting it to False will remove column specific information
DataFrame:
>Nutrition = pd.DataFrame(‘http://andybek.com/pandas-nutrition’, index_col = [0])
.drop():
Removes specific rows / columns from the DataFrame
>Nutrition.drop(‘Unnamed : 0’, axis = 1)
.set_index():
Set specified column as index of the DataFrame
>Nutrition.set_index(‘Unnamed : 0’)