Introduction to Python
List addition:
# Create lists first and second
first = [11.25, 18.0, 20.0]
second = [10.75, 9.50]
#Paste together first and second: full
full = first + second
List Sorting:
# Sort full in descending order: full_sorted
full_sorted = sorted(full, reverse=True)
NumPy Arrays:
Sample array = np.array(sample_list)
Has +, *, -, etc. functions on each individual element
Boolean arrays:
Sample_boolean_array = sample_array > 50
Return an array of True/False values
Can be used for indexing, as in: sample_array[sample_boolean_array] #filters for all the
elements that have TRUE assigned to them
2D Numpy Arrays:
Structured as a list of lists:
Sample_2darray = np.array(sample_list_of_lists)
Can be indexed similar to lists:
Sample_array[x, y]
X and Y can be ":" or other slices such as 2:5
Also can be indexed with boolean filters, among different arrays:
# Convert positions and heights to numpy arrays: np_positions, np_heights
np_positions = np.array(positions)
np_heights = np.array(heights)
# Heights of the goalkeepers: gk_heights
gk_heights = np_heights[np_positions == 'GK']
2D Array Multiplication:
A y*x array can be multiplied in a 1*x array, multiplying the Ath column in the Ath value of the 1*x
array
2D Array Basic Statistics:
np.mean, np.median, np.std, np.corrcoef
Intermediate Python:
Matplotlib.pyplot as plt
Linear and scatter plots can be made from 2 lists with the same length named 'X' and 'Y':
Plt.plot(x, y) OR plt.scatter(x, y)
Histograms: (from a list)
Plt.hist(sample_list, bins=int)
#the arbitrary bins argument specifies the number of bins
Matplotlib.pyplot plt function samples:
Plt.xscale('log') #self-explanatory
Plt.xlabel(xlab)
#title for the x-axis
Plt.ylabel(ylab)
Plt.title(plot_title)
Plt.xticks([number_for_the axis], [titles_for_each_number])
#The second list is arbitrary. The two lists must have the same length.
Plt.yticks(y_axis_ticks)
Plt.text(x_coordinate, y_coordinate, 'text_string')
#The first two arguments are floats
Array.index('element_in_array) #returns the index number of the specified element
pyplot used as method (not function)
accomplishes similar results, with slightly different syntax:
avocados.plot(kind='scatter', x='nb_sold', y='avg_price', title='Number of avocados sol
d vs. average price')
avocados[avocados['type']=='conventional']['avg_price'].hist()
legend:
# Add a legend
plt.legend(['conventional', 'organic'])
the following snippet can be used to plot the number of NaN values in a DF’s columns:
df.isna().sum().plot(kind='bar')
the .isna() method returns a series of boolean values for each value in the dataframe.
the .dropna() method can be used to delete all rows containing NaN values:
avocados_complete = avocados_2016.dropna()
Dictionaries:
Sample_dict.keys() #returns the keys in the dictionary, can be printed
Delete:
Del(sample_dict['sample_key']
Dictionaries of dictionaries can be double-indexed to get to the values in the secondary
dictionaries.
Pandas DataFrames
Import pandas as pd
Can be construced as a dictionary of lists (as the values) and the keys (as the column names):
Cars = pd.DataFrame(dict_of_lists_as_values)
The row IDs can be specified:
Sample_dataframe.index = sample_list_of_strings
Open files:
Cars = pd.read_csv('sample_file.csv', index_col = 0)
The arbitrary 2nd argument specifies the ID column; pandas uses int values by default
for ID
Indexing (for returning or printing):
#One or multiple columns or rows as Series:
Sample_dataframe['column_title']
Sample_dataframe['row_title']
Sample_dataframe['first_row', 'second_row']
#One or multiple columns or rows as DataFrame; use double brackets, may contain 1
or 2 lists for rows and columns:
Sample_DataFrame[['column_title']]
Sample_DataFrame[['row_title']]
Sample_DataFrame[['column_title', 'second_column_title']]
#Indexing using loc or iloc:
# Print out observation for Japan
print(cars.loc['JPN'])
# Print out observations for Australia and Egypt
print(cars.loc[['AUS', 'EG']])
# Print out drives_right value of Morocco
print(cars.loc[['MOR'], ['drives_right']])
# Print sub-DataFrame
print(cars.loc[['RU', 'MOR'], ['country', 'drives_right']])
# Print out drives_right column as DataFrame
print(cars.loc[:, ['drives_right']])
Comparisons:
Boolean arguments for NumPy arrays:
import numpy as np
np.logical_and()
np.logical_or()
np.logical_not()
#example:
np.logical_and(sample_array > 21, sample_array < 22)
#returns an array of boolean values for all the elements between 21 and 22
bmi[np.logical_and(bmi>21, bmi<22)]
#returns an array of all the elements between 21 and 22
Filtering pandas DataFrames:
#DataFrame filtering:
sample_filter = sample_dataframe["column_name"] > x
print(sample_dataframe[sample_filter])
#can also use loc or a y-axis indexing to further customize the filtering
NumPy boolean functions can also be used on dataframes, since pandas is written based on
numpy:
sample_dataframe[np.logical_and(sample_dataframe["column_name"] > x, sample_dataframe["
column_name"] < y)]
#returns all the dataframe rows with x<column_name<y
Lists:
for i, v in enumerate(sample_list): #enumerate returns a list of tuples with i as the
index and v as the value in the list
print(str(i) + str(v))
For loop with NumPy arrays:
the np.nditer function can be used to run the loop on all the elements in the array with no tabular
formatting.
import numpy as np
list1 = np.array([1, 2, 3, 4])
list2 = np.array([5, 6, 7, 8])
total = np.array([list1, list2])
print("basic for-loop print:")
for i in total: print(i)
print("for-loop with nditer print: ")
for i in np.nditer(total): print (i)
"print(simple print:"
print(total)
Printing pandas DataFrames with for loops:
for column_title, row in sample_dataframe.iterrows():
print (column_title)
print (row)
indexing:
for column_title, row in sample_dataframe.iterrows():
print(column_title + ": " + row["column_title"])
Adding a column to a DataFrame:
for column_title, row in sample_dataframe.iterrows():
sample_dataframe.loc[column_title, "new_column"] = len(row["specified_column"])
the apply function is a much better method for this:
sample_dataframe["new_column"] = any_dataframe["specified_column"].apply(len) #or any
other function instead of len
Numpy.random
import numpy.random as npr
npr.seed(123) #or any other number, if u want reproducibility
for i in range (1, 10):
print(npr.rand(), end="\t")
for i in range (1, 10):
print(npr.randint(1, 10), end="\t")
Selecting min and max rows of a column from a pd.DataFrame:
# Display the closest game(s) and biggest blowouts
display(super_bowls[super_bowls['difference_pts'] == max])
display(super_bowls[super_bowls['difference_pts'] == min])
Data Manipulation with pandas:
specific dataframe attributes:
# Print the values of df
print(df.values)
# Print the column index of df
print(df.columns)
# Print the row index of df
print(df.index)
homelessness_ind = homelessness.sort_values("individuals")
#can have an argument for descending sort:
homelessness_ind = homelessness.sort_values("individuals", ascending=False)
basic column summary statistics: head, info, mean, median, max, cumsum, and cummax methods:
# Print the head of the sales DataFrame
print(sales.head())
# Print the info about the sales DataFrame
print(sales.info())
# Print the mean of weekly_sales
print(sales["weekly_sales"].mean())
# Print the median of weekly_sales
print(sales["weekly_sales"].median())
and also custom function(s) using the agg method (use a tuple of function names for multiple
functions:
# A custom IQR function
def iqr(column):
return column.quantile(0.75) - column.quantile(0.25)
# Print IQR of the temperature column
print(sales["temperature_c"].agg(iqr))
Counting:
drop the duplicate column or multiple column combinations using the drop_duplicates methods
and the column name or a list of column names:
store_depts = sales.drop_duplicates(["store", "department"])
count the number of rows belonging to each category using the value_counts() method:
store_counts = stores["store_type"].value_counts()
the count can also have normalize and sort arguments:
# Get the proportion of stores of each type
store_props = stores["store_type"].value_counts(normalize=True)
# Count the number of each department number and sort
dept_counts_sorted = departments["department_num"].value_counts(sort=True)
summary stats can be calculated for a column for different categories of rows using the groupby()
method:
sales_by_type = sales.groupby("type")["weekly_sales"].sum()
combining this with the agg method:
sales_stats = sales.groupby("type")["weekly_sales"].agg([np.max, np.min, np.mean, n
p.median])
Pivot tables:
make data queries much simpler. use the mean function by default. index: categorize data, values:
quantitative value to calculate a descriptive stat for. columns: a second level of categorization
resulting in a tabular pivot table. margins: descriptive stats for rows and columns of 2D pivot
tables. fill_value: default value for null values. aggfunc: the function performed on the values.
# Pivot for mean weekly_sales for each store type
mean_sales_by_type = sales.pivot_table(index="type", values="weekly_sales")
Summary stats can be calculated for pivot tables. the mean() method returns the mean of all the
column by default, and the mean of all the rows with the axis=”columns”.
DataFrame indexing:
indexing is an arbitrary choice.
you can set an index with the set_index() function, with one or several columns as the index:
temperatures_ind = temperatures.set_index("city")
you can also remove the indexing. the drop argument is for deleting or keeping the index data:
# Reset the index, dropping its contents
print(temperatures_ind.reset_index(drop=True))
the main feature of indexes is the loc and iloc functions, which is a much simpler way of filtering
data compared to [] subsetting:
# Subset temperatures using square brackets
print(temperatures[temperatures["city"].isin(cities)])
#vs:
# Subset temperatures_ind using .loc[]
print(temperatures_ind.loc[cities])
multilevel indexes use a list of columns for set_index(), and (a list of) tuples for loc:
# Index temperatures by country & city
temperatures_ind = temperatures.set_index(["country", "city"])
# Subset for rows to keep
print(temperatures_ind.loc[("Brazil", "Rio De Janeiro"), ("Pakistan", "Lahore")])
the sort_index() method can be used to sort a dataframe based on single- or multi-level indexes:
# Sort temperatures_ind by index values
print(temperatures_ind.sort_index())
# Sort temperatures_ind by index values at the city level
print(temperatures_ind.sort_index(level="city"))
# Sort temperatures_ind by country then descending city
print(temperatures_ind.sort_index(level=["country", "city"], ascending = [True, False])
)
Multi-level index slicing is done using tuples:
print(temperatures_srt.loc[("Pakistan", "Lahore"):("Russia", "Moscow")])
example of column-and-row indexing:
# Subset in both directions at once
print(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad"),
"date":"avg_temp_c"])
Note that indexing using loc is inclusive on both ends, unlike list and iloc indexing which is
exclusive for the end.
Creating DataFrames:
Dataframes can be created as a dictionary of lists(each list being a column) or a list of dictionaries
(each dict being a row), using the .pd.DataFrame() method. The .to_csv(“csv_file_name”)
method does the reverse of this.
A common method of creating DFs is the read_csv() function which needs the name or address of
the CSV file as a string. This function can have several arguments such as
index_col=”column_name”, reindex(list_of_desired_index_values), reindex().ffill() to
forward-fill NaNs, parse_dates=True or parse_dates=”column_name”, etc.
Several methods are available after creating a DF, such as .sort_values(), dropna(), etc.
Joining DataFrames
The .append() method and the concat() function accomplish this. Concat is more flexible.
The .append() method joins two dataframes (or series):
df3 = df1.append(df2) #will retain the previous index values
df3 = df1.append(df2).reset_index() #will reset the indexes to have unique values
Concat uses a list of dataframes (or series):
df3 = df1.append(df2) #will retain the previous index values
df3 = df1.append(df2).reset_index() #will reset the indexes to have unique values
The optional axis=’rows’ and axis=’columns’ arguments can be used to specify the
concatenation axis.
Horizontal concatenating is actually an outer join.
If inner joining is desired, the optional join=’inner’ argument can be used.
If the DFs have repeating index values, the optional keys=[#list] argument can be used to specify
an outer index value related to each DF, to avoid ambiguity:
rain1314 = pd.concat([rain2013, rain2014], keys=[2013, 2014], axis=0)
The keys argument actually works with both vertical and horizontal concatenating.
The concat function can also be performed on a dictionary of DFs! Here, the dictionary keys will act
as the keys argument for a list concat. #fascinating.
Merging DataFrames
The function pd.merge() can be used to perform an inner-join on two DFs. the on= argument can
use a (list of ) columns to take for the join, just like SQL joining:
mergedDF = pd.merge(DF1, DF2, on=['col1', 'col2'])
If, for example, ‘col1’ is in both DFs, then the merged dataframe will have only one copy of that
column, otherwise, the suffixes -x and -y are used to specify the parent DF for each column. The
suffixes can be changed using the argument suffixes=[]:
mergedDF = pd.merge(DF1, DF2, on=['col1', 'col2'], suffixes=['_DF1', '_DF2'])
The on= argument works only when the desired joining column has the same name in both DFs. If
they differ in name, two arguments, left_on=’DF1_column_name’ and
right_on=’DF2_column_name’ must be used.
pd.merge() can also perform a left-join (!) by using the argument how=’left’. how=’inner’ is the
default behavior. how=’right’ and how=’outer’ are also possible.
Joining DFs
Another solution is the merged_DF=DF1.join(DF2) method. Join too, can have a how argument.
4. Data visualization with Matplotlib
# Import the matplotlib.pyplot submodule and name it plt
import matplotlib.pyplot as plt
# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots()
This creates a framework to work with for plotting. Multiple plots can be shown using the
ax.plot() method.
The ax.plot() method can have several optional arguments besides the necessary X and Y axis,
including color=, marker=, and linestyle=:
# Plot Seattle data, setting data appearance
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"], color='b', marke
r='o', linestyle='--')
The default behavior of ax, figure = plt.subplots() is ax, figure = plt.subplots(1, 1).
Passing numbers other than 1 (for the number of rows and columns, in order) allows us to work
with an array of plots. Indexing each subplot is similar to array indexing:
# In the top right (index 0,1), plot month and Seattle temperatures
ax[0, 1].plot(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"])
If the array is one-dimensional, only on number is needed for indexing, e.g. ax[2].
Time-series data with Matplotlib
Typically, we start with something like this:
climate_change = pd.read_csv('climate_change.csv', parse_dates = ["date"], index_col =
"date")
The whole process can be summarized with a function:
# Define a function called plot_timeseries
def plot_timeseries(axes, x, y, color, xlabel, ylabel):
# Plot the inputs x,y in the provided color
axes.plot(x, y, color=color)
# Set the x-axis label
axes.set_xlabel(xlabel)
# Set the y-axis label
axes.set_ylabel(ylabel, color=color)
# Set the colors tick params for y-axis
axes.tick_params('y', colors=color)
Annotation example:
ax.annotate(">1 degree", xy=(pd.Timestamp('2015-10-06'), 1))
which has the following options:
ax2.annotate(">1 degree", xy=(pd.Timestamp('2015-10-06'), 1), xytex
t=(pd.Timestamp('2008-10-06'), -0.2), arrowprops={'arrowstyle':'->', 'color':'gray'})
Quantitative (bar-plots, histograms):
# Plot a bar-chart of gold medals as a function of country
ax.bar(medals.index, medals['Gold'])
# Set the x-axis tick labels to the country names
ax.set_xticklabels(medals.index, rotation = 90)
# Set the y-axis label
ax.set_ylabel("Number of medals")
plt.show()
Stacked bar-plot:
# Add bars for "Gold" with the label "Gold"
#ax.bar(medals.index, medals['Gold'], label='Gold')
# Stack bars for "Silver" on top with label "Silver"
ax.bar(medals.index, medals['Silver'], bottom=medals['Gold'], label = 'Silver')
# Stack bars for "Bronze" on top of that with label "Bronze"
ax.bar(medals.index, medals['Bronze'], bottom=medals['Gold'] + medals['Silver'], label
= 'Bronze')
Histogram:
fig, ax = plt.subplots()
# Plot a histogram of "Weight" for mens_rowing
ax.hist(mens_rowing["Weight"], histtype='step', label="Rowing", bins=5)
# Compare to histogram of "Weight" for mens_gymnastics
ax.hist(mens_gymnastics["Weight"], histtype='step', label="Gymnastics", bins=5)
ax.set_xlabel("Weight (kg)")
ax.set_ylabel("# of observations")
Error bars:
Can be used as the argument y_err = DF[‘column_name’].std() to the ax.bar() method; or for
a time plot with the ax.errorbar(x, y, yerr=DF[‘y_sd’]) method.
ax.bar("Rowing", mens_rowing["Height"].mean(), yerr=mens_rowing["Height"].std())
Box-plots:
Take a list of DF columns:
ax.boxplot([mens_rowing["Height"], mens_gymnastics["Height"]]
Labeling is also done using a list:
ax.set_xticklabels(["Rowing", "Gymnastics"])
Scatter plots can use the c= and s= arguments to show additional data with color and size,
respectively.
# Add data: "co2", "relative_temp" as x-y, index as color
ax.scatter(climate_change['co2'], climate_change['relative_temp'], c=climate_change.ind
ex)
PLT exporting
plt.style.use(‘#style_name’)
fig.savefig(‘file_name.png’, dpi=300)
fig.set_size_inches([5, 3])
Seaborn
import seaborn as sns
we start with sns.countplot() and sns.scatterplot():
sns.countplot(y=region)
you can pandas DataFrame in seaborn using the data= argument:
sns.countplot(x='Spiders', data=df)
x and y will take dataframe column names when doing this.
Additional information can be shown using the hue= argument, which will take another dataframe
column or a list as an input:
sns.scatterplot(x='absences', y='G3', data=student_data, hue='location')
When working with hue, options such as palette= and hue_order=[] are available:
# Create a dictionary mapping subgroup values to colors
palette_colors = {'Rural': "green", 'Urban': "blue"}
# Create a count plot of school with location subgroups
sns.countplot(x='school', data=student_data, hue='location', palette=palette_colors)
Relplot() is the more sophisticated method of working with seaborn since it allows multiple
subplots. Here you’ll need a kind=”” argument to specify the type of plot you want. You’ll also
need the col= and row= methods to use multiple subplots. More specific subplot allocation will
need the col_wrap= and col_order= methods.
sns.relplot(x="absences", y="G3",
data=student_data,
kind="scatter")
sns.relplot(x="G1", y="G3",
data=student_data,
kind="scatter",
col="schoolsup",
col_order=["yes", "no"],
row='famsup',
row_order=['yes', 'no'])
Further customizations:
size=, alpha=, style=:
sns.relplot(x="horsepower", y="mpg",
data=mpg, kind="scatter",
size="cylinders",
hue='cylinders')
sns.relplot(kind='scatter', data=mpg,
x='acceleration', y='mpg',
style='origin', hue='origin')
Line plots:
sns.relplot(x="model_year", y="horsepower",
data=mpg, kind="line",
ci=None, style="origin",
hue="origin",
dashes=False,
markers=True)
Catplot()
Has the same col= and row= parameters of relplot()
# Create column subplots based on age category
sns.catplot(y="Internet usage", col='Age Category', data=survey_dat
a,
kind="count")
So far we have seen kind=’bar’ and kind=’count’.
Boxplot: kind=’box’
# Create a box plot with subgroups and omit the outliers
sns.catplot(data=student_data, kind='box',
x='internet', y='G3',
hue='location')
Pointplot: kind=’point’
sns.catplot(x="romantic", y="absences",
data=student_data,
kind="point",
hue="school",
ci=None)
Style and other customizations
sns.set_style():
‘white’, ‘dark’, ‘whitegrid’, ‘darkgrid’ and ‘ticks’ are the 5 default styles.
sns.set_palette():
several palette styles are available, including diverging (for showing dichotomies) and sequential
(for showing a possible trend).
Two examples are: ‘Purples’ are ‘RdBu’.
sns.set_context():
Sets the thickness of the plot elements for use in different contexts, including ‘paper’,
‘notebook’, ‘talk’, ‘poster’.
Facet Grid and subplots
Catplot() and relplot() create Facet Grid objects, which can be assigned to a variable. Setting a
title for the facetgrid tells seaborne we want the titles for the whole plot, not an individual subplot.
If we want a title for a subplot, we should assign it to an Axes Subplot object.
FaceGrid title:
# Create scatter plot
g = sns.relplot(x="weight",
y="horsepower",
data=mpg,
kind="scatter")
# Add a title "Car Weight vs. Horsepower"
g.fig.suptitle("Car Weight vs. Horsepower")
AxesSubplot:
# Create scatter plot
g = sns.boxplot(x="weight",
y="horsepower",
data=mpg,
kind="scatter")
# Add a title "Car Weight vs. Horsepower"
g.set_title("Car Weight vs. Horsepower", y=1.03) #the y= argument sets the height of t
he title
Tick and label options:
plt.xticks():
# Create point plot
sns.catplot(x="origin",
y="acceleration",
data=mpg,
kind="point",
join=False,
capsize=0.1)
# Rotate x-tick labels
plt.xticks(rotation=90)
Labels:
# Create line plot
g = sns.lineplot(x="model_year", y="mpg_mean",
data=mpg_mean,
hue="origin")
# Add a title "Average MPG Over Time"
g.set_title("Average MPG Over Time")
# Add x-axis and y-axis labels
g.set(xlabel='Car Model Year', ylabel='Average MPG')
Examples putting it all together:
# Set palette to "Blues"
sns.set_palette("Blues")
# Adjust to add subgroups based on "Interested in Pets"
g = sns.catplot(x="Gender",
y="Age", data=survey_data,
kind="box", hue='Interested in Pets')
# Set title to "Age of Those Interested in Pets vs. Not"
g.fig.suptitle("Age of Those Interested in Pets vs. Not")
# Show plot
plt.show()
# Set the figure style to "dark"
sns.set_style("dark")
# Adjust to add subplots per gender
g = sns.catplot(x="Village - town", y="Likes Techno",
data=survey_data, kind="bar",
col='Gender')
# Add title and axis labels
g.fig.suptitle("Percentage of Young People Who Like Techno", y=1.02)
g.set(xlabel="Location of Residence",
ylabel="% Who Like Techno")
# Show plot
plt.show()