Lecture 2 : Data wrangling
2 0 2 5 S P R ING I N T RO TO DATA S CI E NCE
Congratulations!!!
You have collected or have been given
a box of data.
What does this "data" actually look like?
How will you work with it?
Data Scientists Love Tabular Data
• Tabular data = data in a table.
• Typically:
• A row represents one observation (here, a single person running for president in a particular year).
• A column represents some characteristic, or feature, of that observation (here, the political party of
that person).
Standard Python Data Science Tool: pandas
Using pandas, we can:
• Arrange data in a tabular format.
• Extract useful information filtered by specific conditions.
• Operate on data to gain new insights.
• Apply NumPy functions to our data.
Stands for "panel data"
DataFrames
• In the "language" of pandas, we call a table a DataFrame
• We think of DataFrames as collections of named columns, called Series.
Series - Custom Index
• We can provide index labels for items in a Series by passing an index list.
• A Series index can also be changed.
Selection in Series
• We can select a single value or a set of values in a Series using:
• A single label
• A list of labels
• A filtering condition
Selection in Series
• Say we want to select values in the Series that satisfy a particular condition:
1. Apply a boolean condition to the Series. This creates a new Series of boolean values.
2. Index into our Series using this boolean condition. pandas will select only the entries in the
Series that satisfy the condition.
DataFrames of Series!
• Typically, we will work with Series using the perspective that they are columns in a DataFrame.
• We can think of a DataFrame as a collection of Series that all share the same Index.
Creating a DataFrame
• The syntax of creating DataFrame is:
pandas.DataFrame(data, index, columns)
• Many approaches exist for creating a DataFrame. Here, we will go over the
most popular ones.
• From a CSV file.
• Using a list and column name(s).
• From a dictionary.
• From a Series.
Creating a DataFrame
• From a CSV file.
• Using a list and column name(s).
• From a dictionary.
• From a Series.
Creating a DataFrame
• From a CSV file.
• Using a list and column name(s).
• From a dictionary.
• From a Series.
Creating a DataFrame
• From a CSV file.
• Using a list and column name(s).
• From a dictionary.
• From a Series.
Creating a DataFrame
• From a CSV file.
• Using a list and column name(s).
• From a dictionary.
• From a Series.
Indices Are Not Necessarily Row Numbers
An Index (a.k.a. row labels) can also:
• Be non-numeric.
• Have a name, e.g. "Candidate".
Indices Are Not Necessarily Unique
The row labels that constitute an index do not have to be unique.
• Left: The index values are all unique and numeric, acting as a row number.
• Right: The index values are named and non-unique.
Modifying Indices
• We can select a new column and set it as the index of the DataFrame.
• Example: Setting the index to the "Party" column.
Resetting the Index
• We can change our mind and reset the Index back to the default list of integers.
Column Names Are Usually Unique!
• Column names in pandas are almost always unique.
• Example: Really shouldn’t have two columns named "Candidate".
Retrieving the Index, Columns, and shape
• Sometimes you'll want to extract the list of row and column labels.
• For row labels, use DataFrame.index:
• For column labels, use DataFrame.columns:
•For shape of the DataFrame we use DataFrame.shape:
The Relationship Between DataFrames, Series, and Indices
• We can think of a DataFrame as a collection of Series that all share the
same Index.
• Candidate, Party, %, Year, and Result Series all share an Index from 0 to
5.
Data Science Lifecycle
Recall the Data Science Lifecycle
Data wrangling / visualization
Into to Data Wrangling
• Raw data is often messy, incomplete, and inconsistent
• Data wrangling is the process of cleaning, transforming, and organizing data into a
structured and usable format for analysis.
• The purpose is to ensure that data is accurate, consistent, and ready for modeling and
visualization.
Into to data wrangling
• Hadley Wickham says that the five verbs help solve 90% of challenge
• filter: select rows (observations) in a data frame;
• select: select columns (variables) in a data frame;
• mutate: add new columns to a data frame;
• arrange: reorder rows in a data frame;
• summarise: collapses a data frame to a single row;
• We will cover essential tools for implementing this verbs using pandas.
* Grammar of Data Manipulation wirh R (2020)
1. Extracting Data (filter, select)
• One of the most basic tasks for manipulating a DataFrame is to extract rows and columns of
interest.
1. Extracting Data (filter, select)
• Common ways we may want to extract data:
• Grab the first or last k rows in the DataFrame.
• Grab data with a certain label.
• Grab data at a certain position.
• We'll find that all three of these methods are useful to us in data
manipulation tasks.
.head and .tail
• The simplest scenarios: We want to extract the first or last n rows from the
DataFrame.
• df.head(k) will return the first k rows of the DataFrame df.
• df.tail(k) will return the last k rows.
Label-based Extraction: .loc
• A more complex task: We want to extract data with specific column or index labels.
• The .loc accessor allows us to specify the labels of rows and columns we wish to extract.
• We describe "labels" as the bolded text at the top and left of a DataFrame.
Label-based Extraction: .loc
• Arguments to .loc can be:
• A list.
• A slice (syntax is inclusive of the right hand side of the slice).
• A single value.
Label-based Extraction: .loc
• Arguments to .loc can be:
• A list.
• A slice (syntax is inclusive of the right hand side of the slice).
• A single value.
Label-based Extraction: .loc
• Arguments to .loc can be:
• A list.
• A slice (syntax is inclusive of the right hand side of the slice).
• A single value.
Label-based Extraction: .loc
• Arguments to .loc can be:
• A list.
• A slice (syntax is inclusive of the right hand side of the slice).
• A single value.
Label-based Extraction: .loc
• Arguments to .loc can be:
• A list.
• A slice (syntax is inclusive of the right hand side of the slice).
• A single value.
A series
A single string
Integer-based Extraction: .iloc
• Arguments to .iloc can be:
• A list.
• A slice (syntax is exclusive of the right hand side of the slice).
• A single value.
Integer-based Extraction: .iloc
• Arguments to .iloc can be:
• A list.
• A slice (syntax is exclusive of the right hand side of the slice).
• A single value.
elections.iloc[[1, 2, 3], [0, 1, 2]]
Integer-based Extraction: .iloc
• Arguments to .iloc can be:
• A list.
• A slice (syntax is exclusive of the right hand side of the slice).
• A single value.
elections.iloc[[1, 2, 3], 0:3]
Integer-based Extraction: .iloc
• Arguments to .loc can be:
• A list.
• A slice (syntax is exclusive of the right hand side of the slice).
• A single value.
elections.iloc[:, 0:3]
Integer-based Extraction: .iloc
• Arguments to .loc can be:
• A list.
• A slice (syntax is exclusive of the right hand side of the slice).
• A single value.
elections.iloc[[1, 2, 3], 1]
elections.iloc[0, 1]
.loc vs .iloc
• Remember:
• .loc performs label-based extraction
• .iloc performs integer-based extraction
• When choosing between .loc and .iloc, you'll usually choose .loc.
• Safer: If the order of data gets shuffled in a public database, your code still works.
• Readable: Easier to understand what elections.loc[:, ["Year", "Candidate", "Result"]]
means than elections.iloc[:, [0, 1, 4]]
• .iloc can still be useful.
• Example: If you have a DataFrame of movie earnings sorted by earnings, can use .iloc to get the
median earnings for a given year (index into the middle).
Context-dependent Extraction: [ ]
• Selection operators:
• .loc selects items by label. First argument is rows, second argument is columns.
• .iloc selects items by integer. First argument is rows, second argument is columns.
• [] only takes one argument, which may be:
• A slice of row numbers.
• A list of column labels.
• A single column label.
• That is, [] is context sensitive.
Context-dependent Extraction: [ ]
•[] only takes one argument, which may be:
• A slice of row integers.
• A list of column labels.
• A single column label.
elections[3:7]
Context-dependent Extraction: [ ]
•[] only takes one argument, which may be:
• A slice of row integers.
• A list of column labels.
• A single column label.
elections[["Year", "Candidate", "Result"]]
Context-dependent Extraction: [ ]
•[] only takes one argument, which may be:
• A slice of row integers.
• A list of column labels.
• A single column label.
elections["Candidate"]
Why Use []?
• In short: [ ] can be much more concise than .loc or .iloc
• Consider the case where we wish to extract the "Candidate" column. It is far simpler to write
elections["Candidate"] than it is to write elections.loc[:, "Candidate"]
• In practice, [ ] is often used over .iloc and .loc in data science work. Typing time adds up!
Boolean Array Input for .loc and [ ]
• We learned to extract data according to its integer position (.iloc) or its label (.loc)
• What if we want to extract rows that satisfy a given condition?
• .loc and [ ] also accept boolean arrays as input.
• Rows corresponding to True are extracted; rows corresponding to False are not.
babynames_first_10_rows = babynames.loc[:9, :]
Boolean Array Input for .loc and [ ]
• We learned to extract data according to its integer position (.iloc) or its label (.loc)
• What if we want to extract rows that satisfy a given condition?
• .loc and [ ] also accept boolean arrays as input.
• Rows corresponding to True are extracted; rows corresponding to False are not.
babynames_first_10_rows = babynames.loc[:9, :]
Boolean Array Input for .loc and [ ]
Boolean Array Input for .loc and [ ]
We can perform the same operation using .loc.
babynames_first_10_rows.loc[[True, False, True, False, True, False, True, False,
True, False], :]
Boolean Array Input
Useful because boolean arrays can be generated by using logical operators on Series.
Boolean Array Input
Can also use .loc.
Boolean Array Input
• Boolean Series can be combined using various operators, allowing filtering of results by
multiple criteria.
• The & operator allows us to apply logical_operator_1 and logical_operator_2
• The | operator allows us to apply logical_operator_1 or logical_operator_2
babynames[(babynames["Sex"] == "F") | (babynames["Year"] < 2000)]
Bitwise Operators
• & and | are examples of bitwise operators. They allow us to apply multiple logical conditions.
• If p and q are boolean arrays or Series:
• Boolean array selection is a useful tool, but can lead to overly verbose code for complex conditions.
babynames[(babynames["Name"] == "Bella") | (babynames["Name"] == "Alex") |
(babynames["Name"] == "Narges") | (babynames["Name"] == "Lisa")]
• pandas provides many alternatives, for example:
• .isin
names = ["Bella", "Alex", "Narges", "Lisa"]
babynames[babynames["Name"].isin(names)]
• .str.startswith
babynames[babynames["Name"].str.startswith("N")]
2. Adding Data
• Remember the five verbs
• filter: select rows (observations) in a data frame;
• select: select columns (variables) in a data frame;
• mutate: add new columns to a data frame;
• arrange: reorder rows in a data frame;
• summarise: collapses a data frame to a single row;
Syntax for Adding a Column
•Adding a column is easy:
1. Use [ ] to reference the desired new column.
2. Assign this column to a Series or array of the appropriate length.
Syntax for Modifying a Column
• Modifying a column is very similar to adding a column.
1.Use [ ] to reference the existing column.
2.Assign this column to a new Series or array of the appropriate length.
# Modify the "name_lengths" column to be one less than its original value
babynames["name_lengths"] = babynames["name_lengths"]-1
Syntax for Renaming a Column
• Rename a column using the (creatively named) .rename() method.
•.rename() takes in a dictionary that maps old column names to their new ones.
Syntax for Dropping a Column (or Row)
•Remove columns using the (also creatively named) .drop method.
•The .drop() method assumes you're dropping a row by default. Use axis="columns" to drop a column
instead.
babynames = babynames.drop("Length", axis="columns")
An Important Note: DataFrame Copies
• Notice that we re-assigned babynames to an updated value on the previous slide.
babynames = babynames.drop("Length", axis="columns")
• By default, pandas methods create a copy of the DataFrame, without changing the original
DataFrame at all.
•To apply our changes, we must update our DataFrame to this new, modified copy.
3. Arrange rows
• Remember the five verbs
• filter: select rows (observations) in a data frame;
• select: select columns (variables) in a data frame;
• mutate: add new columns to a data frame;
• arrange: reorder rows in a data frame;
• summarise: collapses a data frame to a single row;
.sort_values()
• The DataFrame.sort_values and Series.sort_values methods sort a DataFrame (or Series).
• Series.sort_values( ) will automatically sort all values in the Series.
• DataFrame.sort_values(column_name) must specify the name of the column to be used for
sorting.
babynames["Name"].sort_values() babynames.sort_values(by="Count", ascending=False)
Sorting By Length
• Let’s try to solve the sorting problem with different approaches
Approach 1: Create a Temporary Column and Sort Based on the New Column
• Sorting the DataFrame as usual
Approach 2: Sorting Using the key Argument
Approach 3: Create a Temporary Column and Sort Based on the New Column
• Suppose we want to sort by the number of occurrences of "dr" and "ea"s.
• Use the Series.map method.
4. Summaize data
• Remember the five verbs
• filter: select rows (observations) in a data frame;
• select: select columns (variables) in a data frame;
• mutate: add new columns to a data frame;
• arrange: reorder rows in a data frame;
• summarise: collapses a data frame;
Why Group?
Our goal:
• Group together rows that fall under the same category.
• For example, group together all rows from the same year.
• Perform an operation that aggregates across all rows in the category.
• For example, sum up the total number of babies born in that year.
Why Group?
.groupby()
• A .groupby() operation involves some combination of splitting the object, applying a
function, and combining the results.
• Calling .groupby() generates DataFrameGroupBy objects → "mini" sub-DataFrames
• Each subframe contains all rows that correspond to the same group (here, a particular year)
.groupby().agg()
• We cannot work directly with DataFrameGroupBy objects! The diagram below is to help understand what goes on
conceptually – in reality, we can't "see" the result of calling .groupby.
• Instead, we transform a DataFrameGroupBy object back into a DataFrame using .agg
• .agg is how we apply an aggregation operation to the data.
Putting it all together
dataframe.groupby(column_name).agg(aggregation_function)
•babynames[["Year", "Count"]].groupby("Year").agg(sum) computes the total number of
babies born in each year.
Alternatives …
• Now, we create groups for each year.
babynames.groupby("Year")[["Count"]].agg(sum)
or
babynames.groupby("Year")[["Count"]].sum()
or
babynames.groupby("Year").sum(numeric_only=True)
Concluding groupby.agg
• A groupby operation involves some combination of splitting the object, applying a function, and combining the
results.
• So far, we've seen that df.groupby("Year").agg(sum):
• Split df into sub-DataFrames based on Year.
• Apply the sum function to each column of each sub-DataFrame.
• Combine the results of sum into a single DataFrame, indexed by Year.
Groupby Review Question
Answer
Case Study: Name "Popularity"
• Goal: Find the baby name with sex "F" that has fallen in popularity the most in California.
f_babynames = babynames[babynames["Sex"]=="F"]
f_babynames = f_babynames.sort_values(["Year"])
jenn_counts_series =f_babynames[f_babynames["Name"]=="Jennifer"]["Count"]
Number of Jennifers Born in California Per Year.
What Is "Popularity"?
Goal: Find the baby name with sex "F" that has fallen in popularity the most in California.
How do we define "fallen in popularity?"
• Let’s create a metric: "Ratio to Peak" (RTP).
• The RTP is the ratio of babies born with a given name in 2022 to the maximum number of babies born
with that name in any year.
Example for "Jennifer":
• In 1972, we hit peak Jennifer. 6,065 Jennifers were born.
• In 2022, there were only 114 Jennifers.
• RTP is 114 / 6065 = 0.018796372629843364.
Calculating RTP
max_jenn = max(f_babynames[f_babynames["Name"]=="Jennifer"]["Count"])
6065
curr_jenn = f_babynames[f_babynames["Name"]=="Jennifer"]["Count"].iloc[-1]
114 Remember: f_babynames is sorted by year.
.iloc[-1] means “grab the latest year”
rtp = curr_jenn / max_jenn
0.018796372629843364
def ratio_to_peak(series):
return series.iloc[-1] / max(series)
jenn_counts_ser = f_babynames[f_babynames["Name"]=="Jennifer"]["Count"]
ratio_to_peak(jenn_counts_ser)
0.018796372629843364
Calculating RTP Using .groupby()
• .groupby() makes it easy to compute the RTP for all names at once!
rtp_table = f_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
Renaming Columns After Grouping
• By default, .groupby will not rename any aggregated columns (the column is still named "Count", even
though it now represents the RTP.
• For better readability, we may wish to rename "Count" to "Count RTP”
rtp_table = f_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
rtp_table = rtp_table.rename(columns={"Count":"Count RTP"})
Some Data Science Payoff
• By sorting rtp_table we can see the names whose popularity has decreased the most.
rtp_table.sort_values("Count RTP")
Some Data Science Payoff
• We can get the list of the top 10 names and then plot popularity with:
top10 = rtp_table.sort_values("Count RTP").head(10).index
Raw GroupBy Objects and Other Methods
•The result of a groupby operation applied to a DataFrame is a DataFrameGroupBy object.
•It is not a DataFrame!
grouped_by_year = elections.groupby("Year")
type(grouped_by_year)
•Given a DataFrameGroupBy object, can use various functions to generate DataFrames (or Series). agg is
only one choice:
groupby.size() and groupby.count()
groupby.size() and groupby.count()
Filtering by Group
•Another common use for groups is to filter data.
• groupby.filter takes an argument func.
• func is a function that:
• Takes a DataFrame as input.
• Returns either True or False.
• filter applies func to each group/sub-DataFrame:
• If func returns True for a group, then all rows belonging to the group are preserved.
• If func returns False for a group, then all rows belonging to that group are filtered out.
• Notes:
• Filtering is done per group, not per row. Different from boolean filtering.
• Unlike agg(), the column we grouped on does NOT become the index!
groupby.filter()
Filtering Elections Dataset
• Going back to the elections dataset.
• Let's keep only election year results where the max '%' is less than 45%.
elections.groupby("Year").filter(lambda sf: sf["%"].max() < 45)
groupby Quiz
• We want to know the best election by each party.
groupby Quiz
• We want to know the best election by each party.
• Best election: The election with the highest % of votes.
• For example, Democrat’s best election was in 1964, with candidate Lyndon Johnson
winning 61.3% of votes.
Attempt #1
• Why does the table seem to claim that Woodrow Wilson won the presidency in 2020?
elections.groupby("Party").max().head(10)
Problem with Attempt #1
• Why does the table seem to claim that Woodrow Wilson won the presidency in 2020?
• Every column is calculated independently! Among Democrats:
• Last year they ran: 2020.
• Alphabetically the latest candidate name: Woodrow Wilson.
• Highest % of vote: 61.34%.
Attempt #2: Motivation
• We want to preserve entire rows, so we need an aggregate function that does that.
Attempt #2: Solution
Attempt #2: Solution
• First sort the DataFrame so that rows are in descending order of %.
• Then group by Party and take the first item of each sub-DataFrame.
Grouping by Multiple Columns
• Suppose we want to build a table showing the total number of babies born of each sex in each year.
• One way is to groupby using both columns of interest:
babynames.groupby(["Year", "Sex"])[["Count"]].agg(sum).head(6)
Pivot function
babynames_pivot = babynames.pivot(
index = "Year", # rows (turned into index)
columns = "Sex", # column values
values = ["Count"], # field(s) to process in
each group
aggfunc = np.sum, # group operation
)
babynames_pivot.head(6)
groupby(["Year", "Sex"]) vs. pivot
• The pivot() output more naturally represents our data.
Pivot output
babynames.groupby(["Year",
"Sex"])[["Count"]].agg(sum).head(6)
Pivot Tables with Multiple Values
• Pivot_table is same as Pivot() but can takes multiple columns as values
babynames_pivot = babynames.pivot_table(
index = "Year", # rows (turned into
index)
columns = "Sex", # column values
values = ["Count", "Name"],
aggfunc = np.max, # group operation)
babynames_pivot.head(6)
Pivot Table Mechanics
Where are we?
• 5 verbs for data manipulation
• Joining tables
• Tidy data
Joining Tables
• Suppose want to know the popularity of presidential candidate's names in 2022.
• Example: Dwight Eisenhower's name Dwight is not popular today, with only 5 babies born
with this name in California in 2022.
• To solve this problem, we’ll have to join tables.
Creating Table 1: Babynames in 2022
• Let's set aside names in California from 2022 first:
babynames_2022 = babynames[babynames["Year"] == 2022]
babynames_2022
Creating Table 2: Presidents with First Names
•To join our table, we’ll also need to set aside the first names of each candidate.
elections["First Name"] =
elections["Candidate"].str.split().str[0]
Joining Our Tables: Two Options
merged = pd.merge(left = elections, right = babynames_2022, left_on = "First Name", right_on = "Name")
merged = elections.merge(right = babynames_2022, left_on = "First Name", right_on = "Name")
Tidy data
• You can represent the same underlying data in multiple ways.
• Let’s see the example table which records : country, year, population, and number of
documented cases of TB (tuberculosis)
• Which one is the best representation?
Another example
<Table 1>
<Table 2>
Tidy data
• Tidy data is data where:
• 1. Each variable is in a column.
• 2. Each observation is a row.
• 3. Each value is a cell.
Tidy data
Tidy data –cont.
• Why tidy data?
• 1. You can easily add observations
• 2. You can easily add columns
• 3. Many python packages are designed with tidy data in mind
Two verbs for tidy data
<pivot> <melt>
• It makes “wide” data longer.
• It makes “long” data wider.
• Useful when multiple observations stored in one
• Useful when observations are spread over
row
multiple rows
melt()
melt()