|
|
From Pandas Workout by Reuven Lerner This article discusses cleaning data to use with Pandas. |
Take 35% off Pandas Workout by entering fcclerner2 into the discount code box at checkout at manning.com.
In the late 1980s, I worked for a company that wanted to know how much rain had fallen in a large number of US cities. Their solution? They gave me a list of cities and phone numbers, and asked me to call each of them in sequence, recording the previous day’s rainfall in an Excel spreadsheet. Nowadays, getting that sort of information - and many other types of information — is pretty easy. Not only do many governments provide data sets for free, but numerous companies make data available for a price. No matter what topic you’re researching, data is almost certainly available. The only questions are where you can get it, how much it’ll cost, and what format the data comes in.
Actually, you should ask another question, too: How accurate is the data you’re using?
All too often, we assume that if we’re downloading a CSV file from an official-looking Web site, the data it contains is good. But all too often, the data that we download has problems. That shouldn’t surprise us, given that the data comes from people (who can make a variety of types of mistakes) and machines (which make different types of mistakes). Maybe someone accidentally misnamed a file, or entered data into the wrong field. And maybe the automatic sensors whose inputs were used in collecting the data were broken, or offline. Maybe the servers were down for a day, or someone misconfigured the XML feed-reading system, or the routers were being rebooted, or a backhoe cut the Internet line.
All of this assumes that there was actually data to begin with. Often we’ll have missing data because there wasn’t any data to record.
This is why I’ve often heard data scientists say that 80 percent of their job involves cleaning data. What does it mean to “clean data”? Here is a partial list:
- rename columns
- rename the index
- remove irrelevant columns
- split one column into two
- combine two or more columns into one
- remove non-data rows
- remove repeated rows
- remove rows with missing data (aka
NaN) - replace
NaNdata with a single value - replace
NaNdata via interpolation - standardize strings
- fix typos in strings
- remove whitespace from strings
- correct the types used for columns
- identify and remove outliers
The importance of cleaning your data, and thus ensuring that your analysis is as accurate is possible, cannot be overstated.
In this article, we’ll thus be looking at a few pandas techniques for cleaning our data. First, we’ll look at a few ways in which we can handle NaN values.
Useful references
Table 1. What you need to know
|
Concept |
What is it? |
Example |
To learn more |
|
|
Returns a boolean series indicating where there are null (typically |
|
|
|
|
Returns a boolean data frame indicating where there are null (typically |
|
|
|
|
Replace values in one or more columns with other values |
|
|
|
|
Apply a function to each element of a series, returning the result of that application on each element |
|
|
|
|
Replace |
|
|
|
|
Remove rows with |
|
|
|
|
Working with textual data |
|
|
|
|
Reorder the rows of a data frame based on the values in its index, in ascending order |
|
|
|
|
Create a data frame based on an Excel spreadsheet |
|
|
|
|
returns a sorted (descending frequency) series counting how many times each value appears in |
|
|
|
|
returns a series with the unique (i.e., distinct) values in |
|
How much is missing?
We’ve already seen, on a number of occasions, that data frames (and series) can contain NaN values. One question we often want to answer is: How many NaN values are there in a given column? Or, for that matter, in a data frame?
One solution is to calculate things yourself. There is a count method you can run on a series, which returns the number of non-null values in the series. That, combined with the shape of the series, can tell you how many NaN values there are:
s.shape[0] - s.count() ❶
❶ Returns an integer, the number of null elements
This is tedious and annoying. And besides, shouldn’t pandas provide us with a way to do this? Indeed it does, in the form of the isnull method. If you call isnull on a column, it returns a boolean series, one that has True where there is a NaN value, and False in other places. You can then apply the sum method to the series, which will return the number of True values, thanks to the fact that Python’s boolean values inherit from integers, and can be in place of 1 (True) and 0 (False) if you need:
s.isnull().sum() ❶
❶ Calculate the number of NaN values in s
If you run isnull on a data frame, then you will get a new data frame back, with True and False values indicating whether there is a null value in that particular row-column combination. And of course, then you can run sum on the resulting data frame, finding how many NaN values there are in each column:
df.isnull().sum() ❶
❶ Calculate the number of NaN values in each column
Instead of summing the results of a call to isnull, you can also use the any and all methods, both of which return boolean values. any will return True for each row in which at least one of the values is True, and all will return True for each row in which all of the values are True. You can thus do the following:
df[df.isnull().all()] ❶
❶ Show only the rows without NaN
Finally, the df.info method returns a wealth of information about the data frame on which it is run, including the name and type of each column, a summary of how many columns there are of each type, and the estimated memory usage. If the data frame is small enough, then it’ll also show you how many null values there are in each column. However, this calculation can take some time. Thus, the df.info will only count null values below a certain threshold. If you’re above that threshold (the pd.options.display.max_info_columns option), then you’ll need to tell pandas explicitly to count, by passing show_counts=True:
df.info(show_counts=True) ❶
❶ Get full information about the data frame df, including the number of null values in each column
pandas defines both isna and isnull for both series and data frames. What’s the difference between them? Actually, there is no difference. If you look at the pandas documentation, you’ll find that they’re identical except for the name of the method being called. I mostly use isnull, but if you prefer to go with isna, then be my guest.
Note that both of these are different from np.isnan, a method defined in NumPy, on top of which pandas is defined. I try to stick with the methods that pandas defines, which integrate better into the rest of the system, in my experience.
Rather than using ~, which pandas uses to invert boolean series and data frames, you can often use the notnull methods, for both series and data frame.
That’s all for this article. If you want to see more, check out the book on Manning’s liveBook platform here.
