Cleaning Data

Description: https://images.manning.com/360/480/resize/book/0/e848891-d7dd-47e6-b200-d943af5a91d1/Lerner-MEAP-HI.png

From Pandas Workout by Reuven Lerner

This article discusses cleaning data to use with Pandas.

Take 35% off Pandas Workout by entering fcclerner2 into the discount code box at checkout at manning.com.

In the late 1980s, I worked for a company that wanted to know how much rain had fallen in a large number of US cities. Their solution? They gave me a list of cities and phone numbers, and asked me to call each of them in sequence, recording the previous day’s rainfall in an Excel spreadsheet. Nowadays, getting that sort of information - and many other types of information — is pretty easy. Not only do many governments provide data sets for free, but numerous companies make data available for a price. No matter what topic you’re researching, data is almost certainly available. The only questions are where you can get it, how much it’ll cost, and what format the data comes in.

Actually, you should ask another question, too: How accurate is the data you’re using?

All too often, we assume that if we’re downloading a CSV file from an official-looking Web site, the data it contains is good. But all too often, the data that we download has problems. That shouldn’t surprise us, given that the data comes from people (who can make a variety of types of mistakes) and machines (which make different types of mistakes). Maybe someone accidentally misnamed a file, or entered data into the wrong field. And maybe the automatic sensors whose inputs were used in collecting the data were broken, or offline. Maybe the servers were down for a day, or someone misconfigured the XML feed-reading system, or the routers were being rebooted, or a backhoe cut the Internet line.

All of this assumes that there was actually data to begin with. Often we’ll have missing data because there wasn’t any data to record.

This is why I’ve often heard data scientists say that 80 percent of their job involves cleaning data. What does it mean to “clean data”? Here is a partial list:

rename columns
rename the index
remove irrelevant columns
split one column into two
combine two or more columns into one
remove non-data rows
remove repeated rows
remove rows with missing data (aka NaN)
replace NaN data with a single value
replace NaN data via interpolation
standardize strings
fix typos in strings
remove whitespace from strings
correct the types used for columns
identify and remove outliers

The importance of cleaning your data, and thus ensuring that your analysis is as accurate is possible, cannot be overstated.

In this article, we’ll thus be looking at a few pandas techniques for cleaning our data. First, we’ll look at a few ways in which we can handle NaN values.

Useful references

Table 1. What you need to know

Concept	What is it?	Example	To learn more
`s.isnull`	Returns a boolean series indicating where there are null (typically `NaN`) values in the series `s`	`s.isnull()`	link
`df.isnull`	Returns a boolean data frame indicating where there are null (typically `NaN`) values in the data frame `df`	`df.isnull()`	link
`df.replace`	Replace values in one or more columns with other values	`df.replace('a':{'b':'c'), 'd')`	link
`s.map`	Apply a function to each element of a series, returning the result of that application on each element	`s.map(lambda x: x**2)	link
`df.fillna`	Replace `NaN` with other values	`df.fillna(10)`	link
`df.dropna`	Remove rows with `NaN` values	`df = df.dropna()`	link
`s.str`	Working with textual data	`df['colname'].str`	link
`df.sort_index`	Reorder the rows of a data frame based on the values in its index, in ascending order	`df = df.sort_index()`	link
`pd.read_excel`	Create a data frame based on an Excel spreadsheet	`df = pd.read_excel('myfile.xlsx')`	link
`s.value_counts`	returns a sorted (descending frequency) series counting how many times each value appears in `s`	`s.value_counts()`	link
`s.unique`	returns a series with the unique (i.e., distinct) values in `s`, including `NaN` (if it occurs in `s`)	`s.unique()`	link

How much is missing?

We’ve already seen, on a number of occasions, that data frames (and series) can contain NaN values. One question we often want to answer is: How many NaN values are there in a given column? Or, for that matter, in a data frame?

One solution is to calculate things yourself. There is a count method you can run on a series, which returns the number of non-null values in the series. That, combined with the shape of the series, can tell you how many NaN values there are:

 
 s.shape[0] - s.count() ❶

❶ Returns an integer, the number of null elements

This is tedious and annoying. And besides, shouldn’t pandas provide us with a way to do this? Indeed it does, in the form of the isnull method. If you call isnull on a column, it returns a boolean series, one that has True where there is a NaN value, and False in other places. You can then apply the sum method to the series, which will return the number of True values, thanks to the fact that Python’s boolean values inherit from integers, and can be in place of 1 (True) and 0 (False) if you need:

 
 s.isnull().sum() ❶

❶ Calculate the number of NaN values in s

If you run isnull on a data frame, then you will get a new data frame back, with True and False values indicating whether there is a null value in that particular row-column combination. And of course, then you can run sum on the resulting data frame, finding how many NaN values there are in each column:

 
 df.isnull().sum() ❶

❶ Calculate the number of NaN values in each column

Instead of summing the results of a call to isnull, you can also use the any and all methods, both of which return boolean values. any will return True for each row in which at least one of the values is True, and all will return True for each row in which all of the values are True. You can thus do the following:

 
 df[df.isnull().all()] ❶

❶ Show only the rows without NaN

Finally, the df.info method returns a wealth of information about the data frame on which it is run, including the name and type of each column, a summary of how many columns there are of each type, and the estimated memory usage. If the data frame is small enough, then it’ll also show you how many null values there are in each column. However, this calculation can take some time. Thus, the df.info will only count null values below a certain threshold. If you’re above that threshold (the pd.options.display.max_info_columns option), then you’ll need to tell pandas explicitly to count, by passing show_counts=True:

 
 df.info(show_counts=True) ❶

❶ Get full information about the data frame df, including the number of null values in each column

pandas defines both isna and isnull for both series and data frames. What’s the difference between them? Actually, there is no difference. If you look at the pandas documentation, you’ll find that they’re identical except for the name of the method being called. I mostly use isnull, but if you prefer to go with isna, then be my guest.

Note that both of these are different from np.isnan, a method defined in NumPy, on top of which pandas is defined. I try to stick with the methods that pandas defines, which integrate better into the rest of the system, in my experience.

Rather than using ~, which pandas uses to invert boolean series and data frames, you can often use the notnull methods, for both series and data frame.

That’s all for this article. If you want to see more, check out the book on Manning’s liveBook platform here.

Manning Publications

TAGS

Search form

Latest posts

Featured Content