Process Data From Dirty To Clean
Process Data From Dirty To Clean
Earlier, we discussed that dirty data is data that is incomplete, incorrect, or irrelevant to
the problem you are trying to solve. This reading summarizes:
Duplicate data
Description Possible causes Potential harm to businesses
Any data record that Manual data entry, Skewed metrics or analyses, inflated or
shows up more than batch data imports, or inaccurate counts or predictions, or
once data migration confusion during data retrieval
Outdated data
Potential harm to
Description Possible causes
businesses
Any data that is old which should People changing roles or Inaccurate insights,
Potential harm to
Description Possible causes
businesses
be replaced with newer and companies, or software and decision-making, and
more accurate information systems becoming obsolete analytics
Incomplete data
Description Possible causes Potential harm to businesses
Any data that is Improper data Decreased productivity, inaccurate
missing important collection or incorrect insights, or inability to complete
fields data entry essential services
Incorrect/inaccurate data
Description Possible causes Potential harm to businesses
Any data that is Human error inserted during Inaccurate insights or decision-
complete but data input, fake information, making based on bad information
inaccurate or mock data resulting in revenue loss
Inconsistent data
Description Possible causes Potential harm to businesses
Any data that uses different Data stored incorrectly or Contradictory data points leading to
formats to represent the errors inserted during confusion or inability to classify or
same thing data transfer segment customers
Additional resources
Refer to these "top ten" lists for data cleaning in Microsoft Excel and Google Sheets to help you avoid the
most common mistakes:
Top ten ways to clean your data: Review an orderly guide to data cleaning in Microsoft Excel.
10 Google Workspace tips to clean up data: Learn best practices for data cleaning in Google
Sheets.
Activity overview
You’ve learned about cleaning data and its importance in meeting good data science
standards. In this activity, you’ll do some data cleaning with spreadsheets, then transpose
the data.
By the time you complete this activity, you will be able to perform some basic cleaning
methods in spreadsheets. This will enable you to clean and transpose data, which is
important for making data more specific and accurate in your career as a data analyst.
To use the spreadsheet for this course item, click the link below and select “Use Template.”
If you don’t have a Google account, you can download the template directly from the
attachment below.
The first technique we’ll use is to select and eliminate rows containing blank cells by using
filters. To eliminate rows with blank cells:
1. Highlight all cells in the spreadsheet. You can highlight Columns A-H by clicking on the
header of Column A, holding Shift, and clicking on the header of Column H.
2. Click on the Data tab and pick the Create a filter option. In Microsoft Excel, this is called
Filter.
Excel:
3. Every column now shows a green triangle in the first row next to the column title. Click the
green triangle in Column B to access a new menu.
4. On that new menu, click Filter by condition and open the dropdown menu to select Is
empty. Click OK.
In Excel, click the dropdown, then Filter... then make sure only (Blanks) is checked. Click OK.
Excel:
You can then review a list of all the rows with blank cells in that column.
5. Select all these cells and delete the rows except the row of column headers.
6. Return to the Filter by condition and return it to None. In Excel, click Clear Filter from
‘Column’.
Note: You will now notice that any row that had an empty cell in Column A will be
removed (including the extra empty rows after the data).
7. Repeat this for Columns B-H.
All the rows that had blank cells are now removed from the spreadsheet.
The second technique you will practice will help you convert the data from the current long
format (more rows than columns) to the wide format (more columns than rows). This action
is called transposing. To transpose your data:
1. Highlight and copy the data that you want to transpose including the column labels. You
can do this by highlighting Columns A-H. In Excel, highlight only the relevant cells (A1-H45)
instead of the headers.
2. Right-click on cell I1. This is where you want the transposed data to start.
3. Hover over Paste Special from the right-click menu. Select the Transposed option. In Excel,
select the Transpose icon under the paste options.
Excel:
You should now find the data transformed into the new wide format. At this point, you should
remove the original long data from the spreadsheet.
4. Delete the previous long data. The easiest way to do this is to click on Column A, so the
entire column is highlighted. Then, hold down the Shift key and click on Column H. You
should find these columns highlighted. Right-click on the highlighted area and select Delete
Columns A - H.
Your screen should now appear like this:
Now that you have transposed the data, eliminate the extra spaces in the values of the cells.
2. Click on the Data tab, then hover over Data cleanup and select Trim whitespace.
In Excel, you can use the TRIM command to get rid of white spaces. In any space beneath your
data (such as cell A10), type =TRIM(A1). Then, drag the bottom right corner of the cell to the
bottom right to call the data without the white spaces.
Now all the extra spaces in the cells have been removed.
Next, you’ll process string data. The easiest way to clean up string data will depend on the
spreadsheet program you are using. If you are using Excel, you’ll use a simple formula. If you
are using Google Sheets, you can use an Add-On to do this with a few clicks. Follow the steps
in the relevant section below.
Microsoft Excel
If you are using Microsoft Excel, this documentation explains how to use a formula to change
the case of a text string. Follow these instructions to clean the string text and then move on to
the confirmation and reflection section of this activity.
Google sheets
If you’re completing this exercise using Google Sheets, you’ll need to install an add-in that
will give you the functionality needed to easily clean string data and change cases.
Once you have installed the add-on successfully, you can access it by clicking on the Add-ons
menu again.
Now, you can change the case of text data that shows up. To change the text in Column C to
all uppercase:
1. Click on Column C. Be sure to deselect the column header, unless you want to change the
case of that as well (which you don't).
2. Click on the Add-Ons tab and select ChangeCase. Select the option All uppercase. Notice
the other options that you could have chosen if needed.
If you want to clear the formatting for any or all cells, you can find the command in the
Format tab. To clear formatting:
1. Select the data for which you want to delete the formatting. In this case, highlight all the
data in the spreadsheet by clicking and dragging over Rows 1-8.
2. Click the Format tab and select the Clear Formatting option.
In Excel, go to the Home tab, then hover over Clear and select Clear Formats.
You will notice that all the cells have had their formatting removed.
Activity overview
By now, you’ve been introduced to some useful techniques for cleaning spreadsheet data,
such as sorting and filtering. In this activity, you'll continue to develop your data-cleaning
skills by using spreadsheet functions.
Imagine you are a data analyst working for a marketing agency based in San Francisco. The
marketing agency wants to contact local boba tea shops to inquire about a potential
collaboration for a new marketing campaign. The agency plans to visit the top-rated shops
within a 10-mile radius of the center of their target area. To assist with planning, the agency
asks your team to review external data related to ratings and locations of boba tea shops in
San Francisco. One of your teammates has created a spreadsheet from an online source.
However, the data is not in the greatest shape.
Your assignment is to identify the dirty elements in the dataset and clean them up.
By the time you complete this activity, you will be able to identify dirty elements in a dataset,
remove duplicate data, and use the COUNTIF and SPLIT functions to help clean data.
Or, if you don’t have a Google account, you may download the dataset directly from the
attachment below:
CSV File
As a data analyst, your job is to present data that is readable, accurate, and visually
appealing. Cleaning your data helps you achieve this goal. The first step is to identify the dirty
elements in your data.
1. Rename your spreadsheet. Click Untitled Spreadsheet and enter a new name. You can
use the name sf_boba_tea_shop_data or a similar name that describes the data your
spreadsheet contains.
2. If you want to get a better view of your data, you can make the columns wider by
dragging the right boundary of the column heading. This may apply to the name (B),
address (D), and lat-long (F) columns.
3. Now, review your data and consider any problems you may need to address. The
following are examples of errors that you can quickly identify and fix. This is not a
comprehensive list of every potential problem, but is a great starting point for data
cleaning.
First, there is at least one duplicate line (rows 20 and 21) in your dataset.
Second, all Yelp ratings should fall between 0 and 5. However, at least one rating (in
cell C8) falls outside of that range.
Finally, the data for latitude and longitude is contained in a single column (F). In order
for someone to be able to use this data for analysis, the two values should be in
separate columns.
Now you know what issues to focus your attention on during the cleaning process.
Your goal is to fix these errors and help create a clean dataset for analysis. You can address
each issue in turn.
Remove duplicates
The first step is to eliminate any duplicate entries from your dataset. As a best practice,
duplicates should be removed even if they are not readily apparent.
5. If done correctly, 3 duplicate rows will be found and removed and 604 rows will remain.
1. The COUNTIF function quickly counts how many items in a range of cells meet a
given criterion. In cell I2, type =COUNTIF(C:C,">5"). The first entry (C:C) refers to
the range where you are counting the data. In this case, the range is the entire rating
column (C), which contains the Yelp ratings. The second entry refers to the criterion
(>5), and tells the function to count all the values greater than 5.
2. Press Enter. You’ll notice that the function returns a value of 9. This tells you that your
dataset contains 9 entries that have a rating greater than 5.
As a data analyst, it's your job to decide what to do with incorrect values or to ask the dataset
owner for advice if you’re unsure. In this case, one effective approach would be to search on
Yelp for the actual ratings. For this activity, you can just replace the incorrect ratings with the
number 5. An efficient way to replace the ratings is to sort the data numerically from largest
to smallest rating.
4. Then, from the menu bar, choose Data, then Sort range and Advanced range sorting
options.
5. In the pop-up window, check the box next to Data has header row. Sort by rating from Z
→A. This way, the highest ratings will be listed first.
6. Click Sort. Check out your spreadsheet. At the start of the rating column, you should now
find the 9 rows that have incorrect values (rating > 5).
7. Next, select the range of cells C2:C10. Press delete to delete the values that are greater
than 5.
8. Replace all the values with the number 5. In cell C2, type 5. Then, drag the fill handle down
to cell C10 to fill the remaining cells with 5.
9. After replacing the incorrect ratings with the number 5, you may notice that the new value
in cell I2 is 0. The output of the COUNTIF function now reflects the changes in your dataset.
This confirms that the rating column no longer contains any values greater than 5.
10. FInally, delete the formula from cell I2 since you don’t need this information anymore.
Next, clean up the latitude and longitude data by placing each value in a separate column.
You can use the SPLIT function to accomplish this task.
1. The SPLIT function divides text around a specified character or string, and puts each
fragment of text into a separate cell in the row. The SPLIT function will split the single lat-
long column into two separate columns, one for latitude and the other for longitude. In cell
G2, type =SPLIT(F2,"-"). The first entry (F2) refers to the cell where the text is located. The
second entry (“-”) refers to the fact that you are dividing the text based on the minus sign.
2. Press Enter. The result shows each fragment of text in a different cell.
3. Select cell G2 again. In cell G2, double-click on the fill handle to split all the remaining lat-
long entries.
4. Now add column headers to the two new columns (G and H). In cell G1, type lat. In cell H1,
type long.
5. Next, replace the original lat-long data in column F with the new split entries in columns G
and H. Select columns G and H, right-click, and choose Copy.
6. Then, select Column F, right-click, and choose Paste special and Paste values only.
7. Now the new lat column is column F, and the new long column is column G. Adjust the
width of the lat column (F) to fit the data by dragging the right boundary of the column
heading.
9. Finally, the longitude values should be negative so that they are accurate coordinates for
mapping. To make the values in the long column negative, multiply them by -1. In cell H2,
type =G2*-1. The asterisk is the operator for multiplication. Press Enter.
10. Still in cell H2, double-click on the fill handle to fill in the rest of the values.
12. Now, replace the longitude data in column G with the new data in column H. Select
column H, right-click, and choose Copy.
13. Select Column G, right-click, and choose Paste special and Paste values only.
In this case, the data is stored in a database, so they will have to work with SQL. And this data
analyst knows they could get the same results with a single SQL query:
In Google Sheets, you can use the IMPORTRANGE function. It enables you to specify a range of cells in the
other spreadsheet to duplicate in the spreadsheet you are working in. You must allow access to the
spreadsheet containing the data the first time you import the data.
The URL shown below is for syntax purposes only. Don't enter it in your own spreadsheet. Replace it
with a URL to a spreadsheet you have created so you can control access to it by clicking the Allow access
button.
Refer to the Google support page for IMPORTRANGE for the sample usage and syntax.
On Tuesday, they use the following to import the donor names and matched amounts:
On Wednesday, another 500 transactions were processed. They increase the range used by 500 to easily
include the latest transactions when importing the data to the individual donor spreadsheet:
Note: The above examples are for illustrative purposes only. Don't copy and paste them into your
spreadsheet. To try it out yourself, you will need to substitute your own URL (and sheet name if you
have multiple tabs) along with the range of cells in the spreadsheet that you have populated with data.
The QUERY function syntax is similar to IMPORTRANGE. You enter the sheet by name and the range of
data that you want to query from, and then use the SQL SELECT command to select the specific columns.
You can also add specific criteria after the SELECT statement by including a WHERE statement. But
remember, all of the SQL code you use has to be placed between the quotes!
Google Sheets run the Google Visualization API Query Language across the data. Excel spreadsheets use a
query wizard to guide you through the steps to connect to a data source and select the tables. In either
case, you are able to be sure that the data imported is verified and clean based on the criteria in the query.
The FILTER function might run faster than the QUERY function. But keep in mind, the QUERY function
can be combined with other functions for more complex calculations. For example, the QUERY function
can be used with other functions like SUM and COUNT to summarize data, but the FILTER function
can't.