THE DHS PROGRAM HEALTH
DATA MAPPING WORKSHOP
Standard Operation
Procedures for Cleaning Data
in Excel
Standard Operation Procedures for Cleaning Data in Excel 1
Standard Operation Procedures for Cleaning Data in Excel
Microsoft Excel is the primary tool for cleaning and preparing data that will be used for analysis or
for import into a GIS. Storing data in a standard format allows a computer program to search,
aggregate, share, and link data more easily. It is good practice to save the original or raw data and
clean and format with a copy of the original. This ensures that the original file is available if needed.
Below are some standard procedures to create useable, linkable data.
1. Spreadsheet Formatting
2. Text to Columns Tool
3. Excel Formulas for Cleaning Data
4. Find and Replace Tool
5. Filter and Sort Tools
1. Spreadsheet Formatting
A spreadsheet table should contain one row item per record without using merged cells or blank
"spacer" rows. The first row should contain column headers for each column in one cell without
using merged cells. Merged cells may be a nice way to display data for a human but a computer
needs logical and predictable formatting (no merged cells).
If there are merged cells in the spreadsheet, select all cells by clicking the box in upper left corner
(between the header and row labels). If the Merge & Center button is selected (orange in color,
found in the Home ribbon in the Alignment section) click it to change it to Unmerge Cells.
Delete extra, blank rows and columns so that the data column headers begin in A1. Row 1 should
now contain the column name for the data in each column. Some file formats like dBase and
shapefiles limit the number of characters allowed for column headings. The limit for these two
formats are 10 characters and cannot include special characters like () % / # @ * or spaces. If the
shortened column names require additional explanation, include this in a separate codebook or
metadata.
Each separate piece of information should be stored in a separate column. For example, province,
district, and commune should be stored in three separate columns rather than in one column. Text
should be stored as text without preceding or trailing spaces. Numbers should be stored as values
with the appropriate number of decimal places.
2. Text to Columns Tool
There are several helpful formulas that can be used in Excel to assist in the data cleaning
process. Text to Columns can be used to split one column into several. This tool is found
in the Data ribbon in the Data Tools section. This tools can also be used to change the
format of numbers stored as text. If converting a number column stored as text you can
click finish in the first step.
Standard Operation Procedures for Cleaning Data in Excel 2
To use Text to Columns, select one column at a time and follow the wizard steps:
1. In the first window in the Convert Text to Columns Wizard, select Delimited if the data are
separate by a comma, colon, or tab. If the data are separated by a Fixed width, select this
option. Click Next.
2. In the second window, choose the Delimiter if using delimited data or manually adjust the
column width if using Fixed width. Click Next.
3. In the third window, select General under Column data format to store numbers as values
and text as text.
3. Excel Formulas for Cleaning Data
Sometimes a column will contain multiple pieces of information without a delimiter. It is best to
store these data in separate columns. The LEFT function, i.e. =LEFT(text,# of characters), RIGHT, i.e.
=RIGHT(text,# of characters), and MID, i.e. =MID(text,starting place,# of characters), formulas allow
you to extract these data systematically. These formulas allow you extract only the information you
need into a different column while keeping the original data.
If data are copied from another source, like Word, there could be special, non-printable characters
like bullets (•). The CLEAN formula, i.e. =CLEAN(cell reference), removes these characters leaving
only printable characters. Text in cells should not contain "extra" spaces. This means that there
should be one space between words and no spaces before the first word or after that last word. The
TRIM formula, i.e. =TRIM(cell reference), removes these "extra" spaces. The CLEAN formula should
be used first because it may leave spaces that should be trimmed.
4. Find and Replace Tool
Spreadsheets often contain data that are not necessary. An easy way to remove extraneous
information is through Find and Replace. Use this tool if you want to remove or replace entire
words, phrases, or characters. It’s located in the Home ribbon in the Editing section. This tool is
particularly useful for replacing spaces “ ” with underscores “_” in column headers (a common
requirement for GIS data and tools).
5. Filter and Sort Tools
Filtering and sorting data, especially numbers, can help identify outliers and inconsistencies in the
data. A filter is best applied to data with only one column header and no blank rows between the
data. When sorting, be sure that the sorting is applied to all the data in the table so that the sorting
does not change only some of the information in a row.
Standard Operation Procedures for Cleaning Data in Excel 3