0% found this document useful (0 votes)
19 views1 page

IDS Assignment 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views1 page

IDS Assignment 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Assignment: OpenRefine Overview

OpenRefine, formerly Google Refine and before that Freebase Gridworks, is an open source tool
that allows users to load data, clean it quickly and accurately, transform it, and even geocode it.
The main use of OpenRefine is data processing and transformation to other formats. What’s more,
is that all actions that were done on a dataset are stored in a project and can be replayed on
another dataset!

Why Use OpenRefine?


• Simple installation
• Extensive documentation
• Lots of great import formats: TSV, CSV, XML, RDF Triples, JSON, Google Sheets, Excel
• Upload from local drive or import from URL
• Many export formats: TSV, CSV, Excel, HTML table
• Works with large-ish datasets (100,000 rows)
• Can adjust memory allocation to accommodate larger datasets
• Data remains on your computer, so nothing is shared until you choose to share it
• Useful extensions: geoXtension, Opentree for phylogenetic trees from Open Tree of Life, and
many more
• Active development community

FACETS WITH OPENREFINE


One of the most powerful operations that OpenRefine has to offer are facets. When you look at
facets for a given column, it shows all unique entries with frequencies.
You can use that to get a feel for how consistent your data is. You can also use facets to subset
rows that you want to change in bulk.
The facet information always appears in the left-hand panel in the OpenRefine interface. There are:

• Numeric facets
• Timeline facets (for dates)
• Custom facets
• Scatterplot facets

Some of the default custom facets are:


• Word facet: breaks down text into words and counts occurrences.
• Duplicates facet: results in a binary facet (‘true’ or ‘false’) for duplicate values.
• Text length facet: creates a numeric facet based on text length in each row.
• Facet by blank: identifies rows with missing data in a column.

You might also like