Cleaning Data with OpenRefine
Objectives
By the end of this tutorial, you will be able to:
- Facet (count & tabulate) messy data
- Filter and export a specific portion of your dataset
What You Need
- Computer with a web browser and internet connection
- A dataset - ideally something on the larger size (i.e. tens or hundreds of thousands of data points)
- We altered a dataset from the Western Pennsylvania Regional Data Center for this tutorial. Access our “messy” dataset here.
- Right click the “messy” dataset link above and select “Open in New Tab”
- On the webpage that opens, right click and select “Save As”
- Save the .csv file somewhere you can easily access it again when it’s time to upload it
- Your own data can be uploaded as a file in many different formats (csv, json, wiki, xls, etc.), pulled from a webpage, or even copy/pasted as text
- We altered a dataset from the Western Pennsylvania Regional Data Center for this tutorial. Access our “messy” dataset here.
About OpenRefine
OpenRefine (previously Google Refine) is a powerful, opensource tool for working with messy data. Key features include cleaning up small to large mistakes en masse, converting data from one format to another, and adding to a dataset by pulling data from an online source into your dataset. OpenRefine is an app that is downloaded to your personal computer that then uses your internet browser to interact with it. While it might look as though you are uploading your data to the internet through your browser, the data is in fact still being locally hosted on your hard drive.
What Can We Do With OpenRefine?
- Clean up messy data
- Combine data from a website with an outside dataset
- Convert a dataset from one format to another
Getting Started
To get started, there is no account sign up. All you need to do is visit the OpenRefine Downloads Page and pick the download kit that is applicable to your operating system.
NOTE: If using a Mac, you may get a security warning that your computer doesn’t recognize the developer and won’t let you open the application. If that is the case, then go to your System Settings > Security & Privacy > Click the lock icon in the bottom left corner of the window > Click the “Open Anyway” button in the “Allow apps downloaded from” section of your General Settings. Open it again and you should be good to go!
Once you have the app downloaded and installed, launch the app, upload your data, and start working. As a reminder, when you launch the OpenRefine app, your default internet browser will open. This is how you interact with the application even though your data is still being hosted locally on your personal drive. So, no need to worry about the security of your data!
Uploading Your Data
- Click the “Choose Files” button, select your dataset, and click “Next”. For this workshop, we altered a dataset from the WRPDC: “messy” Allegheny County Dog Licenses.csv. If you already have some data that you know is messy, feel free to use that and follow along with the same steps! Just be prepared for us to use different values and have different “mistakes” within our provided data than what you might have.
- Once uploaded, your data will appear in a “Configure Parsing Options” page. There are several options available at this point, but we can stick to the default settings for our purposes. So, just click the “Create Project” button in the top right corner to continue.
- Once your project is created, you will see the first ten rows of your data as a new OpenRefine project.
Faceting
Facets are the bread and butter of OpenRefine when it comes to cleaning up messy data as well as sorting and filtering it. Essentially, they are a collection of each value in each cell of your data, grouped together with a total count for each unique datapoint. When you create a “text facet” or you “facet text” OpenRefine automatically separates and groups them based on text and characters. You will also notice that the fine folks at OpenRefine have provided a link to some great screencasts on using filters and facets. Much of this tutorial is based on these videos, so please feel free to reference those as well.
- In the Breed column of our data, click the dropdown arrow in the column header and select
Facet > Text facet
. - In the left sidebar, you will see a list of all of the different breeds in your dataset grouped together with a total count for each. Take a look through these carefully and you will notice that there are some that appear to be duplicates.
- You will see some minor typos, find trailing spaces, etc. OpenRefine has built-in features to help rid your data of these quickly. The first of which lives in your facet widget in the left sidebar. Click the “Cluster” button in the top right corner of the widget. Here you will see a list of any data points that OpenRefine thinks are supposed to be the same. The “Values in Cluster” column shows you what values are up for being combined and the “New Cell Value” column shows you what OpenRefine wants to change them to. For our dataset, OpenRefine got them all right, so click the checkbox in each row in the “Merge” column (or click the “Select All” button under the table) then click “Merge Selected & Close”.
- Take another look at your Breed facets. Scroll down until you find the “AM PIT BUL TERRIER” and “AM PIT BULL MIX” values. You will notice that they each only have one entry and right under them there are “AM PIT BULL TERRIER” AND “AM PITT BULL MIX” values with hundreds of entries each. This inconsistency might be important for analysis, but it also might just be something that needs to be cleaned. Let’s assume that latter. Hover over each value and click the “Edit” button that appears to the right of them. This will bring up a text box where you can manually edit the typo from “PIT BUL” to “PIT BULL”, etc.
-
Once you click “Apply” you will see your typo’ed value disappear and the count of the correctly spelled value increase by one.
- Listed here are some other minor errors in the provided dataset that you can use the above steps to locate and correct. Scrolling through your list of facets sorted by name is a quick way to identify these by noticing lower than usual counts and duplicates for values.
- Edit the “BAGLE” value to “BEAGLE”
- Edit the “BOXTER” value to “BOXER”
- Edit the “MATLEESE” value to “MALTESE”
- Edit the “MIX” value to “MIXED”
Other Quick Edits
Eliminate Extra Spaces
A common issue with large datasets is trailing or leading spaces. Meaning there is a space either right before or right after the value in the cell. OpenRefine makes trimming these off of your data super easy!
Again in the “Breed” column of your dataset, click the dropdown arrow by the column header and hover over Edit Cells > Common Transformations > Trim leading and trailing whitespace
. You can repeat this for each column just to be sure that you’ve cleared up any trailing spaces throughout your data.
Split Columns with Multiple Values
-
Look at the “ValidDate” column of our data. You will see that the values have both a date and a time in the same cell. We want a separate column for “ValidTime”. In the header of the “ValidDate” column, click the dropdown arrow and hover over
Edit column > Split into several columns
. Notice how the dates and times in the “ValidDate” cells are separated by a “T”. Enter “T” in the “Separator” box in the Split column window and click “OK”. -
Now you have your dates and times in separate columns! Notice how OpenRefine automatically names the headers of these columns based on the original column’s name. Simply edit these by clicking the dropdown next to the header name and select
Edit column > Rename this column
.
Add Text To Existing Values
Just for clarity’s sake, we want all of our pups that are designated as “MIXED” to read as “MIXED BREED”. So let’s add the word “BREED” to any occurrence of the word “MIXED” in the “Breeds” column.
- Back in your Facets widget, scroll down to the “MIXED” value and select it. The facet will be bolded and change to orange to signify that it’s selected.
- With the facet selected, click the dropdown arrow in the “Breed” column header and select
Edit cells > Transform…
- Here you will use what OpenRefine calls General Refine Expression Language or GREL to tell OpenRefine how you want to transform your cells. For this example, in the “Expression” box, use the following command to add your text to the existing values:
value+” BREED”
Luckily, the Google Refine Expression Language is pretty simple and there tons of resources available on using it if you run into a more complicated need for it, so it’s not really necessary to learn it.
You will see a preview of your handy work in the table under the “Expression” box. Click “OK” when it looks the way you want it to.
Filtering and Exporting Data
Now that your data is all cleaned up and ready for use, you can use facets to filter out what you don’t need, sort everything by a specific value, or a combination of both! Filtering is about as easy as it gets.
- Click the dropdown of the “LicenceType” column header. Select
Facet > Text Facet
. Let’s say we want to find out how many of our registered pups are seniors. From your facet widget in the left sidebar, hover over each of the values that include “Senior”. As you hover over each, you will see the option to either “edit” or “include”. Click “include” for each value. As you do this for each facet, they will all become bold and turn orange, signifying that they are selected. You will also notice the number of matching rows at the top of the page adjusts to show you how many rows contain our senior pups. - From here, you can Export your filtered data by clicking the “Export” dropdown in the top right corner of the screen and selecting your desired format. For this example, we will choose “Comma-separated value”.
You have now exported a filtered portion of your dataset to manipulate further or send away, but your OpenRefine project remains right where you left off! Now you can jump back into OpenRefine and remove your filters, using the “remove all” button in the top corner of the left sidebar to restart, and export your data again however else you might need to.
Other Resources and References
- OpenRefine Video Tutorials
- Miriam Posner’s “Get started with OpenRefine”
- General Refine Expression Language Guide
- OpenRefine on Github
This tutorial was adapted and written by Jane Thaler in 2020.