An interesting data cleaning software from Google at
https://code.google.com/p/google-refine/
From the page at
https://code.google.com/p/google-refine/wiki/UserGuide
The Basics
First, although Google Refine might start out looking like a spreadsheet program (Microsoft Excel, Google Spreadsheets, etc.), don’t expect it to work like a spreadsheet program. That’s almost like expecting a database to work like a text editor.
Google Refine is NOT for entering new data one cell at a time. It is NOT for doing accounting.
Google Refine is for applying transformations over many existing cells in bulk, for the purpose of cleaning up the data, extending it with more data from other sources, and getting it to some form that other tools can consume.
To use Google Refine, think in big patterns. For example, to spot errors, think
- Show me every row where the string length of the customer’s name is longer than 50 characters (because I suspect that the customer’s address is mistakenly included in the name field)
- Show me every row where the contract fee is less than 1 (because I suspect the fee was entered in unit of thousand dollars rather than dollars)
- Show me every row where the description field (scraped from some web site) contains “&” (because I suspect it wasn’t decoded properly)
To edit data, think
- For every row where the contract fee is less than 1, multiply the fee by 1000.
- For every row where the customer name contains a comma (it has been entered as “last_name, first_name”), split the name by the comma, reverse the array, and join it back with a space (producing “first_name last_name”)
To specify patterns, use filters and facets. Typically, you create a filter or facet on a particular column. For example, you can create a numeric facet on the “contract fee” column and adjust its range selector to select values less than 1. If the default facet doesn’t do what you want, you can configure it (by clicking “change” on the facet’s header). For example, you can create a text facet with on the same “contract fee” column with this expression:
value < 1
It will show 2 choices: true and false. Just select true. Then, invoke the Transform command on that same column and enter the expression
value * 1000
That Transform command affects only rows where the “contract fee” cell contains a value less than 1.
You can use several filters and facets together. Only rows that are selected by all facets and filters will be shown in the data table. For example, say you have two text facets, one on the “contract fee” column with the expression
value < 1
and another on the “state” column (with the default expression). If you select “true” in the first facet and “Nevada” in the second, then you will only see rows for contracts in Nevada with fees less than 1.
Analogies
Databases
If you have programmed databases before (performing SQL queries), then what Google Refine works should be quite familiar to you. Creating filters and facets and selecting something in them is like performing this SELECT statement:
SELECT * WHERE ... constraints determined by selection in facets and filters ...
And invoking the Transform command on a column while having some filters and facets selected is like performing this UPDATE statement
UPDATE whole_table SET column_X = ... expression ... WHERE ... constraints determined by selection in facets and filters ...
The difference between Google Refine and databases is that the facets show you choices that you can select, whereas databases assume that you already know what’s in the data.
Related Articles
- Transforming spreadsheets into SKOS with Google Refine (semantic-web.at)
- Adding geographical information to a spreadsheet based on postcodes – Google Refine and APIs (onlinejournalismblog.com)
- Chapter 1. Using Google Refine to Clean Messy Data – ProPublica (propublica.org)