Tuesday, July 16, 2013

Notes on data wrangling from Tony Hirst

Illustration from Tony Hirst showing how to use Refine to clean data.
As you might recall from last week's challenge, once you'd found the data on world literacy rates, the next step in the challenge was to transform it into a useful format for your spreadsheet, your infoviz tool, or whatever you were going to do to analyze it. 

I just found that Tony Hirst has written a very nice blog post that gives ALL of the details about that data wrangling step.  

I admit that I gave it a little bit of short shrift.  I said something like "... once you have that page, you can copy/paste the entire table into a text editor, and with a few passes, you can convert it into a nice CSV for importing into your favorite spreadsheet or visualization tool for analysis."  That's a little glib, especially since that's what took all of my time. 

So I'm really happy to give you the link to Tony's writeups on this.  

Actually, he did TWO writeups. 

   1.  How to wrangle the data using Google Spreadsheets

   2. How to wrangle the data using OpenRefine. 

Tony shows you a couple of techniques that I know about, but opted to NOT write about in the post.  (My post was getting too long as it was.)  For instance, he shows how to use: 

=ImportHTML(, "table",

as a way to wrangle a table from a page and into Google Spreadsheets.  

But he ALSO shows how to use regular expressions in Google Spreadsheets (a technical way to do find-and-replace operations with sophisticated pattern-matching).  I didn't know you could do that! Tip of the hat to Tony!  

And it's worth reading Tony's second article about how to use OpenRefine.  It's a powerful tool for transforming data from one form to another.  I'll write about that myself one day.  (Or maybe I'll just point you to Tony's writeup!) 


  1. Thanks Dr. Russell and Tony Hirst.

    It is very interesting and useful. I normally use copy/paste and now I am learning new tools and new ways to work with data. I'll practice them.

  2. unicode, nondeterministic finite automaton (NFA) to be transformed into a deterministic finite automaton (DFA)
    … Dan, see those tiny little nubby green things in the distance in your rearview mirror? — those are the weeds you passed through to arrive at this post… at least for me. I thought wrangling data was something they did at Bluffdale. My fuzzy string matching is getting frayed and I'm headed to my cell.

    Weed on!

    next you may have Vlad run a guest MOOC while you take the sub for a spin.
    C-Explorer 5
    maybe I was confusing Tony with Damien… also a wrangler of sorts.
    dougha in Doha

  3. One of the very few purposes I actually use Internet Explorer (IE) for is when I need to collect data from a table on some website. In fact, if you have MS Office, you may right-click on any table on IE and choose Export to Microsoft Excel. Afterwords, most of the times, the necessary tweaking will be minimal. (I used this for the previous challenge and just had to delete one row and use Replace All to substitute commas for points in the numeric columns, because in Portugal the decimal separator is a comma.)