Monday, November 24, 2014

Answer: How snowy is it this week?


I've been enjoying watching as SearchResearchers tackle this Challenge.  It's a bit trickier than I thought, so let me walk you through my (relatively) simple solution.  

To begin with, I searched for a government organization that I thought would have the snowfall data for November.  My query was: 

     [ NOAA snow depth data ]   -- I added “daily” to make sure I found complete data.  

This led me to the site for snowtracking.  

Which leads to:  

And from there to the data file,  This is just a plain text file, so I opened this in my favorite boring text editor (I use TextWrangler because it has lots of nice features, such as grep, for data wrangling).  I pulled out the data for Buffalo, Rochetser and Oswego  for November 2014.   (Actually, I pulled out the data for Pulaski which is very near Oswego – the other rows for Oswego had too many missing values, indicated by -9999.00) 

For Canada data, I did the simple query: 

     [ Canada snow data daily ]

Which led to the official government site,

On this interface to their data I selected: “daily” data values, for “November”  - then downloaded London International Airport data and the downtown Toronto data.  Here's what the London data looks like.  This is from the London International Airport data set.  

Again, I copy paste this into my text file and then converted it all fit into a CSV file.  

I uploaded that CSV into my spreadsheet, did a quick conversion of the US data (from inches into cm).  

 And I now have a spreadsheet that looks like this.  Each column is one day (from Nov 1 - Nov 22), and each cell is the amount of snowfall on that day (in cm).  

Now, I need to visualize it.  There are lots of things one could do here, but first I did the simplest thing I could imagine.  I computed a line graph for each of the cities over time.  It looks like this:  

While that shows what happened and when, but it's not all that great for showing spatial distribution.  So here's another simple version of the chart (remember that you can click on the chart to see it at full size).  

I didn't do anything fancy here--I just screenshot the map, then laid the charts for each location on top of them.  This is a visualization technique known as "small multiples" (that is, a small number of repeated charts all of the same type)--but one of the things to know is that they all have to show the same thing on the same axes, or all bets are off.  

Note that I DID have to set the max-grid values to all be 19. If left to their own devices, the Y axis will be different on each one.  I wanted them to be comparable, so I had to manually set them all to 19.  If I was producing hundreds of charts, I would have written a piece of code to this... but this data set was small enough that you could do it by hand.  

As I worked through this data collection and plot task, the biggest challenge for me is just keeping track of where my data came from, and how it gets transformed from one data source to another.  As I worked, I kept backtracking and checking my data.  (This is where having a buddy to work with is a great idea--looking at you Ann & Debbie.)  

I also really like your comments about wanting to create contour maps (or heat maps, although maybe we should call them cold maps).  

If I get the chance, I'll write that up tomorrow.  

Search/Sensemaking Lessons:  In the meantime, it's useful to see that sometimes the simplest approach is best.  (See my queries above.)  

The hard part is knowing which of the many data sources will work out.  

My approach is always to grab a small data set (3 or 4 cells) and then work through the whole process from beginning data download to visualization.  It's a mistake to try and do the entire data manipulation process on the full data set at the beginning.  Use a small sample and work UP to the full dataset.  Trust me, I've wasted many hours on wrangling data, only to find out, when I got to the end, that the whole thing was broken.  Better to find that out on a small subset of the data than the entire thing.  

Because sometimes you'll get halfway through an analysis and realize that everything you're doing is wrong.  Or that the data doesn't make sense.  Or it's too full of errors, or missing data points. 

More comments tomorrow... but this was my quick answer for today.  

We'll be talking more about this in the future! 

In the meantime, Keep Searching On! 


  1. Dan, none of the imagery shows up only the placeholders


  2. Good Morning, Dr. Russell and everyone. I wish all a happy Thanksgiving day. Even if you don't live in the United States.

    It was as usual a great source of information, also lots of fun and new learning.

    "Small multiples" is new for me and it is very interesting. When you say I have to set the max-grid values to all be 19. You did it on editing chart, right?

    This is the TextWrangler you mentioned? I tried with the same data and one that I found in NOAA that only shows for example Buffalo with all the state data. However, I couldn't find a way to work with the data in Spreadsheet. In the past, I used "text to column", so I tried this time too, searching for online similar tools and in Google Spreadsheet. No luck in make it work properly.

    TextWrangler remembered me about
    Data wrangling. Also learned that if I search in your Blog, I need exact word. [TextWrangler] gives different results in your blog than doing [text wrangler]

    Heatmaps are very cool to do and watch. I tried doing it on Google Spreadsheet, and data in my case was not enough. But, they are easy do there.

    Dr. Russell, how is it living in lake effect in comparison with California?

    Now, time to re-read the answer and my peers comments.

    Have an amazing day.

  3. Did you remove all rows with -9999.00 or if they had one or two did you keep those and convert to zero 0.0?