Thursday, November 8, 2012

Answer: How much does it rain in Northern California?


The quick answer is this:  No.  The rainfall for October 2012 was pretty ordinary—in fact, much less than just 3 years ago, when the rain fell heaviest in October 2009.  (Which goes to show you how imperfect human memory is:  I don’t remember October 2009 as being especially rainy, but when I look back at my journal from that year, it’s clear it rained a LOT.) 

As I showed you in my graphics from yesterday, I really wanted to get a single, simple chart showing the rainfall across the past decade of Octobers.  It’s often important to have a clear goal in mind, especially when you start wading into the sea of data that’s out there. It’s really easy to get sidetracked by all of the other beautiful charts out there and lose track of what you’re seeking.  (Which is why I hand-sketched the chart: To make it clear what my goal was.) 

Like many readers, I knew about a couple of sources off the top of my head.  Wunderground is well-known (and is a composite of professional and amateur weather stations), the data quality is variable, but they often have data where there’s no other source available.  (For instance, there’s a Wunderground reporting station in my neighborhood.  Awfully handy.)  And of course, you could have done the obvious search:  [ rainfall SFO data ]  or if you've done this kind of thing before, try [ precipitation SFO ] 

I also knew that NOAA (the National Oceanographic and Atmosphere Administration—the official government weather data collection group) would have data.  The problem with NOAA is usually digging down deep enough through their reams of data to find the one you want. 

Then, someone happen to mention WeatherSpark.com, which has a wonderful set of weather visualizations and the ability to drill down and select more-or-less what you want. 

So I did what any discerning data junkie would do… I got data from all three so I could compare them.

Data collection:  Getting data from Wunderground is pretty easy.  Click on “Local Weather” and find the “History Data” tab.  That will get you to data for a given date.  Once you find the “Monthly” button, you can find the cumulative precipitation for that location (KSFO).  

If you then notice the format of the URL for the monthly data report:
   http://www.wunderground.com/history/airport/KSFO/2012/10/31/MonthlyHistory.html

You can then change the year to 2011, 2010, etc and collection the data.

Pop that into your favorite spreadsheet, and plot the graph.  (Here's the link to my spreadsheet version of these charts.)  


That took me only a couple of minutes.  Of course, it helped that I knew about Wunderground to start with. 

Comparison with NOAA:  To get “ground truth” data from the authority, I started my search at NOAA.gov – it didn’t take me too long to click through their site (which is really the best way to do it—you have to learn what they call things by reading their pages—they often use very technical terms that you have to learn along the way). 

But it was only a couple more minutes for me to get to the National Climatic Data Center (NCDC) and find that I could ORDER (for delivery via email) the data set for SFO.  I ended up on their data-set order page ( ) and got an email from them a few moments later with a link to the SFO data set for 2000 – 2012! 

I downloaded the data (which has a ton of values and cryptic notation), poured it into my spreadsheet, opened in up and… realized I needed to go read the documentation.  This is professional data, so I really need to understand what things like “Daily HGCN” meant and what that number in the HPCP column meant.  (Turns out they measure rainfall in 1/100ths of an inch—so a 25 in that column is really 0.25 inches.) 

Fine.  I’m a data guy, so I opened up THAT spreadsheet, filtered out all of the non-October values, added up the numbers and got the values for a decade of Octobers.  (The one thing I tripped over was that one of the months had a HUGE rainfall—well over 10,000 inches!  What happened?  Turns out they use a number of 9999 to denote “rainfall not measured,” so I had to go back and clean up the data a bit. Not a big deal, but lesson learned—when there’s something funny in the graph, go check it out.)


Note how similar these graphs are.  This makes me feel good.  Especially that both graphs have 0 inches of rainfall in 2002 and 2003.  

Comparion with WeatherSpark:  Using the WeatherSpark UI (check it out if you’ve never used it) it’s pretty easy to select a decade’s worth of rainfall data from SFO.  I put it into my cart and then went to checkout.  Surprise!  This data costs money!  Given that I already had 2 data sets (including one from the government), I was a little reluctant to spend $15 to get the data, but I figured I’d be willing to do it for the blog. 

So, one credit card transaction later I had the data.  Dropped into my spreadsheet and… guess what… it’s the same. 


Not too surprising I guess, but it’s another lesson learned.   

But the good news is that I feel pretty confident that these graphs represent what has happened over the past ten years.  Including telling me that this past October was just ordinary. 

Lessons learned:  There were a lot to learn here.

1.  Triangulation is a best practice.  Get your data from multiple places.  In this case, WeatherSpark just replicated the NOAA data, BUT they also cleaned it up.  In this case, it wasn’t a big deal—but it could have easily been worth the money for their slicing/dicing and cleaning of the data. 

2.  Check your data.  That error code (9999) might have been easy to miss if the numbers had been larger.  Look for obvious errors (in this case, a giant spike in the data), but also just eyeball it. Sometimes you’ll see things that would escape plotting.  (Example:  Suppose they’d used a -1 to indicate an error.  You’d never seen that in the plot because it would look just like a 0 at these scales, but you could visually pick it out of the data.) 

3.  Some sources turn out to be less useful.  Several readers suggested sites like Wolfram Alpha and ClimateStations.com Both are great for what they do, but I was not able to download the data from either site, which left me unable to do my own analysis.  While their plots are nice, I wanted to do the data checking myself—and for that you need download ability.  (Note that Alpha says they allow downloads, but it’s only of the chart, not the data.)  

Addendum (11/9/12):   

4.  Make sure you're answering the right question.  A few of our loyal readers answered the question... kind-of...  When I was talking with people about this particular problem, it was clear that often they'd slip slightly off the rails and find data about "San Francisco" and never notice that they'd started by searching for data from SFO.  "San Francisco" usually refers to the city (a place that's noted for its very different weather patterns than those at SFO).  (Hat tip to reader Goon for reminding me of this.)



Search on!  (And stay dry…) 


4 comments:

  1. Another fun challenge as usual and interesting lessons to take from it.

    I would also add another note that you should keep in mind what you're searching for. I noticed a few people providing stats for San Francisco rather than just SFO and for annual rainfall rather than October rainfall.

    ReplyDelete
    Replies
    1. That's right. I'll add this as a comment / update. Good catch.

      Delete
  2. Hi Dan, I enjoyed your Keynote at the last GAFE New England Summit.
    Perhaps you can Help a friend sleep :)
    Here is a great search question
    https://twitter.com/courosa/statuses/256238512108621824
    I think I will add this to my next Internet search exam

    ReplyDelete