Friday, February 5, 2010

Searching for data and datasets

One of the glorious things about the internet is that it has made it very practical to share source datasets.  Among other things, this means you can reproduce analyses from data that is of historical interest (such as Galileo’s observations) or of deep personal interest to you.

The simplest way to start looking for datasets to analyze is to add the term ‘dataset’ to your query.  If you’re interested in getting your hands into statistical analysis, search for [ online dataset ] and you’ll have more content than you’ll know what to do with.  For instance: - has datasets and stories about statistical analysis, with nice datasets ranging across a number of topics from archaeology (ancient Egyptian skull development over 4000 years) to nature (acorn size as a function of location), to zoology (wild horse population statistics)  Draft lottery data, Galileo’s experimental data, Old Faithful eruptions, world population statistics…

If I were teaching math and statistics in high school or college, this wealth of data would be a major game-changer.  You can demonstrate methods and analyses on tiny, demonstration datasets, and then let the students loose to look at moderately sized (and very interesting) data.  With the amount of data content online, every day can be a new science fair!

Let me illustrate this with a story.

It’s been raining in Palo Alto a lot recently.  And I noticed (for about the thousandth time) that… 

Question:  The rain in Palo Alto seems to come in pulses, averaging 10 to 15 minutes per pulse, then a period of quiet, then another pulse.  Is that true? 

Answer:  To test this hypothesis, I grabbed two datasets for comparable rainy days in January, 2010. 

Seattle data is from:

Palo Alto data is from:

Both weather stations sample rainfall in 5 minute samples.

To find this data, I did a search for [ rainfall data palo alto ] and found that one of the first results points to, which features links to local weather stations, nearly all of which provide data feeds in a CSV format.  It was quick to find a Palo Alto weather station that had the right kind of rainfall data. And from there, it was a simple step to do the same for Seattle.  (I did other cities as well, but that’s another story.)

So why didn’t I search for “dataset”?  Because a dataset is usually a curated collection of data; that is, it’s been collected, usually cleaned up and ready for use.  I chose to search for ‘data’  because I wanted a raw feed of data from an instrument.  The National Weather Service has datasets, but I wanted the data straight from the provider, hence the subtle shift in my query.

I downloaded the CSV data from each of my two cites:  Jan 11, 2010 for Seattle, and Jan 19, 2010 for Palo Alto). 

I then imported both CSVs into a Google Spreadsheet, and made a moving average of 3 samples (to smooth the curves a bit and get around sampling uncertainties). 

The observation is pretty clear:  Palo Alto weather tends to come in bursts much more than Seattle weather. 

Note that the blue (Seattle) line is VERY different than the obvious pulses in the red line (Palo Alto).

Once it starts raining in Seattle, it tends to keep raining.  The stereotype of Seattle rain is true! 

By contrast, if it's raining in Palo Alto, you can wait 15 minutes until it clears.

(At least on these two days.  For a real test of statistical significance, you'd have to do a bunch more days, testing the average behavior, etc etc.  For advanced classes, this is a great way to talk about hypothesis testing, measurements of significance, etc.)

The point remains:  As a culture, we now have easy access to more data than ever before.  If you search for it, you can often test your own ideas about what’s going on—you don’t need to wait for the intermediaries to give you their version of the story.  Check it out yourself.

Search on!

1 comment:

  1. This comment has been removed by a blog administrator.