Wednesday, July 31, 2013

Wednesday search challenge (7/31/13): What's going on here?

This is a remarkable photograph of an unusual atmospheric phenomenon.  When I saw this photo in an article I was reading, I couldn't believe it.  Was this real?  Or was it some kind of manipulated photograph that was supposed to create a feeling of fear and dread--could it be the backdrop for a beach scene in a zombie movie?  

After a little bit of searching around, I found that not only is this a real photo, but these kinds of clouds form on a regular basis in a few places around the world.   Supposedly, these clouds go in both directions as far as the eye can see.  

And there's one place in the southern hemisphere where they take shape on a regular basis.

I'm completely captivated by these gigantic, otherworldly clouds.  Suppose I want to go see them for myself?  That leads to today's search challenges: 

1.  (easy)  What kind of a cloud is this?  (And what's the other name that they're often called?)  
2.  (medium)  If I only have 2 weeks for my vacation, where do you recommend I travel so I'll have a high chance of seeing them?  When should I go there? 
3.  (harder)  Can you find out if these cloud formations are happening now?  Prove it with some current imagery!  (Meaning, pictures from this week.) 

As always, be sure to tell us how long it took you to find the answers--AND how you found them.  Teach us your search strategy! 

Search on! 

Monday, July 29, 2013

Postscript: Have the Tour de France riders really gotten faster?

In his comment on last week's Challenge, regular reader Unknown (aka Jon) made the point that: 
Statistical Analysis is tricky. Over at the exact same chart was already posted Jan 2012 in order to prove just the opposite of this Challenge, namely to show that correlation does not equal causality.

And remember that Rosemary made a similar point:  "There are many factors at work..."  

To clarify a bit.  

I was asking a fairly simple question.  (Just "have the riders gotten faster over the years.")  

There are a million ways to analyze this, but I *meant* to ask the simple version:  "Is the average speed faster now than in previous years..."  

Note that I wasn't really looking for anything more subtle than that. I was just curious (and my pedagogical goal was to show how straightforward it is to do this kind of straightforward analysis--find the data, clean it up, start analyzing).  

So I really wasn't trying to answer the more sophisticated question of "are riders now faster than they were back then, IF you account for doping, course difficulty, technology... etc. etc."  

That's a fascinating topic, but it leads into a much more complex analysis.  (See the link above--Jon is right--they have an excellent in-depth discussion on )  

And to answer Jon's question--I wasn't proposing that this correlation equals causality.  (I'm not sure what the causal link would be anyway!)  It's simply an observation: the average speed now is higher than it was 10, 20, 30...100 years ago.  

I certainly enjoyed reading the StackExchange article.  And for fans of data analysis, the Python code in one of the answers is really interesting.  

One of the authors had it right--the dominant term in these equations these days is aerodynamic drag.  As you speed up, the drag forces increase as a square of the speed based on the frontal surface area.  In other words, to get a speedup of 2X (to go from 10kph to 20kph), you need to use 4X more power (or use other tricks to get around the resistance--e.g., making your front surface smaller).  

And from where we stand today, it seems as though we're starting to see the beginning of the asymptote of human performance around 41 kph.  I'm sure there will be a Tour that is relatively flat, or short, or wind-aided that will be faster than 44... but all things being equal, it seems VERY unlikely that humans will beat 45 kph over the length of the Tour.  

Nice question and comments!  

Search lesson:  When doing an analysis of this kind, ALWAYS check for communities of experts who might well have already done the analysis, or at very least, will have discussions of the topic (which might well have issues and factors that you might not have considered).    

I've found the communitities to be pretty high quality.  They have a LOT of communities that have extensive discussions about lots of topics.  Example list:  software, mathematics, home repair, Wordpress, physics, homebrewing, etc.  Perhaps most interesting from our perspective, there's a community about OpenData, which discusses public data sets and the tools you might use to work with them.  

Thursday, July 25, 2013

Answer: Have the Tour riders really gotten faster over the years?

Yesterday's search challenge was fairly straightfoward: 
1.  Has the Tour de France been getting faster over the past century?  
Naturally, what we first want to do is to get reliable data.  This turns out to NOT be the hardest step.  A simple search for: 

     [ Tour de France data speed time] 
And you'll find several. and the official Tour website.  
For what follows, I used the Official Tour website data.  
There are many ways to do this, but I'll show you what I did with Google Spreadsheets.  If you're curious, here's the link to my original spreadsheet
Step 1.  Import the data.  
Here the key step is to grab the data from the source.  If you look at the spreadsheet, all I did was to use the function:  =importHMTL( ) in the A1 cell. 
Top several rows of my spreadsheet. Note that I put the data source link into the comment on cell A1. I highly HIGHLY recommend that you do this for all your analyses.  Always note where / when the data comes from. 
I just plugged in: 
 =importHTML("", "table", 1)
The first parameter is the URL, the second parameter tells Spreadsheets what kind of thing to import (a "table") and the last number tells Spreadsheets which table to import (in case there are more than 1 tables.)  
And in a couple of seconds, Spreadsheets copied in the data from the first table on that web page, in this case, all of the Tour data I care about.  
Unfortunately, as you can see, the data isn't quite in the format I need.  So I selected the entire contents of the spreadsheet and copied it.  (CMD+C, or Control-C)

Step 2.  Clean the data. 
Now that I have the data (tab labeled "Imported Data"), I created a new tab on the spreadsheet (look at the bottom of the page).  
I did an Edit>Paste-special>Paste-values-only.  This has the effect of copying (and freezing) the data from the first tab.  This simplifies later editing, and if the website goes away, I still have the data to work with.  

To clean this data, I mostly just wrote small functions to pick apart the data entries and convert them into numbers I could use.  
For instance, the time is given in this format:  86h12'22"  (86 hours, 12 minutes, 22 seconds).  
One of the tricks I use when converting things like this is to do it in steps.  
The data comes like this:  
The imported data. Note that I cleaned up the misaligned column headers.
But I need the distance numbers without the spaces in them, and I need the time in decimal format.  (That is, not 86h12'22" but 86.20.)  

In column F you can see I broke out JUST the hours using the formula =left(D2, find("h", D2) -1) 
I wrote one column for hours, one for minutes, then combined them into column H for time(decimal).  
Why do it this way?  Yes, I could have written one big function to convert from the 86h12'22" into 86.20, but I've found that breaking it down into steps (and putting them into columns) makes debugging much simpler.
Then column J is just dividing column I by column H to give kph.  

Step 3. Charting.   
I then created another tab, "Charting" and copy/pasted my data from the cleaning tab into this new sheet (again, just to keep things simple and clear).  
And then used the Google Spreadsheets "Insert Chart" function to create the following two charts.  

This pretty clearly answers the question--YES!--overall Tour speeds have increased since the race began. 
Sidenote:  Why show both charts?  Because the first chart shows the big rise in speeds during the 1930s and the stagnation of speeds during the 1920s.  But the second chart nicely shows the MISSING DATA from the war years. (1940s)

2.  Offer a data-driven explanation for the "pecularities" in the Tour race time data?  
What I first noticed when I did this analysis was the relatively slow speeds during the first years of the race.  What's up with that?
It's not hard to create a chart that shows the effect of distance on race times.  And yes, longer races are slower.  
I'll let you figure out how to create that chart in Google Spreadsheets (it's not hard), but I wanted to include this diagram from my friend Steve, who proposed this question in the beginning.  He used the statistical package R to make this beautiful chart that nicely answers the question.  
Chart produced in R by Steve Bagley (Stanford). 

As you can see, longer races are slower overall... and yet the effect of improved equipment, training, and nutrition add up over the years so that later years at the same distance (look at the vertical column of 2700 miles and note the change over time). 

Search.. and analyze... on! 

Postscript:  Regular reader Rosemary M posted a link to her analysis that shows a nice inverse relationship between course length and average speed.  It was so great that I have to give her credit and a link.  Here is the chart she generated from the data.  

Original figure by Rosemary M showing the inverse relationship
between TdF race length and average speed in kph.

Wednesday, July 24, 2013

Wednesday search challenge (7/24/13): Have the Tour riders really gotten faster over the years?

The Tour de France is an amazing spectacle, no matter how you see it.  A couple hundred riders in waves of color racing over the French countryside for three weeks in July.  You might see them as insufferably French, with tight racing shorts and inscrutable European-style rules for the race leader, but you can't deny that it's a beautiful thing that involves the entire country (and much of the rest of the world) in a remarkable feat of athleticism.  Imagine riding your bike that far, that fast, that long--it's still mind-boggling. 

The Tour has been around for 100 years--this year, 2013, was the centennial ride.  It seems as though the riders have gotten stronger and faster over the years.  It got my friend Steve and I to talking about this--is it really true? 

It's certainly the only race I know of where the racers drink champagne on the last lap (just outside of Paris, naturally, before the grand finish on the Champs Elysee).  And while the sport of bike racing has been tainted (as have nearly all sports) by recent findings of chemically augmented performances, we were still curious about the overall trend for the past 100 years.  In the early days (see above), the teams didn't have much by way of support (carry your own tires!) or health and nutrition information (bring your own cigarettes on the tour too!).  So... 

That leads to today's search challenge: 
1.  Has the Tour de France been getting faster over the past century?  
Of course, you'll find articles that say "yes," but we want proof!  The best answer for SearchResearch will be a reliable data source AND some kind of analysis from you showing the answer.  

I realize that this is yet another data analytic search challenge, but it's that kind of data summer.  This isn't big data (only 100 samples with only a couple of columns), but you get the idea.  Can you produce a chart and/or some kind of analysis showing that the race has been getting faster?  

And, if you're curious about this kind of thing, I noticed that there are some peculiarities in the data.  (I'll let you look at the data on your own to make your own observations.)  

For extra credit, can you: 
2.  Offer a data-driven explanation for the "pecularities" in the Tour race time data?  
Search (chart, and analyze) on! 

Friday, July 19, 2013

Answer: What are the connections between the flag and the rest of US history?

Our search challenge… was about California's state flag, which first flew over Sonoma in the summer of 1846.  Let's look at these one at a time.  

1.  Who did the vexillographic design of the flag? (And why is he a surprise?)  
First things first... What does the word "vexillographic" mean?  The easiest way to find this one out is with the query: 

     [ define vexillographic ] 

To find out that it refers to the visual and graphical design of the flag.  So, knowing that, it makes sense to ask the query: 

     [ California flag designer ]   

to find out that the Wikipedia article says "The first Bear Flag was designed by William L. Todd, a nephew of Mary Todd Lincoln."  That's interesting and surprising... The California flag was designed by a nephew of Mrs. Lincoln?  
Take Note:  Of course, for something this extraordinary, you'd check it and look in other places, looking for some places that are widely credible.  (But always be cautious:, normally a very credible site, has a web page about the California flag that claims it wasn't supposed to be a bear on the flag, but a pear.  If you click on the link "More information about this page" you'll see that they put this hoax on their own site to remind you that you should ALWAYS doublecheck... even  
But this checks out:  It's easy to find multiple books that all corroborate the connection with the Lincoln family and that it was Willam L. Todd.  
2.  What unusual ink did he use for the star? 
The obvious query would be something like: 

     [ California flag ink Todd ] 

I'm using his name in the query because I want only articles that mention him (a relatively uncommon name), the flag, and something about the ink.  I found that in the California Historical Society Quarterly (Vol. XXXI, No. 3, September 1952) John Hussey published his article, "New Light on the Original Bear Flag" with the comment that the star was painted with a mixture of “blackberry juice, brick dust & oil.” 

Photo of original California state flag (taken in 1890). The original flag was destroyed in the
San Francisco earthquake of 1932.  Image link from Wikimedia.

3.  In a flag with only two words on it, which word was (temporarily) misspelled? 
In the Wikipedia article, William Todd is quoted as having written in a letter:  
"... Underneath the bear and star were printed with a pen the words 'California Republic' in Roman letters. In painting the words I first lined out the letters with a pen, leaving out the letter 'i' and putting 'c' where 'i' should have been, and afterwards the 'i' over the 'c'. It was made with ink, and we had nothing to remove the marks."  

That sounds good, but when I checked the link to the source article (you DID check the reference link, yes?), that link is long since dead.  Suspicious, I started looking around.  Unfortunately,  most of the hits I could find for this are mere copies of the Wikipedia article.  (This is a problem we've talked about before.  Once that Wikipedia content starts being copied around everywhere, it's hard to keep things up to date.)  

So I selected some text from Todd's quotation and searched for that: 

     [ "Underneath the bear and star were printed with a pen" ] 

I figured that would be long enough to be unique, but the words are unlikely to be misspelled themselves.  

I was able to quickly find a reputable source:  The Works of Hubert Howe Bancroft: History of California. 1884-90, Volume 5; Volume 22.  This book quotes the original letter by Todd to the Napa Register, July 6, 1872.  

Interestingly, the photo shown above doesn't LOOK like it was misspelled.  But I'll give Todd the benefit of the doubt (since it was his flag), and assume the error was on the other side of the two-sided flag.  

4.  What’s the connection between the flag and a famous American patriot and dentist? 

As I was reading the articles about the flag's history, I happened to notice that the name of James Revere kept appearing.

I know that states love to document their own creation stories, so I did a simple search in Google Books for:

     [ California state flag Todd Revere ] 

Why this?  I wanted some documentation that puts both Todd and Revere in the same scene.  One of the first hits was the record of the Senate of the State of California that records both Todd and Revere at the flag raising. In fact, Joseph Revere delivered the American Stars-and-Stripes from Commodore Sloat's warship, Portsmouth, that was anchored near Monterey (about 200 miles away overland). 
Joseph Revere, grandson of Paul.
 Image from Wikimedia.

A little more reading shows that Joseph Revere was in fact the grandson of Paul Revere.. the well-known Boston patriot, silversmith, and (in his spare time when things were slow) dentist!  

5.  For extra credit (and fun!), when the flag was raised at the start of the revolt, it signaled that the Commodore should move his United States troops onto land to support the annexation of California to the US.  In nearby San Francisco, there is a monument to this Commodore.  Where is it?
As I said above, the flag came from Commodore Sloat's ship.  So... a simple search should reveal the monument: 

     [ monument Commodore Sloat ] 

You can imagine my surprise when the results came back.  There IS no monument to Commodore Sloat in San Francisco (even though there's a major east/west street named Sloat)!  
The closest monument I could find to SF is in Monterey, where his ship was anchored during the Bear Flag Revolt of 1846.  It's in the lower part of the historic Presidio there.   
This wasn’t an especially hard search challenge, but it’s good fun to find these connections between people and events that you didn’t know were there.

Search lessons:  (1) Be wary of baked-in assumptions.   The extra credit question remakes a point we've discussed before... That not all questions are going to be accurate in all their details.  Sometimes, there's an assumption built into the question itself ("where's the monument in San Francisco?") that you have to let go.  I was sure there was a Sloat monument in the City, but I was wrong.  Luckily, good search practice is to go with what works.  QUESTION your assumptions constantly!  
(2) Watch out for dead links.  As we saw in the Wikipedia reference, web pages often go stale and need a refresher.  Do not accept a dead link as a good reference for your work.  Chase after it. If it's a decent result, you'll almost certainly find it somewhere else (and somewhere that's live).  
(3) Check your sources. The fake "it was supposed to be a PEAR" story is a great reminder to always check your sources... even ones that you trust.  

Afterword:  The bear that's NOW on the flag was drawn from a live grizzly, namely "Monarch"--one of the last wild grizzlies seen in California, near Ventura.  Read more about his sad, strange story at 

Wednesday, July 17, 2013

Wednesday search challenge (7/17/13): What are the connections between the flag and the rest of US history?

The Bear Flag Revolt ends when the US flag is run up the flagpole in Sonoma, CA (1846)
It sometimes happens that there are convergences in history that seem remarkable.  Today’s search challenge is to find one that surprised me when I discovered it last week. 

It seems that the history of California is full of surprises.  In particular, when California seceded from Mexico it became for a short while (26 days) an independent republic. 

This was proclaimed by creating a new flag to mark the start of the republic, and it was ceremoniously run up the flag pole in the city of Sonoma, now the center of a robust wine-growing region.

But that flag has a history all its own, and leads to our search challenge…

1.  Who did the vexillographic design of the flag? (And why is he a surprise?)  
2.  What unusual ink did he use for the star? 
3.  In a flag with only two words on it, which word was (temporarily) misspelled? 
4.  What’s the connection between the flag and a famous American patriot and dentist? 
5.  For extra credit (and fun!), when the flag was raised at the start of the revolt, it signaled that the Commodore should move his United States troops onto land to support the annexation of California to the US.  In nearby San Francisco, there is a monument to this Commodore.  Where is it?

This isn’t an especially hard search challenge, but it’s good fun to find these connections between people and events that you didn’t know where there.

Search on!

Tuesday, July 16, 2013

Notes on data wrangling from Tony Hirst

Illustration from Tony Hirst showing how to use Refine to clean data.
As you might recall from last week's challenge, once you'd found the data on world literacy rates, the next step in the challenge was to transform it into a useful format for your spreadsheet, your infoviz tool, or whatever you were going to do to analyze it. 

I just found that Tony Hirst has written a very nice blog post that gives ALL of the details about that data wrangling step.  

I admit that I gave it a little bit of short shrift.  I said something like "... once you have that page, you can copy/paste the entire table into a text editor, and with a few passes, you can convert it into a nice CSV for importing into your favorite spreadsheet or visualization tool for analysis."  That's a little glib, especially since that's what took all of my time. 

So I'm really happy to give you the link to Tony's writeups on this.  

Actually, he did TWO writeups. 

   1.  How to wrangle the data using Google Spreadsheets

   2. How to wrangle the data using OpenRefine. 

Tony shows you a couple of techniques that I know about, but opted to NOT write about in the post.  (My post was getting too long as it was.)  For instance, he shows how to use: 

=ImportHTML(, "table",

as a way to wrangle a table from a page and into Google Spreadsheets.  

But he ALSO shows how to use regular expressions in Google Spreadsheets (a technical way to do find-and-replace operations with sophisticated pattern-matching).  I didn't know you could do that! Tip of the hat to Tony!  

And it's worth reading Tony's second article about how to use OpenRefine.  It's a powerful tool for transforming data from one form to another.  I'll write about that myself one day.  (Or maybe I'll just point you to Tony's writeup!) 

Thursday, July 11, 2013

Answer: How can you map world literacy rates?

Yesterday's challenge was simple:  

     Can you quickly create a visual representation
    of world literacy rates by country?  

I know that some of you will find this an easy problem--but in my classes, I find that lots of people don't even begin an analysis like this because they have no idea how long it will take.  So the theme of this week's challenge is How to get started in visual data analysis!   

This is an interactive world map I generated in a couple of minutes.  (Obviously, this is just a static image of it--but when you click on any of those dots, you'll get the pop-up with literacy rate by country, male, and female literacy rates as well, if available.)  Note that I'm NOT recommending this as a great visualization method--I'm including it here just to give a suggestion of the things you might do.  I'm sure you can do a better job!  

There are many ways to get this data: I'm going to quickly outline 4 different methods. 

1.  Get the data yourself, then convert it.  The quick and easy search: 

   [ literacy rate by country ] 

leads to the Wikipedia page List of countries by literacy rate.  Given Wikipedia's coverage, I sort of figured it would be there.  

And once you have that page, you can copy/paste the entire table into a text editor, and with a few passes, you can convert it into a nice CSV for importing into your favorite spreadsheet or visualization tool for analysis.  

You might also notice that the source of the Wikipedia data is largely the World Bank data set. So you could easily go to that site and download their CSV files about literacy rates and work from there. 

2.  Work done by others.  If instead you know suspect that other people would have done this analysis before, you could search for:    

   [ literacy rate analysis ] 

And you'll get a bunch of Fusion Tables put together by various people.  Including one that looks identical to the map I produced yesterday!  Here's the working version mapping literacy rates onto an interactive map visualization. 

Another way to think about this is to search for Google Fusion tables specifically.

I assume you know about Google Table Search?  Never heard of it?  The simplest way to find it is to search for:

   [ Google Table Search ] 

It will take you to the Google search that specializes on data tables (especially Fusion Tables).  When you get there, it looks like this:

As you can see, it's pretty open-ended.  But if you do a search for:

    [ literacy rate ] 

you'll find a lot of interesting tables including many with other visualizations.

3. Google Public Data Explorer.  Yet another option is to look for a resource that has collected together lots of data from other sources (and provides a set of visualization tools as well).

Often, Google Public Data Explorer will be triggered when you search for data.  It's triggered whenever you do a search like:

   [ unemployment rate California ] 

4.  Search for Images of infographics about literacy.  An image search that includes the context term "infographic" or "visualization" will often find charts and graphs for you.  (But note that you STILL need to check that they used a reputable data source and that they did all of the steps leading up to the visual display!)

Okay... so now we've got the worldwide literacy data.  What can we do with it? 

This isn't really a blogpost about visualization or data mining, but I can't help but point out a couple of things to you.  

Visual analytics: 

You could just pour all of this data into a Fusion Table and create a chloropleth map.  That's what I've done here--just imported the country names into one column, and then literacy rates, then selected "Chart>Map" and I get this: 
Overall world literacy rates.  Darker green is more literate (98%+).  White is NA.  The darker the red, the lower the literacy rate in that country.  

I then made a simple chart showing the difference in literacy rate by gender (I just subtracted female literacy rate from male literacy rate since I had both of those data values / country).  Here's that chart.  (From this spreadsheet.)

Yeah... I know it's too wide to show accurately.  (If you want to the whole chart, click on the image above or go see the original spreadsheet I made.) 

But what struck me was the tail on the far right side.  See that?  There are some places on earth where the female literacy rate is higher than the male literacy rate.  


I just used this data (maleLiteracy - femaleLiteracy) and mapped it onto the countries of the world. 

Literacy rates.  Red are countries where female literacy is well below male literacy.  Dark green are countries with the same literacy rates.  Light green is where female literacy is slightly higher than male.  

Fascinating.  I leave it to you to interpret why that part of the world has more literate males than females (adult population, 15+).  

Better yet, I leave it to you to find even more interesting relationships in the data!  

Search on!