Thursday, July 25, 2013

Answer: Have the Tour riders really gotten faster over the years?

Yesterday's search challenge was fairly straightfoward: 
1.  Has the Tour de France been getting faster over the past century?  
Naturally, what we first want to do is to get reliable data.  This turns out to NOT be the hardest step.  A simple search for: 

     [ Tour de France data speed time] 
And you'll find several.  BikeRaceInfo.com and the official Tour website.  
For what follows, I used the Official Tour website data.  
There are many ways to do this, but I'll show you what I did with Google Spreadsheets.  If you're curious, here's the link to my original spreadsheet
Step 1.  Import the data.  
Here the key step is to grab the data from the source.  If you look at the spreadsheet, all I did was to use the function:  =importHMTL( ) in the A1 cell. 
Top several rows of my spreadsheet. Note that I put the data source link into the comment on cell A1. I highly HIGHLY recommend that you do this for all your analyses.  Always note where / when the data comes from. 
I just plugged in: 
 =importHTML("http://www.letour.fr/2012/TDF/HISTO/us/palmares.html", "table", 1)
The first parameter is the URL, the second parameter tells Spreadsheets what kind of thing to import (a "table") and the last number tells Spreadsheets which table to import (in case there are more than 1 tables.)  
And in a couple of seconds, Spreadsheets copied in the data from the first table on that web page, in this case, all of the Tour data I care about.  
Unfortunately, as you can see, the data isn't quite in the format I need.  So I selected the entire contents of the spreadsheet and copied it.  (CMD+C, or Control-C)

Step 2.  Clean the data. 
Now that I have the data (tab labeled "Imported Data"), I created a new tab on the spreadsheet (look at the bottom of the page).  
I did an Edit>Paste-special>Paste-values-only.  This has the effect of copying (and freezing) the data from the first tab.  This simplifies later editing, and if the website goes away, I still have the data to work with.  

To clean this data, I mostly just wrote small functions to pick apart the data entries and convert them into numbers I could use.  
For instance, the time is given in this format:  86h12'22"  (86 hours, 12 minutes, 22 seconds).  
One of the tricks I use when converting things like this is to do it in steps.  
The data comes like this:  
The imported data. Note that I cleaned up the misaligned column headers.
But I need the distance numbers without the spaces in them, and I need the time in decimal format.  (That is, not 86h12'22" but 86.20.)  


In column F you can see I broke out JUST the hours using the formula =left(D2, find("h", D2) -1) 
I wrote one column for hours, one for minutes, then combined them into column H for time(decimal).  
Why do it this way?  Yes, I could have written one big function to convert from the 86h12'22" into 86.20, but I've found that breaking it down into steps (and putting them into columns) makes debugging much simpler.
Then column J is just dividing column I by column H to give kph.  

Step 3. Charting.   
I then created another tab, "Charting" and copy/pasted my data from the cleaning tab into this new sheet (again, just to keep things simple and clear).  
And then used the Google Spreadsheets "Insert Chart" function to create the following two charts.  


This pretty clearly answers the question--YES!--overall Tour speeds have increased since the race began. 
Sidenote:  Why show both charts?  Because the first chart shows the big rise in speeds during the 1930s and the stagnation of speeds during the 1920s.  But the second chart nicely shows the MISSING DATA from the war years. (1940s)

2.  Offer a data-driven explanation for the "pecularities" in the Tour race time data?  
What I first noticed when I did this analysis was the relatively slow speeds during the first years of the race.  What's up with that?
It's not hard to create a chart that shows the effect of distance on race times.  And yes, longer races are slower.  
I'll let you figure out how to create that chart in Google Spreadsheets (it's not hard), but I wanted to include this diagram from my friend Steve, who proposed this question in the beginning.  He used the statistical package R to make this beautiful chart that nicely answers the question.  
Chart produced in R by Steve Bagley (Stanford). 

As you can see, longer races are slower overall... and yet the effect of improved equipment, training, and nutrition add up over the years so that later years at the same distance (look at the vertical column of 2700 miles and note the change over time). 

Search.. and analyze... on! 

Postscript:  Regular reader Rosemary M posted a link to her analysis that shows a nice inverse relationship between course length and average speed.  It was so great that I have to give her credit and a link.  Here is the chart she generated from the data.  

Original figure by Rosemary M showing the inverse relationship
between TdF race length and average speed in kph.

3 comments:

  1. RoseMary and Dr. Russell, it is a great answer and improved post since the morning.

    Congratulations and thanks for sharing.

    ReplyDelete
  2. Atatistical Analysis is tricky. Over at http://bicycles.stackexchange.com/questions/7661/why-arent-tour-de-france-riders-going-any-faster

    The exact same chart was already posted Jan 2012 in order to prove just the opposite of this Challenge namely to show that correlation does not equal causality.

    I would appreciate your insights DR R. I am baffled.

    Cheers

    jon on sunny Vancouver Island

    ReplyDelete
  3. Forget to specify the 2 graphs: Rosemary's and the already posted one in the link I gave

    ReplyDelete