Tuesday, December 9, 2014

Answer--part 2: What's going on in this file?

As you might have figured, I've been traveling, and now that I'm back, I'm trying to catch up.  

I'm not going to give the final answer(s) to the Challenge yet, but I do want to show you another way to look at the data.  

If you open the KML file with a plain text editor, it looks like this: 

As we've already discussed, this is just a KML file.  If you look at the definition of a KML file, you'll see that this is a TRACK piece of data which is just an AltitudeMode (in this case, "clampToGround") followed by a list of timestamps followed by a coord.  Each coord is longitude, latitude, altitude.  (In that order, with the altitude being optional.)  

Since I wanted to really do something interesting with the data, I first opened it in a text editor (again, my choice was TextWrangler), and then pulled out all of the timestamp, long, lat, altitude data.  (notice... this is the opposite of the "lat/long" order that is often used elsewhere).  

Given this, it was pretty straightforward to extract the data and create a data file consisting of four columns, which I then imported into a spreadsheet.  

Based on this, in the spreadsheet I was able to easily compute the additional columns for "Distance in degrees"  "miles traveled" "elapsed time (seconds)" and then ultimately, "speed (mph)." 
 (note:  to compute "distance in degrees" I just used the Pythagorean formula, A^2 + B^2 = C^2 -- that gives me distance in degrees.  See the formula I used in Column F of the spreadsheet.  I looked up that at this latitude, 1 degree of latitude is roughly 55.3 miles.  That's not exactly right, but it's a good enough approximation for this evaluation since all of the location points are more-or-less at the same latitude.)  

I also wrote a couple of spreadsheet functions to extract the day (where 11/22/2014 is day 0, 11/23/2014 is day 1, etc), and the time from the data.  Using those two pieces of information, I could compute the "timevalue," which is the number-of-the-day + fraction of the day as a decimal value.  That sounds complicated, but think of it this way:  noon on the 0th day is 0 + 0.5  (noon is halfway through the day--6PM on the second day is 1 + 0.75, that is, day 1 + 3/4ths of the day, or 1.75).  

Why do that?  I created the "timevalue" so we could easily make a scatter plot of the timevalue (which handily enough goes from 0 - 6) vs. the speed.  

If you look at the second tab (at the bottom of the spreadsheet--the one labeled "Charts of speed and time"), you'll see this: 

See columns A and B?  I just copied those from the main chart above (x-value and speed).  Use copy and then "paste special" in order to just paste the values of the spreadsheet.  

The scatterplot above is interesting, and odd.  You can see big gaps in the data.  Obviously, the phone is turned off during those intervals--that's when I turned my phone off at night.  (Even if it's out of the service area, it will keep recording data.) 

Handily, those correspond pretty nicely to the night times.  

The odd thing is that one spike on day 4.5 (that is, around noon on day 4).  If that's REALLY my speed, then I've moved really fast (close to 3000 mph!!).  Obviously, this bears some closer inspection.  

But zooming in on this data is kind of a pain in Google Spreadsheets, so I went looking for another tool that lets me zoom in more quickly and easily.  

I exported this data to a tab-separated value (TSV) file (just plain text, with each value separated by a tab) and then imported it into a Plot.ly data set.   Here's what the data looks like imported into a Plot.ly scatterplot diagram. Click here to see this data plot live.  

Now the inter-day gaps are really visible, and it's clear that the huge speed spike at day 4.5 has to be spurious.  

And, when you zoom in (which you can do with the live version above), you can see this on Day 0 (that is, timevalues from 0 - 1.0). 

These values look more reasonable.  It's clear that between 0.1 and 0.2 I was driving somewhere, and driving at speeds up to 75 mph.  (That is, on the freeway.)  The phone was turned off between 0.2 and 0.41, and then more freeway driving ensued.  Same story at 0.65 - 0.7.  

Given this kind of view of the data, can you now figure how how the movements (in space) correlated with speeds?  

As another clue, you can look at this plot: 

This is me traveling right around the middle of the day at speeds between 5 and 18 mph from 5.57 until 5.61 (or, to convert that back into regular clock time, that's 1:50PM - 2:35PM).  That's too fast for walking, too fast for running, and too slow for driving in a car.  What could I have been doing during that time?  

More details revealed tomorrow!  We'll talk about the 3,000 mph data spike.  

Search on.  (And now, with spreadsheets and data visualizations!)  


  1. Thanks for sharing Plot.ly. A lot of resources for education.

  2. Good Day, Dr. Russell. Glad you are traveling and sharing your knowledge with more of us in many sites.

    Very interesting answer. I still need to find a good Text Editor that works either online or in Windows. I will review your answer to learn and understand it more.

    About, what could you have being doing at speeds between 5 and 18 mph, I think you were biking. [bicycle average speed] answer says that is possible.

  3. Just a quick question, why did you start with day 0. Is that the normal procedure to do analysis or is just something that was only your choice?


    1. It's just a programmer's habit. Most arrays start with 0, and if you read the http://en.wikipedia.org/wiki/Zero-based_numbering article, you'll see that it argues (among other things) that zero-based indexing tends to reduce the number of off-by-one errors. But in truth, I think it's just convention...

    2. Good Morning, Dr. Russell. Thanks for your comments and for the link. Zeroth is very interesting and that lead to other articles all of them helpful and with lots of connections. Before reading your comment, I was reading El secreto de la vida…otra vez there author talks about Jeremy England and thermodynamics and later the article you share talks again about thermodynamics. That is amazing, don't you think. Twice in same hour!

      About the 3,000 mph spike. I re-read your file and saw the plot. Changes are very fast. Also show a pattern. Finally, data shows change in altitude in both positive and negative. At first, thought it was in an elevator [Speed elevators mph] they get 45 mph. Then [3000 mph devices]. Finally, maybe it is some kind of Roller Coaster or something else like that.

    3. Hello Dr. Russell. If possible can you share with us how you pulled out all of the timestamp, long, lat, and altitude data with textwrangler? I mean in general way. Did you converted it in different columns to delete unwanted data?


    4. I'll post this as a One-Minute Morceau. With luck, later today.

    5. Thank you, Dr. Russell. I enjoyed and learned with the video.

      One more tool and one more lesson for the Search Challenge.

    6. I did it exactly as you. And finally make it work!

      I used NoteTab Light. To delete \n, needed other symbol as you mentioned in the video. First time didn't work because I forgot to add the (,). After that, everything worked great. Also tried with some apps from Chrome Store that can be useful too.

      Thanks, again, Dr. Russell.

      A tip for those who don't know it. When you subscribe to Dr. Russell YouTube Channel or any other channel, we have the option to receive an email when a new video is uploaded. This is great to never lost a video.

  4. This is the not very pretty map I managed to do using Google My Maps: https://www.google.com/maps/d/edit?mid=zyFpL3FhwlpE.kpatRnQZicZk.

    My Maps doesn't have a "Bicycling" option but if you open a map of the same place on Google Maps, you can see that the cyclist used mainly but not exclusively bicycle lanes.

    Maybe I could do better on Strava but I couldn't find a simple and fast way of converting .csv or .xlsx data to any of their formats (.tcx, .fit or .gpx).

  5. Immediately after sending my previous comment, I decided to search [ cycling map imported from csv | xls file ]. Among the results on the first page, there was Veloroutes. Using it to upload my .csv file was pretty straightforward but then it gets stucked on "Loading elevation…" and I can't save it. I can, however, export it even while it is "loading elevation". I exported it to .kml. I'm adding this .kml as a second layer for the map on My Maps I had done.

    But it's much nicer to open that .kml on Google Earth and then both clicking on the "Play Tour" button and right-click on the path title to "Show elevation". The path is not accurate because there are fewer points than needed to follow the roads smoothly. Consequently, the elevation profile is not accurate either, showing in particular a hill at mid-path that was most likely circled around following the road. Anyway, it's always possible to make an educated guess on where was the real track used.

    Note: my .csv file was created from the original challenge's kml, extracting only the relevant information.

    1. This is beautifully done. Thanks! (You're right -- the "play tour" button is a great way to see this data.)

  6. I wanted to share what I've done so far. Later I hope to try Plot.ly but in the meantime I have been able to use a text editor to get the data. That was no small task for me. After many trials and errors I got the data into Google Sheets and create a chart. It looks very similar to Dr. Dan's. I learned a lot by experimenting with the text editor (I chose Notepad++ since I now have Windows 8.1). I learned to make sure your data is clean as we've been told. Record what steps you've taken in case you need to redo some steps. It would be nice to learn more especially for someone like me who has never worked with a text editor before.


  7. This comment has been removed by the author.

  8. This comment has been removed by the author.

  9. This comment has been removed by the author.