So..... THAT was interesting!
As you remember, before the break I posted a little story with a few "simple" questions to answer. And it's taken me this long to get back to you with my commentary...
Because this story is so interesting, I'm going to break my answer up into 3 separate posts. (That way, you won't be overwhelmed with the long, slightly complicated story.) Read my posts today, tomorrow, and Friday, and I hope you'll learn a lot along the way--just as I did.
Little Rob was found dead of unknown causes in his apartment. Near his body were scattered the contents of the rucksack he was packing, clearly anticipating a trip to the mountains. In the pack was a vial of Acetazolamide pills, lots of fishing gear, and a road map of northern California. For food he only had breakfast and lunch, so the suspicion for the murder falls on his hiking partner, Big Jim, who dropped one packet of freeze-dried dinner on his way out the front door. There is also a list of larger lakes and reservoirs that have recently been stocked, with several circled as places to go fishing. Unfortunately, the piece of paper was torn in half, with most of the information missing.
Your job is to try and find Big Jim. Given what you know, you decide to start your search on high-altitude lakes in California that have been stocked with fish.
1. Can you find all of the lakes and reservoirs in California that have been stocked with fish in 2014, in Northern California, and are above 8,000 feet in elevation?
2. Since you know they're interested in fishing in larger lakes, can you find those Californian lakes more than 8,000 feet high that are also greater than 500 acres in surface area?
3. How do you know that Little Rob and Big Jim were planning on fishing at high altitude?
This is a Challenge that requires you to find several different data sources.
For #1, you need a (a) list of all lakes and reservoirs in California, (b) a list of all lakes and reservoirs in California that have been stocked with fish, (c) the latitude of all the lakes in California, and (d) the altitude of the lakes.
You might ask, "isn't (b) overlapping with (a)" (that is, wouldn't a list of stocked lakes be sufficient?), but I like to get multiple data sources for something like this. Lists of things (like lakes) differ in their completeness, and sometimes you'll learn from the overlaps or items that are missing on one list, but present on the other.
For #2, you need a (e) list of the surface area of lakes.
IF you got incredibly lucky, you might find a single table that gives all of these features (lake-name, stocking-date, lat-long, altitude, surface-area), but that seems unlikely.
I started my search with the obvious:
[ list of lakes in California ]
Which leads to the Wikipedia result. The article says there are ~3,000 named bodies of water in CA, and this is a good list to start with, but it's only around 420 lakes long.
And I didn't really want to deal with working out the lat-long from the GNIS Id(s). I thought I could do a little better by searching for a table from a slightly more authoritative source.
So... I did another query to find the lake:
[ USGS database California lakes reservoirs ]
and that led me to the Geographic Names Information Service (what GNIS stands for!).
They have an interface that looks like this, making it easy to download a data file with lots of information about the lakes and reservoirs of California.
I can grab that file (which, as you can see, uses vertical bars--aka "pipes"--to separate the values).
This is getting pretty close. This file has 2000 lakes in California, with the lat-long broken out, AND the elevation.
I saved this file to my drive and then imported into a Fusion Table.
(Why a Fusion Table? A: Because I know I'm going to have to combine multiple data sets together--such this one above with another one that has fish-stocking information and surface areas. A FT is the right tool to combine--or fuse--tables of information together.)
In this case, I just created the FT (in Google Drive) and when it asked for the delimiting character, I said | (the vertical bar, or "pipe") character.
That FT is Lakes in CA, which looks like this:
|Fig 4. Note that the attribution link is filled out telling where the data is from (USGS/GNIS system)|
Now we're starting to get somewhere. This table is 2000 rows long (which is a suspicious number--exactly 2000 rows?).
Let's now try to find the master list of stocked lakes in California.
Now, since I live in California I happen to know that the agency here is called "California Department of Fish and Wildlife" (but you could easily figure that out on your own). So my first query was:
[ California Department of Fish and Wildlife fish planting schedule ]
Interesting. Why did I say "planting" rather than "stocking"? Truth is, I tried the term "stocking" first, and quickly learned that the preferred term of art was "planting." Fine. I can learn quickly, so I did this query which took me to the CDFW schedule at nrm.dfg.ca.gov/FishPlants/
(You can also see their map of planted stocks, although I didn't use this map in my solution. It's very slow to load.)From this data source, I selected the "Export" button, which gave me a big CSV of all the lakes by planting (i.e., "stocking") date. Here, it's the Fish Planting spreadsheet.
Note that this list in the Fish Planting spreadsheet also includes rivers and such, but I don't care at this point because we're going to fuse this table with the previous table, and all the rivers won't match anything, so they'll just drop away. So I just made a NEW Fusion Table with this data about fish planting in it, giving me the Fish Planting2 Fusion Table.
So now I have two Fusion Tables: (1) Lakes in CA, and (2) FishPlantingSchedule2.
As you can see, they both have a "Name" column in common between the two. I'm going to fuse the tables on that column. In essence, the two tables will be joined together, with any row that shares the same name (such as "Avocado Lake") will be merged together.
I do this by using the "Merge tables" option in the Lakes in CA FT.
Which gives me, the Merge of Lakes in CA and FishPlantingSchedule2
If you scroll down a bit, you'll see a couple of the lakes that were found to be in common across the two tables:
Note that Avocado Lake was actually planted twice--once on 12/21/14 and again at 1/11/2015, that's why it shows up in 2 separate rows.
To see JUST the lakes that have been planted (out of the 2000 rows), I create a Filter to show me any row that has a date between Jan-1-2014 and Jan-31-2015.
(To do that, just click on the column--Announce--and then click on the big blue Filter button. Add the dates you want to filter on...)
Now I can sort by Elevation ("Ele (ft)") to see which lakes are highest...
It's easy to see that there are 9 lakes (that have been planted) that are above 8,000 feet in elevation.
But notice that Little Lake appears three times, while Horseshoe Lake and Pyramid Lake appear twice (they were all planted on separate days). So there are REALLY only 4 lakes here.
Now, which ones of these are in Northern California?
The easiest way to figure this out is to map these lakes onto Google's My Maps. So I did that... exporting the top lakes from the Fusion Table to a Google spreadsheet, and then IMPORTING that sheet into a new Google My Map. (Here's how to import the data.)
This is great, right? My map of the top 9 lakes above 8,000 feet now looks like this:
There's just one little problem... None of these lakes are high altitude! What went wrong?
In the map above the "Horseshoe Lake" shown is in Alameda county, at an altitude of about 10 feet. Not 10,000 feet--just 10 feet. (It's actually an old quarry near San Francisco Bay, just barely above sea level.)
If you zoom in on the map, you'll see it there: Horseshoe Lake, part of a big park in Alameda County. Obviously, it was planted with fish at some point. So what happened?
We need to go deeper....
Clearly, something is wonky with our data. But what?
Now I have to track my data really carefully. Where did each item come from, and how was it handled?
As I've said in previous posts, you have to track your metadata. Whenever you create a data table (either as a spreadsheet or a Fusion Table), be sure to write down where it came from and what you did to it.
To figure this out, I decided to track down the Horseshoe Lake example.
First, I went back to my original FT, Lakes in CA, which has all of the lakes and lat/longs from USGS.
I then filtered that data set to see ALL of the lakes with "horse" in the name. (Just click on "Filter" and enter the text "horse.") As you can see, here they all are...
There are 6 lakes names "Horseshoe Lake" (and 1 named "Horseshoes Lake"), and as you can see, several of them are above 8,000 feet.
Oddly, none of them are in Alameda county! What gives?
Let's go look at the other datasource--the Fish Planting spreadsheet and do the same thing, look for all lakes with the string "horse" in it. When I do that, I see:
Okay.. this is interesting. The lat/long of this lake is 37.57, -122.0014 -- which is what we see in the map above. BUT it's in Alameda! And more strangely, this Horseshoe Lake doesn't appear in the master list of California lakes!
So I thought I'd plot out all of the lakes in the list shown in Figure 12... And--guess what--they all seem wrong! That is, the lat/longs don't seem to point to the lakes they list. That is, the lat/long for Horseshoe Lake in Sierra county (393912N, 1203930W) is close, but not EXACTLY on the lake.
I'm going to stop here for today, but let's look at what we need to figure out.
1. Why are the lat/longs for the lakes in USGS (FT, Lakes in CA, Fig 12) wrong? That doesn't make sense--they're the people who provided the original data!
2. Why are the Horseshoe lakes plotted on the map in Fig 11 all in the wrong place?
3. (Important clue) Why is the lat/long of the Horseshoe Lake in Alameda county not the same as any lat/long in the master list?
4. (Another important clue) Why do all of the Horseshoe lakes in the Fusion Table of Fig 10 have the same lat/long??
All these questions will be answered in TOMORROW's installment of SearchResearch.
The quest continues...