## Wednesday, January 21, 2015

### Answer: Where's the lake... (Part 3 of 3)

I realize that this is a kind of long process, but stick with me until the end.  It's got a twist ending.

A quick recap of what we've done over the past couple of days...

Post #1 we started to figure out everything that's needed.  Here's what we did:

1. Find a list of lakes in CA.  Find the USGS GNIS system, and use it to...
2. Download the GNIS list of lakes, import into a Fusion Table (“Lakes in CA”), convert their LAT, LONG into decimal lat/long.
3. Find a list of the stocked lakes in CA.  Find the CDFW list.
5. Merge the “Fishplanting” table with “Lakes in CA” to make “Merge of Lakes…” FT
6. Check the data.  Create a map of all the Horseshoe Lakes.  Notice that the Horseshoe Lakes are all in the wrong places.  Why?  Make a plan for the next post...

Post #2 we've got errors--we need to fix them up and find the answer.

7. Work backwards to discover that the lat/long conversion in step 2 was wrong.  Fix that up and re-merge the tables together.
8. Map the Horseshoe Lakes again, and check that they’re correct.  (They are now.)
9. Check ALL of the Horseshoe Lakes in the table and find that the one in Alameda county is missing—why?
10.  Discover that it’s actually a reservoir, and not in the master list of lakes from GNIS (step 2).
11.  Go back to GNIS and import a list of all the reservoirs in CA, and re-merge that table into the master list of “CA lakes and reservoirs with extracted lat/longs”  and re-merge everything together.
12. Now, in the newly re-merged table, notice that all of the Horseshoe Lakes have the same lat/long.  Now what?  Why is this happening?
13. By looking over the data tables, find that there’s only one Horseshoe Lake that has been planted, the one in Alameda county.  Figure out that the fusing step overwrites ALL of the lakes names “Horseshoe Lake” with the same lat/long value because the lake names are not unique.

That gets us up to today’s post.

Post #3 in which we solve those last problems and find the answer!

Here's a sketch of what we need to do today.

Step A. Create the unique IDs by fixing up the table to have a new column of Lake+County name.

Step B. Re-fuse the tables together to make a new, correct table.

Step C. Filter to see the lakes that have been planted, and in Northern California, AND above 8,000 feet in elevation.

Step D.  Figure out the surface area of the lakes, and find the lakes that are > 500 acres, above 8,000 feet, AND have been stocked with fish in 2014.

Step A.  Fix up FT to have Lake+County IDs.

First, I opened the original spreadsheet for the "Fish Planting" (because you can’t manipulate columns easily in a Fusion Table).  Then, in that spreadsheet I...

- added a new column (G) called “LakeCounty” which is column C (“Lake name”) concatenated with column B (“County”).   This gives us a new column "LakeCounty" which is unique, and will let us fuse the tables together.  It looks like this:

Check out the new column on the far right.  That will come in handy in a bit.

I'll want to fuse this with the fish plantings FT in a minute, but first I need to convert this spreadsheet into a FT first.  Create a NEW FT “Fish Planting Schedule 3” with the newly updated “Fish Planting” spreadsheet.

Now I need to go back to the “CA lakes with extracted lat/longs” spreadsheet  and add a new column (T) that is called “LakeCounty” which is column A (“Lake name”) and concatenate it with column D (“County”).   This will give us a key (that is, the column “LakeCounty”) over which we can fuse our data sets.

Now, create a new FT from that spreadsheet, and give it the same name “CA lakes with extracted lat/longs”  (There's no confusion here because the first one is a spreadsheet, while this one is a Fusion Table (FT).

Step B.  Fuse the tables together.  We need to fuse these two FTs -- “Fish planting schedule 3” with the “CA lakes with extracted lat/longs”.  This will give us the planted lakes fused on the key “LakeCounty.”

With me so far?  This table looks like this:

 Note that I've hidden a lot of the extra columns that we don't care about.

NOW we’ve got the Fusion Table we want.  It’s got unique IDs for each lake (“LakeCounty”), along with the correct lat/long, the elevation, and the fish-planting date.  The blank rows on the right (where the yellow marks are) are lakes that were never planted.  Those columns ("Announce Date" etc) all come from the Fish Planting data that was fused with the lakes table.)

First we look for the lakes/reservoirs in Northern California.  To play it safe, let’s filter on all lakes above latitude 39.0 (that’s roughly the centerline of California).  Click on the Filter button, and let's look at lakes between 39 degrees and 90 degrees of latitude.  (Remember that the correctly extracted latitude is in column LAT1.)

Then sort by elevation (“Ele(ft)”) and we see this list sorted by elevation.  (Note that I’ve hidden many of the columns that aren’t useful to us now.  Here we see only the relevant columns.)

Step C.  Filter to see the planted lakes.  And now, in the next-to-last step, we can filter by which lakes have been planted.  Just add a new filter (by clicking on the blue Filter button) and add a filter by FishType = ‘trout’

WHAT?

There ARE no lakes that have been planted with fish in California that are above 8,000 feet in elevation!

As you can see, there are only 2 planted lakes in northern California, and only 1 of them is even reasonably high (Lake Almanor, at 4505 feet).

Step D.  Check on the sizes of the lakes.

Let’s check to see how big these lakes are.

A quick search for [ Lake Almanor ] tells us that it’s 43.75 sq miles (113.3 km²), or well over 500 acres.  Likewise, a search for [ Baum Lake surface area ] tells us that it’s only 89 acres in size.  It’s nowhere near 500 acres we were looking for.  (And a quick look on the map reveals it’s long and skinny—not a biggish lake at all.)

So... after all this analysis, we find that Little Rob and Big Jim probably don't actually need the Acetazolamide pills.  (I'm sure you did a search on Acetazolamide and found that it's a commonly used prescription drug for high altitude sickness.)

But they're clearly headed for the largest, highest lake that's been planted with trout... and that's probably Lake Almanor in Plumas county.

 Map of Northern California showing Lake Almanor between Lassen and Tahoe National Forests

Search Lessons:

There are a LOT of lessons here, and I'll discuss some of them in more detail tomorrow.

But the two biggest lessons are the ones I hope you take away, even if you didn't go through all of the Fusion Table details.

1.  Sometimes the results aren't quite what you thought.  In this case, because we took the "download all the data and analyze it" approach, we were able to show pretty conclusively that there ARE NO lakes over 8000 feet that are planted with fish.  Just doing a set of queries for this kind of Challenge just doesn't cut it.  This is definitely one of the hardest things to find--a negative result.  But we were able to do it, even if (especially if) the result isn't what you expected.

2.  Complex search is often a multi-step process.  As you saw, I took a lot of back-and-forth steps along the way, making one table, and then having to go back and modify it to include more information (e.g., adding a unique ID for lake names).  This was a slightly more involved search than usual, but it's not that uncommon a process for sensemaking and data science.  Welcome to data wrangling 101!

And... to top it all off, while this solution works just fine, it wasn't what I thought you'd do!  (I'm showing you the Fusion Table solution, but there's another way to solve the problem that involves building a database... but we'll tackle that in another month or so.)

Tomorrow... Reflections on the thought process to solve this Challenge, and my comments on the Discussion group for this problem

Searching on!