Monday, January 19, 2015

Answer: Where's the lake... (part 2 of 3)

Preface:  I know this is kind of a longwinded solution to the problem, but I wrote down all of the steps I took along the way.  "But Dan," I hear you cry, "why not just give us the answer?"  Because I wanted to show all of the steps and mistakes that I made along the way.  This is all to illustrate that sometimes a SearchResearch process takes multiple steps, sometimes going forward, and sometimes backing up to fix errors.  We'll wrap this all up tomorrow with the big conclusion and a set of invaluable search lessons!  

Last week I left off in mid-solution, trying to figure out 4 questions that came up during my research.  Here they are again... 

1.  Why are the lat/longs for the lakes in USGS (FT, Lakes in CA, Fig 12) wrong?  That doesn't make sense--USGS are the people who provided the original data!  
2.  Why are the Horseshoe lakes plotted on the map in Fig 11 all in the wrong place? 
3.  (Important clue) Why is the lat/long of the Horseshoe Lake in Alameda county not the same as any lat/long in the master list?  
4.  (Another important clue) Why do all of the Horseshoe lakes in the Fusion Table of Fig 10 have the same lat/long??  

I'll answer these questions today, and then, once we've got all of our ducks (or fish) in a row, we'll get to the final answer.  

Let's pull up these figures again. Here's figure 12 again.   

Fig 12. List of Lakes in CA in a Fusion Table, filtered by "horse" in the name.

I asked "why are the lat/longs wrong here"?

Here's what I thought was going on.  

I knew that California is mostly at the 37th latitude, and the longitude was mostly around -119 (you can look that up on any map--just drop 37, -119 into a Google Map and you'll end up more-or-less in the middle of California).  

So I wrote a small spreadsheet function to pull off the first 2 digits of the strings in the Latitude column, and the first 3 digits of the Longitude entries, and then use the rest of each string as the decimal part of the lat/long.  (That is, 1190983W would become -119.0983 )

But if you do that, then the coordinates don't actually point to the place they describe!  In other words, the entry for Horse Lake in Sierra county (393912N, 1203930W) would become 39.3912, -120.3930.  If you put that into Google Maps, it's about 59 miles south of the real Horse Lake in Sierra county.  

This puzzled me for a while until I had the bright idea of actually searching for the definition of the lat/long as defined by GNIS.  I did the query: 

     [ GNIS latitude longitude definition ] 

and found the page for "Geonames File Formats."  

I had committed a classic mistake:  I thought I knew what the format was.  And I based my analysis on what I thought.  And I was wrong.  This isn't the format of those numbers at all!

On this page, it says that the lat/long when stored as a "primary latitude" (or longitude) is stored in the DDMMSS format.  (That is, the first two characters are the Degrees, the next two are the Minutes, the last two are the Seconds.)  

I'd assumed they were in decimal lat/long format, not degrees/minutes/seconds format!   

I went back to my spreadsheet and wrote a small converter function to go from degrees/minutes/seconds to decimal.  (It sounds hard, but it's not.  Just extract the minutes, convert to a number and divide by 60.  Same thing for the seconds, but divide by 3600.  Add it all up and you've got your decimal lat/long, suitable for Google Maps. For details, see CA Lakes with extracted lat/longs.  Look at columns K through R.  You can see how I pulled out the middle characters, converted them to numbers, divided by 60 (or 3600) and added them together to give the correct lat/long).  

Now I've got the correct lat/longs.  And when I plot the lakes with the name "Horse" in them, everything looks correct. 

Here's a My Map I made with the lakes that have "horse" in their name.  

So now we've corrected the lat/longs for all the lakes.  And this answers the first two questions: What are the lat/longs "wrong"?  And "Why are the Horseshoe Lakes in Fig 11 in the wrong places?"  

Answer: I messed up the data conversion.  

But now, what's going on in Question 3 (from above)?   (Repeat:  "Why is the lat/long of Horseshoe Lake in Alameda Country not the same as any of the Horseshoe Lakes in the master list?") 

Ah... It's true.  Horseshoe Lake (Alameda) doesn't appear anywhere in the master list of lakes in California.  Why?  

There's a hint in my original question:  "Can you find all of the lakes and reservoirs in California...?"  

A quick search for [ Horseshoe Lake Alameda County ] quickly reveals that it's a reservoir, and the "master list of lakes" doesn't contain all of the reservoirs.  

Okay, so we have to backtrack and find a list of reservoirs and that THOSE into the master list.  

Back to the GNIS system, and this time ask for reservoirs!  

We can get that list of reservoirs, and append it to the bottom of our Lakes in CA spreadsheet and create a new spreadsheet CA lakes and reservoirs with extracted lat/longs.  

There it is--the full list of all Horseshoe Lakes, including Alameda County's reservoir.  

(FWIW, that spreadsheet is 3379 rows long.  A fair number of lakes! But I don't want to think about the number of lakes in Minnesota.)  

And now we're ready to answer the last of our questions for today:  "Why do all of the Horseshoe Lakes in the Fusion Table of Fig 10 have the same lat/long??"  

Fig 10.  Faulty Fusion Table with wrong lat/longs for Horseshoe Lake
As you can see, all of the Horseshoe Lakes are set to 37.57295200, -122.00143000 -- what's up with that?  

What happened is that we fused two tables with non-unique keys.  

That is, we took "Lakes in CA" and fused it with "FishPlantingSchedule2."  

But what happened is that there are (as we see from above) a LOT of lakes named "Horseshoe Lake."  But there's only ONE entry for Horseshoe Lake, and it's the one in Alameda!  

Here, I'm filtering the FishPlantingSchedule2 FT to show me all lakes named "Horseshoe" -- and you can see there's only one, and it's with that lat/long (37.57295200, -122.00143000).  This is kind of whacky.  What's going on? 

Let's step back for a second to see what happened.  

When you fuse two tables together, you're merging them based on a "key" -- in our case, we used the name of the lake as the key.  

Here are the column headings for the "Lakes in CA" FT.  

And here are the column headings for the "FishPlantingSchedule2" FT.  

When you fuse the tables, you have to pick a column to "join."  I chose to fuse (or join) on the "Name" column.  

In essence, the fuse process goes through every row in the "Lakes" table, and looks for a row in the "FishPlanting" table that has a matching name.  

When it does this, it finds all 11 of the "Horseshoe Lake" rows in the "Lakes" table, and joins them with the ONE ROW in the "FishPlanting" table.  In other words, it copies the contents of the "FishPlanting" Horseshoe Lake row into the fused table.  

And it copies the same lat/long into the new table as well.  

Ah ha!  That means that the lat/lon from those columns in the FishPlantingSchedule2 table gets copied into every row of the new fusion table.  

So what we need to do instead is to join on a column that has a unique name for every lake/reservoir.  

See that column labeled "ID" in the Lakes table?  That would be perfect!  That's a unique ID for every feature that GNIS knows about. It would be a perfect column to fuse on because every lake is guaranteed to have a different, unique number. 

Except we don't have the IDs for the lakes mentioned in the Fish Planting table.  Arrgh!  

But don't panic, we can make up our own unique IDs by appending the County name to the name of the lake/reservoir and then join on that unique column.  

BUT.. I'll do that tomorrow.  

Take heart, we're almost done.  

All that's left to do now is: 

1. Create the unique IDs by fixing up the table to have a new column of Lake+County name.  

2. Re-fuse the tables together to make a new, correct table. 

3. Pull the lakes that have been planted, and in Northern California, AND above 8,000 feet in elevation.  

And then we'll tackle the surface area.  (Believe me, after all this, the surface area will be a piece of cake.)  

Hang in there.  Almost to the end!  


  1. Good Morning, Dr. Russell. I think the three part answer is the best way. And, the time you provide between answers gives me the time to read, try and learn. Besides, it is like you say, only one part answer could make that we lost valuable information, tools and learnings.

    Reservoirs was a surprise. I thought you will choose to download both lakes and reservoirs in just one visit to GNIS.

    Thanks for sharing so many details with us, Dr. Russell.

    Have an incredible day.

    1. I thought I could get Lakes AND Reservoirs together in one data file. But I couldn't get the GNIS system to do that (even though the instructions say you can).

  2. I was wondering a quick and silly question. How we define "Northern California"?

    1. See today's answer. (It's a little arbitrary, but will make sense...)

    2. The best definition for "Northern California" can be found in the Wikipedia article on "Southern California":

      "Southern California" is not a formal geographic designation, and definitions of what constitutes Southern California vary. Geographically, California's north-south midway point lies at exactly 37° 9' 58.23" latitude, around 11 miles (18 km) south of San Jose; however, this does not coincide with popular use of the term. When the state is divided into two areas (northern and southern California), the term "Southern California" usually refers to the ten southern-most counties of the state. This definition coincides neatly with the county lines at 35° 47′ 28″ north latitude, which form the northern borders of San Luis Obispo, Kern, and San Bernardino counties. Another definition for southern California uses Point Conception and the Tehachapi Mountains as the northern boundary. Though there is no official definition for the northern boundary of southern California, such a division has existed from the time when Mexico ruled California, and political disputes raged between the Californios of Monterey in the upper part and Los Angeles in the lower part of Alta California. Following the acquisition of California by the United States, the division continued as part of the attempt by several pro-slavery politicians to arrange the division of Alta California at 36 degrees, 30 minutes, the line of the Missouri Compromise. Instead, the passing of the Compromise of 1850 enabled California to be admitted to the Union as a free state, preventing southern California from becoming its own separate slave state. Subsequently, Californios (dissatisfied with inequitable taxes and land laws) and pro-slavery southerners in the lightly populated "Cow Counties" of southern California attempted three times in the 1850s to achieve a separate statehood or territorial status separate from Northern California. The last attempt, the Pico Act of 1859, was passed by the California State Legislature and signed by the State governor John B. Weller. It was approved overwhelmingly by nearly 75% of voters in the proposed Territory of Colorado. This territory was to include all the counties up to the then much larger Tulare County (that included what is now Kings, most of Kern, and part of Inyo counties) and San Luis Obispo County. The proposal was sent to Washington, D.C. with a strong advocate in Senator Milton Latham. However, the secession crisis following the election of Abraham Lincoln in 1860 led to the proposal never coming to a vote. In 1900, the Los Angeles Times defined southern California as including "the seven counties of Los Angeles, San Bernardino, Orange, Riverside, San Diego, Ventura and Santa Barbara." In 1999, the Times added a newer county—Imperial—to that list.

    3. Interesting... I grew up in LA and now live in Northern CA (and so I'm partial to the definition of "anything north of the Tehachapi Pass"). But by all definitions, the two lakes we're discussing are in Northern CA.

    4. Yes, indeed very interesting. I found that also Chess divides California. With all those definitions, the way that Dr. Russell selected Northern California is the best way to select and find our lakes and reservoirs.