Friday, January 23, 2015

A quick reflection on the "Where's the lake...?" Challenge

I don't know about you, but in the history of SearchResearch, the "Where's the lake...?" Challenge was probably the toughest Challenge to date.

But it actually wasn't quite what I thought would happen!  I had 3 big surprises in this Challenge.  

1.  SearchResearchers fairly quickly agreed on using Fusion Tables as the main tool for pulling all the data together.  In the discussion group (which was quite active, and very fun to read), the consensus came up fairly quickly. That makes sense, and it's the way I wrote my solution, to follow along in what everyone was doing.    

What I thought was going to happen was that you'd create a mySQL database in the cloud, load up the data, then run your query in that!  

In other words, I tried to write the Challenge so that a "regular" database was needed.  I didn't think that Fusion Tables would be a good solution--but I was wrong.  Obviously, it's quite possible to search a Fusion Table for "All lakes above 8000 feet that have been planted with trout in Northern California."  It was just a matter of applying filters to the data table once it was created by fusing different pieces together. 

I didn't expect that.  But it's a great solution.  Nicely done, team.  


2.  The second thing I didn't expect was that there wouldn't BE ANY lakes that fit all of the criteria.  What do you know?!   (I mean, I knew there were a lot of lakes way high in the California Sierras, and I know that many of them were planted, and when I scanned the list I thought I saw some that I thought fit the bill.  Turns out I misread the dataset when I was creating the problem!) 

As I wrote in the previous blog post, that's part of the reason that you want to use complete data sets (and them put them into Fusion Tables or mySQL so you can query those tools).  This is exactly the kind of thing that search engines are NOT very good at--the very fine grain analysis of data.  A search engine can help you find the data, but then you have to process it a bit yourself, with your own tools.  Metaphorically speaking, the search engine can find you the cow, but you have to make your own sausage.  


3.  I didn't expect all of the back-and-forth steps during my solution.  I realize that my writeup (in three parts) was long and complicated, but I hope you took away one lasting lesson from this:  Even experts have to do a lot of iteration to get the data right.  

In effect, a lot of what I wrote down were all of the "I forgot to include this, let me back up and do it again with this new data" steps.  Normally in classes, the teachers don't show you these steps because they're slightly boring and show what an idiot you are.  (Remember doing proofs in your high school math class?  I realized after a while that nobody really does proofs like that.  Real mathematicians take a lot of forward-and-back steps to figure it out. Everyone goofs.)  

But when I wrote up my solution, I wanted to document all of those intermediate steps as well.  Real data scientists do this all the time, which is why I wanted to write it down, to show you the inner steps that I wish my teachers would have shown me.  

Overall this was a toughie, no doubt about it.  But I hope the search lessons are clear.  If I was to summarize them, I'd say: 

A.  Keep track of your data sources; keep your metadata with the data.  With all of the updates and recasting of the data, it was essential to know where a particular set of data originated.  Keep track of that stuff!  (In FT it's easy--there's a spot for it in the header.  In Spreadsheets, I always add a comment to cell A1 with the metadata.)  
B.  When something takes multiple days to solve, leave yourself a note at the end of the day so you know what you're doing and what's next.  That's why I summarized what the key questions were in each of my posts.  That's basically what I wrote down to keep track of the whole process.  
C.  Check your data.  Check your data.  Check your data.  As you saw, a couple of times I found errors in transcription, or data getting clobbered by accident. (Such as when the lat/longs on Horseshoe Lake were wrong.)  I like to try to view the data in a different way--such as plotting the locations on a map--to see what I can spot.  Spreadsheet computations are often a source of error, so constantly check to make sure that each time you touch the data, you're not accidentally messing it up.  
D.  Keep trying.  This was really a multiple step problem.  Sometimes you just have to stick with it.  

Thanks again to everyone for sticking with the problem.  It was great to see everyone pitching in and contributing.  

Search on! 




10 comments:

  1. Good Morning, Dr. Russell.

    Yes, this was one of the tough Challenges. I think others could be even harder. At least in it we had some ideas about how to work with it.

    I never thought about using database. That will be something I need to try.

    I really like that you always write all the "forward-and-back steps to figure it out" that helps a lot and also gives us all the steps and ability to re-do it ourselves.

    About no lakes that match the criteria. I think the issue could be the time frame. When searching approved sites, another possible lakes were found. I think if we try this challenge changing only fish planting in two months answer will be totally different and that is great. Your Challenges are timeless.

    Thanks to you for doing SearchResearch Challenges with multiple grades of difficulty and specially for this one, that we changed the answer and you had to make 2 challenges instead of one.

    Have great weekend.

    ReplyDelete
  2. This solution was very curious but I see know what you were doing.

    Rosemary and the rest of us used lat 37 as the southern boundary of the northern area, Would please run your solution again using 37 instead of 39 ? I am curious to see if using 37 "our" lakes/reservoirs show up.

    Thanks for this

    jontU

    ReplyDelete
    Replies
    1. Hi Jon tU, I am working on your request. Thanks to Dr. Russell, we can "play" with his FTs and try different scenarios without deleting his work. And, if we save a copy, we can save our work too.

      Delete
    2. Ramón is right--you can just do the filter on the "Merge of CA lakes with extracted lat/longs and Fish planting schedule 3" (The link is: https://www.google.com/fusiontables/data?docid=1ujcorasCYmIUlZzXxLCzS1tZwniVa6eH0Faz-L7n#rows:id=1 ) Do "your" lakes show up there?

      Delete
  3. cheers to all that persevered through multiple data genuflects… it was way over my brain pan - after a month, I was flummoxed.

    it did make me think in pictures though (different data), rather than spreads, tables, data bases and tools… sigh
    "Little Rob was found dead of unknown causes in his apartment." … not foul play after all.

    ahhh, Little Rob, we hardly knew ya…
    Big Jim, catching rare NoCAL Frilled Chlamydoselachus anguineus @ Lake Almanor
    big data
    actionable analytics
    information in the wrong hands
    tools
    of dubious use, but I still lust
    the freeze dried meal allegedly dropped by Big Jim

    fwiw, Mr. Holmes told me Little Rob mis-administered his Diamox pills he was using for epilepsy and idiopathic intracranial hypertension and he had a seizure and passed on — after slipping and falling on a freeze dried dinner package he had dropped on the floor… for his cats, who actually prefer fish… so he never met up with Big Jim at the lake as they had planned - which was just as well because Jim went to a different lake, which wasn't on their list. Little Rob's mother, who resides in Maryland, was still heartbroken.
    will re-check data and premise later if I can find my notes. ;)

    ReplyDelete
  4. The fact that Google sheets > Fusion Tables provided us with the tools to do the challenge was great because I for one am not familiar with MySQL. And perhaps there are in-the-cloud databases free for our use. But our tools worked.
    My takeaways from this challenge are more about the process. Checking data & managing the process. How recording the source [metadata as I learned] gives us a trail of breadcrumbs. The biggest lesson for all of us seems to be the one Dr. Dan himself experienced and that sometimes the answer comes through two steps forward one step back.
    In addition I liked how the group kept expanding its search when the criteria didn't give us good results. We gathered information through another dataset, use of the wayback machine and a longer timeframe. I learned a lot from the challenge. It was difficult and certainly the longest project we have tackled. Managing a challenge/project over time was challenging since normally weŕe done in 1 or 2 days. All of it was a learning opportunity.

    ReplyDelete
  5. When I was deciding upon how to approach this challenge, I decided that instead of using fusion tables (or SQL), I would use the 'Vlookup' function in Google Sheets. However, the results were that I couldn't find any lakes that fit the criteria, and assumed I must have done something wrong, so I didn't post earlier. But reading this, I thought I would just highlight that Vlookup has some of the capabilities of Fusion Tables and SQL, and for small datasets like the ones we were dealing with it is one way to go.

    ReplyDelete
    Replies
    1. Hello Dr. Russell and everyone.

      Tim, can you please share more of how you work with "Vlookup"?

      I tried other ways to merge and searched for other lakes, elevations and different values with Dr. Russell's FTs. And found this guide very useful to learn more about Google Fusion Tables.

      Delete
    2. This actually is a pretty interesting approach. You can read more about VLOOKUP here: https://support.google.com/docs/answer/3093318 It basically lets you select all the values that match within a given column.

      Note that you could also use QUERY to do something similar (see: https://support.google.com/docs/answer/3093343 ), although it's a bit more complicated to use.

      Delete
    3. Thank you, Dr. Russell.

      I'll try to get the answer with that approach too. And also, with QUERY. I was thinking about Copy/Paste all the data in one single spreadsheet, with only the columns that we need. I don't think this approach could work and therefore never tried it.

      Delete