Friday, September 26, 2014

Answer: Should I be worried about this fish?

In our previous episode... 

WHILE diving in the Somosomo Strait on September 8th of this year, I found this fish down at 10m, busily picking up chunks of coral and moving them from place to place.  

The question for this week was "Should I be worried about this fish?" 

You SearchResearchers did an excellent job of answering the questions.  

1.  What IS this fish, and should I worry about it being aggressive? 
2.  If so, WHEN should I worry?  Everyday?  Or just sometimes?
3.  Should I have been worried on the day I took the photo? 

1.  How can we identify a random fish like this?  

As we've discussed before, you could spend a lot of time looking at photos of fish. But there a literally a lot of fish in the sea, and it could take a while.  A much better approach is to use some kind of key to identify the category of fish, and then zoom in to photos once you know a bit more.  (This is really the best, most general method of identifying any kind of plant or animal. Use a key.)  

My first query was to figure out where the "Somosomo Strait" is--that's not hard--it's an island in the Fiji archipelago that's known for great scuba diving.  (Makes sense. Of course that's where I'd go!)  

So now I want to find a good fish ID key, my query was: 

     [ fish identification key ] 

which led me to which has a really extensive index AND a great key system.  They're pretty serious about their fish.  

Here's a piece of their home page.  I immediately noticed the "Quick Identification" link and clicked on that.  

Once there, you'll see a set of options.  Each is a category of fish--click on it, and you'll go the subcategory, etc etc, until you reach the fish family you're interested in examining.  Here's the top of their visual key: 

In this case, I'm going to click on the fish that looks most like the one in the image.  So here I click on "Ray-finned fishes."   That takes me to the next choice point in the key

Now at this point, I MIGHT click on "Puffers and Filefishes" (the same of the mystery fish looks pretty much like the fish on the far left of that category), or I might scroll down and click on "Dories" farther down the page.  But "Dories" are, as the note down there says, "most are deep sea."  So I'll click on "Puffers and Filefishes" and see what's there. 

At that page, I see something that looks a LOT like the mystery fish.  See the "Baslistidae (Triggerfishes)"?  I suspect that's our kind of fish.  An Image search for: 

     [ Fiji triggerfish ] 

quickly brings up a bunch of triggerfish, including one that's a perfect match.  Clicking on that then tells me it's a Titan triggerfish (Balistoides viridescens).  A quick Image search on the Latin name [ Balistoides viridescens ] gives me yet another confirmation: 

So we've identified our fish.  

Now, is it aggressive?  Let's do this search to see if we can find anything out about the behavior of the Titan triggerfish.  

     [ behavior Balistoides viridescens ] 

Note that I didn't search for "aggressive" here--that's just ASKING for confirmation.  Instead, I searched for "behavior" because maybe it's perfectly passive 99% of the time, and rarely aggressive.  If I searched for "aggressive," I'd be sure to find every single page talking about it's aggressive tendencies.  It would be a fair research question.  

When I did this I was slightly surprised to learn that (despite me trying to be fair), the Titan triggerfish HAS been observed being fairly aggressive to other fish (and humans) who enter their territory.   

The Wikipedia Titan triggerfish article says that "...The titan triggerfish is usually wary of divers and snorkelers, but during the reproduction season the female guards its nest, which is placed in a flat sandy area, vigorously against any intruders. The territory around the nest is roughly cone-shaped and divers who accidentally enter it may be attacked. Divers should swim horizontally away from the nest rather than upwards which would only take them further into the territory. Although bites are not venomous, the strong teeth can inflict serious injury that may require medical attention..."  

Zounds!  As you can see from the photos, any fish with teeth like (who bites coral!) would have an impressive bite.  

Just to double check this, I used Google Scholar on that same query and found several articles documenting aggressive behavior of the Titans.  Interestingly, several of the articles (e.g. "Lek-like spawning, parental care and mating periodicity of the triggerfish Pseudobalistes falvimarginatus" [1]) point out that the males set up a mating ground (a "lek") where they establish, and defend, territories to which the females come and deposit their eggs. Both parents care for the eggs, although the female is "confined to the nest by the male."  Mating was semi-lunar, several days before the new and full moons on days when high tide occurred near sunset.  (Note that this paper is about a related fish, Pseudobalistes, but at the end of the paper, the authors say that this is also true for the Titans as well.)  
Okay. So the question NOW is "was Sept 8 near a new or full moon when a high tide was near sunset?"  

Phases of the moon are easy to figure out:  

     [ phase of the moon calendar ] 

And yes, Sept 8 WAS a full moon according to   

What about the tides?  My query was: 

     [ high tide Fiji September 8 2014 ] 

(I gave the date because I wanted the historical record, not this week's tides.)  

The second and fourth columns are the high tide times.  Holy cow!  5:46PM was the high tide AND sunset was just 15 minutes later at 6:00PM FJT!  

So YES... I should be careful!  

Search Lessons:  

1.  Know the geography.  In this case, fish look very much alike, but can be different worldwide.  Knowing that Somosomo Strait is in Fiji really helps. 

2. Use an identification key.  There are keys for fish, plants, animals, insects, fungi, flowers, etc etc.  Know that a good key is almost always the best approach for identifying something.  

3.  Do not bias the results by including "leading terms" in your query.  In this case, the Titan triggerfish really IS aggressive, but don't search for trouble to begin with.  Let the data guide you to that interpretation--don't overlimit you search results to only those with evidence that confirms your already existing biases.  

Note:  Rosemary made a great observation about using Search-By-Image for this Challenge.  It's such an interesting finding that I'll write a separate post about that.  

[1] Gladstone, William. "Lek-like spawning, parental care and mating periodicity of the triggerfishPseudobalistes flavimarginatus (Balistidae)." Environmental Biology of Fishes 39.3 (1994): 249-257.

Wednesday, September 24, 2014

Search Challenge (9/24/14): Should I be worried about this fish?

WHILE diving in the Somosomo Strait on September 8th of this year, I found this fish down at 10m, busily picking up chunks of coral and moving them from place to place.  

It's a pretty big fish, around 70 cm / 27 inches tip to tail (and the chunks of coral it was moving were the size of my dive buddy's fist).  The teeth on this thing are also impressive, and seeing what it could do to coral makes me think that I'd prefer to not tangle with this fish. 

And that's today's Search Challenge: 

1.  What IS this fish, and should I worry about it being aggressive? 
2.  If so, WHEN should I worry?  Everyday?  Or just sometimes?
3.  Should I have been worried on the day I took the photo? 

Even though it sounds crazy-hard, this isn't that hard of a problem, but it requires linking together a few different resources.  Can you figure it out?  (Ideally, we should find authoritative resources to answer this.  Can you find them?)  

As always, be sure to tell us what you did to answer the Challenge, and how you figured it out.  

Search on! 

P.S.  I'll get back to the Twain place-names tomorrow.  It's been an overly busy week, unfortunately.  

Friday, September 19, 2014

Answer (delayed)...

Sorry folks... this really hasn't been my week at all.  After coming back from my dive trip, I'm still working through the tail-end of my cold/flu thing, while simultaneously trying to get a bunch of unexpected things done at work.  Usually I can just get up earlier in the morning and get my SRS writing done, but this week I just couldn't quite pull it off.  

Luckily, the weekend is coming, and I'll catch up on this long conversation on Monday.  More maps, more analysis, more information.  

Have a great weekend!  See you then. 

-- Dan 

Wednesday, September 17, 2014

Answer (Part 1) to: Can you find the places Twain mentions in "Around the Equator"?

I have to start off by saying that this really is a complicated and difficult challenge.  But the SRSers rose to the challenge.

Answering this is slightly complicated as well, so I'm going to write this up in two (or three) parts.  

Here's installment #1, which is really a story of how to keep digging in, learning things along the way, and finally coming up with something that works.  

Entity identification in arbitrary text.    

When I sat down to do this Challenge I had an advantage--I already knew about the idea of "entity identification" (aka "named-entity recognition").  The idea is that your computer can scan a text (say, "Around the Equator") and automatically identify named entities--the names of cities, rivers, states, countries, mountain ranges, villages, etc.    

Just knowing that this kind of thing exists is a huge help.  All I figured I'd need to do is to find such a service and then use it to pull out all of the entities from the text. 

My plan at this point was just to filter them by kind, merge duplicates, clean the data a bit, and I'd be done.  

But things are never quite this easy.  

My first query was for: 

     [ geo name text entity extraction ] 

which leads to a number of online services that will run an entity extractor over the text.  

The one I tried first, Alchemy, looks like this: 

You can see that I downloaded the fulltext from Gutenberg onto my personal web server ( and handed that link to Alchemy.  

I thought that this would be it--that I'd be done in just a few moments.  But no.  Turns out that you can't just hand Alchemy a giant blob of text (like the entire book), but you have to do it in 50K chunks.  

That is, I would have to split up the entire book (Twain-full-text-Equator-book.txt) into a bunch of smaller files, and run those one at a time.  

Since the entire book is 1.1Mb, that means I'd have to create 22 separate files, each with 49,999 bytes.  

I happen to know that Unix has a command called split that will do that.  I used the split command to break it up into 22 files and I moved those all back out to my server.  

At this point my natural inclination would be to write a program to call the Alchemy API.  The program would basically be something like: 

for each file in Twain-Docs: 
     entities =  Alchemy-Api-Extract-Entities( file ) 
     append entities to end of entitiesListFile 

Which would give me a big file with all of the entities in it.  But I didn't want this to turn into a programming problem, so I looked for a Spreadsheet solution.  

Turns out that Google Spreadsheets has a function that lets you do exactly this.  You can write this into your spreadsheet cell:  

     =ImportXML (url, xpath)  

where url is the URL of the AlchemyAPI and xpath is an expression that says what you're looking for from the result.  

Basically, the url looks like this:  (I learned all this by reading the documentation at

Let's decrypt this a bit.... 

The first part:

tells Alchemy that I want for it to pull out all of the "RankedNamedEntities" in the text file that follows. 

The second part:  apikey=XXXXXXXXXX

tells Alchemy what my secret APIKey is.  (Note that XXXXXXXXXX is not my API key.  You have to fill out the form on Alchemy to get your own.  It's free, but it's how they track how many queries you've done.)  

The third part:  &url=

is the name of the file (neatly less than 50K bytes long) that I want it to analyze.  

Now, I make a spreadsheet with 22 of these =ImportXML(longURL, xpath

Here's my spreadsheet (but note that I've hidden my APIKey here).  

You can see the "Alchemy base url:" which is the basic part of the call to Alchemy. 

The "Composed URL" is the thing we hand to ImportXML.  That is, it's basically the: 

     AlchemyBase + analysisAction + APIkey + baseTextFile

Remember that the spreadsheet function ImportXML takes two arguments--the first is the URL to call Alchemy (which has the link to the file built into it) and an XPath expression. 

What's XPath?  I did the obvious search to find out.... 

     [ xpath tutorial ] 

and found a nice little intro to XPath.  Turns out that it's a kind of language for reaching into XML data and pulling out the parts that you want.  (It took me about 15 minutes to read up about XPath, and then figure out that all *I* wanted was to pull out the entities from the XML that's being imported.  In short, all I needed was the XPath expression:  "//entity" as the second argument.  

Then, for each of the 22 files I split up from the original text, I created a separate spreadsheet, cell A1 gets the magic ImportXML function.  In this case, A1 on spreadsheet A has the ImportXML function that looks like this: 

   = ImportXML ("
            Twain-part-aa", "//entity")  

Here's what the sheets look like after the ImportXML function runs.  This is the Alchemy analysis of Twain-part-aa (that is, the first 50K bytes of the book):  

Looks pretty good, eh? 

I did this same thing 22 times, one analysis for each of the 22 sections of the book (Twain-part-aa through Twain-part-av).  

Then I copy/pasted all of the results into a single (new) tab of the spreadsheet.  I used paste-special>values so I could then do whatever I wanted with them.  

That new page of the spreadsheet looks like this.  

Remember that Alchemy is searching for MANY different kinds of entities (as you can see: HealthCondition, Person, Organization...) 

What we want is just the geographic entities.  This means I can now use the spreadsheet Filter operation.  (Click on cell A1, then click on Data>Filter.  It will popup a menu with all of the values you can filter on.) 

Here you can see that I've already deselected "Crime"-- so all of the "Crime" entities will be filtered out of the list.  

Once I've filtered the list, I'm nearly done.  I can selectively filter for only the geographic entities I care about (City, Country, GeographicFeature, StateOrCountry...).  And my spreadsheet now looks like this: 

This list now has 567 placenames in it, many of which are duplicates.  To create a new list of only the unique names, I'll use the =Unique (range) function to create another tab in my spreadsheet with the unique names.

This gives me a sheet that looks like this: 

Now we have 283 unique entities. 

This column (which I sorted into alphabetic order) looks pretty good, although there are a few oddities in it.  ("Ballarat Fly" is an express train to the New Zealand town of Ballarat. And "Bunder Rao Ram Chunder Clam Chowder" isn't a place name, it's just a funny expression that Alchemy Analytics thinks is a place. "Ornithorhynchus" isn't a place, it's the Latin name for a platypus...)  

So we still have some data cleaning to do.

But this is point at which we need to do some spot checking to see how accurate the process has been.  As is clear, it has included a few extra "place names" that aren't quite right.  This is called a "false positive."  By my count, the false positive rate is around 3% (that is, out of the 283, I found 8 clear mistakes).  

And that makes me wonder, how many "false negatives" are there?  That is, how many place names does Alchemy miss?

There's no good way to do this other than by sampling.  So I choose a section out of the middle of the text (Twain-part-ak, if you're curious) and manually checked for place names. 

I found about a 5% false negative rate as well... (including cities that should have been straightforward, like "Goa").  So this approach could be off by as much as 9 or 10%.  

Still, this isn't bad for a first approximation.  But there's more work to be done. 

In tomorrow's installment, I'll talk about some of the other approaches people used in the Groups discussion.  There are always tradeoffs to make in these kinds of situations, and I'll talk about some of those tomorrow as well.  Creating a map with all this data?  That's Friday's discussion.  See you then. 

Part 2... tomorrow! 

Search on! 

Tuesday, September 16, 2014

I'm teaching a class on Google Books next week (Sept 23rd, 2014)

Want to be Google Books wizard? If you're in Mountain View on 9/23/14, you can take my (free) class at the Plex.  

Register by clicking on this link.  

It starts at 6PM, runs till 7:30, with dinner (free!) to follow.  

You should already be a Books user (but I suspect that most of you reading this are)... 

Feel free to pass around to folks you know in the Mountain View / Palo Alto / Santa Clara / Menlo Park area that might have an interest in this. 

See you then. 

- Dan 

I'm back from vacation... and trying to catch up with your work!

Hi folks.  

I'm now back at home, reading through all of the comments and ideas, all of the back-and-forth everyone's been posting since I left.  Two quick comments spring to mind... 

1.  You guys worked really hard on this!  I see lots of evidence of people putting out ideas, other people testing them, and then other people doing some work, and generally building atop each other's investigations.  This is superb, and exceeds my wildest expectations.  Thanks.  

2.  This is a really hard problem.  I started working on it yesterday, and it's taken me about 4 hours thus far.  (Including dead ends.)  But I see the end, and it will come it at around 5 hours to complete.  (Not counting the writeup.)  It won't be perfect--there will be places mentioned in the text that will be missed--but we should be able to get pretty good accuracy.  Details tomorrow. 

To slightly complicate things, I had a great time on my vacation.  Turns out the resort did have Wifi, but it was a bit spotty; trying to do any real work would have been crazy-making.  

The good news is that the South Pacific was fantastic.  

And the bad news is that the moment I got home I was hammered with a bad case of the flu, so I'm barely functioning.  My solution won't be as clean and beautiful as I would have liked, but it'll be there. 

More tomorrow.  I'll post a few comments in the group today, but the answer will be on Wednesday.  (With no new challenge this week.  You've worked hard enough.  Take a week off yourself!)  

Dan enjoying the surface interval between dives.

Wednesday, September 3, 2014

Search challenge (9/3/14): Can you find the places Twain mentions in "Around the Equator"?

As I mentioned in my last post, I'm about to head out for a few days of SCUBA diving in an exotic, tropical (and undisclosed) location.  Who knows?  I might want to use some of things I pick up there as future Search Challenges! 

This week's Challenge is one that I've wanted to do for a while, but never quite had the time (or nerve) to post it as a Challenge.  

It's fairly tricky, and will require some new skills on the part of Search Researchers.  But I'm confident that you can do this.

Here's the Search Challenge for today: 

Background:  I remember reading Mark Twain's Following the Equator as a schoolboy and completely enjoying the story.  I was also amazed at all of the places he visited.  I know he made it to Hawai'i and Australia, but he also seemed to visit much of the world... and in 1895.  By ship.  Suppose I want to do his trip over again.  Where all would I have to go?  
Challenge 1:  Can you figure out all of the place names he mentions in the book?  The link above is to the Gutenberg Project's plain-text version of his book.  Can you figure out some way to determine ALL of the place names he mentions? 

Example: The first two paragraphs of the book are... 

"The starting point of this lecturing-trip around the world was Paris, where we had been living a year or two.
We sailed for America, and there made certain preparations.  This took but little time.  Two members of my family elected to go with me.  Also a carbuncle.  The dictionary says a carbuncle is a kind of jewel.  Humor is out of place in a dictionary." 

In these paragraphs he mentions "Paris" and "America."  Those should be the first two entries in your list of placenames.  

Now, can you figure out ALL of the OTHER places he mentions in the course of the text?  

(And yes, I know he mentions a lot of places he doesn't actually visit; that's okay, for our list let's include every place he writes about and not worry about whether or not he actually visited there.)  

Obviously, you don't want to do this by hand.  So the question really is, can you find a way to solve this problem using SearchResearch methods? 

Challenge 2:  In case anyone finishes this early... Can you then create a set of Placemarks on Google Earth to show all of the places mentioned in your list of placenames?  Ideally, you should give us a link to your KML file with all of the places Twain mentions in the book.  

This is probably the most sophisticated Challenge I've issued--which is why I'll write up my answer in about 2 weeks.  (Note that I haven't yet solved this myself; but I'm confident that I can.)  

As mentioned, I'll be out-of-town for the next 10 days, so we won't have a Challenge next week (Sept 10).  Instead, I'll write up my solution on Wednesday, Sept 17th.  

I'm also going to be off-the-grid (mostly), so I won't be able to approve your posts to the blog after Thursday.  (Well.. probably.  I will try to check in; but I'm not sure about Wifi coverage where I'm going.)  

So I set up a Google Group for everyone to discuss this Challenge.  For this problem, we can have our discussion in SRS Discusses Around The Equator.  (Click on that link to join the group.)  This way, I won't need to manually approve every comment to the blog (which is what I do now).  

As I said in the Welcome message for the group, this is a no-hold-barred Search Challenge.  If you want to work together, be my guest. You can set up Hangouts to meet and chat about possible solutions, you can swap ideas about how to solve it... Whatever works for you.  

It's a two week Challenge.  Are you up for it?  Can Team SearchResearch do it?  

Search on!