Wednesday, October 26, 2022

SearchResearch Challenge (10/26/22): A missing building in the park?

 Blank spaces on maps intrigue me, 

What was once here? 

.. as do blank spaces on the land.  More than a few SRS Challenges have centered on attempting to find out what happened here--this week's is in that category--what happened here? 

In this case, I was walking through a local park and happened upon an area (see the image above) at 37.885832, -122.261408 that was fairly blank.  Since it was on an otherwise hilly area, it looked very much as if a building had been there at one time.  There were a few pieces of flat concrete that looked suspiciously like former building foundations, and even more mysteriously, there's a flagpole on the western edge of the flat spot.  

Well... huh.  What's up? After a bit of SRS I found the answer AND another interesting story about the place. This leads to today's Challenges: 

1. What was once at 37.885832, -122.261408?  Can you figure out the story given just that lat/long and a keen desire to figure out the past?  Was there a building there?  If so, who built it and why?

2. The architect that that building is also the architect of a few other pieces nearby.  Can you figure one that's closest to this spot?  

As I mentioned last time, I'm deep in the middle of running my class at Stanford, so I'm going to keep the SRS Challenges on the interesting/fun but not-difficult side for the next 2 months.  Hope you enjoy these just as much! 

Search on! 

Wednesday, October 19, 2022

Answer 2: Can you find characters from Moby Dick in other places?


Let's wrap this up... 

 Side comment:  Why so long between posts?  Answer: In addition to my regular gig at Google as a research scientist, I'm ALSO teaching a class at Stanford University on "Human-Computer Interaction & AI/ML" with my friend and colleague Peter Norvig (syllabus).  It's a wonderful experience, but it's also taking a LOT of time.  I always forget how much effort it takes to create a new university-level course from scratch, especially one that's full of content-rich lectures.  Last week (and this week, to be honest) are very full of me writing the course, creating tests, making slides, and organizing the material.  Well, it got busy last week, and as you'll see below, writing up the Wikidata method wasn't straightforward.  Hope you'll bear with me for the next 7 weeks as we work through the course and write SRS posts.  I think I'll make the next few posts somewhat simpler questions--still very fun--but they shouldn't take me as much time to write up the answer.  The last day of the class is December 15th.  I'll have more time to post more advanced Challenges after that.  


As I mentioned last time, there are multiple ways to think about answering this question.  Let me show you the Wikidata approach and then summarize.    

Reminder: Our Challenge was... 

1. Can you find a way to identify other major works of fiction (leaving out fan-fiction for the moment) in which the names of "Starbuck" and "Queequeg" appear (either independently or together)?  

Last time we looked at using queries like:    [ "starbuck" -starbucks ]  to search for all mentions of the word  in ALL of Wikipedia.  (Noting the use of the minus symbol to remove any mentions of that coffee company.)  

My plan was to write up a long post here about how to use Wikidata to search the data underlying Wikipedia to find all mentions of Starbuck or Queequeg in any literary object (books, movies, cartoons, etc.).  

So... I spent several hours learning the SPARQL query language for Wikidata and figuring out how to write those queries.  Here's what they look like in the Wikidata SPARQL editor: 

Yeah.  In this example the term p:P1441 stands for "is present in work" and wd:Q3414055 stands for "Queequeg."  Roughly, this query translates to "search for everything that has Queequeg present in the work."

You have to know that "in the work" means, specifically, 

"this (fictional or fictionalized) entity or person appears in that work as part of the narration (use P2860 for works citing other works, P361/P1433 for works being part of other works, P1343 for entities described in non-fictional accounts)"

The SPARQL language is very powerful--you can ask nearly anything.  If you'd like to learn more about it, here's the SPARQL tutorial that I used.  Using this, you can ask questions like "Who are the grandchildren of Johann Sebastian Bach?" and get answers: 

As it happens, I know (because I'm a fan of early music) that JS Bach had 4 grandchildren, not 3: Anna Philippiana Friederica Bach, Wilhelm Friedrich Ernst Bach, Christina Luise Bach, and
Johann Sebastian Altnickol.  A fact you can check in many ways, but notably by looking at the Wikipedia page about Bach's descendents. And this illustrates a problem with the Wikidata; it has entries for everything that's a first-class object (e.g., famous people), but not everything that's a piece of text in Wikipedia has an entry in Wikidata.  Thus, if you run the SPARQL query for works that contain Queequeg, you'll get only "first-class items," such as well-known books, movies, etc... but not all of them.  If a book has Queequeg as a character, but the Wikidata doesn't have an entry for Queequeg in that book, you won't find it.  

That's not really surprising--every database has a coverage issue.  (That is, the database contains only certain types and amounts of information, this is called coverage.)  The coverage of Wikidata is less than the full-text of Wikipedia.  

It took me a while to figure this out.  I was hoping that Wikidata would be more extensive, and allow me to find new entities that simple text search would not--but it didn't work out that way.  

And, in truth, it took me quite a while to learn how to use SPARQL.  It's sufficiently complex that unless you're going to use it every day, it's probably not worth the time to learn it.  (It's a great project, but not quite ready for the ordinary SRSer who just wants to look things up.)  

SearchResearch Lessons

As I said last week,  

1. When searching for literary resources, no single source is going to give us everything we want.  For complex search tasks like this, the best you can do is to assemble data from multiple sources.  It's going to be really really hard to get a complete list of all the uses of Starbuck or Queequeg in literary works with just a single search.  The sources are too many, too diverse, and with very different interfaces. Just realize this when you set out on your journey.  Some SRS Challenges are still really hard. 

2. Use SITE: on Wikipedia As a first repository of cultural knowledge, Wikipedia is pretty good.  Just searching for your target on Wikipedia (using the MINUS operator judiciously as needed) can get you pretty far. In a previous post we found a LOT of hits this way.  

3. Consider searching other special collections.  Think about searching on Google Books (remembering that the Hathi Collection is mostly, but not quite the same as Google Books).  Searching in Books will get you a bunch of additional hits, but also search in places with other cultural resources such as movies (IMBD to find "Age of Dragons" where Captain Ahab is in search of the Great White Dragon, with trusty Starbuck and Queequeg on hand in the crew) or recorded music (e.g. Spotify, to find the song Queequeg by Quarteto Minimo).  

Search on! 

Wednesday, October 5, 2022

Answer 1: Can you find characters from Moby Dick in other places?


Another way... 

 As I mentioned last time, there are multiple ways to think about answering this question.  I'll give one way today, and then ANOTHER answer later this week.  (I promise that next week we'll move to another Challenge.)  

Reminder: Our Challenge was... 

1. Can you find a way to identify other major works of fiction (leaving out fan-fiction for the moment) in which the names of "Starbuck" and "Queequeg" appear (either independently or together)?  

Last time we showed how to use Wikipedia to find mentions of Starbuck or Queequeg by looking at the Wikidata entry for each entity.   

Here's that page for the character Starbuck: 

We noticed the right hand side is a column with all of the Wikipedia articles about Starbuck in different languages. But look at the Wikidata entry for Queequeg! 

Note that there seem to be entries only in English and French (looking at the right hand column). BUT if you look at the main table (in the middle), you can see there are entries for English, Spanish, and NOT French.  What gives?  

In the previous post I asked the question for you to consider:  Is there another way to identify the Wiki pages that DO mention Starbuck, rather than relying on Wikipedia's own search function?  

Well, yes, of course there is, let's try it this way: 

     [ "starbuck" -starbucks ] 

Notice that this will search all of the Wikipedia languages.  (To search just within the French Wikipedia, you'd search with: 

     [ "starbuck" -starbucks ] 

If you search both the French and Spanish Wikipedias ( you'll see that they both have multiple hits for the name Starbuck AND for the name Queequeg.  For instance, both French and Spanish Wikipedias mention Queequeg in connection with the story La Grotte Gorgone (FR), aka La cueva oscura (ES), which is aka The Grim Grotto (EN) the eleventh book in the series The Disastrous Adventures of the Baudelaire Orphans by Lemony Snicket (aka A Series of Unfortunate Events).  When I browsed through the list of hits, I also found another mention of Queequeg in Italian, etc etc.  

Interesting that the Wikidata page on Queequeg doesn't cover all this. This tells us that the Wikidata might not be complete over the sweep of the Wikipedia landscape.  

But the should get most of the hits.  The search page for this site: query should look like this: 

This says there are about 498 results (although only 304 of them can be actually read).  By clicking on each link and opening in parallel, it's pretty simple to quickly assess what each of these hits actually represents.  

Quickly scanning the top 100 or so results leads me to this list.   (I'm only looking for literary references and ignoring fascinating asides like the insect from Guinea, West Africa , Queequeg flavibasalis.)  

In La Grotte Gorgone (French), Lemony Snicket book #11,  Queequeg is a submarine.  

The 2013 film "A Spell to Ward off the Darkness" features a Norwegian metal band named Queequeg.   

In the television show, The X Files, the character Dana Scully has a dog named Queequeg.  

Queequeg is the name of a shapeshifter, a kind of super-villian in the DC Comicbook universe who works with his buddy Ishmael.  Strangely,  both of them work for Tobias Whale...  

A 1926 silent film gets the whole crew together for a slightly different telling of Moby-Dick, the Sea Beast.  (Does this really count as a different work?  Maybe.  Unlike Moby-Dick, it has a happy ending.)               

In 2011, a fantasy film, Age of the Dragons, has the crew chasing dragons on land.  Queequeg is along for the ride with Danny Glover as Ahab.  

And in 1977 the book Queequeg's Odyssey came out, telling the (true!) story of the trimaran named for the harpooner that was built in Illinois, floated to the ocean, and used to sail around 

I could go on, but you see my point. A simple site: search on Wikipedia can reveal ALL of the pages that mention a given person's name.  

My point:  (and I do have one) 

... is that no single source is going to give us everything we want in a single search.  For complex search tasks like this, the best you can do is to assemble data from multiple sources.  It's going to be really really hard to get a complete list of all the uses of Starbuck or Queequeg in literary works.  

On Friday of this week I'll write one more post that pulls everything together along with a way to do a query of Wikidata using a tool that might surprise you.  

Search on!