Thursday, August 16, 2018

Answer: How to find difficult to find web pages? (Part 1)


There are many reasons...   


... why a particular page might be difficult to find.  Sometimes your memory is just plain wrong; sometimes your memory is so generic as to not remember anything that would let you pick the right page out of thousands of similar pages; sometimes the page is really missing (that is, a 404 error--web page missing).  
The Challenges from last week are interesting examples of Difficult-Web-Pages.  Let's talk about how to find these, and why they're tough.  

1.  A while ago I remember reading an article about a famous US author that was having some difficulty editing his own Wikipedia page.  As crazy as it sounds, they wanted to have independent verification of what he was saying.  I found myself wanting to re-read that article so I could refer to it in my writing.  I needed to find it to confirm details.  This was my Challenge:    
Who was the famous US author that was involved in a dispute with Wikipedia over the accuracy of the entry describing his novel?   

I had a very clear memory of this article, but I couldn't remember where I read it.  You might think that the obvious query is something like: 

     [ Wikipedia famous author article challenged ] 

(or something similar).  The problem is that the query isn't specific enough--there are a LOT of web pages with these terms, so the results are pretty scattered--they didn't help me find what I'm looking for.  We have to find a way to change the query:  


A big part of the problem here is that many of the results are FROM Wikipedia.  (Makes sense, Wikipedia is a search term.) In fact, the first 40 results are all from Wikipedia.org  
What would happen if we excluded the Wikipedia results?  Would that improve our accuracy?
  
As you know, site: lets you search just within that site (e.g.,  [ site:Wikipedia.org ] )  
But how can we exclude a site?  That's easy: Use the minus symbol like this:  
     [ blah blah blah –site:Wikipedia.org ] 

Notice that small MINUS sign (aka a hypen) in front of the site: operator.  That means to search everywhere on the web, but NOT on this site.  

When I do this search, my SERP looks like this:  




See that 4th result?  (The one at the bottom of this image.)  This is exactly what I was looking for--a famous US author (Philip Roth) who was in a dispute with Wikipedia over the accuracy of the entry describing one of his novels.  As Roth wrote in his open letter to Wikipedia,  he was told that "...I, Roth, was not a credible source: 'I understand your point that the author is the greatest authority on their own work,' writes the Wikipedia Administrator – 'but we require secondary sources.'"  This dispute went on for a while, and to their credit, Wikipedia repaired the entry, and it stands as an accurate source of information about Roth and his work.  

Other approaches work too.  Reader SpiritualLadder found the answer with the query: 

     [ dispute with author over book Wikipedia entry ] 

And D. Lazar found it with: 

     [ author dispute wikipedia ] 

I tried a few other queries like this (that is, without the site:) and found it to be pretty hit or miss.  If you managed to guess the right words, you'd find the article.  Using the -site: operator gets you to the result pretty quickly. 

Several readers also found their way to the Wikipedia List of Controversies page (which is pretty interesting reading), and then found the article about Philip Roth on that page.  


A black racer rising up out of the grass.
Thanks & P/C Continis on Flickr.
2. See that image above?  That's a black racer snake.  I happened to see one the other day, and I remembered from previous reading that the state of New Jersey had a few articles about snakes in their state, and I remember one about black racers in particular. 
Can you help my fading memory and find an article about the black racer snake that’s published by the state of New Jersey as part of their educational outreach program? 
This was my first query.  


I'm not proud of it, but I wanted to show you that even practiced searchers also make mistakes.  

What's wrong here? 

What I'm trying to do is to just search on websites in New Jersey.  I know that the code for New Jersey is .nj  so I used that as the target of the site: operator.  

But I got zero results.  Why?  

After I calmed down after the shock of zero results, I realized that no matter what you think of New Jersey, it doesn't have its own top-level-domain name.  (A top-level-domain name is the code at the very end of a URL--e.g., .GOV .MIL  .EDU  .INFO etc.)  

An important lesson for searchers is to look at what's going on and try to debug your process.  What can you learn from this for next time?  In this case, I needed to know what the REAL web name is for the state of New Jersey.  

A quick search for [ official website New Jersey ] tells you that they're part of .GOV -- and their URLs all end in .NJ.GOV!  

Let's redo that query with the correct site specifier (and a better query).  


This looks more like what I'm seeking.  It's in New Jersey's educational web site, and it's about black racer (Coluber constrictor) snake.  

Now, to find all of the educational content at the NJ.GOV site, I just truncated the URL of the first result.  That is, I went to: https://www.nj.gov/pinelands/infor/educational/   and found a great page full of results...





Search Lessons 

In this post, I really wanted to emphasize the way that site: operates.  There are two big lessons here. 

1.  You can use –site: as a way to remove invasive results from your search.  In this case, because we were searching for something about Wikipedia (but not necessarily ON Wikipedia), we used the –site: operator as a way to get rid of the annoying results that were all on Wikipedia.  Use this trick anytime you want to remove an entire site from consideration.  Usually, this happens with super popular sites that tend to dominate the results. 

2.  SITE: can take any site specifier, including subdomains and directories.  In this example we just used the subdomain + top-level-domain  .NJ.GOV  -- but we could also do a site: with a directory as well.  Here's an example showing that the Pinelands part of the official web site has around two thousand pages covering a broad range of topics.  (And you can see that Educational content is part of their much larger mission.)   





As I mentioned, this is Part 1 of a series of "Difficult Web Page" search Challenges.  This one wasn't too difficult--Part 2 will be more challenging.  

During the next week or so I'll be doing an occasional additional post about topics that I think you'll be interested in reading.  See you here soon. 

Search on! 

No comments:

Post a Comment