Wednesday, August 29, 2018

Answer: How to find difficult web pages? (Part 2)


What makes a page difficult to find?  (Part 2)

I was impressed by how well (and how quickly!) SRS readers were able to figure these out.  Some of the search paths were lovely and inspired.  Nice work, Readers!  


Here's what I did.  Let me repeat the two Challenges and then tell you what I did to answer them.  


1.  This happens to me more often that I would like:  Images in my blog will sometimes go missing in action.  This happens when a website disappears, leaving my nice link to their image with a gaping hole.  Perhaps you've seen it on other web sites--the hole looks like this: 

A broken image link leaves behind a hole-in-the-page.  I want to find a replacement image.  One that looks the same as this missing image!
How can I find a replacement image for this hole in my blog?  In other words, can you find this missing image?  This hole-in-the-blog comes from the SRS post of December 14, 2011 and shows a particular remote-control glider.  (In fact, it's one that I built back in the late 1990s.)  
The Challenge for skilled SRS-ers is to (a) figure out what that image looked like, and (b) find that image somewhere else on the internet.  Can you? 

My solution:  I tried opening this image by Control-clicking (right-click on Windows) on the image-hole and then "Open Image in New Tab"--like this: 

  
I wanted to get the URL of the image.  (And yes, I could have done "Copy Link Address," but you'll see why I did it this way in a second...)   The URL for this image is: 

www.carlgoldbergproducts.com/airplanes/gpma0960_01_bg.jpg


This is what you might see if you open this link in a new tab: 


This is a classic "page not found" error.  

If you recall from a few weeks ago, I mentioned that it's handy to use the Wayback Machine browser extension.  This is a Chrome (or FF) extension that pops up when you hit a missing page (or file).  So my display really looked like this: 


If you "click here," it takes you to the Wayback Machine, and if you follow the obvious links forward, you'll get to this page: 



Now I see that the image is from an old site about remote-control gliders.  That makes sense, and it's going to be one of those images, but which one?  

I just went back to the Wayback Machine and put in that image URL above (the one in green + bold above).  Here's what I get from the Wayback Machine: 


Great!  It looks like the image was last saved on Feb 8, 2018. But if you click on that, you get another "missing" image.  Truth is, sometimes you have to work your way back along the timeline to find a real version of this image.  I jumped back to Mar 12, 2014 and found this: 


But I wasn't quite done yet.  I was wondering if that image had been used somewhere else.  Did this particular glider move from the Carl Goldberg company to some other place?  

To test this out, I did this query, looking for another use of this image name elsewhere on the web: 

     [ inurl:gpma0960_01_bg.jpg ] 

As you know, the inurl: operator searches for any string inside of a URL.  In this case, I was searching for that particular file name.  (Why?  Because I know that people are lazy and usually don't rename images.)  

Unfortunately, that gave me zero results.  

Now what?  

Let's look at the file name in detail.  It's: 

      gpma0960_01_bg.jpg

To me, this looks like a product code ("gpma0960") with a number (01) and a code indicating that it was used in the background (bg).  

What would happen if we just did an inurl: search for the product code name?  I'd expect to find all kinds of things with that code in the URL.  Here's my next search: 

     [ inurl:gpma0960 ] 

And... we hit the mother lode!  Here's the SERP for this query.  See how the product code appears in all of the URLs.  


This inurl: trick is incredibly useful for finding products, especially those that are no longer in production!  


2.  A while ago I was having dinner at a hole-in-the-wall Turkish restaurant somewhere in Europe and had a fantastic dessert.  It was rich, creamy, simple and wonderful.  I wrote down the namekaymak–so I could find it again at a place closer to home.  My Challenge was to find a place near me (that is, in Mountain View, California) that sells kaymak.  Can you find a place in Mountain View, CA that sells this fantastic dessert?  
(Note that I do not want clotted cream, nor do I want to buy it through online purchase, I want real kaymak that I can eat today!!  
For extra credit (and this is the difficult part)--How much does this place in Mountain View sell it for?  


My solution started by searching for: 

     [ kaymak near me ] 

But if you're not in Mountain View, CA (as I am), you could do the equivalent thing with this query: 

     [ kaymak near Mountain View, CA ] 

In this case I included the city and state because there are multiple cities that share our name.  I wanted to be sure to get the right one.  Here's what I see: 


Notice that the first result is to a Yelp result that lists places that sells "clotted cream."  That's close, but not quite what I wanted.  I want kaymak!  In this case, I want to turn off the synonyms, so I quote the term to get exactly that (and only that).  Note the difference between these two SERPs.  


This looks great!  

But oddly, when I open the Olympus Caffe & Bakery web site, I can't find the word kaymak on the page.  This is a case where my Control-F skills didn't pan out.  

Now what?  As you can see, it's not on the page!  


I'm confident that kaymak is here, somewhere.  Where?  

I could start clicking on all of the buttons (e.g. "Cakes/Desserts"), but I went with a more hacker approach, a method that's sometimes handy.  

I went ahead and did a View Source.  It's an option that you can get to like this:  


This will show you the raw HTML, which can be scary, but you can then search for kaymak... Here, I've highlighted the line, which happens to include the price:  $4.50 


If you read HTML, you can see it appears under the "Turkish Breakfast" menu item, which would have taken me a long time to find by clicking on all of the options.  

Viewing the source of the page is often a useful method when the page is complex and has a lot of 


As I said, I was impressed by some of the answers in the comments this week.  Well done team!  


Search Lessons 


1.  Remember the Internet Archive / Wayback Machine when looking for lost pages or images!  They don't cover absolutely everything, but it is an invaluable service to the community. 

2. Using INURL: to find other pages with the same text in the URL is often a great way to track down pages that share content with what-you're-seeking.  Don't underestimate the power of inertia:  Webmasters often prefer to keep the URLs of previously existing images and pages when they move (or copy) content.  As a side-effect of this, you can often find content that would otherwise go missing.  

3.  Developer>View Source  ... it gives you access to the ground truth for many pages.   In this case, I was able to find the kaymak entry very quickly, without all of that annoying clicking around in the menus to figure out which category of thing it was hidden under.  

Search on! 

Wednesday, August 22, 2018

SearchResearch Challenge (8/22/18): How to find difficult to find web pages? (Part 2)


What makes a page difficult to find?  (Part 2)

As we saw last week, sometimes you remember the page, but have difficulty figuring out the exact words for your query.  In one case, I remembered seeing an article on a topic (the American author disputing a Wikipedia article), but the results were filled with Wikipedia results, which in this case, really didn't help.  So we used the site: operator to exclude those results.  

The other example from last week was to find an article about a black racer snake from the New Jersey government educational site.  There, we had to use the right domain name (site:NJ.gov) and search in that part of the NJ web site with site:NJ.GOV.  

This week, I have two other "difficult to find" problems that I hope you can solve.  These are both a bit more tricky than last week's and require a bit more sophisticated search knowledge, so I hope you're up to the Challenge!  


1.  This happens to me more often that I would like:  Images in my blog (THIS blog!) will sometimes go missing in action.  This happens when a website disappears, leaving my nice link to their image with a gaping hole.  Perhaps you've seen it--the hole looks like this: 

A broken image link leaves behind a hole-in-the-page.  I want to find a replacement image–one that looks the same as this missing image!
Arrgh!  This is frustrating, but an inevitable consequence of having companies go out of business.  This causes link-rot and that makes the target of the link (in this case, the image of a remote-control glider) go missing.  It shows a broken image icon instead.     
How can I find a replacement image for this hole in my blog?  In other words, can you find this missing image?  This hole-in-the-blog comes from the SRS post of December 14, 2011 and shows a particular remote-control glider.  (In fact, it's one that I built back in the late 1990s.)  
The Challenge for skilled SRS-ers is to (a) figure out what that image looked like, and (b) find that image somewhere else on the internet.  Can you? 

A related difficult to find web page relies on a different technique... but it's also a toughie.  Can you answer this dessert-related Challenge?  

2.  A while ago I was having dinner at a hole-in-the-wall Turkish restaurant somewhere in Europe and had a fantastic dessert.  It was rich, creamy, simple and wonderful.  I wrote down the namekaymak–so I could find it again at a place closer to home.  My Challenge was to find a place near me (that is, in Mountain View, California) that sells kaymakCan you find a place in Mountain View, CA that sells this fantastic dessert?  
(Note that I do not want clotted cream, nor do I want to buy it through online purchase, I want real kaymak that I can eat today!!  
For extra credit (and this is the difficult part)--How much does this place in Mountain View sell it for?  



These two Challenges need very different and fairly advanced techniques.  If you can solve both of these, you can rate yourself as a Jedi-level SearchResearcher!  

Please let us know how you solved the Challenges--and be as clear as possible in HOW you did it.  (For these Challenges, you need more than a clever query and the use of site:) 

Search on! 


Thursday, August 16, 2018

Answer: How to find difficult to find web pages? (Part 1)


There are many reasons...   


... why a particular page might be difficult to find.  Sometimes your memory is just plain wrong; sometimes your memory is so generic as to not remember anything that would let you pick the right page out of thousands of similar pages; sometimes the page is really missing (that is, a 404 error--web page missing).  
The Challenges from last week are interesting examples of Difficult-Web-Pages.  Let's talk about how to find these, and why they're tough.  

1.  A while ago I remember reading an article about a famous US author that was having some difficulty editing his own Wikipedia page.  As crazy as it sounds, they wanted to have independent verification of what he was saying.  I found myself wanting to re-read that article so I could refer to it in my writing.  I needed to find it to confirm details.  This was my Challenge:    
Who was the famous US author that was involved in a dispute with Wikipedia over the accuracy of the entry describing his novel?   

I had a very clear memory of this article, but I couldn't remember where I read it.  You might think that the obvious query is something like: 

     [ Wikipedia famous author article challenged ] 

(or something similar).  The problem is that the query isn't specific enough--there are a LOT of web pages with these terms, so the results are pretty scattered--they didn't help me find what I'm looking for.  We have to find a way to change the query:  


A big part of the problem here is that many of the results are FROM Wikipedia.  (Makes sense, Wikipedia is a search term.) In fact, the first 40 results are all from Wikipedia.org  
What would happen if we excluded the Wikipedia results?  Would that improve our accuracy?
  
As you know, site: lets you search just within that site (e.g.,  [ site:Wikipedia.org ] )  
But how can we exclude a site?  That's easy: Use the minus symbol like this:  
     [ blah blah blah –site:Wikipedia.org ] 

Notice that small MINUS sign (aka a hypen) in front of the site: operator.  That means to search everywhere on the web, but NOT on this site.  

When I do this search, my SERP looks like this:  




See that 4th result?  (The one at the bottom of this image.)  This is exactly what I was looking for--a famous US author (Philip Roth) who was in a dispute with Wikipedia over the accuracy of the entry describing one of his novels.  As Roth wrote in his open letter to Wikipedia,  he was told that "...I, Roth, was not a credible source: 'I understand your point that the author is the greatest authority on their own work,' writes the Wikipedia Administrator – 'but we require secondary sources.'"  This dispute went on for a while, and to their credit, Wikipedia repaired the entry, and it stands as an accurate source of information about Roth and his work.  

Other approaches work too.  Reader SpiritualLadder found the answer with the query: 

     [ dispute with author over book Wikipedia entry ] 

And D. Lazar found it with: 

     [ author dispute wikipedia ] 

I tried a few other queries like this (that is, without the site:) and found it to be pretty hit or miss.  If you managed to guess the right words, you'd find the article.  Using the -site: operator gets you to the result pretty quickly. 

Several readers also found their way to the Wikipedia List of Controversies page (which is pretty interesting reading), and then found the article about Philip Roth on that page.  


A black racer rising up out of the grass.
Thanks & P/C Continis on Flickr.
2. See that image above?  That's a black racer snake.  I happened to see one the other day, and I remembered from previous reading that the state of New Jersey had a few articles about snakes in their state, and I remember one about black racers in particular. 
Can you help my fading memory and find an article about the black racer snake that’s published by the state of New Jersey as part of their educational outreach program? 
This was my first query.  


I'm not proud of it, but I wanted to show you that even practiced searchers also make mistakes.  

What's wrong here? 

What I'm trying to do is to just search on websites in New Jersey.  I know that the code for New Jersey is .nj  so I used that as the target of the site: operator.  

But I got zero results.  Why?  

After I calmed down after the shock of zero results, I realized that no matter what you think of New Jersey, it doesn't have its own top-level-domain name.  (A top-level-domain name is the code at the very end of a URL--e.g., .GOV .MIL  .EDU  .INFO etc.)  

An important lesson for searchers is to look at what's going on and try to debug your process.  What can you learn from this for next time?  In this case, I needed to know what the REAL web name is for the state of New Jersey.  

A quick search for [ official website New Jersey ] tells you that they're part of .GOV -- and their URLs all end in .NJ.GOV!  

Let's redo that query with the correct site specifier (and a better query).  


This looks more like what I'm seeking.  It's in New Jersey's educational web site, and it's about black racer (Coluber constrictor) snake.  

Now, to find all of the educational content at the NJ.GOV site, I just truncated the URL of the first result.  That is, I went to: https://www.nj.gov/pinelands/infor/educational/   and found a great page full of results...





Search Lessons 

In this post, I really wanted to emphasize the way that site: operates.  There are two big lessons here. 

1.  You can use –site: as a way to remove invasive results from your search.  In this case, because we were searching for something about Wikipedia (but not necessarily ON Wikipedia), we used the –site: operator as a way to get rid of the annoying results that were all on Wikipedia.  Use this trick anytime you want to remove an entire site from consideration.  Usually, this happens with super popular sites that tend to dominate the results. 

2.  SITE: can take any site specifier, including subdomains and directories.  In this example we just used the subdomain + top-level-domain  .NJ.GOV  -- but we could also do a site: with a directory as well.  Here's an example showing that the Pinelands part of the official web site has around two thousand pages covering a broad range of topics.  (And you can see that Educational content is part of their much larger mission.)   





As I mentioned, this is Part 1 of a series of "Difficult Web Page" search Challenges.  This one wasn't too difficult--Part 2 will be more challenging.  

During the next week or so I'll be doing an occasional additional post about topics that I think you'll be interested in reading.  See you here soon. 

Search on! 

Wednesday, August 8, 2018

SearchResearch Challenge (8/8/18): How to find difficult to find web pages? (Part 1)



Every so often you know a web page exists, but it's tough to put your finger on it.  


This last week I had several search Challenges pop up in my work.  Here are a couple of questions I found myself asking, and was ultimately able to resolve.  Can you?  

This is Part 1 of a series of "Difficult Web Page" search Challenges.  These first two aren't so hard--Part 2 will be more challenging.  Each of these Difficult Web Page SearchResearch Challenges is intended to highlight one particular method for doing your web searches with precision and skill.  


A black racer rising up out of the grass. 
Thanks & P/C Continis on Flickr.


-->
1.  A while ago I remember reading an article about a famous US author that was having some difficulty editing his own Wikipedia page.  As crazy as it sounds, they wanted to have independent verification of what he was saying.  I found myself wanting to re-read that article so I could refer to it in my writing.  I needed to find it to confirm details.  This was my Challenge:    
Who was the famous US author that was involved in a dispute with Wikipedia over the accuracy of the entry describing his novel?   

2. See that image above?  That's a black racer snake.  I happened to see one the other day, and I remembered that the state of New Jersey had a few articles about snakes in their state, and I remember one about black racers in particular. 
Can you help my fading memory and find an article about the black racer snake that’s published by the state of New Jersey as part of their educational outreach program? 

To answer these requires a bit of Search Engine Jedi-level skills.  Can you answer both of these Challenges?  

When you do, be sure to tell us how you did it in the comments!  

Good luck in your quest.  

Search on!