Tuesday, January 10, 2012

What is AND about, really?

One question I hear quite a bit is this: Don't we have to teach the basics of Boolean search to our students? 

The answer, from a Google perspective is this:  We teach Boolean searching only for using traditional database systems.  

Here's why.  

Google queries let you use OR freely between terms.  It's basically a way for you to control your own synonyms.  A search like [ "mountain lion" OR cougar habitat ]  will look for [ "mountain lion" habitat ] OR [ cougar habitat ]  and then rank order the results.  Synonyms like "puma" or "painter" will be pushed farther down in your results list.  In effect, your query is pulling up those synonyms to a higher place in the results.  You'll still get terms like "puma" in the results, but they're farther down on the list.  

So, OR basically tells Google that these terms are synonyms.  Note also that parentheses are dropped from the query.  What this means is that you group your OR terms together (e.g, [lungs OR pleural OR respiration systems] without using complicated sets of parens.  Terms that are in an OR list (e.g, [a OR  b OR c   w OR x OR y OR z] ) are all synonyms for terms within that list.  Thus, a b c are all synonyms, while w x y z are all considered as synonyms for each other.  

Realistically, I use ORs when I have particular synonyms in mind that I want Google to use.  For example: [ "high pile" OR fleece OR polarfleece jacket ] asks for 3 different synonyms for the same concept... the ones I want Google to use.  

What if I don't use OR?  

Then you're implicitly ANDing the search terms together.  Except it's not really an AND.   

So then, what is AND?  

For Google, AND is basically a no-op.  That is, it's just another word that you can search for--it doesn't affect the way the query is handled at all.  You can see this for yourself.  Compare the differences in the results of these queries in the image below: 

                    #1 [ screening injury ]   #2 [ screening and injury ]    #3 [ screening AND injury ] 

You can see that #1 (with no AND or and) is searching for documents that have both the terms in it. That's an implicit AND.  If it were a Boolean AND, then both terms would HAVE to be in the document.  Thing is, for other searches (say, [ xeric redemption plangent VXII ] ), then you'll get pages that may not have all of the search terms on that page, but might have synonyms or other variants of the terms.  If you want "Verbatim Search," you can get that (see my post on Verbatim)

And if you compare results #2 and #3 above, you'll see that the term 'and' is just another search term in the query.  That's why it's bolded in the 3rd result of panels #2 and #3.  

Make sense?   

To summarize:  OR gives you specific control over the synonyms that are being searched for; everything else is implicitly ANDed together.  Google will try its best to find documents that have all of the search terms in your query, but it will try synonyms and spell-corrections in an effort to do what you really meant (but only after everything else has failed).  

I don't know about you, but this "trying other queries after everything else has failed" approach has saved me on multiple occasions.  Google's synonymization is pretty extensive; it's part of what makes the search results so robust.  

Search on! 

1 comment:

  1. As you note, the most important thing is that terms implicitly ANDed together don't require that both of the terms show up in the search, but rather simply weight the search based on both of the terms. AFAIK, there isn't a way of searching that actually _requires_ multiple terms. That said, it's somewhat rare to want such a search (the last time I recall wanting it was on some prior art patent research, where conjunction of concepts was important), whereas the "give me the best from off of these two terms" search I want probably three times a day.