Tuesday, August 23, 2011

Control-F and other tools for reading online

I seemed to have caused a minor disturbance in the Force with my comments to Alexis Madrigal last week about search.  


We were having a conversation about how I go about doing my research at Google (especially "search anthropology") when I mentioned the results of my Control-F study.  The key result (which I wrote about a while ago in SearchResearch) is that around 90% of all English-speaking US internet users do NOT know how to find a specified string on a web page using anything other than visual search.  That is, they don't know about the Edit>Find> function, or how to use Control-F (or CMD-F)... or any other means to determine that the word (or string) is there.  


In other words, 90% of internet searchers don't know how to jump to the location of a desired string on a web page.  What's more, they also cannot prove that a particular string does NOT appear on a web page.  


This is important because this one tool--Control-F, find--changes the way you read long documents. 


If you're searching for any occurrence of a word (say, "iceberg") in a web document, you can quickly find out where in the document that word appears, and how often.  


Suppose I was reading an interesting article about the effect of tsunamis on iceberg creation.  (Apparently when the recent Japanese tsunami hit the Antarctic ice shelf, some massive icebergs were made as a side-effect!) Here's a good article on this:  http://www.igsoc.org/journal/current/205/j11j073.pdf  snapshoted at the moment when I did a Control-F find for "tsunami."  


(You can click on the image to make it large enough to read my annotations.)  


This is the Chrome browser, which has a pretty nice feature in their Find function.  Note the yellow occurrence lines in the scroll bar.  That shows you where the hits for your find term are in the body of the document.  It's pretty clear to see here that there are a bunch:  46 to be exact, and the find function has highlighted number 19.  


Looking at the pattern of hits can tell you a good deal about the document.  Is it rich with hits?  Do the hits all congregate just at the beginning or the end of the document?  (Often the case when you're searching for an author's name, or key words that are used only in introductory text.)  


I've written more about the Subtleties of finding text in a document (basically, choose the shortest unique substring).  


It's interesting to consider what the presence of a tool like Control-F does for our ability to read.  As mentioned, you can now prove that a given word doesn't appear in a document.  Handy when you're on a very long document and don't want to waste a lot of time.  You can discover relationships in long documents that are difficult to perceive otherwise.  


I remember reading Jurassic Park on my Mac Duo, back in the days when entire books could be purchased on floppy drives (Voyager, 1992)... 


Using their find function, I was able to see where place names in the text were used, and I found that Cabo Blanco was mentioned at the very opening, and at the very end... only.  Seemed to me like the perfect device for a sequel.  I'm not sure I would have noticed otherwise.  


That's the simplest case I can think of: Using a find command to see structure in the text that's otherwise very difficult to notice.  


More importantly, what other tools should we have, and how would they affect our ability to read?  


A few ideas spring to mind.  This is a short list of some of the reading functions I'd like to see in my editor/browser.  


1.  Concordance function.  Able to count words in the current selection and sort by frequency.  Would be handy to get a gist or summary of what language is being used. 


2.  Nearby repeated words flagger.  One of the most common errors *I* make as a writer is to write something, then return to that section and edit it, using nearly identical words.  It's a big oops when you read back through it.  I'd like my editor to automatically note when I'm using a low-frequency word within some distance of another rare term. 


3.  Statistically improbable phrases.  Given a text (e.g., a long report or a magazine article), I want my browser to be able to highlight the phrases that are low-frequency wrt other writings.  If someone writes a really whacky phrase in the middle of a text, I'd like to be able to see it highlighted.  (Idea derived from Amazon's SIP section of their UI.  Knowing that the phrase "tyrannosaur roared" is in the book tells me something interesting about the book).  


What other kinds of reading-behavior tools would YOU like to see? 
Or, what other kinds of tools do you use NOW to help your online reading?  


Search on! 





4 comments:

  1. I like the idea of a 'concordance function' - the most used word can be highlighted on demand. I would also add an element of socialness in 'reading tools': how about the system automatically highlights those passages or sentences that most other people found to represent the 'gist' of what is being conveyed.

    ReplyDelete
  2. I think there is an obvious solution to people not knowing how to search inside a web page: web browsers should have a search box for searching inside the page (just like almost every website nowadays contains a search box). It is a bit ironic that browsers make it easier to search the web than to search inside a page.

    ReplyDelete
  3. Greg, I agree that most users don't use the find function because it's not in front of their face. However they also tend to get confused when they see more than one text box ("address" bar and "find" box). I frequently teach CAD software to designers who with wouldn't know how to search a page or document and they no matter how easy the program becomes they always need training.

    Dan, this is a great blog, I'm happy to have found it!

    ReplyDelete
  4. I use command-F frequently (I have a Mac, not a PC, so using "control" keys is rare—perhaps the blog should not be so Windows-centric).

    I sometimes find it handy when a search highlights every occurrence of a word being searched for, but not when it highlights every word individually of a multi-word search.

    I doubt that I would find concordances very useful for web pages (and they are easy to create when needed from documents), but nearby repeated words would be a handy browser tool.

    Most often, I need more specific searching than the rather vague bag of words search that Google and Bing use. Even quoted phrases and negation don't help much. Try looking for info about "Santa Cruz" and only get California, for example.

    ReplyDelete