Tuesday, December 15, 2015

Answer: How often do we write... this?

If you want to know about how often a word... 

... occurs in our language, you need to find some data--usually the bigger, the better.  

This week's Challenge is all about comparing how often people write about certain ideas.  How often do they mention LA vs. NYC vs. London?  How often do they write about different beverages?  And so on.

As I mentioned, once you have a great data set you can process it to find answers to questions like "is fly used more often as a verb, or a noun?"  
But before you do the analysis yourself, a great approach is to do what I always say:  When in doubt, search it out.  In this case, you'd save yourself a TON of time.  
I did a search for: 
     [ word frequency database ] 
where "frequency" is the special term that language people use to talk about how often a word is used.  (A "low frequency" word is one that's fairly rare--it is an infrequent word in common writing; that's a word like "peruse."  A "high frequency" word is one that's pretty common, such as "the" or "is" or "bank.") 
When you do this query you'll find a bunch of speciality databases about words.  There's the BYU Corpus  With this resource you can chose from 440 million words of full-text data for Contemporary American English (190,000 texts), 385 million words from COHA (115,000 texts), or 1.8 billion words for Global Web English (1,800,000 texts). You can either download the data ($), or use the web interface to do analysis. The data comes in three formats: relational database, word/lemma/PoS (vertical format), or text (linear format).
There are others, but the obvious one for our Challenge is the Google NGRAM viewer.  This is a database and web-interface to query the dataset.  You can read lots of details about the NGRAM viewer, including the resources it is derived from.  
Now that we have a tool to help out, let's try to answer these Challenges.  (Remember that you can click on the images to see them at full size.)  

1. When people write about world cities, which do they write about most often?  Los Angeles, London, Berlin, or Beijing? 
In NGRAMs this is pretty straightforward, you just enter these terms as your search:  
It's pretty clear from this chart that London clearly outscores the others. (Notice that I didn't include New York in this comparison set.  Why?  Because you can't tell the difference between "New York" as a reference to a city from "New York" as a state.  Just for comparison purposes, here's that set of cities WITH New York, where New York clearly takes over from London right around 1910, just as it becomes an important commercial and cultural center.  
(I'm willing to bet that most of those references are to New York as a city, but I don't know how to automate it.  If I really wanted to know, I'd just take a sample set and then break up the task of reading the text snippets and classifying them as "city reference" or "state reference" with Mechanical Turk.  But that's a different blogpost.)   
2. If you look at what people write about what they drink (as a beverage), what do they write about?  (Water? Wine? Beer? Coffee? Root beer?)  Which is the most commonly written-about beverage? 
Let's do this one in the obvious way.  
The slightly non-obvious thing I've done here is to click on the "coffee" label on the right to highlight the orange line for coffee.  If you toggle back and forth between wine and coffee, you'll see that coffee comes right up to wine in frequency starting around 1920, but it never quite becomes more frequent.  
For SearchResearchers, it's worth noticing that at the bottom of the chart is a small instruction line that says:  "(click on line/label for focus)"
In general, sophisticated researchers pay attention to stuff like this.  There's often great capability hidden in the user-interface.  For instance, here are some additional features that you should notice:  
1. Note that you can change the date over which the search is run.  Here I've zoomed into the time between 1910 and 1950 to get a better view of the data about "wine" and "coffee."  (It's true, wine is ALWAYS more commonly written about than coffee.)  
2. This is a pull-down menu that let's you change which corpus is being used.  You can choose different data: American English, British English, German, Russian, Chinese, etc.  
3. You can change the "smoothing" (that is, how many data points in a row does it average together to create a smooth graph--without the smoothing, the graph sometimes looks pretty choppy). 
4. "case-insensitive" You might care if the terms are capitalized or not. For these terms, we wanted the case to NOT matter, so I clicked it off. 
5. If you move your mouse over the chart, you'll see a pop-up that gives year-by-year detailed data.  

3. Is the word "fly" used more often as a noun, or as a verb?  
Now how can we do this using NGRAMs?  It's really not obvious.  If we can't tell New York city apart from New York state, how can we figure out "fly" as a verb vs. "fly" as a noun?  
The fact that I'm even asking this question is a big hint to you that it's probably possible!  
Since there's no obvious way to do this with the NGRAM UI, you'll have to search a bit deeper.  This time, we need to do: 
     [ Google ngram advanced search ] 
to find the keys to this particular kingdom.  Sure enough we find the advanced search page, which tells us that we can modify the query terms to indicate which word sense to use (such as "verb" "noun" "adjective" etc.).  All together, NGRAMs lets you do wildcard search, inflection search, case insensitive search, part-of-speech tags and ngram compositions.  
Here we just care about verb vs. noun.  
The documentation tells us that we just change the search term by appending the category name, like this: 

I'll leave it to your imagination about why we talk about the act of flying more than we talk about a "fly" as a noun.  That's also another blog post.  (Think also about why there's that big hump in "fly_VERB" in the 1940s.)  

4. Speaking of polysemous words (words with more than one meaning), can you find any words that USED to be used more frequently as nouns, that are now usually used as verbs?  (Or vice-versa? Words that were once verbs, but are now thought of as nouns?)  
Unfortunately, there's no way (that I know of) to find this out without either writing a program to test LOTS of verb/noun pairs.  Here's one I found with a fascinating cross-over.  

Ramon found that "now" has switched relatively recently from verb to noun (although I rather suspect this tagging--in the phrase "where to now?" the "now" would be categorized as a verb, when this is in fact a sentence fragment).  
As Rosemary found in a Forbes article about large data sets, there are lots of issues with using any large data set.  (I recommend that article for people interested in the details of how you use multiple different data resources to cross-check and validate your findings.)  Among other things, that article points to the New York Times NGRAM corpus as drawn from their newspaper articles over the past 150 years.  Check out their own NGRAM chart drawing tool.  
As an example, here's a contrasting graph between the NYTimes and Google NGRAMs for the word "percent": 

I tried to align the years as closely as I could (by hand).  You can see that the word "percent" wasn't very common pre-1920 in English books, and shows an incredible rise in the press around 1975.  (In fact, that rise is so sudden, that if I were writing an article about this, I'd try to figure out why "percent" was being used in books in the 40s and 50s, but not in the press.  What changed to make such a dramatic rise?)  

Search Lessons 

There are a few that I hadn't planned on for this Challenge, but ones that I welcome!  
1. Search for the tool / database before trying to do this on your own.  SearchResearchers know this, but it's the kind of thing that is often overlooked by students.  Ten minutes of Google search can save you weeks of writing code to do it on your own.  (Note:  Sometimes writing your own code is exactly the right thing to do, especially if you want to replicate a study or gain a deeper understanding of the data yourself.)  
2. Pay attention to the UI.  Learn to scan the interface looking for things it can do for you.  You might not need all of that capability right now, but if you see it once, you stand an excellent chance of remembering that when you need it most farther down the road.  
3. Sometimes you need to search for advanced features.  You had a clue that you could search for multiple word senses (verbs vs. nouns vs...), but how could you figure it out?  Searching for [ application advanced search ] or [ application advanced use ] will often lead you to the documentation you really need.  
4. Question what you find. Always always always keep a skeptical mind about what you discover.  This is true for everything, including this blog post!  But in particular, if you can, try to cross-validate your findings.  And if you find an extraordinary bit of data, remember that it probably needs an extraordinary argument.  

Search on! 


  1. Good Morning, Dr. Russell.

    This Challenge has lots of lessons and many questions and as you said, we can invest many time finding words and results.

    [ word frequency database ] is a great query. I decided to try Ngram because you wrote " in a large collection of written English text". That was a big clue.

    What do you mean with "Mechanical Turk"? I tried [define Mechanical Turk]Wikipedia gives twp possible answers but I better ask you.

    In Q2, I found that clicking water divided it in many subgroups(don't know how to do this again.) In this case water is a beverage because Ngram looks context or water can be like New York state Vs city? and all kinds of waters are together.

    In the additional features, at the bottom says: "Run your own experiment! Raw data is available for download" that sounds interesting and complicated.

    Thanks Dr. Russell. A new SearchReSearch Tools for us.

  2. Interesting post, thank you!

    For #1, I wonder if London dominates because so many English-language books and other published texts were published there and therefore London appears on the book's title pages with the publishers name.

    For #2, can the search be limited to those liquids *as* beverages, not as topics of industry or nutritional research (or coffee beans or water buffalo or beer fries)?

    It's really hard to nail down the context of the usage of a word.