Friday, April 18, 2014

Answer: What was the major news story?

As mentioned, this wasn't an especially easy or simple Challenge. Here are the questions, and my solution.  

1.  What was the biggest news story of something that happened in a particular part of Andhra Pradesh during the past 100 years?   (As measured by coverage in the international English language press. For people who want to use other languages, you should find the same result, but I don't know how to solve this in anything but English.)  
2.  In that same year, what was the biggest audio technology story? 
3.  Again, in that same year, what was the biggest story (same basis as before) about events going on in Germany?  

Let's break this first question down.

What and Where is Andhra Pradesh?

If you do the obvious search, you'll quickly learn that Andhra Pradesh (AP) was formed in 1956 by the annexation of the country of Hyderabad with the Indian Union.  AP is now composed of the regions of the small enclave of Yanam, along with the smaller coastal regions of Coastal Andhra, Telangana and Rayalaseema. The city of Hyderabad (formerly the capital of Hyderabad, the country) is now the capital of AP.



So finding the "biggest news story" that happened in a particular part of AP means checking news stories in AP since 1956 AND checking news stories about Hyderabad, Coastal Andhra, Telangana, and Rayalaseema as well.

Now... How can we check for "biggest news story" about a particular topic over the past 100 years?

It's pretty clear that (a) we need an archive of news stories, and (b) we need some way to count the frequency of news stories on a topic over a given time period.

Think about it this way:  What would be the ideal tool to answer a question like this?  When I thought about it, I realized that what I'd like is some way to create a histogram of the "number of news stories over the past century on a given topic."  Make sense?  How are we going to find that?

I remembered that once upon a time, Google News had this capability, but it's long gone.  There are various APIs to let one write code to access archival content (such as the NYTimes Article Search API, but that archive only goes back to 1981).

There are a number of newspaper archives out there, and the first one I checked, Newspapers.com, just happened to show a histogram feature to indicate the number of hits on a topic over the past century. This is exactly what we need to answer this question. 

Here, I've done the obvious query about AP.  And it's easy to see that Andhra Pradesh didn't register at all before it was founded in 1956.  Notice that the UI shows the number of hits in red above the histogram.  






By doing several queries : 

  [ Andhra Pradesh ] 
  [ Hyderabad ] 
  [ Telangana ] 
  [ Rayalaseema ] 

etc.) it didn't take me more than a minute to discover that the major event in the region of AP was the annexation of the state of Hyderabad by India.  That event, called "Operation Polo" (September, 1948) was the takeover of the princely state of Hyderabad and its incorporation into the union. By clicking through to the news articles, it was clear that this was generating a LOT of news coverage, as you can see by searching just for Hyderabad, and zooming in on 1948.  



As an event on the world stage, it was fairly major, causing a great deal of news coverage at the time.  (It was, after all, the overthrow of one of the world's last principalities, and merged 18 million people into India. During the fighting, roughly 30,000 people were killed.)



Now we have the answer:  The biggest news story in the AP region was the annexation of Hyderabad into India with at least 18,746 stories  Even the creation of Andhra Pradesh state in 1956 itself didn't generate quite as many news stories (at least 1000 fewer).  

Next question?  Now that we know the year (1948), the next two questions are relatively simple.  We could do more searching in the newspaper archive.  BUT.. a simpler search: 

     [ major news stories 1948 ] 

will bring up a large number of "What Happened in the Year 1948?" sites.  Scanning through several of them, you'll quickly discover that they agree: the 33&1/3rd LP (long-play) record format was announced in 1948 by Columbia Records.  Reading through some of that content revealed that this was a major advance over the previous formats (78 rpm records) with a much longer play time (20 minutes / side) and an improved signal/noise ratio.  

Doing the same kind of search for Germany in the "Top Stories of 1948" shows that, from the English language perspective, the Berlin Blockade and the subsequent Berlin Airlift to provide essentials to the city during the blockade by the Soviet Union.  Airplanes and crews from the United States Air Force, British Royal Air Force, Royal Canadian Air Force, Royal Australian Air Force, Royal New Zealand Air Force, and the South African Air Force flew over 200,000 flights in one year, delivering up to 4700 tons of necessities each day to the beleaguered city.   This story dominated headlines and coverage for much of the year.  


Search Lessons:  In this case, the hard part was figuring out HOW to measure "biggest news coverage."  In my case, I wasn't sure how to do this, but I started with what I already knew--to wit, that there were news archives.  By doing a bit of exploring there, I found that at least one of them (Newspapers.com) offered exactly the tool I needed--the histogram of hits by year.  Lesson:  Even when you don't think you know how to solve the problem, at least go visit the content--you never know what will turn up.  

Also remember that there are LOTS of people with deep interests in history out there on the web.  Although the quality of web content sometimes varies quite a bit, they DO provide useful aggregations of information... including "top stories" by year.  When in doubt, consider searching for those aggregations and visit several of them, looking for agreement (repetitions) about what were the top stories in any given time period.  

Finally, remember that many things change name over time. The state of AP didn't even exist before 1956.  So checking for news on the state of AP really meant checking for several previously existing regions (with totally different names) that were there before.  This continues in more recent time: You won't find any news stories about the country of Eritrea before 1993.  (There are stories about the region, but not the country.  You've got to be careful...)   

Search on! 



15 comments:

  1. Hello Dr. Russell and everyone!

    Thanks for the Coordinates post. It is very helpful.

    When I did this Challenge, finding the "correct" year, was not so hard because you gave to us 3 questions and that narrowed the options.

    The key, for me, was the press coverage. I tried many things and honestly I couldn't think about how to find the measuring tool. Searched like Sarah George mentioned similar to Ngram viewer, found nothing.

    I already tried the site that you mention. I had some problems to find the same results that you found, and finally I did it.

    I have new knowledge and I like that very much.

    Have a great weekend, and an excellent Eastern!

    ReplyDelete
  2. whew!… always a learning curve for me with sRs…regarding the tool…
    if you are feeling "charty or chart-ie" (in search of histogram meaning & use)… must confess, it makes my head ache…
    but Naomi explains histograms without making me a hysterical tool
    A Histogram is NOT a Bar Chart
    Comparing Distributions with Box Plots
    Naomi Robbins archive
    GNA postmortem
    a final notion/slant on the histogram:
    Western Union

    re: Berlin ✈ - "rhythm, on a beat as constant as a jungle drum."
    65 years ago, almost to the day… start vs highpoint dates…
    "The high point of the Berlin Airlift came on April 6, 1949, when the participating countries mounted what was known as the “Easter Parade.”
    as you scuba, may the rabbit not seek you below with Google colored eggs…
    soggy bunny

    ReplyDelete
  3. Dr. Russell, one question. The query you used to find newspapers.com was [newspapers archives] or you tried others. In the case that histograms weren't available on the site, what other way we could use to measure coverage?

    Thanks!

    ReplyDelete
  4. Great challenge. It's one of those "once you know it seems so easy". Interesting that Ramón and Passager both got the right articles even if it was achieved by long hand. Well done. I had a look around to see what other sites might have on histograms and found this that looks more like art than data.

    http://www.smashingmagazine.com/2007/08/02/data-visualization-modern-approaches/

    ReplyDelete
  5. Perhaps others like myself do a "post game analysis" to see where I went wrong. I have tried using keywords to get a search result showing newspapers.com and the only reference I've found is on Wikipedia "List_of_online_newspaper_archives".
    Searching Wikipedia I find this somewhat hidden under the Sub Heading Multistate near the bottom -
    "Newspapers.com owned by Ancestry.com (complete list of papers on site, poorly presented)Pay".
    I would have likely passed over this site based on the "poorly presented" and "pay" comment. I can't find the right keywords in Google Search so I certainly bookmarked it. Keywords?

    ReplyDelete
    Replies
    1. Hi RoseMary. I found it just with words [newspapers]. I'd like to know more keywords and also, what other ways could help to measure media coverage if for example, the mentioned site didn't show the Histogram.

      Searching for keywords, found:

      [past newspapers] How to Find Old Newspaper Articles Online Amit Agarwal gives some links about this.

      Delete
    2. Thanks Ramón I can't believe I didn't try just newspapers. I was adding keywords such as"historical, archive, timeline etc.".Sometimes the answer is simple.¿decir? Huevo en la cara!

      I have been unsuccessful finding keywords or websites that provides us with such a simple solution as Dr. Russell. It may be that it doesn`t exist. I did find one website that may be a gateway to producing a histogram or dataset but when I tried running it on my Chromebook it didn`t work. See what you think. I like to find more than one source as well. As well I like to create bread crumbs so if needed in the future I can retrace my steps. Nice if we could automate the steps as well.

      http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html

      http://acube.di.unipi.it/tmn-dataset/

      Delete
    3. Rosemary - Thanks for the pointer to the AG_corpus. I thought about that a bit, but I noticed that the catalog of included newspapers was a 404 (file not found), so I more-or-less gave up at that point.

      But your idea is a good one. Worth pursuing (if I had the time). Does anyone else want to do a bit of coding to create a "count topics by date" function over a decent news corpus?

      -- Dan

      Delete
    4. Ramón- I hear Mexíco had a big earthquake which was felt in Mexico City. I hope there are no injuries and no major aftershocks. Be well this Easter weekend.

      Remmij. Thanks for the info on Vinyl Records day ( I didn't know about it); maybe my vinyl collection is worth something.

      Dr. Dan - Our group could contribute by putting together such a catalog. Easy for me to say but only coding I ever did was with punch cards. If you don't know what that is don't worry it's from the last century. Sounds worthwhile & we could make it available to other researchers.

      Delete
  6. My post game analysis:

    I laboured under the impression that "the event" occurred in AP. But it did not. It was in Hyderabad which is NOW in AP.

    Now, curiously I made another goofy booboo which could have turned out with me getting 2 of 3 answers. I wrote out my answer prior to sending it in. And as usual I checked it over to make sure I had enough truthiness in it. Well, my discovery which I attributed to 1948 (!) was for something that happened in New Delhi. But I had carried on and easily got the last two answers correctly (!) So I did not submit it.

    I could not resolve my confusion which I perceived in the Challenge.

    Never thought of Histogram.

    Now, Newspapers.com: I cannot get into that without leaving a credit card #. Am I also confused in that I have thought we were only using Free stuff for solving the Challenges? I know the site says its a free 7 days but if its as tricky to cancel as with other sites well I don't want to find out.

    CHeers

    jon

    ReplyDelete
    Replies
    1. jon - Your analysis is correct. We all have to be careful about the assumptions we make about countryhood (AP didn't even exist for much of the last 100 years). Just as a cautionary note: There are other Search Challenges in the queue that have this property!

      WRT Newspapers.com, I forgot that it was a paywall site. I have a subscription that I set up a while ago, so I just didn't even notice it when I clicked through. Good point.

      On the other hand, we shouldn't restrict ourselves completely to free sites. In general, I like the idea of open/free access, but not everything is that way. Until that glorious day, we'll keep doing the best we can.

      (And it's worth noting that many public libraries have a subscription to Newspapers.com (at least in the US). You can go there to get free access. I know my local public library does this.)

      -- Dan

      Delete
    2. Hi Dr. Russell and Jon. Good day to all and Happy Easter.

      Yes, Newspapers.com is a paywall site. However, the data that you posted, Dr. Russell is free.

      I had problems finding the information too, Jon. This is how I did it.

      Newspapers.com. Then Browse, finally, search for the data.

      Here is the result with Andhra Pradesh 1,736 matches

      Delete