Tuesday, March 27, 2012

A do-over on Languages above and below the equator

As several readers pointed out in the comments, my analysis didn’t REALLY answer the question I’d asked.

They were correct.  Thanks for picking up on this. 

I found the languages by COUNTRY, then added up ALL the languages spoken in the country.  That’s not what I asked for!  I wanted the TOTAL number of unique languages spoken in each hemisphere. 

Okay.  Let’s do this again.  Better this time. 

This time I pulled the list of languages from the CIA World Fact book page on languages of the world.

I copied it from the web page and pasted it as plain text into a .TXT file.  This gave me a simple spreadsheet with country names in the left column and columns of languages to the right. 
From CIA Fact Book (see link above)
HOWEVER… I now had to do some data cleaning.  Many of the language entries had comments (e.g., percentages of speakers of that language in the country) scattered throughout the listing.  They’re useful for readers of the language list, but not helpful for the data analyst.

So I spent about 45 minutes cleaning up the data: removing percentages, deciding which small languages to keep (I mostly didn’t, but just called them all called “other”—sorry speakers of regional dialects). I alsot tried to canonicalize all of the variant spellings of different African languages, but I’m sure I didn’t get them all right.  (So there’s a slight overcount of African languages.) 

After about 1 hour total, I had a spreadsheet of countriesw/ a list of languages.  (I’ve shared these spreadsheets with you so you can see what I did.) 

I turned this into a Fusion Table of countries and languages spoken (so I could fuse it with my  already existing Fusion Table Country-LatLong

Then exported THAT fusion table as a regular spreadsheet so I could easily select the rows that are below the equator (that is, that have a latitude with an “S” in them). 

Now we’re getting someplace!

Next I sorted by latitude, then copied all of the languages BELOW the equator into a simple text file (which I then imported into a spreadsheet) so I could remove duplicates easily with a spreadsheet function.  In Google Spreadsheets, that’s the UNIQUE function.  This step gave me the Master list of Southern Languages

Did the same thing with the languages ABOVE the equator and created a spreadsheet Master List of Northern Languages.  

Once again I used the unique function to identify the unique languages. 

Now I can read off the numbers:  the Southern Hemisphere has 90 unique languages, while the Northern Hemisphere has 230 unique languages. 

Lesson learned:  Doing these kinds of analyses can sometimes be tricky, even when you write the question!


  1. Daniel

    I appreciated the do-over but I still am unclear about the original question, in the original post, it is phrased as

    A big part of answering questions like this is making your terms very clear. For this question, let's consider only Official languages (that is, languages recognized by the country) and not worry about relative sizes of speakers. We're just interested in whether or not countries below the Equator have more languages than those above the Equator.

    In that case, does % of speakers matter? If you run into a situation where more people speak one language but it is not recognized by the state or country?

    For example, 20% of people in America speak Spanish, but it is not recognized within the 50 US states.

  2. As I mentioned, I'm only interested in the number of "OFFICIAL" languages. (You might be interested in something different, depending on you research goals.) But no, % of speakers doesn't matter. That would be interesting, but that's a slightly different big of research.