As several readers pointed out in the comments, my analysis
didn’t REALLY answer the question I’d asked.
They were correct.
Thanks for picking up on this.
I found the languages by COUNTRY, then added up ALL the languages
spoken in the country. That’s not what I
asked for! I wanted the TOTAL number of
unique languages spoken in each hemisphere.
Okay. Let’s do this
again. Better this time.
This time I pulled the list of languages from the CIA World Fact book page on languages of the world.
I copied it from the web page and pasted it as plain text
into a .TXT file. This gave me a simple spreadsheet
with country names in the left column and columns of languages to the
right.
From CIA Fact Book (see link above) |
HOWEVER… I now had to do some data cleaning. Many of the language entries had comments
(e.g., percentages of speakers of that language in the country) scattered
throughout the listing. They’re useful
for readers of the language list, but not helpful for the data analyst.
So I spent about 45 minutes cleaning up the data: removing
percentages, deciding which small languages to keep (I mostly didn’t, but just
called them all called “other”—sorry speakers of regional dialects). I alsot tried
to canonicalize all of the variant spellings of different African languages,
but I’m sure I didn’t get them all right.
(So there’s a slight overcount of African languages.)
After about 1 hour total, I had a spreadsheet of countriesw/ a list of languages. (I’ve shared these
spreadsheets with you so you can see what I did.)
I turned this into a Fusion Table of countries and languages spoken (so I could fuse it with my
already existing Fusion Table
Country-LatLong
Then exported THAT fusion table as a regular spreadsheet so
I could easily select the rows that are below the equator (that is, that have a
latitude with an “S” in them).
This gave me: https://docs.google.com/spreadsheet/ccc?key=0AhlpTzK9iG-2dDRDbE43SndrN1hOSW5jZ2FrVVEydVE#gid=0
Now we’re getting someplace!
Next I sorted by latitude, then copied all of the languages
BELOW the equator into a simple text file (which I then imported into a
spreadsheet) so I could remove duplicates easily with a spreadsheet
function. In Google Spreadsheets, that’s
the UNIQUE function. This step gave me the Master list of Southern Languages.
Did the same thing with the languages ABOVE the equator and
created a spreadsheet Master List of Northern Languages.
Once again I used the unique function to identify the unique
languages.
Now I can read off the numbers: the Southern Hemisphere has 90 unique
languages, while the Northern Hemisphere has 230 unique languages.
Lesson learned: Doing
these kinds of analyses can sometimes be tricky, even when you write the
question!
Daniel
ReplyDeleteI appreciated the do-over but I still am unclear about the original question, in the original post, it is phrased as
A big part of answering questions like this is making your terms very clear. For this question, let's consider only Official languages (that is, languages recognized by the country) and not worry about relative sizes of speakers. We're just interested in whether or not countries below the Equator have more languages than those above the Equator.
In that case, does % of speakers matter? If you run into a situation where more people speak one language but it is not recognized by the state or country?
For example, 20% of people in America speak Spanish, but it is not recognized within the 50 US states.
As I mentioned, I'm only interested in the number of "OFFICIAL" languages. (You might be interested in something different, depending on you research goals.) But no, % of speakers doesn't matter. That would be interesting, but that's a slightly different big of research.
ReplyDelete