Friday, March 6, 2015

Answer: Finding and getting to know an obscure island



This was fun, eh? 


I'm glad you also enjoy learning about obscure places on the planet, and picking up a few lessons about how to search for information along the way.  

Finding this island was interesting.  But let's revisit our questions...  


1.  Can you find the island?  What's its name, and where exactly is it?  Is it an independent country, or is it part of another country?  
2.  The history of this island is fascinating--at the crossroads of history, but often not really a part of it.  Recently a cave was discovered on this island that has some rather old graffiti. What kind of graffiti is it?  Can you find a picture of the inscriptions?   
3.  Why is there a military tank on the island?  Who would bring such a thing there? 

Finding the island wasn't that hard, but there were a few hiccups along the way.  


My first search was (like many of you): 

     [ cave graffiti Arabian sea island ] 


I chose these terms as they seemed as though they would appear on any description of the island.  Location is always an important term, but not knowing ahead of time anything else about it, I chose to include the terms "cave graffiti" as that seemed pretty unusual--at least unusual enough to let me find the island.  

Sure enough, the results quickly led me to the island of Socotra (after a few moments being diverted by reading about cave paintings recently found in Petra--but that's not an island, so I stopped reading that quickly).  

Then, a quick search for just the island's name: 

     [ Socotra ] 

leads to a wealth of information, with the Wikipedia article being the most prominent. 

Socotra is just off the coast of Yemen and Somalia, solidly in the western Arabian Sea, at 12°30′36″N 53°55′12″E, nearly due west of Goa, south of Oman.  

I first looked it on Google Maps: 




And then zoomed in to take a look in Satellite view



In this view, you can see it's a chain of 4 islands, with the largest, Socotra, being mostly undeveloped, and pretty arid.  

It has fascinating flora (the Dragon's Blood tree, and a traditional frankincense source), and is also home to some remarkable caves.  

To determine the government, I read the entry in the Wikipedia entry which says that Socotra is a governate of Yemen, but then also checked CIA.gov Worldbook site  (just:  [site:CIA.gov Socotra ]  to cross-check.  (They tend to be pretty up to date about governmental issues, and they agree.)  

Because I want to be careful about these things, I also checked the Arabic language version of the Wikipedia page (available by clicking on the Arabic language link on the left-hand-side of the Wikipedia page--it looks like this:  العربية

That source (which is not just a copy of the English Wikipedia article) agrees.  It's part of Yemen.  

Finally I tried one last thing, just for completeness. 

I know that this is such an interesting place that a magazine like National Geographic MUST have something about it.  So I tested this out by doing a quick site: search like this:  

     [  Socotra site:nationalgeographic.com  ]

Which leads to several marvelous articles on the National Geographic site.

And then, just because I was really interested in Socotra, I decide to do a related: search as well, using one of the best articles from National Geographic as my seed text.  

      [related:ngm.nationalgeographic.com/2012/06/socotra/white-text ]

Leads to lots of related things… Really a great way to browse around and find great content.  




Graffiti:  Finding pictures of the Socotran cave art wasn't hard, but I really wanted some good context AND an authoritative source.  

I had noticed that there was a reference to the book "Foreign Sailors on Socotra. The inscriptions and drawings from the cave Hoq" in the references on the Wikipedia article, so I had to search for it. 

A search for the book title: 

     [Foreign Sailors on Socotra. The inscriptions and drawings from the cave ]

Leads to Academic.edu article Socotra Island, which is a great resource for all things academic that have been written about Socotra.  

Reading through those results then led me to the article, Les vestiges antiques de la grotte de Hôq (Suqutra, Yémen).   In: Comptes rendus des séances de l'Académie des Inscriptions et Belles-Lettres, 146e année, N. 2, 2002. pp. 409-445


Which has the following images (and much more): 



Images from the Academic.edu article.  


This is a marvelous book, written by archaeologists who have spent considerable time in the Hoq cave on Socotra.  Even if you don't read French, this is well worth looking at.  

Note how the first image (Fig 10) looks much like an Indian script, not an Arabic one.  There are a large number of inscriptions, drawings and archaeological objects. The majority seem to have been left by sailors who visited the island between the 1st c. BC and the 6th c. AD. The majority of the texts are written in the Indian Brahmi script, but there are also inscriptions in South-Arabian, Ethiopian, Greek, Palmyrene, and Bactrian scripts.  



A Tank?  It's really easy to check if there's a tank there.  A simple image search works well: 

     [ Socotra tank ] 


And clicking through on a few of those tells the same story over and over... "geez... we were on Socotra, and look at this tank we found."  

Question is:  How and why is it there? 

One clue I picked up was in the Arabic language Wikipedia article on Socotra:  "...Socotra [was] in the Convenant of the Soviet naval base military..for battleships and Walosatil..working until the unification of Yemen in 1990."  ("Walosatil" is the "Soviet Navy" transliterated.)  

I assume this means that the former Soviet Union found Socotra to be strategically interesting, and so had some kind of deal with them.  That hint suggested I search for: 

     [ Soviet tank Socotra ] 

That worked.  It led to multiple documents, including several first hand reports by war machine fans, who verified that this is a T34/85 (a popular Soviet tank).  

Just to double check that as well, I decided to check on the US "paper of record" with the query: 

     [ site:nytimes.com Socotra Soviet ] 

By looking at a few of those articles, it's pretty clear that the Soviets had a military relationship with Socotra (although they now seem to have moved away, and few traces of their existance there remains to be seen in Google Earth). 

However, according to one article on Socotora.info, "Soviet military ships preferred rather to anchor off Yemeni Island Socotra’s coast than in the Berbera port [on the Somalia coast]....Socotra had neither a port nor a mooring..."  (It's worth noticing that Socotra.info is a web site run out of Moscow, according to its WHOIS information.)  

The future of Socotra is clearly going to be interesting...  




Search Lessons:  

1.  Check multiple sources. As you can see, Socotra is an "in-between" topic.  There are references to it scattered across multiple sites, in multiple languages.  As Luís mentioned in his comments, many of the best references are in Russian (they were there for quite a while, doing extensive scientific studies).  We have to learn to check multiple kinds of citations! 


2.  Think about checking sources that are on the general topic, but might not show up in the top 10 (or 20) of the SERP.  As you saw, my search in National Geographic reveals a host of related and really interesting articles.  Unfortunately, you have to know to look in order to find them.  Keep in mind that sometimes a well-known site might have to be searched separately.  

3.  Check different kinds of resources.  That's why I looked at Images and Books, as well as the usual web search results.  


______________________________


Postscript:  About that "at least one million years.." reference at the top of my Challenge

As Luís found out, the Wikipedia article mentions the Oldowan culture in the very beginning, without much fanfare.  But if you follow that link, "The Oldowan, sometimes spelled Olduwan, is the archaeological term used to refer to the earliest stone tool industry in prehistory. Oldowan tools were used during the Lower Paleolithic period, 2.6 million years ago up until 1.7 million years ago, by ancient hominids."  Apparently some Russian archaeologists have found Oldowan tools on Socotra, made from local stones.  So... it's  been occupied for quite some time... 




Wednesday, March 4, 2015

Search Challenge (3/4/15): Finding and getting to know an obscure island

Caution: While beautiful, this looks nothing like the island we seek in today's Challenge. 

I have to admit that I have an unusual hobby...
I collect islands.  The more obscure, the more remote, the more unusual, the better.  

I've been known to scour maps to find places that might have islands, and after a while, you get to see the patterns of islands--both along the coasts, and in the center of the sea.  

That's how I can to find (but not yet visit, alas!) a rather obscure island in the Arabian Sea that has had a population for at least a million years, and is named not with an Arabic term, but with a name that's Indian in origin.  Can you find it and answer a few questions about this remarkable place? 

This island is known for its unusual plants and animals, many of which are endemic--that is, they occur no place else on earth. 

1.  Can you find the island?  What's its name, and where exactly is it?  Is it an independent country, or is it part of another country?  
2.  The history of this island is fascinating--at the crossroads of history, but often not really a part of it.  Recently a cave was discovered on this island that has some rather old graffiti.  What kind of graffiti is it?  Can you find a picture of the inscriptions?   
3.  Why is there a military tank on the island?  Who would bring such a thing there? 

Let us know how you found the answers!  My answers will come out on Friday.  

Until then...

Search on! 




Friday, February 27, 2015

Answer: Finding things with additional property limits (beginning web scraping)

In this week's Challenge ... 

I asked two questions that seem different, but are really both examples of web scraping.  When I realized they could both be done using the same tool... well, that's too much of a chance to demonstration how to use web scraping in your day-to-day SearchResearch! 

I asked these questions which are both "can you grab this data from a web site, and then make these charts?"





1.  Can you find (or create) a table of 50 summer internship positions in cities that are in Silicon Valley, and not in San Francisco?  Ideally, you'd make an interactive map (like the one above), where you can click on the red button and read about the internship.  



2.  Can you make a chart like this one showing the price distribution of all the sofas in the Ikea catalog?  (With the current catalog data.)  
 





In the comments, Ramón got us off to a good start by pointing out that both of these problems required grabbing data from websites, which he pointed out is called "web scraping."  That's how I started this Challenge solution as well, by searching for web scraping tools.  (I'll write another post about web scraping generally... but this is what I did for this Challenge.)

     [ web scraping tool ] 

As you can see, there are many, many tools to do this--I just chose the first one, Import.io, as a handy way to scrape the internship data and the Ikea catalog.

Web scraping is just having a program extract data from a web page.  In the case of Import.io, you just hand it a URL and it pretty much just hands back a CSV file with the data in it.


So, for the first Challenge (finding cities with internships about "big data" that are in the Bay Area, but not in SF), we first have to find a web site that has internships listed.  My query to find them was:

     [ summer internships Mountain View ] 

Why Mountain View in the query?  Because it's pretty centrally located (and it's where I go to work everyday, so I have a local's interest), and I know there are a lot of large companies there with internship possibilities.

Looking at the list of results, I found Indeed.com, Jobs.com,  InternMatch.com (and many others).

I decided to scrape the LinkedIn.com listings--again, because they're just down the street, and I happen to known they have a large interest in "big data" topics.

Their jobs site is:  www.linkedin.com/job/  -- I just filled out the form for internships near Mountain View on the topic of "big data" and found a nice set of results.



Note that I set the "Location" filter to be "Mountain View" (I could have set it to any city in the Valley).  

To scrape this data, I just grab the URL from the page:

https://www.linkedin.com/job/intern-%22big-data%22-jobs-mountain-view-ca/?sort=relevance&page_num=2&trk=jserp_pagination_2

And drop that into Import.IO -- and let it scrape out the data.



At the bottom of the Import.IO page there's a "Download..." option.  I saved this data to a CSV file, and then imported that into a Google Spreadsheet, which gave me: 





As you probably noticed, this is only 25 positions.  I can easily go to the "next page" of results on the LinkedIn site, copy the URL, drop that into Import.IO and repeat.  I did this, and appended the results to the bottom of the spreadsheet.  (Shared spreadsheet link.) 

In the spreadsheet you can see that I copied all of the city names and put them into Sheet 3 ("Cities").  I did a quick cleanup there, and created a second column "CityName" so I could then pull this spreadsheet into a new My Map.  But FIRST I copied all of the Cities and CityNames into another spreadsheet  with JUST the names of the cities in it.  (Why?  Because My Maps doesn't like to import spreadsheets with special characters in the columns.  A fast workaround is to just make a new sheet with just the data--the city names--that I care about.)  

So.. my new (city names only) spreadsheet looks like this: 




And when you import that into a My Map, you get this map




Note that there are many more cities in the spreadsheet than are shown here--there are a lot of duplicates.  (I guess we could have looked-up the street addresses for each business, but that's too much like real work.)  



And now that you know this method....  

doing the Ikea sofas is just as straight-forward.  

Go to the Ikea page and search for sofas.  Grab the URL and paste into Import.IO -- that gives you another data table.  Their URL looks like this:  

     http://www.ikea.com/us/en/search/?query=sofa&pageNumber=1



Notice that this is page 1 out of 36.  It's kind of a hassle to get all 36 (but I'll write up how to do that in another post!), so let's get a few more and do the same "save as CSV."  

Luckily, Import.IO has a "Save 5 pages" which automatically grabs the next 5 pages of Ikea data (just by changing the &pageNumber=1 argument in the URL above.  

So by the time you save the CSV and import it into your spreadsheet, you'll have 120 rows of data.  (For completeness, you could go to page 6, and then import the next 5 pages... but I'm happy with 120 samples of Ikea line of sofas.)  Here's my spreadsheet:  




And then you can easily copy out column L (prodprice_price) and do whatever kind of visualization you'd like, including the chart like that above, the chart below, or others: 




Such as a histogram of prices (in $150 / bucket price-ranges).  You can see here where the bulk of their product lies.  Ikea has a target audience in mind, but also carries a few rather expensive items as well.  






Search Lessons:  

1.  Find the right tool.  As I've said many times before, often the best way to start a complex project is to figure out what operation you're actually doing.  (This particular task is called "web scraping.")  And then find a tool that will help you out.  In this case, Import.IO is perfect for the task.  Be aware that there are many such tools--some of which might be better matches for the task you're doing.  (It all depends on the details of what you're trying to do.) 

2.  Linking tools together.  To solve this Challenge, you needed to not just extract the data, but also load it into your favorite spreadsheet, do a bit of cleaning, and then either visualize it with your spreadsheet charting tools, or export/import your data to My Maps (or whatever your intended end goal is).  


Hope you enjoyed this Challenge as much as I did! 


Search on! 




______ 
Addendum:  In response to a couple of questions from readers, I went back and made the map of "big data internships" truly interactive.  Now if you click on a pin, you'll see the city, the company with the position, and a link to the job posting.  


Wednesday, February 25, 2015

Search Challenge (2/25/15): Finding things with additional property limits


THIS WEEK.... 
I had two questions that came up that superficially look very different, but upon reflection, I realized it's the same search challenge in both cases.  

Earlier this week the son of a friend asked if I could help them find a summer internship that would involve working on the topic of Big Data "somewhere in Silicon Valley, but not in San Francisco."  He went on to ask if could be within an easy commute of Redwood City (since that's where he's going to live this summer).  

I thought about it for a while, and was able to fairly quickly make a map that looks like this: 


where each red pin shows a possible summer internship position working on "big data."  (Interestingly enough, this is pretty much the map of the cities of Silicon Valley...)  

1.  Can you find (or create) a table of 50 summer internship positions in cities that are in Silicon Valley, and not in San Francisco?  Ideally, you'd make an interactive map (like the one above), where you can click on the red button and read about the internship.  




And then..... the very next day, a different friend said she was frustrated looking at  the Ikea catalog.  She's trying to buy a sofa, and found the range of options pretty overwhelming.  It's a great asset to have many things to choose among, but it's sometimes kind of a lot. 

She wondered to me,  "I just want to know the range of prices of Ikea sofas!"  In talking with her, it became clear that she was also really interested in what the distribution of prices is.  (That is, she wanted to know if all Ikea sofas are expensive, or if they have just as many economy-priced sofas as well.)  

I fairly quickly whipped up a chart like this one (not the actual chart, but it looks a lot like this): 



Here the X axis is just different model numbers, and the Y axis is the price.  So you can immediately see that about 25% of all their models fall in the $200 - $400 price range, with a bit more than half being priced below $200.  Obviously, the chart for sofas will be different (everything is probably more expensive).  


2.  Can you make a chart like this one showing the price distribution of all the sofas in the Ikea catalog?  (With the current catalog.)  

Finding the prices isn't hard.  (Ikea.com)  The question is how do you extract the prices (or internship position descriptions) and then do something with THAT data?  

Big Tips:  I know this seems like a crazy hard problem, but it's really not. You just have to know the right tools.   You should NOT spend much time (if any) copying and pasting data from the online catalogs of jobs or sofas.  You should be able to find a tool to help you do the automatic extraction of data from a web page.  

(If nobody's figured out how to do this by the EOD tomorrow, I'll give you another big hint on Thursday.)  

A bit o' philosophy:  This is yet-another of Dan's "find the data and massage it" Search Challenges.  As I've said before, this is a blog about Search and Sensemaking.  Although "sensemaking" is typically a larger, longer behavior pattern, these "find the data / massage it" kinds of questions are typical of the kinds of sensemaking questions that professional analysts have to solve all the time.  Because we're trying to have fun AND learn something, my Challenges don't go on for weeks or months, but try to give you the sense of what the larger skill set is like.  
So I hope you enjoy these "find & massage" search data challenges as much as I do.  In truth, I'm having a good time creating these Challenges that teach a very particular skill, and sometimes give a bit of insight at the same time.  


Search on! 


Friday, February 20, 2015

Answer: A couple of odd questions...

One week, two curious questions.   Two interesting answers... 


1.  Those equations are really interesting, but WHAT do they mean?  And why would Woody Paul put them on his concert performance gear?  (For extra points: Where did Woody go to college?)  

Woody Paul's awesome shirt with mysterious equations.

It's not hard to figure out from a quick query who Woody Paul really is: 

     [ Woody Paul "Riders in the Sky" ] 

Here I quoted the name of the band just to make sure I didn't get anything spurious.  (It turns out not to matter too much.)  

Woody Paul is his stage name.  Real name:  Paul Chrisman. My favorite article was from MIT's Technology Review magazine which points out that he got his PhD in nuclear engineering from MIT in 1976.  When he graduated there were two job offers waiting--an assistant professorship at Columbia University or a recording gig in California.  (Other versions of this story have him heading to Nashville.) It's clear he chose music in either case. 

A little more poking around finds that his thesis was "Inertial, Viscous, and Finite-Beta Effects in a Resistive, Time Dependent Tokamak Discharge", Thesis Nuc. Eng. 1976, PhD, supervised by James E. McCune. 

This is all relevant because it gives us a clue about what these symbols on his shirt might be. 

If you already recognize the symbols involved, you've got a headstart.  But suppose you DON'T recognize any of these symbols, how do you start?  

You might remember that earlier I've written about "Symbol Search" using either the Shape Catcher web app, or the Google Docs symbol search tool.  

In both cases, you draw the character and look through the list of recognized symbols.  Here's my screenimage from using ShapeCatcher.com  (it works with Google Docs symbol reco as well)... 


 The trick here is to notice (look carefully!) that the symbol on Woody's shirt has a special, thickened left side on the downward pointing triangle.  That's what makes it a "nabla" symbol.  (Note:  It's not a delta--that's a triangle that points up, not down!) 

Once you know that, you can do a query for: 

     [ nabla ] 

and learn that it is the name of the symbol in mathematics that (quoting Wikipedia) "... is used in mathematics to denote the del operator, a differential operator that indicates taking gradient, divergence, or curl..."  

That might look scary, but hold on a second--stay with me here.  The key idea here is that nabla isn't the big thing, that's just the character's name (it's a bit like saying "virgule" for the slash, or divide, character).  The important term to notice here is that it's the del operator.  

Okay, so what's del?  

If you [ define del ] you'll find it's an operator used in vector mathematics.  

So now I just searched for the most obvious natural language translation of this, which I put down as: 

     [ del dot b equals 0  ]

Sure enough, once you do that, you land in the land of physics discussions.  



It only takes a minute or so of looking around there (in PhysicsForums.com) to find out that his "del dot E" is one of Maxwell's equations.  A quick search for that takes us to the Wikipedia entry (or any of a thousand textbooks, all with exactly the same information): 

Adapted from Wikipedia entry on Maxwell's Equations

As you can see, Woody has Gauss's law in his collar, Gauss's law of magentism and  Faraday's law of induction embroidered into his western-style, fringed shirt.  And as Luis cleverly spotted in an image I hadn't seen before, assume Ampère's circuital law is on the back of the yoke.  (You can just see it riding up over his right shoulder here.)  

Except of image from SCVhistory.com
  

Maxwell's equations are four equations that form the foundation of electrodynamics, classical optics, and electric circuits. They describe how electric and magnetic fields are generated and altered by each other and by charges and currents. These equations are named after the Scottish physicist and mathematician James Clerk Maxwell, who published an early form of those equations between 1861 and 1862.

They are, of course, equations that Woody would have used extensively in his thesis writings.  A Tokamak reactor is one that uses a strong torus-shaped magnetic field to contain a plasma.  It was (and still is, in some circles) the best approach for possibly generating a viable fusion reactor. 




2.  Have crows become suddenly much more common in the Bay Area?  Is it just crows, or have ravens also turned into frequent guests?  


You know, sometimes things don't work out the way you'd expect.  I'd fully expected that the only good way to answer this would have been to look at data pulled from the annual Audubon bird counts.  (Such as this entry for the 2014 counts of crows.) 

But people other than me have noticed this crow / raven trend.  Consequently, the first searches you might do:  

     [ crow population statistics san francisco bay area ] 

end up pretty much answering the question.  The lesson for me is obvious--pre-test the questions!  

Several people pointed to one of the three articles about crow population expansion.  

SFGate (a local media service) wrote an article in 2012 about "Why ravens, crows are more common now in the Bay Area" quoting an ornithologist from Cornell, and giving crow counts from the Audubon census of 1991 (17 crows and 54 ravens in San Francisco; 60 crows and 23 ravens in Oakland), and 2011 (SF tallied 566 crows and 599 ravens;  while Oakland had 1,152 crows and 193 ravens).  

They also pointed out that crows were "relatively rare" back in 1927, so this really is a recent phenomenon.  

The San Jose Mercury News also had an article that same year, except giving similar numbers for crows growth in the penninsula (where I live) and the South Bay (where San José is).  Their article, "Counting crows: Number of black birds on the rise in Bay Area" also has a lovely chart showing the growth rate over time.  

Chart from SJ Mercury story on crows, 2012, by Aaron Kinney.  


As you can see, the growth rate the South Bay (San Jose) has been nothing short of spectacular.  

But just this past month, the Mercury wrote another article about crows.  "They're everywhere! Crows, ravens overrun Bay Area" (Nicholas Wieler, Santa Cruz Sentinel, 2/14/15)  which repeats the same basic data, but adds a new graph showing crows v. ravens.  

Graph from SJ Merc article.  

The findings here are clear enough:  Yes, both crows and ravens have been skyrocketing in populations--with crows more common in the South Bay, and ravens more common in San Francisco.  


Search Lessons: 

1.  Remember the search tools you know about!  Finding the nabla is easy, IF you know about the symbol finding tool.  (And if you don't, try describing it in the simplest way possible-- something like:  [ downward pointing triangle symbol ]  )  
2.  Once you've found the symbol name, trying searching for the way it's commonly used.  In this case, it was to discover that the symbol was also called "del" and under that name, it's easy to find in the physics literature. (Which is why it was helpful to know about Wood Paul's previous writings...)  
3.  When searching for analyses (e.g., crows population over time), always search for a completed report.  You never know (I certainly didn't!) when someone will have already done the analysis for you.  Double check the data and the sources, but all of these articles refer to data from Audubon Society and/or the Cornell Lab of Ornithology--both extremely respected resources in the birding world.  


Search on!