Wednesday, February 25, 2015

Search Challenge (2/25/15): Finding things with additional property limits

I had two questions that came up that superficially look very different, but upon reflection, I realized it's the same search challenge in both cases.  

Earlier this week the son of a friend asked if I could help them find a summer internship that would involve working on the topic of Big Data "somewhere in Silicon Valley, but not in San Francisco."  He went on to ask if could be within an easy commute of Redwood City (since that's where he's going to live this summer).  

I thought about it for a while, and was able to fairly quickly make a map that looks like this: 

where each red pin shows a possible summer internship position working on "big data."  (Interestingly enough, this is pretty much the map of the cities of Silicon Valley...)  

1.  Can you find (or create) a table of 50 summer internship positions in cities that are in Silicon Valley, and not in San Francisco?  Ideally, you'd make an interactive map (like the one above), where you can click on the red button and read about the internship.  

And then..... the very next day, a different friend said she was frustrated looking at  the Ikea catalog.  She's trying to buy a sofa, and found the range of options pretty overwhelming.  It's a great asset to have many things to choose among, but it's sometimes kind of a lot. 

She wondered to me,  "I just want to know the range of prices of Ikea sofas!"  In talking with her, it became clear that she was also really interested in what the distribution of prices is.  (That is, she wanted to know if all Ikea sofas are expensive, or if they have just as many economy-priced sofas as well.)  

I fairly quickly whipped up a chart like this one (not the actual chart, but it looks a lot like this): 

Here the X axis is just different model numbers, and the Y axis is the price.  So you can immediately see that about 25% of all their models fall in the $200 - $400 price range, with a bit more than half being priced below $200.  Obviously, the chart for sofas will be different (everything is probably more expensive).  

2.  Can you make a chart like this one showing the price distribution of all the sofas in the Ikea catalog?  (With the current catalog.)  

Finding the prices isn't hard.  (  The question is how do you extract the prices (or internship position descriptions) and then do something with THAT data?  

Big Tips:  I know this seems like a crazy hard problem, but it's really not. You just have to know the right tools.   You should NOT spend much time (if any) copying and pasting data from the online catalogs of jobs or sofas.  You should be able to find a tool to help you do the automatic extraction of data from a web page.  

(If nobody's figured out how to do this by the EOD tomorrow, I'll give you another big hint on Thursday.)  

A bit o' philosophy:  This is yet-another of Dan's "find the data and massage it" Search Challenges.  As I've said before, this is a blog about Search and Sensemaking.  Although "sensemaking" is typically a larger, longer behavior pattern, these "find the data / massage it" kinds of questions are typical of the kinds of sensemaking questions that professional analysts have to solve all the time.  Because we're trying to have fun AND learn something, my Challenges don't go on for weeks or months, but try to give you the sense of what the larger skill set is like.  
So I hope you enjoy these "find & massage" search data challenges as much as I do.  In truth, I'm having a good time creating these Challenges that teach a very particular skill, and sometimes give a bit of insight at the same time.  

Search on! 


  1. Good day, Dr. Russell, fellow SearchResearchers


    [extract data website]
    [extract data from website tool]
    [web scraping tool]

    Web Pages into Data

    Here searched Ikea. There is an example, so changed query to "Sofa" downloaded csv.

    Work with it on Google Spreadsheet and got the results. Need to verify because Dr. Russell has 100 models and my results only shows 25.

    Later, more!

    1. In order to get more page that can be downloaded. Changed price parameter to get sofas in USD range [1 to 1200]

      1.Deleted no necessary columns.
      2.Downloaded the CSV.
      3.Upload to Google Spreadsheets.
      3.1 Filter Z to A and non white values.
      4.Created Chart.

      Chart 1

      For question 1.

      [Summer internship position "big data"]

      [Summer internship position "big data" -"San Francisco"] Search tools past month. No good results yet.

      Also tried similar queries. Need to work more with this.

      A month or so, wanted something similar to this SearchResearch Challenge. I wanted to know how many podcasts my friends have created in almost 10 years.

      How many podcasts in 10 years.

      I couldn't find an automatic way and did it with Copy/Paste in Google Spreadsheets and found the result. Today, I tried with this method and I still don't understand why in some queries we can only download 1 page.

      More, later!

  2. [tool to extract data] found many choices. I got It looks marvellously simple, so simple in fact I can't quite make it work. I'll need a demo. It has a link to plotly which nicely explains how to mine linkedin job descriptions which should do for the internships.

    Will try to get back to this but other obligations are in the way

    jon tU

    1. THANK YOU JON!!! I hope by shouting it that it makes up for my mistake. It was that helped me grab the data. I apologize for the error.

  3. Question 2

    Quick response. Downloaded data from Ikea to website “MagicImport” & got 127 results.

    Uploaded csv file into Google Sheets.

    Selected prices only column & uploaded to Google Fusion.

    Created chart showing price distribution.

  4. Thanks Ramón! I spent hours this afternoon trying all sorts of import functions in Sheets. I even went down the XML and Xpath path only to end up with nothing working. Your linked helped.

    I haven't stated on Ikea yet, but could you get everything listed for a living room and then take out things like chairs. My thinking is Ikea has items that can pass as sofas but might be labeled couch, futon, etc.
    Thanks again my friend.

  5. Hello searchers,

    For the question 1 here's what I did:
    - [internship silicon valley] > > advanced search: "all of these words": internship summer, "With none of these words": "san francisco", display: 50 results per page

    The magic starts here. I create a sheet in "Google Sheets" and put the big url of the indeed result page in column 1 row 2:

    Then downloaded that sheet and create a new project in OpenRefine (you know how I love this app) by importing this sheet. I got only one cell with the indeed url.
    Then "Add column by fetching urls" gives a huge cell with the html of the indeed result page. The rest is just cooking in Refine: create columns by parsing the html (having a quick at the source code of the indeed result page helps a lot) with job title, location, company and description. Export to Google Fusion Table, geocode and create a map.

    Et voilà… Here's the map:

    Silicon Valley Summer 2015 interships

    I'm not sure it's the simplest way to get the result (let's wait for Dan's answer) but it took me less than 1 hour to get the map.

  6. The issue for question 1 seems to be " How can you get data for this type of position from so many websites?" I've checked several sites that list "summer internships" for "big data" in "Silicon Valley” but this can't be how we would search i.e. one site then another. Too many positions could be overlooked. I have searched through public databases as well. If you could "scrape" websites with this type of data creating a database as you went this would be most comprehensive. So Question 2 seemed a lot easier because we were provided with the data i.e. all the data was from one site.

    1. I guess I should have said that "you need only do one jobs website." Of course, doing more than one is even better -- but I'd consider that to be extra credit!

    2. OK. I quickly setup a CSE but that is as far as I got. Used SERP results, copied bulk urls. I am not sure this would have worked anyway and if I could I would have explored it further.

    3. Here is a map with two websites (red & green icons) listing summer internships. Click on icons for details. Note that has 50 results & the problem is that most cities were labelled “virtual”. I have provided the Fusions Table to show the listings not on the maps.
      My Map

      Fusion Table

  7. I did similar to passager. Queried [ summer internships silicon valley 2015 ] to the site. Used their advanced search to find internships in Silicon Valley -francisco and listing 50.
    Search for Summer internships

    Used Ramón's tool find to extract it to a csv file then put it in Google Sheets and gave the data a "rub down".

    Then to My Maps for Summer Internships Map

    1. Good Morning, Fred and everyone.

      Glad you found helpful my link, Fred.

      About the maps. I like the way you and Passager worked it. I tried some other queries like:
      [summer internship 2015 intext:California intext:"Big Data"]
      [Summer internships 2015] looked good as well as Until now, have not found the way to select the internships that we are looking for. Even so, these links are very helpful to know and learn about internships.

      Looking both of your maps and also Dr. Russell's found that Menlo Park only is on Dr. Russell's. Therefore, his source is other than us.

      Then tried [Summer internships intext:"Big Data" intext:"Menlo Park"] search tools past month.

      Still working on question 1. And, at least, we find a way to create the map, Dr. Russell asked for.

      It has been a great challenge, lots of fun, new tools, new knowledge.

    2. is one site; there are others. As I said above, I really only expected people to use one site. Of course, YMMV if you use a different site. (FWIW, I didn't use -- although it, indeed, gives good data as well as the other sites.)

  8. I did use as well with Magic Import and got a .part filetype which may have been because I increase the pages to 20. I didn’t know how to extract to a csv. I suppose for this challenge using one website to extract 50 positions works but in the real world I would expect to source positions from across many websites because the key question seemed to be location. I don’t know how to do that or how many results that would produce. If it could be filtered down to say easy commutes from Redwood City that would meet the criteria in the original question posed.


  9. Just thought of another issue I had and that was how I would zero in on Big Data internships and Summer Internships. Perhaps in the industry all positions are labelled Big Data but when looking at this wasn’t so clear to me.

  10. good maps passager & Fred, using different Goo formats - found the differences in results interesting, but ended up with 15 apiece (why that #?) - were there ~35 more? I remain mapless, chartless & graphless…

    …while chatting with Anna Ikea (AOA) about this challenge she suggested a couple things…
    first change the pronoun usage; e.g., "She went on to ask if could be within an easy commute of Redwood City" &
    "He wanted to know if all Ikea sofas are expensive…"
    and pointed out that part of the inturn experience is finding the internship — not having it found for you…so, perhaps instead of commuting from Redwood City to Deadwood, it would be more efficient to stay on the ~$200 pouffes (eating cheesy poofs) and access a MOOC —
    MIT example - big DADA

    The most important thing seems to be the transition from a full backpack (school) to a sling pack (internship) — thus avoiding this common school embarrassment:
    and corvidaes
    an internship is all about learning the value of teamwork…
    why so serious?
    unscrewing/rescrewing IKEA with Uncle Bob…Look Like a Million Bucks

    IKEA husky data usage:
    video explanation
    Karen Lowe/IBM
    "Understanding their needs is not a trivial task. Ikea, for example, sends thousands of researchers to people’s homes near their stores to see how people live and the types of products they might use. In developing nations where Ikea prices are not considered low, the store markets itself as an “international lifestyle” chain. In developed nations, it’s marketed as a low-price, mass market retailer."

    1. I thought about using the neuter pronoun to avoid gender bias, but I just flipped a coin, and this was the way it came out. ;-)

    2. das - S/he - Hir - Dan, VIVE LADIFFÉRENCE… though waning in usage… c'est la vie. ,´=•P
      …guess i could intone, "What difference – at this point, what difference does it make?" 1984 The Smiths
      the purple circle
      purple 6.5

  11. Has anyone attempted to build a Google Custom Search Engine to enable multiple websites searches?

    1. But sites like already are aggregators. You mean building an aggregator of aggregators?

  12. If you want to have an idea of how tedious the ikea challenge is, paste this URL in your browser:

    (the trick is to add department=living room found on the lefthand side on the result page of the search for "sofa" on the ikea site, that narrows the search, and then add "&pageNumber=0" in the address bar to list all the results (removing paging) )
    You get 272 results and you find that sorting by prices is not really relevant (some sofas are cheaper than covers for other ones!)

    So again, Open Refine, some hand cleaning then Google Sheets and the graph is here :

    Sofas by price (chart)

    The sheet containing the 91 sofas with descriptions and even a link to an image for each is here :

    Ikea Sofas (List)

  13. Looking at the Ikea website - and specifically their Canadian website these were the steps I took:

    1) clicked on the section for sofas and then went through each sofa and hit "add to list", but only clicking on things that were in fact sofas, some other things showed up like a $60 armrest.
    2) clicked "my shopping list" to get a table with 5 columns.
    3) I manually changed any number in the 'Quantity' column to "1" and hit update. This is because I accidentally added some things twice.
    4) I drag my mouse over the whole table, copied, and pasted it into a google sheet.
    5) I sorted them by their cost descending, and created a chart.

  14. Well I thought I had posted for Anne and me but must not have published our answer :( Too much to re-create so we'll join in next week! We did find a very interesting article on Big Data jobs. We needed to look that up as we weren't really sure what Big Data jobs entailed. We got to the Indeed site and used Ramon's site for "scraping" the data (a term we didn't know). The article we found was in Forbes - The Ikea challenge was a little easier. We put a limiter on the prices because we found the lower priced items were mostly parts for sofas. So we used a lower limit of $100. Well that is about all I can remember so we'll see you next Wed.!