Monday, May 29, 2017

Answer: Finding tweets from a particular place

Geolocating Tweets... 

People tweet about darn near everything.  That's handy for you, if you're a SearchResearcher.   


I figured out that my friend's house was NOT on fire with geolocated tweets.
 

This week's Challenges:  

1.  When the Google I/O event happened last week, I wanted to see what kinds of things were being tweeted about.  Unfortunately, not everyone adds the #GoogleIO hash tag to their tweet, and sometimes people add the hashtag when they weren't really there.  Can you find tweets that were posted from INSIDE of Google I/O 2017?  
We want to search for tweets that are geo-located to a particular place.  In this case, the location of Google I/O 2017.  

And where was that? 

     [ Google I/O 2017 ] 

quickly tells you that it was held at the Shoreline Amphitheatre, right by the Googleplex.  (lat/long: 37.4263042,-122.080828)   


You can search for Twitter's Advanced Search UI (by doing this search): 

      [ Twitter advanced search ] 

but you'll find that you can search for towns, but not specific geolocations.  

I next did a search for: 

     [ Twitter geocode search ] 

and quickly learned from a number of sources (e.g., ResearchGate and ThoughtFaucet) that you could use the regular Twitter search box and drop a geocode in there following the pattern:  geocode:LAT,LONG,RADIUS -- where you enter the latitude, longitude, and radius with a units (e.g., 10km) appended.  ... and no spaces!  That is, the modified query is now: 




But note also that Twitter returns tweets within your given radius by using their own internal methods to geolocate tweets. They use a combination of device/gps coordinates (but ONLY if the user has opted into providing their location) and the user provided profile location.  Usually, "user-provided profile location" will be a city, and not often an amphitheatre location.  


2.  Obviously, you'd like to be able to restrict your tweet lookup by time and date.  Can you find those INSIDE tweets from Google I/O that were posted only during the days of the event?  (May 17 - 19, 2017)  

Let's go back to Twitter's advanced search page.   Here they have the ability to search for Tweets by date range.  When we fill out that date range, we see that it creates a query to Twitter with this in the middle of the URL: 

      %20since%3A2017-05-17%20until%3A2017-05-19

That's the date-range specification for a Twitter query.  (For the curious: %3A is the URL encoding for the character ":" (colon) while %20 means " " (space).)  

This means the whole query really is: 

         geocode:37.4220041,-122.0862515,5km since:2017-05-17 until:2017-05-19

This is generally a good trick to know:  Many search engines pass through arguments through the URL.  If you decode the URL into its parts, you can often replace it with your own value and get exactly the results you want; in this case, we limit the results of the lat/long geocode: query to ALSO include the date limits.

If you edit the URL for the geocoded query (see above), and add to the end of the URL the date-range limiter shown above, you'll get a result like this (I put the dashed oval around the date-range restricted part of it).  



Obviously, if you were NOT doing a geocode query, you could just do this with the advanced search UI.  

OR if limiting the search results by city was okay, then you could do both the city and the date restriction by the advanced UI.  That's actually not a bad way to do it.  Yes, you get MORE than just Tweets from Shoreline Amphitheatre, but they're all pretty good. 


3.  Since you can find the location of a tweet, is it possible to make a map of tweets that are posted from the city of San Francisco during a single day? How would you do this?  (I'll post the best maps in the blog next week.)  
There are many ways to do this, but let's start with the simplest:  Find an already existing solution. 

My first query was: 

     [ tweet map ] 

This led me to Mapd's Tweetmap demo.  Here's an image of the map for tweets with the hashtag #GoogleIO, zoomed into San Francisco.  


Here you can see (on the very bottom) the tweet volume by day.  The big peak on the right hand side is May 17, 18, and 19.  


But I was hoping for a way to search for tweets (where I could guide the tweet selection), and then put them onto a map that I could then edit as I wanted.  

To do that, I knew I'd need a tool that could collect the tweets (given my query), and extract the geocode, and then put them onto a map in some way.  

In truth, I poked around for quite a while, trying out different options, finally settling on this.  

I figured that someone probably built a Google Spreadsheets add-on that would grab the tweets.  So my search was: 

     [ Google spreadsheets tweet collection ] 

There are several.  After testing a few, I found that Twitter Archiver seems to fill the bill.  It's an add-on to Google Spreadsheets that lets you write a Twitter query (including using geocode:) and then it runs every hour and updates the sheet.  But as I found out (after a couple hours of debugging), it won't let you search for tweets from a specific place OR a specific location.  

But it will search for recent tweets and drop them into your spreadsheet... from which you can then do your own processing.  Here's what I did with it.   

You add the Twitter Archiver to your spreadsheet, and then create a "Search Rule" to specify what you want to download into the spreadsheet.  




The add-on then runs this query every hour and adds new tweets into the spreadsheet.  It then looks like this (notice that the dates on the far left are from today, NOT from the dates of the conference--if I'd run this last Wednesday, I'd have all of those tweets!) I've blurred out people's names and IDs. 

Don't be mistaken: this is NOT a regionally restricted search (no matter what it says).  But the results are coming from all over.   



This is pretty good, but we now need one more step to put these out onto a map.  Luckily, I remember that Google's My Maps lets me import a CSV, including lat/longs and content.  

I created a map and then imported the spreadsheet into my map, selecting the "place" column for the pin to drop.  

Interestingly, here's that map: 


This is close, but not quite what we had in mind.  FWIW, I tried out a few other scraping tools and found that none of them (at least of the ones I tried) quite have the ability to search by specific geocode.  Clearly Twitter has the ability--but the results aren't making it into the scrapers.  

This probably means that I'll need to go write some real code in order to make this work.  I'll try this later this week and report back to you. 

And we now know how to make a map with pins for all of the tweets... once we get them.  

But at least we found a commercial tool that nearly there!    


4.  For completeness, it's useful to know how many tweets come with a geocode. Can you estimate the fraction of tweets that are geocoded?  (I ask because you need to know what fraction of tweets you're NOT seeing when you do a search for tweets posted from a particular location.  If you're only picking up 5%, then that's a very different story than if you're seeing 95% of all tweets.)  
If you search for: 

     [ fraction geocoded tweets ] 

lead me to a Quora article by Ryan Gomba claiming that around 1% of tweets are geocoded. Other sources give similar number:  "less than 2% of tweets are geotagged" (Laylavi et alii, 2016), and it's clear that the number is growing over time, but is still quite small.  In an Arxiv.org paper, from 2014, 3 researchers from IBM give the percentage as 0.7%, but also note that they have developed an algorithm that can tweeters' home location fairly accurately by looking for other contextual information (e.g., explicit references to locations over time, noting time of day, and correlating posts with other, similar posts, in other social media).  

So it's not a large fraction of all the posts.  Still potentially useful, but it's not going to be anything like a statistically useful sample for most purposes.  


Search Lessons 


Well this was a tough one--I spent WAY too many hours on this trying to get various Twitter scraping tools to work.  

1.  Remember that many web services have their own advanced search capabilities.  This is a really useful thing to know: you can often do a lot by using a company's search tool, rather than counting on Google to figure out all of their internals (like geocoding or time/date filtering).  Search for it! 

2.  If you real the URLs carefully, you can often modify them to suit your own needs.  As we see above, with a little sleuthing (and URL surgery), you can get their search tool to do what you want (such as combining a geocode: search with a until: and before: search time restriction. 

3. Sometimes tools don't do quite what you want--test them!  The big moral of my multi-hour scraper testing is that you have to test your tools very, very carefully to make sure that you're getting what you THINK you should be getting.  (I spent a couple of hours trying to figure out why the maps weren't correct... only later did I realize that although I gave the scraper my geocode query, it wasn't actually returning those results.  Moral:  Debug your system with data you can easily validate!)  

Okay... Onward to the next Challenge later this week.  (It will be much easier, I assure you.)  

Thanks to everyone who wrote it.  This was tricky!  

Search on! 

1 comment:

  1. Hello Dr. Russell!

    Yes, it was tricky and at same time fun and helpful to learn and practice new things.

    I like a lot this week Search Lessons.

    I have to say, that never thought about the Google Spreadsheets add-on. That is an awesome idea.

    Also new for me and glad to know it is the "For the curious" part. When wanted to copy/paste and found those encodings, thought something was bad so I modified that part to make it look normal. So now, I know something else.

    ReplyDelete