Tuesday, July 24, 2018

SearchResearch Challenge (7/25/18): The Mystery of the Salzburg Stream--does it flow uphill?

I've been in Salzburg, Austria
for the past week... 

It's just as lovely as you might imagine--the Alps surround a large river valley, it's endlessly green, and the Altstadt ("old town") is full of medieval streets, lined with churches and shops that have been around for hundreds of years.  And yes, the hills are alive with the sound of music--at least I went to enough concerts to convince me of this.  

The surrounding area has a web of bike/foot paths and streams that seem to run every possible way.  They're lovely, but it's a little confusing when you first arrive, although you learn the paths quickly.  That's when you start to notice the little things--the things that make you say "What?"  

My first big surprise (after I got to know the place a bit) was the peculiar behavior of a stream that I passed often on my walks.  As I walked to the foot of a pathway that climbed up to the castle wall, I had to pass over a small stream.  It was a beautifully flowing stream, just about the platonic ideal of a stream. 

One time while I walked this path I paused for a moment, looking down at the water and noticed that it was flowing from south to north.  That was so unexpected that I dropped a leaf into the water to verify that what I thought I was seeing was actually what was happening.  Why was it so unexpected?  

It looks like the water is flowing uphill!  What?  

The little bridge I was standing on is clearly downhill from the where the stream starts.  It looks for all the world like the stream is running up the hill towards the castle!  

Here's the place (from Google Maps, in 3D mode).  In reality, it's clear that the left part of the picture is uphill from the right part.  

The white building at the top of the image is the old city castle, the Festung Hohensalzburg (that is, the "high Salzburg fortress").  It sits atop the Festungsberg, a hill that rises up over the historic center of town.

Just below the fortress is a patch of forest, and below that is a meadow.  In the middle of the meadow is a stream that ends in a clump of trees and brush.  

Actually, that's what I noticed the first time I walked past--this looks like the source of the stream, the spring from which the stream flows.  As I passed I thought to myself, "Oh, there's a spring, the source of one of the many brooks..."  (Since I grew up in LA, I've always thought of springs as being somehow magical.  I never saw one until I was in my mid-20s, so I notice them.)  

Here's the map of that place.  See the blue line of the stream?  That dotted line is the pathway where I'd walk into the Altstadt.  This all makes sense--it follows exactly parallel to the path, and you can see the little bridge on Brunnhausgasse where I crossed it.  The map suggests that this is the spring source.  

Here's the same scene in the satellite view.   It sure looks like the source of a spring.  

As I walked past, I took a quick photo, just so I'd remember this for SRS. 

Location: https://www.google.com/maps/@47.7945965,13.0444718,18.33z 

But... as you see in this map, the water is flowing from south to north!  How is that possible?  The slope of the hill is increasing from south to north.  

This leads me to this week's SearchResearch Challenge.

How is it that this stream in Salzburg is apparently flowing uphill?  What's the real story behind this gentling flowing brook? How is this possible?  

Once you figure it out, be sure to let us know WHAT you did to come to this understanding.  (I have to admit I was rather surprised when I discovered the answer.) 

Hope you enjoy this Challenge!  

Search on! 

Monday, July 23, 2018

How to get SearchResearch posts by email...

Getting your SearchResearch
posts by email... 

IF you want to get these SearchResearch posts by email (rather than by checking in every few days), here's how to do this.  

 (1) Log in with your email account (your-name@gmail.com ...or whatever it is)

(2)  go to the Google Group page to sign up for the email version 

(3) Click on the sign-up button as shown below: 

And, if you want to unsubscribe, just reverse this.  (That is, come back to this page, and click unsub.)  

Hope many more of you join the email group!  

-- Dan 

Sunday, July 22, 2018

Answer: How do you plot out data by region? The case of regional boundaries.

Our goal is to plot data by region.. 

.. an immensely handy thing to be able to do if you want to do basic analysis of geo-localized data.  For instance, here's a map of all the counties in the US, with color mapping out the median income for each county: 

Figure 1. Median household income by county (2012). P/C Wikimedia.

Our Challenge for this week was to do something very similar--let me remind you of our Challenge:  

1.  Can you make a map of the median household income for each of the MSAs in the United States?  (Or equivalent statistical areas, if you're from another country.)  

As I pointed out, we can break down this problem into a few pieces: 
1. Find a data set for the median household income that's organized by MSAs.   
2. FInd a data set for the MSAs shapes.  
3. Find a visualization application that can ingest both the median income data and the shape of the MSA and create a map similar to that above.  
Here's what I did for each of the three steps.  

1.  Find the data set.

My query to find the MSA income data:

     [ median income united states data 2017 intext:msa ]  

This query led me fairly quickly to the FFIEC  data tables (Federal Financial Instutions Examination Council).  
Remember that the intext:  operator requires that the term "MSA" be part of the document.  That's exactly what I needed to do in order to find this.   
Figure 2. The data for median family income by MSA.

In particular, it wasn't hard to find that https://www.ffiec.gov/Medianincome.htm  had an Excel spreadsheet with median incomes where each row has the income and the code for one of the MSA regions.  

This gives us the following spreadsheet (I downloaded the EXCEL file in row 1 above): 

Figure 3.  The data table opened up to show the first 11 values of median income. Note the NSA/MD FIPS CODE NO.  (That's the code for the particular MSA region.)  

 I opened this XLS, then deleted row 1 (to get rid of the not-very-useful heading information) and then deleted column C (to get rid of the data from 2015 that I don't want to show).  

Then, I opened that Excel file in Fusion Tables because I happen to know that Fusion Tables has a way to import both Excel data files (like the one above) AND shape-files (like those that define the MSAs).  

That Fusion Table now looks like this with just the GEOID (that is, the MSA number), the MSA area name, and the FFEIC estimated median family income (which is what we wanted to plot): 

Figure 4.  The data from the XLS spreadsheet imported into a Fusion Table. 

Now I need to find the MSA boundaries.  How can I do that?  Easy.  I did a search for KML files (which I know Fusion Tables can read) with this query:  

     [ census metropolitan cartographic boundary KML files ] 

On that page, I found the  Core Based Statistical Areas (CBSAs in KML format):  cb_2017_us_cbsa_500k.zip  This is a ZIP file (compressed), so you have to download it and decompress it, which gives you a bunch of   .KML  and .XML files.  Here, I'm just interested in the KML files.  They're the MSA "shapefiles," each of which outlines each of the MSAs.  

Now I want to somehow grab all of these MSA shapes and line them up with each MSA median household data.  To do this, I'm going to import these shapes into a second Fusion Table, then merge the two tables together.  I'll make sure that both tables have rows that can line up and can be merged together because they have the same name and same values. 

Here's how I did it:  

We've already got one Fusion Table with all of the median incomes and a GEOID code telling us what MSA area that data represents.  Now, make a second Fusion Table to hold all of the KML regions, each of which has a code that's the same as the GEOID code.  

Figure 5. The KML (shapefiles) for each of the MSAs in the US.  

You can import the KML into it by just opening the KML file.  This should have a nice map with all 462 MSA regions on a map of the US. 

Once you open the cb_2017_us_cbsa_500K.kml file in Fusion Tables, you should see this: 

Figure 6: Fusion Table after I loaded the KML file with all of the MSA shapes. 
This shows all of the MSA shapes in the KML file.  Notice anything odd? 
If you click on the Rows1 tab, you'll see each row that defines one of those MSA shapes.  (The actual KML is in the far right-hand column.)  

Figure 7.  The MSA Fusion Table with the GEOID in the 4th column,
 and the geometry of each MSA in the rightmost column

 Now, let's step back for a second. 

We have 2 Fusion Tables.  One with the median income and an MSA region (both with the name and the GEOID for that region).  The other table has the KML shapes and a GEOID.  

Now I want to MERGE the MSAs map shapes into the MSA income data where the rows have the same GEOID (that is, the same MSA region).  

Notice that the column labeled GEOID has a unique identifier for each MSA region.  Notice that the Median income data table also has a column labeled “MSA/MD FIPS CODE NO.” – that’s the same thing. 

Now, START with the Income Fusion Table.  Then select File>Find a Table to merge into this one.  The idea is that we'll pull in the MSA shapes (from the MSAs (2017) Fusion Table) and line up the matching rows with the same GEOIDs.    

Figure 8.  How to merge the household income Fusion Table with another table.
(That is, to "fuse" them together.)  

Select a table (with matching columns):

Figure 9.  Select the corresponding spreadsheet (in this case, the first one with the income data). 

When you do this, it will align the rows with the same data in the GEOIDs field, and give you something like this.  

Figure 10.  The default MSA median income data shown as a color of green.  (This is the default coloring pattern.)

There are a couple of problems with this map. 

(1) The colors are all kind of the same.  The default color scheme is a gradient of greens, but that doesn't quite distinguish the different values very well.  We can fix this by changing the colors assigned to each income range.  To do this, you have to edit the colors assigned to each income range.  To do this, you have to change the settings for the "map feature style"

Figure 11.  How to change the polygon fill pattern to show
different income ranges as differing colors. 

What you'll want to do is to change the range / color assignment.  To do this, just click on the "Buckets" tab.  That menu looks like this: 

Figure 11.  Changing the buckets and the corresponding colors. 

This menu option lets you change the bucket sizes (and notice that you have to scroll to get to all of the buckets).  For each bucket, you can pick your own colors.  Here's another version of the median income map.  You can easily change the colors to emphasize (or de-emphasize) whatever data features you'd like.  

Figure 12.  The new MSA map with a somewhat better color choice.  


(2) The biggest surprise to me was how many MSAs seem to be missing, including some close to where I live.  (See the map below...)  In particular, how is it possible that the greater San Francisco area is missing?   

Figure 13.  A closeup of the map.  Some MSA regions seem to be missing!  

What happened to San Francisco??  

There's some kind of missing data.  This was so striking, I had to check.  

First, look up above at the map that shows all of the MSAs ("MSAs (2017)" with all of the red blobs).  There are a LOT of open spaces in the map.  Surprise!  The MSAs do not cover the entire United States.  I didn't realize this when we started this Challenge, but the MSAs are an incomplete cover of the US landscape.  That's observation 1.

Second, it looks like some MSAs are just plain missing in the original spreadsheets.  (See Figures 3 and 4.  We can see that the MSA region is in the dataset for San Francisco, but when we look in the median income database, we see it's in there.  

I puzzled over this for a long time (like... 4 hours!) before I figured out what happened.  The GEOID code for the San Francisco MSA was wrong in the map file.  I noticed this because I finally pulled out the records for both the MSA spreadsheet and the one line of the MSAs Regions KML file.  In the "median income" file, the GEOID was 41884, while in the "KML regions" file, the code was 42034!  

Basically, a Fusion Table takes two (or more) data tables that have a column in common.  In this case, I used the GEOID code to line up the data sets.  As you saw, it worked pretty well.

But if a code is wrong (as it was in the MSA KML mapping file), then only half of the data transfers--in this case, the KML region doesn't get copied over.  Damn.  THAT took a while!  

To fix it, I just opened the "incomes" spreadsheet, and fixed the GEOID number so it would line up with the KML regions file.  And, voila!  The closeup of the San Francisco region now looks like this (notice that I changed the colors once again... I was experimenting with different settings):  

And yes, the Median Family Income for the Bay Area in 2017 was $113,100 / year.  

Naturally, after I did all this, as I was cleaning up my tabs I found a really great example (with videos!) on how to do this.  If you're interested in this kind of data understanding, check out Ken Blake's instructional web page and videos.   (It's really done quite well.)  

SearchResearch Lessons 

Well, we didn't get a lot of comments this past week.  I don't know if it's the summer slump, or if people just didn't find this kind of demographic analysis interesting!  (Let me know in the comments.  If nobody speaks up, I won't do many of these in the future.)  

There were a couple of things to take away from this lesson: 

1.  You have to know what your tools can do.  In this case, I had a headstart because I knew that Fusion Tables could do something close to this (especially the mapping part).  Of course, one of the SRS skills is toolfinding.  (When you find yourself trying to do something complicated, check for a tool that can help you out.)  

2. Remember how to use intext: to require that a particular key search term is on the page.  When you really need that term to be there, intext: can really, really help. 

3.  Not all data sets are error free!  Yes, I know this data set came from the government... but they have typos too.  And remember that as data sets grow larger, it gets increasingly difficult to spot typo-type errors.  An excellent practice is to spot check a number of different locations (and to check with a filter, if you can).  I ended up finding several more coding errors in that data set. 

Postscript:  Sorry that this week's SRS post was a little delayed.  I was traveling to a wonderful event--the Salzburg Global Seminar--and was busy giving talks, and talking with a bunch of very smart and completely fascinating people.  I'm actually flying on Wednesday, but will try to post the Challenge before we lift off.  (But if I don't get it out, you'll know what happened.)  

Wednesday, July 11, 2018

SearchResearch Challenge (6/11/18): How do you plot out data by region? The case of regional boundaries.

It's time for a Challenging Challenge!  

As you know, every so often I like to mix up the SRS Challenge with something that's a bit more in-depth.  (And if this is overwhelming, just take the week off--I'll be back next week with an easier one.) 

The Setup:  If you read the news these days you'll see all kinds of claims about various kinds of data.  In an earlier SRS post we talked about immigration rates, and found that the data is a bit complicated, but you can figure it out.  

One of the things you'll see in the news are charts like this one: 

.. by COUNTY (not MSA or CSA).

This is the "Median household income in 2012 by county."  This chart is from Wikimedia and shows the median income by county in the US.  Of course, counties are sometimes just arbitrary boundaries.  They may or may-not make sense.  (For instance, Los Angeles County has around 10M souls living inside the county, while only 600K people live in Providence county, Rhode Island.  That's a factor of 16X difference in size.) 

There are many ways to draw regional boundaries that make some kind of sense. For instance, gerrymandering is the practice of drawing political boundaries to give a particular party more (or less) voting power.  

There are commercial regional boundaries (such as the "Designated Market Areas," aka DMAs, define by the polling / survey company Nielsen).  These regions correspond to media markets.  
More often, though, people who are looking at data use either "Metropolitan Statistical Areas" (MSA).  An MSA is “is a geographical region with a relatively high population density at its core and close economic ties throughout the area.”
For instance, the San Francisco-Oakland-Hayward Metropolitan Statistical Area (with a population of 4.5 million) and the larger San Jose-San Francisco-Oakland Combined Statistical Area (8.4 million) are both near where I live in Silicon Valley.  
A slightly different version of the MSA is the "Combined Statistical Area" (CSA), whi is composed of "adjacent metropolitan (MSA) and micropolitan (╬╝SA) regions in the United States and Puerto Rico that can demonstrate economic or social linkage."  (This is primarily defined by commuting patterns.)  
A map of the combined metropolitan and micropolitan statistical areas of the US looks like this: 

I'm telling you all of this background because it leads to today's Challenge.  
1.  Can you make a map of the median household income for each of the MSAs in the United States?  (Or equivalent statistical areas, if you're from another country.)  
That is, you'll need to: 
A. Find a source of recent data that's organized by MSAs.  2017 would be best, but you should look for the most recent data. 
B. Find a visualization application that can ingest both the median income data and the shape of the MSA.   
C. Figure out a way to create a visualization of the US MSAs that color-codes the income.  It should look a bit like the above example, except with the income level determining the color of the MSA region.   
This is a bit of a Challenge, but it doesn't require programming.  (If you want to program, be my guest, but this doesn't really need it.)  
And, if you really don't like MSAs as the boundaries of map regions... find a different one, and tell us why you like yours better.  
Once you figure out how to do this, you'll have the means to do your own analysis, looking at data in your own way.  
Search on!  

P.S. This is the kind of thing that Data Scientists do all the time.  With this Challenge, I'm hoping to instill some of the skills and values that Data Scientists bring to the job every day.  Hope you have fun with it.  I'm looking forward to your comments!