Wednesday, August 3, 2022

Answer: What's a large US city with very low population density?

Let's work backwards...   


... and answer the last SRS Challenge first... you remember, the one about finding the largest US city with a very low population density!

I'll answer the previous Challenge ("rusty thing I found in the woods") before next week and get us back on track.  

If you recall, our Challenge was:  "what large US city has the lowest population density as of 2020?"  

There are tables one can find that will tell you one answer, but I'd like you to solve this Challenge in a more direct way--a way that will teach you how to download data directly into a table and then manipulate it yourself.  

Can you do this hands-on data manipulation Challenge? 

Here's what I want you to do:  

1.  Search for a table of the largest US cities by population.  You'll want to find a table with at least 330 entires in it.  

2. Download that table into a spreadsheet. 

3. Compute the population density (if you need to... it might be a column in the data set).  

4. Sort the table by density, and then tell us what the city name is!  

Your table should look like the one above (hint: I got it from Wikipedia, but you can find your own source if you'd like--the diversity of data sources might be interesting).  

Here's what I did... 

First search for the Wikipedia table.  My query was: 

     [ wikipedia table largest US cities population ] 

which led me to the Wikipedia table List of US cities by population.  It's a classic Wikipedia entry with all the standard data disclaimers and information (e.g., This table lists the 331 incorporated places in the United States (excluding the U.S. territories) with a population of at least 100,000 on July 1, 2021, as estimated by the United States Census Bureau.  The table displays: (a) The city rank by population as of July 1, 2021, as estimated by the United States Census Bureau, (b) The city name, etc etc etc.)  

If you look carefully, you'll see that the columns can be sorted by clicking on the sorting widget at the top of each column in the table.  (You need to recognize that that's what these widgets do--it's part of your SRS visual literacy: recognize and understand what UI widgets are and what they do.)  


If you click on that widget, you can sort by population density: 


And voila, you've got the answer.  That's the fast and easy way to get to the answer.  

BUT... the point of the Challenge is to get you to figure out how to download this table and then manipulate it to get to the same answer.  (That is, the pedagogical point of this Challenge is to learn how to download data tables from the web and then how to work with them.)  

Do you know how to pull data tables off the internet and into your favorite spreadsheet?  

Well, the obvious SRS way to find out is with a search: 

     [ how to import data tables into Google sheets ] 

which will lead you to a number of sources, including this well-written and extensive post by Parul Pandey about Importing HTML tables into Google Sheets or this YouTube video from Teacher's Tech about How to Import Data from Webpages into Google SheetsThese are excellent resources, and for full details about how to do this, I recommend those page.  

For our purposes, I'll cut to the chase and point you to my Google Sheet with the population data in it.  This sheet looks like this: 

There are a couple of things to note here.  First, cell A1 has the magic function in it: 

     =ImportHTML (URL, "table", 5) 

which says to import the 5th table of that web page (the Wikipedia link) into the sheet as a table. I had to experiment a little to figure out that it was the 5th table, but I guessed it was #5 on the second try.  It's easy to just keep trying until you get the right table.  (Look at the Wiki page and count down from the top of the page.)  

This imports the 5th table into that location.  If you look back and forth, you'll see it's a complete copy of that data table.  

That's pretty straight forward.  



Next thing to notice: I put that table into Tab 1 of the sheet ("Datatable import"), and then did all my manipulations on a COPY of the sheet that I made in Tab 2.  If you look at Tab 2 ("Cleanedup data"), you'll see that it's where I did my cleaning up.  This is a good practice to follow--don't muck up your original data set as you're exploring.  

Note that when the data is imported, it's often imported as TEXT data, and perhaps not the numeric data you're seeking.  

In particular, column J is the 2020 population density in people/km2, and it's a text field, not a number.  So I initially wrote this formula to extract the number from column J.  


And that LOOKS right.  But as we know, appearances can be deceiving.  I thought it was right, but when I sorted the column, I noticed that the sequence was bizarre.  I saw patterns in the data that looked like this: 


... which is clearly very wrong.  The problem is that column O is all TEXT... and sorting text like this gives you a sort where 7,681 precedes 706. 

Once I realized that, I fixed up the extraction formula to give a real numeric value.  This is that formula (with the extra  =value(...)  in it):  


NOW I can sort by Column L ("Density Value") and see the right names of cities appear at the top: 



So... there's a quick and easy way (just use the built-in sort on the Wikipedia page), and also a more sophisticated way to download the data to your own sheet.  Of course, this then allows you to do more things with it--say, create a chart like this: 



SearchResearch Lessons


Two key points from today.  

1. Use the ImportHTML function to pull data tables from web pages into Sheets.  Incredibly handy when you need to get your hands on the actual data. 

2. ALWAYS check that your data is what you think it is.  In this story, I mistook what LOOKED like numeric data for numbers, but noticed that the sort order was all messed up.  When I converted those text numbers to actual numbers, sorting suddenly started working.  

I guess the summary is, as always, pay attention to what you're doing.  Keep asking yourself, does this make sense?  And when it doesn't, dive into full-on SRS mode and figure out what's happening.  

Search on!  





P.S.  Fairness requires that I point out that Microsoft Excel has a VERY nice import data function built into it.  It even lets you browse through the tables in the source web page rather than trying to figure out the number of the table to import.  See Importing Data into Excel from the Web.  It's actually very handy.  

2 comments:

  1. Out of topic. And, connected with previous SRS Challenges

    Today, I found this article about Chankillo in Peru. Article has beautiful photos. A link to YouTube video and mentions the site is a WHS. I thought it was from 2022. However, the article is new. The award is from 2021

    https://tectonica.archi/articles/el-complejo-arqueoastronomico-chankillo-patrimonio-de-la-humanidad/

    Searched World Heritage Convention. UNESCO

    And found this list. Latest 2021

    https://whc.unesco.org/es/list/

    Ctrl-F Mexico and found:

    Primeros monasterios del siglo XVI en las laderas del Popocatepetl. Not exactly in the Popocatépetl volcano, as laderas could make people think.

    https://whc.unesco.org/es/list/702#top

    Searched for them and found a YouTube video from INAH. They are very beautiful.

    Monasterios en las laderas del Popocatépetl

    https://m.youtube.com/watch?v=OV69n3eFuzc



    ReplyDelete