Friday, February 14, 2014

Answer: How many students, how many years of school?

The challenge here comes in two parts.  

First, find the data.  Second, create a plot of it in the graphical form of choice.  

To create this chart: 

The number of people in the US with this amount of education, by year.  (In thousands.)  

I did a relatively simple search: 

     [ education data table by year ] 

In my SERP, the 3rd result was the link to the website.  A quick click took me to the data tables for "Educational Attainment," a phrase that I learned was the correct description of what I was searching.  (I did also check out the website "National Center for Education Statistics," but couldn't figure out how to download decades worth of attainment data.  I gave up on that pretty quickly and turned to 

The link labeled "Time series data" looked pretty reasonable to me, and when you click through, you see their graphs and links to their data sets.

You can see there's a graph (on the left) that's pretty much what I asked for (and it would have been okay to stop there).  But I like to get the data for myself, just so I can check what other have done, and possibly re-analyze the data in my own way.  In my case, I summarized all of the post-graduate numbers together.

So I right-clicked on the CSV file and downloaded it to my desktop.  (It's control+click on a Mac.  If you're an Excel user, right-clicking on the XLS file is what you want to do to download it.) 

This copies the CSV to my desktop, and then I can create a Google Spreadsheet and import the data into it.  

I choose to create a spreadsheet rather than a Fusion Table (as Rosemary did) because I didn't think I'd need any of the advanced features of FT.  (It's also easy to import a spreadsheet into a Fusion Table, so it wouldn't be hard to change in any case.)  

I just left all of the defaults about "Import action" and "separator character"to be what they are at the start.  The import process usually gets it right, and in this case, you can see a preview on the right hand side.  

Then, when you click the blue "Import" button, your spreadsheet will look like this.  

My educational data spreadsheet (if you'd like to look at it).  

And at this point, you just select the data (from cell C11 to I69) and click on Insert>Chart 

Note that I've highlighted the "Use row 11 as headers"  AND the "Use column C as labels" buttons.  This is what sets up the chart to plot properly from left to right and create the colors index on the right side of the chart.  

Then all that's left is to modify the labels, and you've got the chart.  

Extra credit: Remember that I'd hoped to find the students with post-graduate years of study as well? 

I was able to find data tables of "degrees granted" by year, but not number in the same format as in the previous data set.  Remember:  those numbers in the chart above are the TOTAL number of people, by year, with that level of educational attainment.

(Yes, I could have looked up the US population for each year and divided... But I took the lazy way out, since I was really curious about the actual numbers of post-graduate degrees attained in any given year.) 

And here's where I went down an interesting rat-hole.  

I did the query: 

     [ number of doctorates awarded US ] 

and found what looked like the perfect answer at the National Science Foundation.  They have a data table "Graduates and post-graduates in science and engineering" (but which includes data on humanities).  They have numbers from the 1940s, so I thought it would be a great resource.  However, when I plot all of the numbers (from both Census and NSF), I see a plot like this:  

It LOOKS like NSF is underreporting (by a factor of 3) OR that the Census is over-estimating the number of degrees awarded each year.  

What's going on? 

I got the NSF data from their data site: 

And I got the Census data from their data site.

I've read both of these data tables for a while now, and haven't yet figured out why they're so different.  

The NSF data comes from their survey of graduating PhDs, but they claim a 98% return rate on the surveys.  That missing 2% won't make up the difference in the data.  

Search lesson:  If you want ALL of the data in a time series, search for "time series"--that's the coded language for data collected over time.  

Perhaps most importantly, be prepared to spend time reading over the metadata (that is, the information about the data set) that describes what the data is and how it was collected.  

And when you find discrepancies in the data, embrace the differences and try to figure out what's going on.  (I put both data sets onto a single chart, just to see the differences.)  

Status:  I'll keep working on this over the weekend, but if you figure out why the data is SO different, post a comment.  We'll figure this out together!  

Searching on.... 


  1. This comment has been removed by the author.

  2. If you follow the link from NSF
    (bad link) but I went to and searched Doctorate Degrees. It looks like NORC does the survey under the name SED. Here's the result of my search

    I will have a look around this site but thought it looked interesting for our purposes at least perhaps a starting point.

  3. Status: diffīdent data — trying to embrace… but
    "And here's where I went down an interesting rat-hole. "
    … and that's where I got derailed with the phraseology… I'm lacking search focus - need to find a master teacher along the lines of Po or Kahn (not the *Academy* one;
    although, that might be a start.)
    rathole v rabbithole
    urban's take -
    alt book rabbit
    in the wild:
    ID holes
    I'm a fan of Rat

    …must work on discipline, focus & tenacity, like RM… =) searching on… but drowning in the sea of data… oh look, a fish graph -- a little like trying to refloat the Garden City with data pontoons.

    fwiw: a 2007 Dr. (dissertation, Student Retention at the Community College: Meeting Students' Needs): Dr.JJ-B

  4. Hello, Dr. Russell.

    It is very interesting your post and how you create chart. "Time Series" is a great tip, too.

    I just want to give a tip to create spreadsheets in Google Docs. Instead of downloading (CSV) and then uploading it again; we can use =importdata("url")

    I am trying to find more information about the data you are looking. I have some issues due to definitions; for example:
    - doctorate recipients and others that I lost searching. Maybe they mean something different in USA in the way they add in a group.

    [Survey of Earned Doctorates statistics]

    Other links that maybe can help are:

    Data Tables this link to next.

    TABLE 1. Doctorate recipients from U.S. colleges and universities: 1957–2012

    Doctorates Awards

    Hope it helps

    1. Thanks for the =importdata("url") tip. I've used that before, but on complex pages (with many tables) it will sometimes grab the wrong one. Still, an incredibly useful thing to know. (I just wanted to teach a method that would always work for everyone.)

  5. 1957 2012 Doctorate recipients
    Sampling Data Monthly 50,000 households
    Difference between Educational Statistics Fact Sheet

    Are the numbers from the Census Bureau a monthly sampling of households? I'm thinking it is. 50,000 per month x 12
    The NSF numbers source is from the students through surveys which I would think is more accurate.

    I wanted to bounce this off you to see if I am heading in the right direction.

  6. Okay here's a NSF report for 2012 giving lots of details-

    The numbers at Census Bureau is where the issue lies. I will look at that site to try and understand the numbers better.

  7. This comment has been removed by the author.

    1. Removed comment because sampling was irrelevant.

      NFS/NCES would suggest that current annual doctorates obtained are about 50,000 in the last reporting year. I have found several references to this number. But the Census Bureau has that number around 170,000 in their table. Just to add confusion I just found another table sourced from NCES reporting in 2009 Doctorate degrees at 67000 (compared to 2009 to CB at 158,000). Links below.

      However the following statement gave me a clue as to what might be happening here from this website link

      “The population eligible for the 2012 survey consisted of all individuals who received a research doctorate from a U.S. academic institution in the 12-month period ending 30 June 2012. The total universe consisted of 51,008 persons in 418 institutions that conferred research doctorates in 2012.”

      That number 50,000 is those eligible for the survey. It isn’t the # of degrees achieved.

      So if I have this right NSF contracted NORC to survey via SED to obtain data from eligible graduates.

      This exercise certainly points out the need for not only noting sources for all your data, but as well understanding the purpose and the population. The Census Bureau I can only assume at the moment included all doctorate recipients. We would need a breakdown of the 170,000 and learn how the data was obtained. I hope we're a step closer.

    2. Good tracking. This is what I was working on as well (who actually DID the data collection, and how many copies of it are out there). I think your *provenance* is correct.

  8. You spurred me on to try one more time. Have a look at this pdf.
    Specifically Table 46.1 . As we try to breakdown 170,000 I mentioned in my previous post about 158,000 reported in 2009. Well this table confirms this number. It is made up of Public Institutions and Private Institutions (non-profit and profit). I would venture to say that if we had 2012 numbers they would be our 170,000. The note also mentions those included were institutions that participated in specific program so we still don't have a total number. Would you agree?