Saturday, February 15, 2014

The next day... How many students?

After thinking about this last night, I realized that there was probably a simple solution to the discrepancy... if only I could figure it out.  

The basic problem is that the Census numbers of post-graduates was VERY different than the numbers from the NSF.  For example:  

     1986 - Census reports that the US graduated 100,000 doctoral students.  

     1986 - NSF reports that the US graduated 30,000 doctoral students.  

That's quite a gap.  Clearly, there's something different in the way they're both counting doctors.  

When I'm stuck on a problem like this, I will consciously try to take a very different look at the problem.  This morning I thought "Let's try Google Books!"  

So I did this query on Books: 

     [ number of doctorates awarded in US ] 

The number 1 hit was for a book called "OECD Science, Technology and Industry Scoreboard 2007"  This seems like a pretty credible source (when you look up OECD, the "About this site" link tell you that it's "The Organisation for Economic Co-operation and Development.. an international economic organisation of 34 countries founded in 1961 to stimulate economic progress and world trade...")  

Interestingly, the page that I read included the phrase "the definition of postdoctorates differs among academic disciplines, universities and sectors.  For the US NSF, postdoctorates include 'individuals with science and engineering degrees  Ph.D.'s, M.D's, D.D.S's, or D.V.M's (including foreign degrees equivalent to US doctorates)."  It goes on to point out that this includes the natural sciences, mathematics, social/behavioral sciences...  

Maybe the issue is just with the definition of post-graduate degrees.  

To simplify things, I closed all of my open tabs and spreadsheets and started from scratch.  

Aside:  Because I thought this challenge would be pretty simple, I didn't bother writing down the provenance--that is, where the data came from--because I knew where each dataset had originated.  Overnight, though, all that temporary knowledge was lost.  It was a dumb move on my part.  Note to self:  ALWAYS write down where your data comes from!

I refound the Census data and the NSF data and for comparison, I found data from the NCES (National Center for Education Statistics).   

Census data set.  (See table 815 "Doctorates awarded by field of study and year of doctorate")  

NSF data set.   (See table 2.  "Institutions and doctorate recipients per institution: 1972–2012")  

NCES data set.  (SeeTable 318.20. "Bachelor's, master's, and doctor's degrees conferred by postsecondary institutions, by field of study: Selected years, 1970-71 through 2011-12") 

But NOW I had an idea:  Was there some way to subtract out some of the non-science-and-engineering disciplines to make the Census data match the NSF data?  

This time around I read the metadata more carefully.  See the note /5/ on the NCES data set?  It says: 

\5\ Includes Agriculture and natural resources; Architecture and related services; Communication, journalism, and related programs; Communications technologies; Family and consumer sciences/human sciences; Health professions and related programs; Homeland security, law enforcement, and firefighting; Legal professions and studies; Library science; Military technologies and applied sciences; Parks, recreation, leisure, and fitness studies; Precision production; Public administration and social services; Transportation and materials moving; and Not classified by field of study.
Thank heavens for good metadata that's inserted into the spreadsheet.  (Note to search-researchers:  This is an excellent practice that all of us should do.  ALWAYS add metadata!)  

That matches up (more or less) with what I'd read about in the OECD's report.  

I poured these three data sets into my spreadsheet in the tab labeled "Postgraduate numbers."  Note that I excluded Masters degrees and focused solely on PhDs just to debug this problem.  

That description above (of \5\ ) is the Other I'd been looking for.  

If you subtract all of the PhD's labeled as "Other" in the spreadsheet (see column J in my spreadsheet: here I've done the subtraction and then copied these values over into the NCES column on the other tab).  

Then, another quick chart to compare values and... 

Voila!  The green dots and line are the adjusted NCES data, subtracting out the "Other" category.  

Now you can see it:  The Census data is a count of ALL the PhD degrees issued in the US in any year... of all flavors and kinds.  The NSF data is more-or-less the same as the NCES count of PhDs MINUS all of the PhDs in the "Other" category.  

There still is a slight variation, but it's close enough that I'll accept these numbers as the full count of all PhDs (the Census data) and the science-and-engineering PhDs (the NSF or adjusted NCES numbers).  

Bottom line:  For 2012, around 47K science or engineering PhDs were awarded in the US.  That contrasts with around 170K PhDs for ALL studies combined.  

(If you're interested in this kind of topic, I highly recommend the NSF's Report on PhDs in the US.  It's full of fascinating details including analysis of ethnic identity of the students, an analysis on their parents' level of educational attainment, and how they fund their studies.)  
Search lesson:  There are many here, but I'll just point out three of them.  

1.  Don't do data analysis when you're tired.  (That was the genesis of my error, and inability, to see what was going on with the data.) 

2.  Read the metadata carefully.  Learn to love reading all those little footnotes and marginalia.  They're often key to understanding the data as a whole. 

3.  When you create data (or even temporary intermediate data sets) be sure to add in your own metadata.  Do NOT let the metadata become separated from the data.  Use the note trick (seen above) to put the metadata into the data tables themselves.  If I had done this while making my temporary worksheets, I wouldn't have had to shut everything down and restarted.  C'est la vie.  

And overall... 

Keep searching! 


  1. regarding numbers, a little broader snapshot:
    PhDs globally
    a side note from looking at the NSF site:
    Computational science
    meditating on metadata
    thanks for the thorough exercise and explanation.

    1. Hi Remmij, I like your video. It is very useful, clear and helps to understand better Metadata. Thanks for sharing.

      Dr. Russell, this question is dumb, I know. What Metadata you recommend we add in our projects. I saw your notes in your spreadsheet. Any other tip?

      About =importdata("url") function. I learned with your SearchResearch Challenge : Answer: Have the Tour riders really gotten faster over the years? After reading it, searched and found this other function. Thanks for the advice of its use in more complex works. I have been trying only in small ones,

      Excellent week!