Monday, August 11, 2014

Answer: Earthquake data?

I asked a fairly simple question--or at least a question that seems simple when stated like this:  

1.  Can you find data about when and where earthquakes happen and then chart it so we can see location and number over time?  Ideally, you would create (or find) a chart showing number of earthquakes by US state over the past 10 years.  Even better, if that chart could show month-by-month, then we should be able to spot both any seasonality effects AND resolve the question about Oklahoma.  

As several Regular Readers pointed out, the obvious source of earthquake data is the USGS (United States Geological Survey).  But if you don't know that ahead of time, how do you find out? 

The discovery query: 
   
     [ earthquake data ] 

shows us that 8 out of the top 10 results are all links to various web pages in USGS.GOV.  A little drilling into those links shows that the USGS really IS a great source of earthquake data.  (And reading around a little bit reveals that they collect data worldwide; not just for the US or just the Americas.)  

It doesn't take long to figure out that USGS provides a very nice earthquake database that you can pull data from with a single query. 

  http://earthquake.usgs.gov/earthquakes/search/ is the go-to source for this stuff.  

http://earthquake.usgs.gov/earthquakes/search/

I set the date start / end range to 2004-08-01 and 2014-08-01, and the magnitude to be 3, and the geographic region to be West = -128, North = 49, East = -62, South = 20  (that's roughly from the coast off Vancouver to the ocean just east and a bit south of Cuba).  

I asked for the result to be delivered as a CSV file, that way I knew I could then pull it into Google Sheets and do whatever I wanted with it.   (To do that, just click on the radio button that says "CSV" in the upper right.) 

The USGS then runs the query against their backend (which is NOT indexed by Google), and downloads a 1.2Mb file onto your computer.  

I opened up that CSV file in Google Sheets and see something like this:   



Each earthquake is one row of data, time & date in column 1, lat/long in columns 2 and 3, etc.  

Conveniently, in column N there's a place name that looks like this: 

20km ESE of Anza, California

it's just a city, or a location (20 km ESE of...) AND a state designation. (Or, if you look through the data, you'll see lots of "Mexico" in there as well, along with the closest city names.)

For SearchResearchers, a great practice to follow is to ALWAYS provide the metadata with the Sheet. My common practice is to write a comment and insert it on cell A1 (upper left). That tells me when, where, and how I downloaded the original data.


My metadata comment in the Sheet. ALWAYS do this with your data,
even if you think you won't come back to it.  Maybe ESPECIALLY if you
don't think you'll come back--that's when you'll need it the most. 


Then, as you clean, filter, and munge the data, you NEVER work on the original data. Always make a copy.


As you can see, I made several tabs across the bottom.  ("cleaned up data" "NV" "OK" "CA" "Summary")  



If you look at the "cleaned up data," it's just a copy of the original data (all 7,386 rows) with a couple of extra columns.  The most important for us is that I wrote a little function that would pull the name of the state out of Column N.  (Remember, the name of the state is the text following the last comma--that was the magic spreadsheet function I wrote--check out Column F in the "cleaned up data" for the state names).  

NOW I'm ready to do my state-by-state comparison.  

I decided to look at just the top 3 states with the most earthquakes over the past 10 years.  That's California (CA), Nevada (NV), and Oklahoma (OK).  

I filtered the "cleaned up data" by California (click "Filter," then click on the little triangle at the top of the column to select the values, e.g., "California," that you want to filter-by).  

I copied the entire spreadsheet (now filtered to show only California earthquakes) into a new tab (the "CA" tab) using "Paste Special>Values" which then gives me a sheet of just the California earthquakes.  

Repeat that process for the Nevada quakes and the Oklahoma quakes. 

Now I'm set to pull the data for each of the states by year.  So I made a little table, one row / year, and wrote a function to count the number of earthquakes in each table by year.  


That countif(C2:C4000, "2014") function just counts all of the cells in the C column (which is the year of the earthquake). 

I did that in each of the three states sheets, and then copied all of the state's data into the Summary sheet. 



This answers our original question ("Is Oklahoma having more earthquakes?") with an emphatic YES!  In the early 2000's, Oklahoma had only a tiny number of quakes > 3.0 (onesies and twosies).  Starting in 2009, the number went way up (by a factor of 10X), and has been climbing ever since.  Oklahoma now has more earthquakes / year than California (which was the previous leader).  

And note that the chart only goes to August 1 for 2014: the year's not over yet, so expect both CA and OK numbers to grow.  

As you look at the chart, another question leaps out:  What happened in California during 2010?   Short answer:  I don't know.  Another Search Challenge? 

But the consensus seems to be that fracking in Oklahoma seems to be causally connected to the earthquakes that the region is experiencing.  By doing a search in Scholar for [Oklahoma earthquakes] I found a pretty compelling article in the journal Science, "Sharp increase in central Oklahoma seismicity since 2008 induced by massive wastewater injection" that makes the case that the two are connected.  (Of course, we can't infer that just from correlation, but the Science article goes into great detail about what the causal connection seems to be.)  

Just to double-check, I also poured this same data into Google Fusion Tables.  Why?  I know they have a handy "heat map" display that shows the number of earthquakes as a colored overlay on the map.  (My Fusion Table heatmap for US Earthquakes 2013-2014.)  


(Note that "only" the first 1,000 earthquakes are shown here. Those are the earthquakes for 2013 and 2014, up to August 1 of 2014.)  

As is evident, Oklahoma is feeling more than its "fair share" of quakes.  


Search lessons:  Often, searching out the answer is something that can't be done just by looking up a single number or article.  Sometimes you actually have to find the original data and then wrestle IT to the ground by doing some manipulations.  

In this case, I used spreadsheets to transform the data from a purely CSV format into something I could easily use, both as a line graph and as a heatmap.  

And, of course, always track your metadata.  (I say this through harsh personal experience.  I can't tell you how many data sets I've lost because I couldn't figure out how they'd been derived.  Don't let this happen to you!) 


Special Offer:  I'll make a short video showing HOW I did it.  (If enough people write in with comments asking for more details on "how did you do that spreadsheet trick with the USGS data?")  

Search on! 


19 comments:

  1. How did you do that spreadsheet trick with the USGS data?

    Specifically, how would you pull out and graph the months that the earthquakes happen? Would we see the correlation of more earthquakes during the summer months?

    ReplyDelete
    Replies
    1. See my video. The short answer is that I'd use a spreadsheet function (something like =mid(n3, 7, 12) ) to pull out the date. You can then use that same method to pull out month, etc.

      I haven't yet done the summer/month graph. Will get around to it soon.

      Delete
  2. Hello Dr. Russell. Good Morning!

    I tried the way you did it. I always had problems inserting the coordinates. Then looked for another way. I'd love to see the video of how you did it. Your explanation is very clear and a video is a way to keep it in mind easy and to verify if I understood right.

    In your heat map, just a curiosity. I thought Mexico City and Guerrero will had more color in the heat map. Never thought that in those years Chihuahua and Monterrey could had the color.

    I am sure that in this year (2014) heat map will show center of Mexico in red. Too many earthquakes and not fracking...yet. How Fracking Causes Earthquakes

    Also related to that. I recently read that places that had big hurricanes or big amount of water flood; the next year most likely will have an earthquake. [hurricane AROUND(3) earthquakes] results have more information.

    Thanks for the tip of the metadata and of course for the SearchResearch Challenge and lesson.

    ReplyDelete
    Replies
    1. I think that I just didn't include ALL of Mexico in my data pull from USGS. You should give it the proper coordinates for Mexico, re-run the analysis, and tell us what you find!

      Delete
    2. Hello Dr. Russell. I just did a Heat map with the same coordinates. Just changed South from 20 to 18. I need to verify data because no earthquakes are shown for Guerrero or Michoacan in 2013 and 2014 in the USGS data. I have from 2004 to 2014 (filtering shows only until 2012) in those two states only 357 earthquakes.

      Heat map south 18 instead of 20

      I'll keep you posted.


      Delete
    3. I changed coordinates. I used West = -128, North = 30, East = -62, South = 16. I did that to cover all Mexico and not just part as in my previous post in which I did a mistake. Total earthquakes 1946.

      Heat Map

      Histogram

      Delete
  3. Please, show us more details on "how did you do that spreadsheet trick with the USGS data?"

    ReplyDelete
  4. Of course I would like to see how you did your magic. I created a copy of your table & explored the funciton & filter aspects but I don't know how you extracted the state which you may recall is where I got stuck with the challenge. My chromebook is not powerful enough to handle a file that size I discovered. When I downloaded the data originally from USGS I was limited to 2000 rows (not sure what was not included in the limited data & what was excluded). For anyone else facing limitations when I copy Dr. Dan's table to Google Drive it overwhelmed my Chromebook & the only way I found to delete that file was reverting back to the old Google Drive. While the old Google Drive seems to better handle the file, I still don't know how to do the magic.

    The question I have relates to what I was trying to accomplish. Could a scatter chart (or perhaps trend chart) be created showing seasonality ie. # of earthquakes, month of occurrence, over a ten year period and what would it look like? For example we see the numbers in California in the past ten years are significant especially in 2010. Maybe in Fusion Tables, a heatmap using "buckets" to show what month & where. I'd give it a try but because I had issues with the file size in Google Drive I will hold off.

    And as a side note this weekend we had an earthquake & its the first over 4.0 since 2001. Not huge but definitely unusual for our area. http://goo.gl/5TSPK4 http://goo.gl/dS8T5J Interesting timing!

    ReplyDelete
    Replies
    1. If you use Google Spreadsheets, you shouldn't have any "row limit" problems. I've got spreadsheets with more than 10K rows. (And Fusion Tables can be gigantic!)

      Delete
    2. The issue appears to be the USGS application doesn't work well with Chromebook. It won't download more than 2000 on "my device". In fact I am unable to download directly from the CSV option. The only way I got it to work previously is by using the "map & list" option & restricting it to 2000 rows.

      I've discovered the issue with Google Drive is the 'new' google drive is not working well on my chromebook. I can work with the table using the old Drive format. I will get it sorted.

      Delete
  5. Just got this link to "merge values" I am guessing we need to "unmerge". How you can look in a cell and separate with a comma the data, amazing to me. http://goo.gl/GlmPf3

    ReplyDelete
  6. PUBLISH failed yet again..................this is PREVIEW

    Dan, Sign me up to see just what you did with the CSV data. I got that far myself but then had no idea how to proceed from there.

    Did I get this right ? It's not the fracking causing the EQs but the massive waste water injections. Four in particular in OK.

    Good pertinent topic for me.

    jon tU

    ReplyDelete
    Replies
    1. You're probably correct--it's the injection that's causing the issues. (But where does that waste liquid come from? Fracking...)

      Delete
  7. I went ahead and pulled out the month data because I was curious. I made a copy of your sheets.
    For each of the state data I used the SPLIT function to separate Year-Month-Day into separate columns.
    Used COUNTIF for each month.

    Changed the Summary page to graph the number of quakes per month.

    Copy of Dr. Russell's Earthquake Data

    ReplyDelete
    Replies
    1. Thanks Fred! Looks great.

      I was just thinking about Split and Concatenate Functions. Dr. Russell's way to separate data from columns is better and also more not basic. It is great to learn new tricks and functions.

      Rosemary you can use the Split function as Fred mentions to un-merge data from columns. You need =Split (A2, ";") for example in the cell and the necessary clear and free columns next to separate the values. You can change " " for space, period, or other. Here is an example

      Also I learned why I couldn't make it work the LAT/LONG rectangle and why Dr. Russell's heat map doesn't show Mexico. In the first case I entered wrong the coordinates. I didn't add the correct set of values. We need in some Lat and others Long, I just took one. In the second. Mexico City is at 19.43 N.

      Delete
    2. Excellent. That's a nice way to do it as well. (And better than what I suggest above!)

      The SPLIT function is much more robust than using string manipulation functions.

      Note, though , that I wasn't able to figure out a way to use SPLIT to get to the state in the "location" column. (I could do that in AppScript, but I was trying to stay out of that....)

      Delete
    3. I just tried =split(E2,",") on the "cleaned up data" on my copy of your data. It seemed to work, but then began giving errors where the place is just listed as Northern California or just Wyoming when there was no comma to act as the delimiter.

      Delete
  8. Hi Dan -- I would love to learn ""how (you did) that spreadsheet trick with the USGS data" by watching your video.

    -- Mike

    ReplyDelete