Wednesday, September 2, 2015

Search Challenge (9/2/15): How to search in a scanned document?


If your research is like mine... 

... you fairly frequently find a document that's from another era.  It doesn't even have to be that long ago before you find yourself dealing with infernally annoying crufty docs.  

For instance, when I'm searching, I fairly often find a document that was scanned as an image. It's great to have the document in the first place, but as a scan, it's often less than completely useful.  

Here's an example.  A document I found in one of my research studies was this excellent paper that's available only in a scanned PDF format.  (Here's the LINK to the paper.)  When you open it up, you'll see sections that appear like this:  


Of course, our usual Control-F / CMD-F tricks don't work on this kind of scanned doc, and since this is a long paper, it makes it very much harder to read.  In particular, what I WANT is this--something I CAN use Control-F on:  



Our SearchResearch Challenge for this week is meant to give you an additional powerful tool for importing scanned documents and making them findable.  

1.  How can you transform this document (LINK) into something that you can search within? 
2.  Once you've done that, can you determine how many times the authors refer to "multiple documents" in that paper?  (This was my original search task--finding interesting papers about how people read multiple documents at the same reading session. That's how I found this paper.)  

So this Challenge is really about "tool finding" -- can you figure out how to convert from a scanned document into a readable / findable / searchable one?  

(Big hint: It's much easier than you think.)  

Let us know how you found out how to do the magic process!  

Search on! 


16 comments:

  1. Cool challenge, and timely. Thank you, Sir!

    ReplyDelete
  2. Since it's a PDF file, I just use the OCR tools built into Adobe Acrobat Pro. I simply opened the text recognition table and clicked "In This File" and let it run. Not everyone has access to Acrobat Pro, however, so you might need a different OCR tool.

    Knowing the text recognition isn't perfect I then selected "Find all Suspects" and scrolled through the document looking for any instance of the word "multiple" or "document" that was highlighted and made sure it got it right. There were three instances of multiple and none of document and only one of the suspects was in the phrase multiple documents. However, in each case, it actually had the correct spelling. If I was going to be using this document for a lot of things, I might actually play with the settings a bit to try to get better recognition and go through it and fix all the suspects (there were a large number) but for the challenge I didn't need to do that.

    Then searching through the document I found 6 instances of "multiple documents" in the text and one in the references (which I didn't check for errors).

    ReplyDelete
  3. Looked it up in GoogleBooks and seached within and find the search terms multiple documents is mentioned 7 times in the book but 5 times in the selection you chose.

    Tried to get Drive to convert with no success

    jon tU

    ReplyDelete
  4. If I understand correctly the challenge this tool is something I have been using for a while. Formerly known as Notable PDF and now known as Kami. I uploaded your document to Google Drive. There I have linked Kami with Google Drive. I now use the OCR function in Kami. and here is the scanned document.

    https://web.kamihq.com/web/viewer.html?file=https://notabletemporarydownloads.s3.amazonaws.com/Notable%2520PDF%2520Export%2520-%2520LHvAVw3OvM-T_ciCP82_RA.pdf

    or http://bit.ly/srs_Sept_2_2015

    Multiple documents showed up 5 times.

    ReplyDelete
    Replies
    1. I just checked my links and it looks like this link won't work, sorry. If you want to try Kami download the app Kami and link it to Google Drive.

      I have tried uploading scanned/online articles to Google Docs in Google Drive but I find a lot of editing required. In some cases despite the editing it is worth the bother. For example if you want to use text in a presentation, add your own comments etc. using Google Documents for OCR documents can be very beneficial. It works well in a class environment where students are working together on exercises.

      I tend to use Kami for language learning. I highlight unknown phrases/words. As well when working with online books or scanned books that have exercises within the document it's easy to fill in answers. Very handy.

      Delete
  5. How did I find out? Out of necesity I went in search of such a tool. One particular use I needed it for is my online book collection. I have even asked the company to expand their search abilites to incorporate more Google Search functions.

    ReplyDelete
  6. …am guessing this is headed in a Google/Drive-centric direction?? a small detour
    the key seemed to be finding the term [optical character recognition] - there seem to be a number available… for $$ like Adobe Acrobat, some like what
    is offered by Google in Drive (About Optical Character Recognition in Google Drive) and a number of free on line offerings - which is the way I went as an alternative -
    this required registration to really do the job, but seemed to work fine - had it convert to Word… on line ocr

    "2. Once you've done that, can you determine how many times the authors refer to "multiple documents" in that paper? (This was my original search task--finding interesting papers about how people read multiple documents at the same reading session. That's how I found this paper.) "
    I'm going with 7 times, including in Notes out of the 24 pages… (saw where Drive only does 10 pages… don't know if that is current)

    ReplyDelete
  7. Good day everyone.

    After reading your comments. I tried [Ocr online] results are good for images and smaller files.

    I found as mentioned before 5 multiple documents and 1 in reference. That is total 6. Terl mentions 6 on text and 1 on references can not find the one I miss.

    I'd like to know in the answer if possible how Dr. Russell selected "multiple documents" as the word to search.

    Yes, Remmij 10 pages in Drive is still the current amount.

    ReplyDelete
  8. Hello everyone,

    1. I placed the PDF in my Google Drive. Right-clicked on the file and selected OPEN WITH > Google Docs. The converted file has each page as an image and then the text of that page below. This is helpful if you are doing translations or fixing OCR that Google Drive does. Answer - use Google Drive. https://get.google.com/tips/#!/tips/hack-photos-and-PDFs-to-say-what-you-want?category=learn-better

    2. Used Command-F to find that they mention [ multiple documents ] twice.

    Next question - would they consider reading the pdf image page and the OCR text below it in the Google Doc for translation as multiple documents in one or is still just one document?

    ReplyDelete
    Replies
    1. Hello, Fred. Thanks for the Url about Google Tips. Your way is simpler and just looks first 10 pages, not the whole document.

      OCR is great.

      Delete
    2. Hi Ramón, if it is more than ten pages sorry, but TLDR. ;-)

      Delete
    3. Hello Fred, good day :)

      I know, I didn't know what "TLDR" meant. Now, I know. I tried to read the whole document and lost multiple documents number. In any case, very interesting reading lecture.

      As I mentioned, I like your path. I tried different ways and got same result but with more steps.

      Looking forward to know Dr. Russell's path. I know Terl's work. RRR's too and is new for me that app. And mine, is different but also works.

      About apps or extensions, how we can know if one is trustable and safe to add to our Drive?

      Enjoy Sunday!

      Delete
  9. And now for something completely different: mix a snip from the 1st, another from the 2nd, eg:["reading comprehension strategies" "strategies are developmental"]; get the (text-) cached version; markk' the 6+1 finds at http://ow.ly/RQEar. Voilà.

    ReplyDelete
    Replies
    1. This is great. But how did you find all 6 + 1 instances? Regular control-F only finds 6. What did you do to find the 7th?

      Delete
    2. I'm not sure how A.M. got there (where are the snips coming from?) or why you are not seeing the 7 with ⌘f… or am I missing the question all together?
      ⌘f on Aui Maisi's link example… shows 7

      OK, now I see where the article was cached & it makes more sense to see - nicely done, A.M. - pretty clever approach getting around OCR on a scanned document, works for
      items on the net that Google has indexed - but if someone had directly sent a long scanned pdf that wasn't on the web, OCR would still be needed… right?
      used the full text version - ⌘f shows 7
      using the cached version

      Delete
  10. I don't know what happened to my first comment. I think my connection has some issues. Here it is again.

    Good day, Dr. Russell, fellow SearchResearchers

    Searched:

    I know that Optical Character Recognition (OCR) helps for this Challenge and that Google Drive works. Not sure it works with large documents so for that reason searched.

    [search scan document google drive]

    Only text extract first 10 pages.

    Did you know that you can search through text in scanned documents? Or convert them to Docs so you can make edits?

    [search scan document] [search scan document online] [search scan document OR PDF online]

    Gives some options. Some need email, others small files. Therefore, need to find other ways.
    [pdf searchable][pdf searchable online]

    Convert Document

    Answers

    1. How can you transform this document (LINK) into something that you can search within?
    A: I did it this way, to practice Google Drive. (first 10 pages):
    1. uploading to Google Drive.
    2. Openining with Google Docs.
    3. Ctrl f [multiple documents]
    4. This gives 2 times "multiple documents"

    With source: convert document, changed the file to docx and opened it on Chrome.

    2. Once you've done that, can you determine how many times the authors refer to "multiple documents" in that paper? (This was my original search task--finding interesting papers about how people read multiple documents at the same reading session. That's how I found this paper.)
    A. Six times.

    ReplyDelete