SearchReSearch: Search Challenge (9/2/15): How to search in a scanned document?

Wednesday, September 2, 2015

Search Challenge (9/2/15): How to search in a scanned document?

If your research is like mine...

... you fairly frequently find a document that's from another era. It doesn't even have to be that long ago before you find yourself dealing with infernally annoying crufty docs.

For instance, when I'm searching, I fairly often find a document that was scanned as an image. It's great to have the document in the first place, but as a scan, it's often less than completely useful.

Here's an example. A document I found in one of my research studies was this excellent paper that's available only in a scanned PDF format. (Here's the LINK to the paper.) When you open it up, you'll see sections that appear like this:

Of course, our usual Control-F / CMD-F tricks don't work on this kind of scanned doc, and since this is a long paper, it makes it very much harder to read. In particular, what I WANT is this--something I CAN use Control-F on:

Our SearchResearch Challenge for this week is meant to give you an additional powerful tool for importing scanned documents and making them findable.

1. How can you transform this document (LINK) into something that you can search within?

2. Once you've done that, can you determine how many times the authors refer to "multiple documents" in that paper? (This was my original search task--finding interesting papers about how people read multiple documents at the same reading session. That's how I found this paper.)

So this Challenge is really about "tool finding" -- can you figure out how to convert from a scanned document into a readable / findable / searchable one?

(Big hint: It's much easier than you think.)

Let us know how you found out how to do the magic process!

Search on!

16 comments:

AnonymousSeptember 2, 2015 at 8:32 AM
Cool challenge, and timely. Thank you, Sir!
ReplyDelete
Replies
Tom StephensSeptember 2, 2015 at 11:30 AM
Since it's a PDF file, I just use the OCR tools built into Adobe Acrobat Pro. I simply opened the text recognition table and clicked "In This File" and let it run. Not everyone has access to Acrobat Pro, however, so you might need a different OCR tool.

Knowing the text recognition isn't perfect I then selected "Find all Suspects" and scrolled through the document looking for any instance of the word "multiple" or "document" that was highlighted and made sure it got it right. There were three instances of multiple and none of document and only one of the suspects was in the phrase multiple documents. However, in each case, it actually had the correct spelling. If I was going to be using this document for a lot of things, I might actually play with the settings a bit to try to get better recognition and go through it and fix all the suspects (there were a large number) but for the challenge I didn't need to do that.

Then searching through the document I found 6 instances of "multiple documents" in the text and one in the references (which I didn't check for errors).
ReplyDelete
Replies
jonSeptember 2, 2015 at 12:11 PM
Looked it up in GoogleBooks and seached within and find the search terms multiple documents is mentioned 7 times in the book but 5 times in the selection you chose.

Tried to get Drive to convert with no success

jon tU
ReplyDelete
Replies
Rosemary MSeptember 2, 2015 at 5:27 PM
If I understand correctly the challenge this tool is something I have been using for a while. Formerly known as Notable PDF and now known as Kami. I uploaded your document to Google Drive. There I have linked Kami with Google Drive. I now use the OCR function in Kami. and here is the scanned document.

https://web.kamihq.com/web/viewer.html?file=https://notabletemporarydownloads.s3.amazonaws.com/Notable%2520PDF%2520Export%2520-%2520LHvAVw3OvM-T_ciCP82_RA.pdf

or http://bit.ly/srs_Sept_2_2015

Multiple documents showed up 5 times.
ReplyDelete
Replies
Rosemary MSeptember 2, 2015 at 5:35 PM
How did I find out? Out of necesity I went in search of such a tool. One particular use I needed it for is my online book collection. I have even asked the company to expand their search abilites to incorporate more Google Search functions.
ReplyDelete
Replies
remmijSeptember 2, 2015 at 6:25 PM
…am guessing this is headed in a Google/Drive-centric direction?? a small detour
the key seemed to be finding the term [optical character recognition] - there seem to be a number available… for $$ like Adobe Acrobat, some like what
is offered by Google in Drive (About Optical Character Recognition in Google Drive) and a number of free on line offerings - which is the way I went as an alternative -
this required registration to really do the job, but seemed to work fine - had it convert to Word… on line ocr

"2. Once you've done that, can you determine how many times the authors refer to "multiple documents" in that paper? (This was my original search task--finding interesting papers about how people read multiple documents at the same reading session. That's how I found this paper.) "
I'm going with 7 times, including in Notes out of the 24 pages… (saw where Drive only does 10 pages… don't know if that is current)
ReplyDelete
Replies
Ramon GonzalezSeptember 3, 2015 at 8:01 AM
Good day everyone.

After reading your comments. I tried [Ocr online] results are good for images and smaller files.

I found as mentioned before 5 multiple documents and 1 in reference. That is total 6. Terl mentions 6 on text and 1 on references can not find the one I miss.

I'd like to know in the answer if possible how Dr. Russell selected "multiple documents" as the word to search.

Yes, Remmij 10 pages in Drive is still the current amount.
ReplyDelete
Replies
krossbowSeptember 5, 2015 at 9:57 AM
Hello everyone,

1. I placed the PDF in my Google Drive. Right-clicked on the file and selected OPEN WITH > Google Docs. The converted file has each page as an image and then the text of that page below. This is helpful if you are doing translations or fixing OCR that Google Drive does. Answer - use Google Drive. https://get.google.com/tips/#!/tips/hack-photos-and-PDFs-to-say-what-you-want?category=learn-better

2. Used Command-F to find that they mention [ multiple documents ] twice.

Next question - would they consider reading the pdf image page and the OCR text below it in the Google Doc for translation as multiple documents in one or is still just one document?
ReplyDelete
Replies
AnonymousSeptember 6, 2015 at 3:10 AM
And now for something completely different: mix a snip from the 1st, another from the 2nd, eg:["reading comprehension strategies" "strategies are developmental"]; get the (text-) cached version; markk' the 6+1 finds at http://ow.ly/RQEar. Voilà.
ReplyDelete
Replies
Ramon GonzalezSeptember 6, 2015 at 2:05 PM
I don't know what happened to my first comment. I think my connection has some issues. Here it is again.

Good day, Dr. Russell, fellow SearchResearchers

Searched:

I know that Optical Character Recognition (OCR) helps for this Challenge and that Google Drive works. Not sure it works with large documents so for that reason searched.

[search scan document google drive]

Only text extract first 10 pages.

Did you know that you can search through text in scanned documents? Or convert them to Docs so you can make edits?

[search scan document] [search scan document online] [search scan document OR PDF online]

Gives some options. Some need email, others small files. Therefore, need to find other ways.
[pdf searchable][pdf searchable online]

Convert Document

Answers

1. How can you transform this document (LINK) into something that you can search within?
A: I did it this way, to practice Google Drive. (first 10 pages):
1. uploading to Google Drive.
2. Openining with Google Docs.
3. Ctrl f [multiple documents]
4. This gives 2 times "multiple documents"

With source: convert document, changed the file to docx and opened it on Chrome.

2. Once you've done that, can you determine how many times the authors refer to "multiple documents" in that paper? (This was my original search task--finding interesting papers about how people read multiple documents at the same reading session. That's how I found this paper.)
A. Six times.
ReplyDelete
Replies