![]() ![]() When there are strange boxes on the page, Docs OCR might skip over a chunk of the text.īut that didn't explain the "extra" instances of the phrase multiple documents I found in the printed-out version of the paper. Okay, I know that OCR is a difficult process many OCR systems have errors, and I just found one here in the Docs OCR. IF the OCR process was accurate, it certainly would have located the title of the paper (which is just a few lines below). As you can see in the above image, you can't even Control-F for the title of the document: there are zero hits for the title. That's when I noticed that much of the first page of text had NOT been recognized! Huh. Which led me to a lovely Help Center article about how to import a PDF file into your Google Drive, then open it with Docs. ![]() I also remembered that Google Docs had some OCR capability, so my first query was: So this Challenge is really about "tool finding" - can you figure out how to convert from a scanned document into a readable / findable / searchable one?Īs we've talked about before, taking a scanned document and converting the scan into recognizable text is called "Optical Character Recognition," or OCR, so I'm going to use that in my query. Once you've done that, can you determine how many times the authors refer to "multiple documents" in that paper? (This was my original search task-finding interesting papers about how people read multiple documents at the same reading session. How can you transform this document ( LINK) into something that you can search within? 2. Let's review: the SearchResearch Challenge for this week is meant to give you an additional powerful tool for importing scanned documents and making them findable.ġ. there are many ways to search in a scanned PDF for some text. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. Archives
December 2022
Categories |