You are viewing limited content. For full access, please sign in.

Question

Question

OCRed or not OCRed that is the question

asked on July 14, 2015

I am working on workflow to ensure that the whole repository is thoroughly OCRed but I have come across a question about the search I am using after reading the definitions in the manual:

 

https://www.laserfiche.com/support/webhelp/Laserfiche/9.1/en-US/UserGuide/Laserfiche_Client.htm#Search/Advanced/OCRed_Documents_Search_Syntax.htm%3FTocPath%3DSearch%7CAdvanced%2520Searches%7C_____18

 

This states:

Advanced search syntax can be used to search for documents by their OCR status—whether searchable text has been generated for them—using the following syntax. All advanced search types can be customized with advanced searchoperators and wildcards.

  • {LF:OCR=All/Some/None}

Which seems to indicate that Laserfiche keeps track of which pages within all the documents it has applied its own OCR process to - ie it is not just looking for pages that have ascii/plain text metadata associated with them - and which of course could have come from another page scanning and OCRing application (like EzeScan).

Is the search looking for pages that have been OCRed by Laserfiche's schedule or Manual OCR process or is it only identifying OCR pages by the presence or absence of all, some or none with text present???

If it is truly a flag that provides reliable information about Laserfiche's successful OCRing of the pages in documents it would alleviate the need to set flags for docs that have been OCRed or not by workflow where it is possible that documents with some pages that have text are in fact fully OCRed (and have blank or photographic pages that genuinely don't contain any text and consequently could have been fully/successfully OCRed ALL pages resulting in a document with TEXT on SOME pages)

Best 

W

0 0

Answer

SELECTED ANSWER
replied on July 16, 2015

Yes, if the entire document has been processed through OCR, then it will be flagged as "All" even if some pages have an empty text file. That was actually specifically requested by other customers in past for exactly the purpose of knowing the document has gone through OCR without errors.

If OCR crashes on a page, we restart the process and try again. There is a default timeout of 15 min where OCR will give up on the page if it hasn't finished in that time. The time can be adjusted.

You could have Workflow tag the documents it sends to DCC so you can distinguish between the ones you never processed and the ones that you attempted to process.

1 0

Replies

replied on July 15, 2015

"OCR" is not specific to OCRing the document in the Laserfiche Client or any other LF product. An image page is considered "OCRed" if it has an associated text page. So, technically, it is a "has text" flag.

As far as blank pages go, OCRing through Laserfiche products will still set a (blank) text file for the page, so the page will be considered OCRed. I don't know if EZEScan behaves the same way.

1 0
replied on July 15, 2015

OK, so further to the blank pages instance...  A page OCRed by Laserfiche which has only a photograph on a page in which no text can be detected would get the same (blank) text file for the page so that documents with text on some pages, blank pages and photographic pages with no text will be considered as OCRed "ALL" if the OCR process was not interrupted midstream - say by the DCC Worker process being killed due to an extremely long document running over a time when the server (or worker node) needed to be available for office hours availability or maintenance etc...

My problem is creating a workflow that leverages the DCC to ensure that all documents in the repository have been completely OCRed but some documents are taking a very long time to complete - even with the settings on "speed" and no clean up.  These documents may only partially contain text so when the workflow goes out and does a search on 

Text=Some or None but NOT All I need to know that I am not pulling in documents that have previously been sent out to the DCC for processing and completed with only partially text populated pages as compared to pages that have not been OCRed yet.

 

I hope this makes sense.  

 

I kind of want the workflow to wait till the DCC hands back a success token for each EntryID that gets queued but I don;t think that is possible easily and certainly don't know how to approach it with scripts/powershell etc if it is.  But the client wants to ensure that the entire repository is methodically and efficiently OCRed at night as soon as possible.  If that sounds reasonable.

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.