You are viewing limited content. For full access, please sign in.

Question

Question

Distributed Computing Cluster: Set SkipPagesThatAlreadyHaveText to True

asked on February 21, 2019

In the Distributed Computing Cluster, how do you set SkipPagesThatAlreadyHaveText to True?  It appears in the event logs in the OCREngineOptions as False.

 

OcrEngineOptions:
    Decolumnize: False
    LanguageTag: en
    OptimizationMode: Balance
    OcrEntriesInSubFolders: False
    AutoOrient: False
    PerformImageCleanup: True
    SkipPagesThatAlreadyHaveText: False

ImageCleanupOptions:
    Deskew: True
    Despeckle: True
    SpeckleSizeInPixels: 2
    Rotate: False
    RotationAmountInDegrees: 0
    HorizontalLineRemoval: True
    VerticalLineRemoval: True
    LineRemovalCharProtection: True
0 0

Answer

SELECTED ANSWER
replied on March 7, 2019

The option to skip pages with text using DCC will depend on which application you are using to call DCC. If you are using the Web Client, this option is something you can toggle when you are scheduling OCR on a document. When you have a document selected, you should be able to find the option to “Generate Searchable Text” from the toolbar. Choosing this option will open a dialog box with the option to generate text for all image pages, specific pages, or only on pages without text.

 

If you are using Workflow, the option to skip pages that already have text is not currently available for configuration from the Schedule OCR activity, but the Laserfiche development team is looking into adding it in the future. In the meantime, you may be able to work around this using searches to locate the documents you need to OCR. For example, if you would like to find all documents with pages but no text associated, you can use the following query: {LF:AssociatedPages="Y"}&{LF:OCR="none"}.

0 0

Replies

replied on March 12, 2019 Show version history

I have implemented a WF that sends to OCR through DCC when any new documents were either created or modified for that day and the OCR status is either NONE or SOME.

In the OCR setting on the Windows Client, there is an option that says to OCR "All image pages without text".  I, however, don't see that option available in the DCC options in workflow so anytime there is a new page being added to a document, the whole document gets OCRed again. 

The main issue is that lots of documents with hundreds of pages get OCRed completely again and again and again... and this slows down the OCR process for nothing.  The reason why that same document goes back to OCR is because these documents get new pages added many times for a few months, thus, OCR need to be run so often on the same document.

Is there an option I have not seen?, or a hidden option?, or does this require Laserfiche to add this as a feature request as it was said earlier in this post?

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.