You are viewing limited content. For full access, please sign in.



Distributed Computing Cluster options and cleanup

asked on April 24, 2014 Show version history

I've setup LFDCC to leverage OCR processing via workflow and webaccess. Installation seems to have gone very well. One thing I notice within the WebAdmin DCC job reporting pages:



    Decolumnize: False
    LanguageTag: en
    OptimizationMode: Balance
    OcrEntriesInSubFolders: False
    AutoOrient: False
    PerformImageCleanup: True
    SkipPagesThatAlreadyHaveText: False

    Deskew: True
    Despeckle: True
    SpeckleSizeInPixels: 2
    Rotate: False
    RotationAmountInDegrees: 0
    HorizontalLineRemoval: True
    VerticalLineRemoval: True
    LineRemovalCharProtection: True



Is the ocr engine committing the deskew/despeck within image cleanup? Where are these ocr engine and image clean-up options coming from? 



2 0


replied on April 24, 2014 Show version history

These options are being pulled from the settings of the sending application.


For Workflow:

OCR Settings are configurable for each Schedule OCR activity.


For Web Access:

OCR Settings for each user are under the Generate Text section of the Settings dialog.


For the currently released version of DCC, the 9.1 preview, the ImageCleanupOptions are applied if PerformImageCleanup is True.



Since this was selected as the answer, I wanted to emphasis what Matthew mentioned below:

Note that the image cleanup is applied only to the image used for OCR, the cleaned up image is not saved to the repository.

2 0


replied on April 24, 2014

Yes, image cleanup is being invoked on the document prior to OCR. Based on those settings, you are despeckling the image (remove speckles of size 2 or less), removing horizontal and vertical lines, and deskewing the image. These options are set in:


  1. The Workflow Designer. If you select the Schedule OCR activity, you will have Image Cleanup Options available under the Additional Options pane.
  2. Web Access under Settings->Generate Text/OCR Settings


OCR settings in Web Access are per user and are stored as user attributes. OCR settings in Workflow are unique to each Schedule OCR activity in the workflow.


Note that the image cleanup is applied only to the image used for OCR, the cleaned up image is not saved to the repository.

3 0
replied on April 24, 2014

Thank you both. Excellent! 

0 0
replied on November 9, 2018

We are using WORKFLOW "Schedule OCR activity", the OCR setting for "SkipPagesThatAlreadyHaveText" is default to "False", this option is nowhere to be seem in the OCR setting.  How to change it to "True" as we have lots of documents that were partially OCR in the past or have new pages added to them that needs to be OCR.  

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.