Help please with a tricky situation I find with Schedule OCR and a large repository

asked on July 13, 2015

Hi all,

I find myself with a tricky situation. A large repository is being populated with many files of varying length - some many hundreds and even a thousand pages (300dpi tiffs from MFDs and bulk scanning sources) that we need to OCR for deep text search.

I will ultimately be able to run two worker nodes and the scheduler on the server but for now I need to build a workflow that will search the repository and return the documents that need OCRing.

Problems seem to be multiplying.

First, because the workflow comes back with the search results of documents with more than 0 pages and NO text very quickly and hands those entry IDs off to the scheduler and completes the workflow is not really in control of the jobs as such. Consequently the starting and stopping of the OCRing is having to be looked at as a worker script and it is possible that I will have to interrupt an OCR task mid way if the service is working when the night shift ends and the server needs to be back to high availability for users and thus not OCRing.

As a result there will be a build up of jobs that will not satisfy the condition of having pages but no text - some will have some text and need OCRing - however, it is also likely (is it not?) that as a result of some successful/completed OCR sessions there will also be documents with more pages than there are pages with text (ie docs with text and full image pages in which no text was found to recognise).

These documents one way or another will also need to be flagged somehow not to be reprocessed.

So I will be needing to get the documents that definitely need OCRing from scratch.

Documents that have only been partially OCRed - to complete.

And exclude documents that are only partially text anyway...

Do I do this by recursive workflow that takes a document and calls a new workflow on it for each iteration setting a tag and not tagging it as "done" till it exits cleanly?

How do I get the workflow to stay in charge of progress when it seems to hand everything off - again by nested calls to iterative workflows setting tags and perhaps handling one entry id at a time so we know which have been handled?

Getting so confused and client needs reassuring that this is not some kind of time sink that they will never recoup....

Best,

1 0

replied on July 13, 2015

Hi Will,

I wholeheartedly agree that the control aspects of the distributed processing certainly need refining and a better way to halt the OCR is needed if it runs overnight and is still running at the start of the next day.

Having said that, to get round this:-

Why don't you limit the search results to batches of a few hundred and have the workflow run overnight multiple times. This way it's easy to keep a better handle on what WF gives to the workers? Obviously you will need to fine tune the number in each WF iteration to match the performance of the workers. Alternatively, add more workers for more processing power if you want it to go faster.

2 0

Question

Question

Help please with a tricky situation I find with Schedule OCR and a large repository

Replies

Sign in to reply to this post.