You are viewing limited content. For full access, please sign in.

Question

Question

Schedule OCR with workflow

asked on March 27, 2014

Hi,

 

We need a workflow to schedule an OCR for documents with some or no pages. Using the DCC and workflow 9.1.1, we are able to accomplish this. However, we're not able to stop the process when it has been running for too long, so that we don't overload the machines running the OCR process. 

 

Is there a different way to do this workflow, or is there a way to make the OCR process stop at a certain time?

 

Regards,

 

0 0

Answer

SELECTED ANSWER
replied on March 27, 2014

Devin is on the right track here.

 

Workflow doesn't wait for DCC to complete its OCR jobs, it only schedules the job then moves on. So from the Workflow end, you'll need to batch your documents into appropriate chunks for DCC to work with.

 

Internally, DCC will keep track of the jobs sent to it and schedule them to be done on its worker nodes. Note that the way that you send documents to DCC from Workflow will make a big difference in how well DCC will handle the work.

 

If you send each document individually in its own Schedule OCR activity then a full new DCC job will be created for each of them. Doing this, you will hit the cap on active jobs in DCC (5,000), and you'll be paying extra execution time overhead for each document.

 

Better would be to group up all of the documents and send them in a single Schedule OCR activity, by using a collection of Document IDs in your Schedule OCR activity. This will allow DCC to create a single job for all of the documents, and then split up the documents internally into tasks. This helps avoid the 5,000 active job limit, and the overhead for tasks is much smaller than the overhead for jobs. This is probably the best method if you are running OCR from a search result.

 

The absolute best way to send the documents, at least in terms of DCC resource usage, is to put all the documents to be OCRed into a single folder and select that folder in the Schedule OCR activity. DCC will find all of the documents in that folder and work on OCRing them, and so it won't need to keep as much information in memory while it works, and it can schedule more effectively by restricting the number of tasks it has waiting at any given time.

 

No matter which method you go with for submitting the documents, Devin's suggestion is still a good idea. Try to figure out about how many documents your cluster can OCR during some time period (on the order of one to several hours), and submit that many documents at a time. You can either be conservative and submit more documents well after the cluster should have finished processing, or you can get clever with PowerShell and cancel jobs or use PowerShell to wait until all jobs are complete, then do something to trigger the Workflow.

2 0

Replies

replied on March 27, 2014

Are you using a search to find these documents? You could limit the number of search hits returned by the Search Repository activity so you send smaller batches to DCC.

0 0
replied on March 27, 2014

Yes, I am using a search in the entire repository. Thank you for the suggestion. 

 

However, I'd still like to know if it'd be possible to build a workflow like the one I'm describing. 

 

Thank you.

0 0
replied on March 27, 2014

The best way is to only send an appropriate number of documents to DCC at each execution. However, due to many different factors, there's bound to inaccuracies in doing that.

 

I've got some tasks scheduled in Windows that executes a PowerShell script each hour. Each hour it cancels outstanding jobs so that the next run of the Workflow can start fresh without worrying about building up a backlog. Also, each evening a script executes that tells DCC what machines can be used and how many cores. The same thing happens in the morning. The effect is that each night, several machines spin up and are completely dedicated to DCC. In the morning, they go back to their normal duties.

2 0
SELECTED ANSWER
replied on March 27, 2014

Devin is on the right track here.

 

Workflow doesn't wait for DCC to complete its OCR jobs, it only schedules the job then moves on. So from the Workflow end, you'll need to batch your documents into appropriate chunks for DCC to work with.

 

Internally, DCC will keep track of the jobs sent to it and schedule them to be done on its worker nodes. Note that the way that you send documents to DCC from Workflow will make a big difference in how well DCC will handle the work.

 

If you send each document individually in its own Schedule OCR activity then a full new DCC job will be created for each of them. Doing this, you will hit the cap on active jobs in DCC (5,000), and you'll be paying extra execution time overhead for each document.

 

Better would be to group up all of the documents and send them in a single Schedule OCR activity, by using a collection of Document IDs in your Schedule OCR activity. This will allow DCC to create a single job for all of the documents, and then split up the documents internally into tasks. This helps avoid the 5,000 active job limit, and the overhead for tasks is much smaller than the overhead for jobs. This is probably the best method if you are running OCR from a search result.

 

The absolute best way to send the documents, at least in terms of DCC resource usage, is to put all the documents to be OCRed into a single folder and select that folder in the Schedule OCR activity. DCC will find all of the documents in that folder and work on OCRing them, and so it won't need to keep as much information in memory while it works, and it can schedule more effectively by restricting the number of tasks it has waiting at any given time.

 

No matter which method you go with for submitting the documents, Devin's suggestion is still a good idea. Try to figure out about how many documents your cluster can OCR during some time period (on the order of one to several hours), and submit that many documents at a time. You can either be conservative and submit more documents well after the cluster should have finished processing, or you can get clever with PowerShell and cancel jobs or use PowerShell to wait until all jobs are complete, then do something to trigger the Workflow.

2 0
replied on March 27, 2014

Thank you everyone for your response!

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.