You are viewing limited content. For full access, please sign in.

Question

Question

Questions about DCC and Workflow in OCR of whole Repository

asked on July 3, 2015

Hi all,

 

I have successfully setup my DCC with the main server as the scheduler and worker and executed some test workflows to discover documents in the repository (tested on one folder so far) that have no OCR Text and perform scheduled OCR.  Pic of workflow attached.

The thing is the server is set to perform maintenance and backups at night and for the initial OCR of the backlog of files (the site has just completed a relocation and done about 1500 documents of between 1,2,10,200 and 1000 pages) and needs now to digest them.

So I suggested rather than contesting for capacity at night we schedule the OCR to run between 6am and 6pm Saturday and Sunday this first weekend and see how much it gets done.

So the workflow, scheduled to start 6am on Sat and Sun, is running within the conditional:

If %(Time) is less than 06:00:00 PM

then End Workflow

And it acts on the output of the search:

({LF:AssociatedPages="Y"} & {LF:OCR=none})

 

Something that has occurred to a colleague and I is that with the DCC configured to hand the OCR task to the worker process is it not possible that with a search result of 1500 documents being returned in milliseconds and the queue being handed over to the worker process the workflow may in fact end more or less immediately and the OCR could continue to run for days (presumably not but at any rate you get my drift).

 

Can anyone enlighten me about how these interactions are governed and what the best practice would be to contain the load of a long backlog OCR like this with routine maintenance and backups that must be done at pre-determined times?

I considered running the workflow in batches of 10 or 100 documents and loops that look somehow at average execution time but it all got too complicated given that I don't know how the backend processes are interacting anyway.

Also, when 6pm ticks over what is the way that a conditional will end a process like this?  Is there a distinction between a graceful completion of current queued OCR task and a Kill -9 that may be problematic for the subsequent OCR when the workflow runs again?

Any suggestions would be very much appreciated,

 

Best regards,

Will

0 0

Replies

replied on July 3, 2015

Workflow does not wait for OCR to complete, it just hands over the documents to DCC. I would set the workflow schedule to start at 6 AM on Sat and Sun and run for 12 hours, which would make it stop sending more documents after 6PM. During this interval, set it to repeat every few minutes, based on the average execution time for a batch of documents. I'd probably try it with 100 documents at a time per DCC node.

0 0
replied on July 3, 2015 Show version history

Thank you that clarifies a lot.

 

What governs the cycle and ability to schedule downtime for the worker application so that it is not processing during the maintenance/backup periods for the server?

Especially as there is only one worker node - the scheduler itself.

 

And could you suggest an efficient way to send 100 documents at a time please? 

 

Cheers,

Will

0 0
replied on July 3, 2015

I expect this is what I am looking for:

https://www.laserfiche.com/support/webhelp/Laserfiche/9.1/en-US/AdminGuide/LFAdmin.htm#DCC/SchedulingDCC.htm%3FTocPath%3DLaserfiche%2520Administration%2520Guide%7CLaserfiche%2520Distributed%2520Computing%2520Cluster%7C_____4

I need to add a scheduler on the server:

 

"Each task you create in Task Scheduler will run a script. If you're scheduling DCC to run during off-peak hours, for example, you'll need scripts to enable and disable DCC workers at appropriate times each day."

 

Disable all workers other than the scheduler.

Import-Module LfDccClusterAdmin
Get-DccWorker | where { -not $_.IsScheduler } | Disable-DccTaskExecution

0 0
replied on July 4, 2015 Show version history

Before I go ahead and try to script this - what is the best way to stop the current DCC OCR queue - every time I kill a task the next in the queue picks up because the worker is configured to run 2 consecutive tasks and its now been running 16 hours and is at 49 of 940.

 

Will

 

edit: I have gone into the Jobs of the DCC in web admin and cancelled both yesterday's and today's workflow launched "jobs" and now they show as cancelled but the cpu is still tracking at 100%.

Would this be because it is completing the last task and so if the average task completion time is 10min I need to wait for it to complete before seeing the load drop>

0 0
replied on July 6, 2015

To try to prevent the task from running more than a certain amount of hours you can change this within the task scheduler of the machine that is the worker I believe. The default time in task scheduler is set to run the DCC for 3 days. I have seen this issue before and have minimized it to 4 hour intervals. The only issue with this is that it may cause a document to just have "some" OCR pages and not all when the cut off hits. Also change the number of the amount of concurrent task to -1 or -2 the amount of cores on the machine. I have seen that it locks up at times when having the jobs based on the number of cores.

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.