Question

Questions about Troubleshooting Distributed Computing / long running jobs

Distributed Computing Cluster

Updated April 30, 2014

asked on April 29, 2014 • Show version history

I am not sure why an OCR job of a single document would take over 11 days. When running into this how do we see the status of the job?

Is it waiting on input somewhere?

Also is there any logs to see if there are errors, the application does not appear to post to the local event viewer of either the scheduler machine or the worker machine.

For example if we get a status "error", what would the next troubleshooting step be?

0 0

Answer

SELECTED ANSWER

replied on April 29, 2014

DCC logs to the Application and Services log.

2 0

replied on April 29, 2014

Aha! That will be very helpful, I was looking under the Application Log. I will check this straight away.

0 0

Replies

replied on April 29, 2014

You can use the Web Administration Console to view DCC jobs. Clicking on a specific job ID will give you additional details and even show you the current task progress. If you see that it's stuck on a task, then check the "Machines" node in Web Admin and confirm that there are workers available that can perform tasks.

As for where you can check for additional information about jobs, go into the Event Viewer and look at the DCC logs. DCC has it's own set of event logs which are under Applications and Services Logs > Laserfiche > Distributed Computing Cluster > DCC Service and you will see three logs: Admin, Developer, and Operational.

For an issue with the jobs themselves, you'll want to look at the Operational and maybe Developer logs.

1 0

replied on April 29, 2014 • Show version history

Hey Chad,

Issues like this tend to happen when the scheduler and the worker disagree on the execution status of a job. See if running the following PowerShell commands on the scheduler causes the stalled job to change its status:

Import-Module LfDccClusterAdmin
Get-DccWorker | Disable-DccTaskExecution
Get-DccWorker | Enable-DccTaskExecution

This essentially equates to a DCC reboot. It will get all of your worker machines and disable their task execution, then get them all and enable them. If you have task execution disabled on your scheduler (a good practice in most cases), change the last line to

Get-DccWorker | where {-not $_.IsScheduler} | Enable-DccTaskExecution

The where command will filter out your scheduler and stop it from being enabled.

If this does cause a change in the task execution status of the job, it will likely fail immediately because its token to access Laserfiche will be expired. Nonetheless, I'd be interested to see if this works for you.

Note that the above PowerShell commands can be run on an active cluster with no issues. Tasks currently executing will continue executing without interruption, you will only get (at worst) a few seconds per worker of idle time.

1 0

replied on April 29, 2014

Thank you, this is good to know.

Is there a document with the complete list of commands for the LfDccClusterAdmin and LfDccLocalNodeAdmin modules?

0 0

replied on April 29, 2014

Full listings of the PowerShell cmdlets available for DCC are available in the Administration Guide section on Administering Distributed Computing Cluster through Windows PowerShell.

1 0

replied on April 30, 2014

Thanks perfect!

0 0

replied on April 29, 2014

It's probably best just to terminate the job. There were some issues with the preview that would cause it to hang on certain tasks.

I just removed it so I could rebuild the cluster, but there are events related to DCC in the event logs.

0 0

You are not allowed to follow up in this post.

Question

Question

Questions about Troubleshooting Distributed Computing / long running jobs

Answer

Replies

Sign in to reply to this post.