Question

distributed computing cluster job failed

Workflow Distributed Computing Cluster Laserfiche

Updated June 2, 2015

asked on May 21, 2014

I have created a simple workflow using the Schedule OCR activity which is taking the results from a Search Repository activity to send to LFDCC.

My LFDCC consists of 3 nodes (scheduler/worker and 2 workers). The job started and ran for a bit to then go in the Failed status.

Job ID
```
2
```
Job Status
```
Failed
```

Job Description

OCR Job
-Repository: xxx
-Entry ID's: 13, 15, 56, 59, 64, 75, 81, 90, 91, 92 ...

OcrEngineOptions:
    Decolumnize: False
    LanguageTag: 
    Language: English
    OptimizationMode: Accuracy
    OcrEntriesInSubFolders: False
    AutoOrient: True
    PerformImageCleanup: False
    SkipPagesThatAlreadyHaveText: False

ImageCleanupOptions:
    Deskew: False
    Despeckle: False
    SpeckleSizeInPixels: 0
    Rotate: False
    RotationAmountInDegrees: 0
    HorizontalLineRemoval: False
    VerticalLineRemoval: False
    LineRemovalCharProtection: False

Start Time
```
5/21/2014 3:11:58 PM
```
Duration
```
1 h 11 m 47 s 
```
Source Application
```
Laserfiche Workflow Server 9.1
```
Source Machine
```
xxx
```
Task Progress
```
169/611
```

Machines

node #1

Service Start Date 5/16/2014 2:56:10 PM

Assigned Tasks 0

Successful Tasks 85

Task with Terminating Errors 0

Average Task Duration (Total) 4 minutes 28.1 seconds

Average Task Duration (Last 50 Tasks) 5 minutes 35.2 seconds

node #2

Service Start Date 5/16/2014 2:56:45 PM

Assigned Tasks 0

Successful Tasks 134

Task with Terminating Errors 0

Average Task Duration (Total) 7 minutes 14.5 seconds

Average Task Duration (Last 50 Tasks) 6 minutes 19.4 seconds

node #3

Service Start Date 5/16/2014 2:55:05 PM

Assigned Tasks 0

Successful Tasks 113

Task with Terminating Errors 5

Average Task Duration (Total) 8 minutes 1.6 seconds

Average Task Duration (Last 50 Tasks) 7 minutes 1.7 seconds

I have looked in the LFDCC logs for all 3 machines as per this thread.

https://answers.laserfiche.com/questions/55177/Questions-about-Troubleshooting-Distributed-Computing--long-running-jobs

But I couldn't find any errors in the operational log on any of the machines.

Interestingly, the 3rd machine had some OCR processes still running using the CPU and tons of NuanceLS.exe idle processes on both non-scheduler machines.

0 0

Replies

replied on May 21, 2014

Node #3 has this in its DCC Service Developer log

Task execution has been halted by the system because its task executor reported that it was no longer making progress.

0 0

replied on May 22, 2014

That particular error is caused when OCRing a single page takes more than 10 minutes. Do your documents contain any particularly large pages?

0 0

replied on May 23, 2014

I don't think so, they are all letter size and I have switched the optimization mode to standard and it still does it.

Disabled node #3, went good for a few small jobs and then did it again with node #2.

Would it be better to use schedulers only for now?

0 0

replied on May 23, 2014

Your tasks seem to be taking a long time to complete. Do the documents have a lot of pages? And what kinds of machines are you using? How many cores? Also, how many documents are included in your Workflow job?

We're investigating the issue with NuanceLS.exe, though it does not seem to be causing any issues that would prevent the cluster from continuing to run. And we are aware of the issue with hanging OCR processes and are planning to include a fix for it in the next release. In the meantime, you may want to periodically end any long-running LfOmniOCR.exe or NuanceLS.exe processes to make sure that the machines have resources.

0 0

replied on May 23, 2014

In node #3 do you have some BPSession81 running? I recall having a similar problem with QF BarCode and QF agent. Within a few hours it ended up with many BPSession81 and tons of NuanceLS.exe and a few OCR processes running. The problem was that BPSessions81 crashed when processing a lot of pages and the next OCR schedule task just keep adding processes causing the CPU to run at 100% utilization.

0 0

replied on May 23, 2014

BPSession81 is only used by Quick Fields.

0 0

replied on May 23, 2014

To continue on the previous post, this problem was resolved with BarCode 9.1.1. It might not be the same problem indeed but some processes, like NuanceLS.exe are the same so I thought it might help.

0 0

replied on June 2, 2015

Was any progress made on the NuanceLS.exe issue? We have an app that OCRs through SDK calls, and NuanceLS.exe is being left open as a result.

0 0

replied on June 2, 2015

No, the problem still exists. It's more stable if you ensure that only one OCR job runs at a time on the computer.

1 0

replied on July 23, 2014 • Show version history

I am getting this issue also, can't find any way around it. Why is there inquiries on how many pages the job contains if the timeout is based on a single page. If it takes over 10 minutes to OCR a single page then it wouldn't matter if it was on page 1 of 1 or page 1 of 100.

So all we know is that on some page the NuanceLS.exe proc stopped responding. Since NuanceLS.exe doesn't appear to write to the event log we don't know what to look at next.

0 1

You are not allowed to follow up in this post.

Question

Question

distributed computing cluster job failed

Replies

Sign in to reply to this post.