You are viewing limited content. For full access, please sign in.

Question

Question

distributed computing cluster job failed

asked on May 21, 2014

 

 I have created a simple workflow using the Schedule OCR activity which is taking the results from a Search Repository activity to send to LFDCC.

 

My LFDCC consists of 3 nodes (scheduler/worker and 2 workers). The job started and ran for a bit to then go in the Failed status.

 

  • 2
  • Failed
  • OCR Job
    -Repository: xxx
    -Entry ID's: 13, 15, 56, 59, 64, 75, 81, 90, 91, 92 ...
    
    OcrEngineOptions:
        Decolumnize: False
        LanguageTag: 
        Language: English
        OptimizationMode: Accuracy
        OcrEntriesInSubFolders: False
        AutoOrient: True
        PerformImageCleanup: False
        SkipPagesThatAlreadyHaveText: False
    
    ImageCleanupOptions:
        Deskew: False
        Despeckle: False
        SpeckleSizeInPixels: 0
        Rotate: False
        RotationAmountInDegrees: 0
        HorizontalLineRemoval: False
        VerticalLineRemoval: False
        LineRemovalCharProtection: False
    
  • 5/21/2014 3:11:58 PM
  • 1 h 11 m 47 s 
  • Laserfiche Workflow Server 9.1
  • xxx
  • 169/611

 

Machines

 

node #1

  • 5/16/2014 2:56:10 PM
  • 0
  • 85
  • 0
  • 4 minutes 28.1 seconds
  • 5 minutes 35.2 seconds

 

node #2

  • 5/16/2014 2:56:45 PM
  • 0
  • 134
  • 0
  • 7 minutes 14.5 seconds
  • 6 minutes 19.4 seconds 

 

node #3

  • 5/16/2014 2:55:05 PM
  • 0
  • 113
  • 5
  • 8 minutes 1.6 seconds
  • 7 minutes 1.7 seconds 

 

 

I have looked in the LFDCC logs for all 3 machines as per this thread.

 

https://answers.laserfiche.com/questions/55177/Questions-about-Troubleshooting-Distributed-Computing--long-running-jobs

 

 

But I couldn't find any errors in the operational log on any of the machines.

 

Interestingly, the 3rd machine had some OCR processes still running using the CPU and tons of NuanceLS.exe idle processes on both non-scheduler machines.
 

 

 

0 0

Replies

replied on May 21, 2014

Node #3 has this in its DCC Service Developer log

 

Task execution has been halted by the system because its task executor reported that it was no longer making progress.

 

 

0 0
replied on May 22, 2014

That particular error is caused when OCRing a single page takes more than 10 minutes. Do your documents contain any particularly large pages?

0 0
replied on May 23, 2014

I don't think so, they are all letter size and I have switched the optimization mode to standard and it still does it. 

 

Disabled node #3, went good for a few small jobs and then did  it again with node #2.

 

Would it be better to use schedulers only for now?

0 0
replied on May 23, 2014

Your tasks seem to be taking a long time to complete. Do the documents have a lot of pages? And what kinds of machines are you using? How many cores? Also, how many documents are included in your Workflow job?

 

We're investigating the issue with NuanceLS.exe, though it does not seem to be causing any issues that would prevent the cluster from continuing to run. And we are aware of the issue with hanging OCR processes and are planning to include a fix for it in the next release. In the meantime, you may want to periodically end any long-running LfOmniOCR.exe or NuanceLS.exe processes to make sure that the machines have resources.

0 0
replied on May 23, 2014

In node #3 do you have some BPSession81 running? I recall having a similar problem with QF BarCode  and QF agent. Within a few hours it ended up with many BPSession81 and tons of NuanceLS.exe and a few OCR processes running. The problem was that BPSessions81 crashed when processing a lot of pages and the next OCR schedule  task just keep adding processes causing the CPU to run at 100% utilization.

0 0
replied on May 23, 2014

BPSession81 is only used by Quick Fields.

0 0
replied on May 23, 2014

To continue on the previous post, this problem was resolved with BarCode 9.1.1. It might not be the same problem indeed but some processes, like NuanceLS.exe are the same so I thought it might help.

0 0
replied on June 2, 2015

Was any progress made on the NuanceLS.exe issue? We have an app that OCRs through SDK calls, and NuanceLS.exe is being left open as a result.

0 0
replied on June 2, 2015

No, the problem still exists. It's more stable if you ensure that only one OCR job runs at a time on the computer.

1 0
replied on July 23, 2014 Show version history

I am getting this issue also, can't find any way around it. Why is there inquiries on how many pages the job contains if the timeout is based on a single page. If it takes over 10 minutes to OCR a single page then it wouldn't matter if it was on page 1 of 1 or page 1 of 100.

 

So all we know is that on some page the NuanceLS.exe proc stopped responding. Since NuanceLS.exe doesn't appear to write to the event log we don't know what to look at next.

0 1
You are not allowed to follow up in this post.

Sign in to reply to this post.