You are viewing limited content. For full access, please sign in.

Question

Question

DCC OCR taking 50+ minutes for a single page

asked on February 24, 2020

The OCR engine is overloaded and timing out at 50 minutes on a single document. The error in event viewer is:

Task ID: 1302.1302.2, Task Type: OCR.OCR, Host Name: HOST
Context Message: Running OCR exceeded its timeout of 600000 milliseconds.
Running OCR exceeded its timeout of 600000 milliseconds.

The documents that it's OCRing are difficult, with a repeated text watermark and a laurel border. The workflow which sends documents to the scheduler is set to OCR in standard mode, so I'm not sure if that's a part of it. The worker machine is allowed to run two tasks concurrently, and it looks like each task is only allowed to use 25% of the CPU.

The machine is running with 12 GB of memory, and has a 4x E5-2699 cores @ 2.30 GHz. It seems like that should be enough, but maybe not if it's limited to 25%. Right now it's showing that it's using 75% CPU on three processes, and 4 GB of memory.

My questions are:

1) Is it possible to manually increase the limited CPU resources that a single OCR process can take?

2) Does this machine have enough resources to handle OCR tasks?

3) Is there anything I'm not thinking of that could be causing the OCR processes to time out?

0 0

Replies

replied on February 25, 2020 Show version history

Hi Brian,

To your questions:

  1. No, you cannot manually increase the CPU resources allocated to a single OCR task.
    OCR tasks are single-threaded (at least on a per-page basis). The "25% of the CPU" you're seeing is 100% utilization of a single core. With your current cap of two concurrent tasks, DCC should never exceed 50% CPU utilization ((100% x 2 tasks) / 4 cores) on that machine. It's interesting that you see three processes using a total of 75%. I don't think that should be happening.
     
  2. Yes, that machine has plentiful resources. It especially has more memory than it needs, as you likely noticed. DCC is almost always compute-bound. I will note that I've experienced better throughput with two 2-core/4 GB RAM DCC Workers than one 4-core/8 GB RAM one. Seems DCC benefits from having more nodes to dynamically rebalance tasks across.
     
  3. Not off the top of my head. Here are a few things I would try:
    -If your difficult documents are multi-page, trying splitting off a single page as a new entry and sending only it to DCC. Does it still time out?
    -Try changing the OCR mode in the Workflow "Schedule OCR" activity from "Accuracy" to "Speed" (or vice versa). Does that make a difference?
    -Throw some "easy" test docs through like a TIFF render of a text file. Does it still time out?
    -Check the event logs on the DCC machine for other messages that may give insight on the reason for the timeout. Check both for the application log and in the Windows Application/System logs.

 

Hope something there helps.

Cheers,

Sam

0 0
replied on February 26, 2020

Hi Sam,

Thanks for the input. I was hoping it was something obvious I could fix relatively easily but it looks like I'll have to open up a case for this one.

The document that it's failing on is a single page 8.5x11 image. The only difficult pieces are the laurel border and the watermark. The workflow is pushing the documents through on standard mode. I haven't tried any 'easy' documents yet but will be sure to do so. I've checked all the logs related to DCC and the only errors are the timeout errors, no clues as to why the timeouts are happening.

Thanks again,

Brian

0 0
replied on February 26, 2020

The OCR is a black box to us, unfortunately. I spoke with a colleague who mentioned they had seen a page like that before.

They said "it was a page with a background with a ton of tiny dots that OCR was trying to process individually, causing it to take almost an hour."

Based on that case, I strongly suspect your laurel border is the culprit. I'd try cropping it out and running the cropped page through as a test. If that works, you might have an opinion in using Quick Fields' Zone OCR and Image Enhancement/cleanup features. 

1 0
You are not allowed to follow up in this post.

Sign in to reply to this post.