You are viewing limited content. For full access, please sign in.

Question

Question

DCC OCR'ing PDF's

asked on October 25, 2016

What I am trying to do is drag and drop a pdf in to a repository without selecting the 'generate searchable text' option.  Generate pages is still selected. 

I then want to run a Workflow that schedules the OCR to run after hours. However, the PDF doesn't seem to finish being processed on the DCC.  I have let it run for 90minutes before cancelling it.  The PDF is only 10 pages.  I have tried other PDFs of varying size and they don't process either.  Docx and Tiff files process through the same DCC workflow in minutes.

I have tried the same process on 3 different Laserfiche Servers v10.1 update 2

Is this a feature, fault or am I doing something wrong?

0 0

Answer

SELECTED ANSWER
replied on November 1, 2016

@████████

I opened a support ticket.  The answer is a known issue with the DCC and colour PDFs. The workaround is to deselect the 'Despeckle' setting.  Apparently, this will be fixed in a future release.

Jonathan

0 0

Replies

replied on October 26, 2016

Hi Jonathan,

 

Are these text PDF's or image PDF's? You might find if they are image PDF's then they don't have a text layer and DCC cannot OCR them. frown

0 0
replied on October 26, 2016 Show version history

DCC only OCRs images, it does not generate image pages for PDFs nor does it extract text from PDFs. If the goal is to have these PDFs searchable in the repository, you can install the PDF IFilter on the Laserfiche Server (or whatever machine you have the Laserfiche Full text Search Engine installed) and the search engine will index them based on their text. The caveat there is, as Chris said above, the PDFs need to have a text layer.

Edit: Sorry, missed the part where you were generating pages. Do these PDFs actually get image pages on import? What is the image resolution? If you try to OCR one of these documents through the Laserfiche Client, do they behave the same way as when you're OCRing them through DCC?

0 0
replied on October 26, 2016 Show version history

Hi @████████

I have tried all sorts of PDF files. It is quite easy to replicate. I used one of the Laserfiche PDF guides as a test. Drag and drop the file in to the repository, then run a workflow on it that just has the Schedule OCR. One issue is that even if the PDF can't be OCR'ed it never seems to time out either. The jobs just stay in the DCC queue 'in progress' 1/2

 

0 0
replied on November 1, 2016

At this point, it's probably best if you open a case with Tech Support so we can take a closer look at DCCs error messages.

1 0
SELECTED ANSWER
replied on November 1, 2016

@████████

I opened a support ticket.  The answer is a known issue with the DCC and colour PDFs. The workaround is to deselect the 'Despeckle' setting.  Apparently, this will be fixed in a future release.

Jonathan

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.