Question

PDF OCR via Import Agent and Search highlight in PDF?

Laserfiche

Updated April 3, 2014

asked on April 2, 2014

1. I thought Import agent would work to OCR electronic document but it is not. I am using Laserfiche 9.1 and I can probably use new wrokflow OCR fucntion. Is there any other way to automate the OCRing for exectronic document including PDF?

3. When I search some words in manualy ORCred PDF, I don't see the highlights on the words that it find. Is this typical?

0 0

Answer

SELECTED ANSWER

replied on April 3, 2014

DCC will not currently generate text or pages for electronic files such as PDF at this point, it only supports image file OCR. We're working on getting that functionality in there though. That's one of the reasons this is officially a DCC 'preview' - there are definitely more modules we want to get in there before we feel comfortable saying that it is a fully released product, but the modules that are there (and as new ones are added) are suitable for production usage.

0 0

Replies

replied on April 3, 2014 • Show version history

Have they definitely been indexed?

A colleague put together this table, it might be useful in this situation as well.

Server	Client	PDF Indexed
Adobe Reader 9	x	No
Adobe Reader 9 with ifilter 9	x	Yes
Adobe Reader 9 with ifilter 11	x	No
Adobe Reader XI	x	No
Adobe reader XI with ifilter 9	x	Yes
Adobe reader XI with ifilter 11	x	Yes
ifilter 9	Adobe reader xi	Yes
ifilter 11	Adobe reader xi	No

2 0

replied on April 3, 2014

There's a few things going on here.

First, Import Agent currently does not have the functionality to generate pages from PDFs. PDFs are not OCR'd in the same way that image files like TIFFs are - instead it's handled more like how electronic files like Word Docs are. If there is a text stream present and you run generate text, the Client will generate a text stream, but not images. If you run generate pages, the Client will generate both the image pages and (if selected) the text itself. We're looking into adding this functionality to Import Agent, but it's not presently there.

When you look at the image in the client, is it showing up in the Image pane, or the Electronic File/PDF pane. You can generally tell by if it has an adobe reader toolbar in the pane. If it's the later, you don't actually have 'image pages' for this document in Laserfiche - instead it is just displaying the original PDF in the Laserfiche Client through an Adobe PDF frame. Without an image page, the Laserfiche Server cannot provide search highlights on the image.

In the Client, are you running Generate Text or Generate Pages? Try running Generate Pages and you should see actual image pages created and then get the search highlights.

As to OCR'ing through DCC, that only OCRs - it doesn't do text extraction. As such, it's not relevant to generating text from PDFs that don't already have affiliated image pages. That said, all Avante and Rio users currently have activation keys for the 9.1 Distributed Computing Component preview release with OCR, but you will need to be using Workflow 9.1 to use it.

1 0

replied on April 3, 2014

Hi Ben

We have set up that Import Agent brings the file in from Windows Folder. The PDF file get into Laserfiche folder Indexed without being OCRed. (I like OCR to be automated at this point)

Than I manualy OCR the document (thanks I turned the ifilter on with option to generate text and images without text stream). Now the search highlights the Text in OCRed text pane not in the Adope document image pane (on the actual page), I am not sure if this is the way it should be. The file is also inexed when it gets in by import agent.

The other question I have I need to automate the OCRing portion. I think I need to use Laserfiche Distributed processing. Are there any other options?

I have workflow 9.0 running in client side do I have to upgrade if I use Distributed processing for workflow OCR? also, is Distributed Processing still on Beta version?, and, is licenses requered ?(client has Laserfiche Rio)

0 0

replied on April 3, 2014

Thanks Justin, you covered most of my questions.

This time I generate pages first than I ran the OCR on the PDF file. When I search the word, I have the actuaul page, thumbnails, and OCRed Text on.y words in the Text pane (buttom right corner) is highlighted.

As per you question regarding the viewer: Yes the PDF file open in Laserficeh viewer (I have set it op in the cleint's options as a default)

As per DCC will it still OCR the PDF document even if the document is electronic version. Will it treat both image PDF and electronic PDF (searchble PDF) document the same way?

0 0

replied on April 3, 2014

What do you mean by image file? will it still OCR static PDF files? (non searchages, eg: scanned PDFs).

Do you have an ETA for import agent and DCC release? Will license be included RIO once the final version of DCC is released?

Sorry for bumbarding you questions Justin :-).... I have to get back to client with the answers

0 0

replied on April 3, 2014

PDFs: When I say 'image' files I'm talking about types such as TIFF, BMP, JPG, PNG, etc.. While static PDFs appear as images, internally they are not and can't be directly OCR'd like the other types. That's why we have the native PDF page generation in the Client. So currently, no, DCC will not OCR any PDFs unless pages have already been generated for them. I believe page generation is currently possible in the Client and Quick Fields although, as noted, we're hoping to have it in Import Agent soon.

Releases: I don't have time-frames for either Import Agent or additional DCC modules, but I can say that we are actively working on them. DCC modules will come out as they are ready, although I don't know where PDF is in that list. I also can't say for sure on DCC licensing for the full release I'm afraid, or where generating text/pages from PDFs fits in - it depends on the order of the various modules. We'll have information up on the product roadmap as we have specifics.

0 0

replied on April 3, 2014

Thanks Justin.... Appreciate you help

0 0

You are not allowed to follow up in this post.

Question

Question

PDF OCR via Import Agent and Search highlight in PDF?

Answer

Replies

Sign in to reply to this post.