Question

Keeping Import agent from extracting text - updated

Import Agent

Updated May 22, 2015

asked on May 12, 2015 • Show version history

I am trying to get import agent 9 to import PDFs so that we can generate pages without extracting the text.

I've tried multiple settings in Import Agent but no matter what I do it seems to always extract the text information off the system. I'm trying to have just plain tiff's put in the system so that a OCR process in Quick Fields can then ocr off the images. (since Import Agent 9 will not generate pages AND OCR at the same time.)

The issue is that the original PDF's have errors in their text that is causing issues all the time.

Also, if this is not possible is there any suggested third party solutions to do this outside of Laserfiche?

0 0

Answer

SELECTED ANSWER

replied on May 22, 2015

Import Agent is processing a PDF, generating pages from it, and storing those image pages along with text pages that was extracted from the embedded text of the PDF.

Quick Fields is configured to grab only the images and then OCRs them and stores that new document.

The process itself is working fine. The real issue is that for some of those image pages (originally generated by Import Agent's processing of the PDF) when OCR'ed with the decolumnize option disabled, text was generated improperly. If decolumnize is on, then text generated fine. Also, if Import Agent had been configured to generate pages from the PDF using a different DPI, i.e. 351, then those problematic pages could be OCR'ed correctly when decolumnize was off. This is the issue that we'll be looking into.

0 0

Replies

replied on May 12, 2015

Is there a reason why those documents can't still be OCR'ed in Quick Fields? The existing text should just get overwritten with the new OCR'ed text.

0 0

replied on May 13, 2015

I am actually OCRing those documents in Quick Fields afterwords. However Quick Fields doesn't seem to OCR the document again if it already has text associated with it. It has the exact text that is shown on the original PDF. If I export the document as a tiff and pull it back in again with Import Agent it OCR's fine. Obviously this is an issue with quality though.

I can't provide examples of this at the moment. The documents in question are sensitive documents so I can't relay them. However it seems to be reproducible. If you add an image with legible text as a stamp onto an existing PDF with text in acrobat and then use Import Agent to import it, you'll see the text minus the stamp. If you then OCR it with Quick Fields you'll see the exact same text.

I'm almost wondering if there is a way in the SDK you can tell it to drop all current OCR information so that quick fields actually OCR's it. This way I don't have to worry about issues exporting/reimporting that may crop up.

0 0

replied on May 13, 2015

If you're not able to have Quick Fields re-OCR a document and replace existing text with new text, then it may be a configuration issue in your Quick Fields session.

Also, is Import Agent importing the PDF along with the generated text (instead of discarding the PDF and just keeping the images) and could Quick Fields be inadvertently configured to process the PDF rather than just the images?

To confirm that Quick Fields is able to overwrite existing text with new OCR'ed text, you can import one of the SAMPLE TIFF files from the Laserfiche Client program files folder. OCR that document in the Client. Then edit the text in some fashion. Finally, have Quick Fields process the document and re-OCR it and when it stores it back to the repository, you should see that the text is overwritten.

0 0

replied on May 13, 2015

Quick Fields can overwrite the existing text pages for PDFs.

0 0

replied on May 13, 2015

I have made sure that the Laserfiche document only has a tiff file associated with it. I have the import agent session set to generate pages from PDF's and then set to only keep those pages.

Inside quick fields, I have the "retrieve electronic document" field unchecked, so it shouldn't be getting the PDF from there.

I have not tried just importing the PDF's and then letting Quick Fields convert them. I'm hesitant to do that because occasionally some of these PDF's have security on them and fail (which goes into the IAerror folder on Import agent). With Quick Fields I'd be worried that they would be lost.

I won't have access to the system until later this week or early next. But I can try some of these options and see what happens.

0 0

replied on May 15, 2015

Ok, so I've got some screenshots...

First, here's import agent's pertinent settings:

2nd, here's Quick Fields:

So with these settings it does not reprocess the OCR information. It simply puts the exact same text that is extracted from the PDF onto the final text for that page after running through laserfiche. It does NOT replace the text that's there.

However, I discovered quite by accident that it worked if I set the DPI in import agent. I had import agent set like this:

And 2 out of my 3 test documents failed exactly as was outlined above. However the third did not. I was able to determine it may have been that the 3rd PDF document itself was created at 400 DPI.

I then set the system to be at an odd DPI:

When I ran this through, all of the documents worked.

My biggest concern with just leaving it like this is odd issues that might arise if the customer ever needed to print these.

0 0

replied on May 15, 2015

Could you open a support case and attach some samples images or PDFs?

0 0

You are not allowed to follow up in this post.

Question

Question

Keeping Import agent from extracting text - updated

Answer

Replies

Sign in to reply to this post.