Advanced PDF Import Options

asked on February 25, 2016

A customer importing PDFs from an MFP discovered that they were unable to OCR these documents. Using the “Generate images and text for PDF files without a text stream” option without also choosing "an alternative method" to generate text still leaves you with no Text data in Laserfiche. Of course, choosing to always use OCR to generate text, even on PDFs that contain a text stream, is not ideal if for no other reason than performance. Native Text extraction is the optimal method where a text stream is available. "Generate Images and Text" plus "Use native text extraction" is an inoperable combination for PDFs without a text stream which will always result in no text being generated after a user has pressed a button with the sole function of generating text. It seems like it should be possible to drive the text source decision on the condition of whether or not the PDF has a text stream.

When no text stream is available there are only two viable options: Use an alternate method to generate text - or generate an error indicating no text will be generated. Silently failing to generate text is an unfortunate default behavior.

Should this be an SCR?

0 0

replied on February 25, 2016

So basically, you are "generating images" and text after you've imported it into LF already.

When you set “Generate images and text for PDF files without a text stream” option in combination with "Use native text Extraction" you get "This page contains no text, but text may be added."

After that, can you right click on the document again and select Generate Text? That works right?

0 0

View 3 previous replies

replied on February 25, 2016 • Show version history

"After that, can you right click on the document again and select Generate Text? That works right?"

No, it appears that Laserfiche continues to go back to the PDF and attempt native text extraction. The first time you run "Generate Searchable Text" the pages will be generated and the "Text" pane is set to empty/null. If you then run "Generate Searchable Text" again, the result is the same. If you edit the Text pane and add text, that text will be overwritten with empty/null when you run "Generate Searchable Text" with this combination of options.

This makes sense because Laserfiche does not modify the PDF. No text stream is added to the file itself. The “Generate images and text” option only appears to initialize pages (I assume this is required for the full text search system) and text for use with Laserfiche's native PDF support implementation for what would otherwise be an "electronic document" that Laserfiche does not handle natively.

0 0

replied on February 25, 2016

That works for me with a sample PDF. Please attach the PDF to the case I opened for you. We will continue there.

0 0

replied on February 25, 2016 • Show version history

I only have so much time to dedicate to improving Laserfiche's product. I've attached a PDF with no text stream to this post which can be used to recreate this scenario. At this point I'm just looking for a route to an SCR which has two elements:

1) Change the "Generate images and text for PDF files without a text stream" condition to also drive the text source selection. Because a document that does not have a text stream will never contain data for the "Use native text extraction" method to consume, using the two in combination is pointless. Likewise forcing all PDFs through the OCR engine when Native Text is available is not ideal but required to support files that do not contain a text stream. This should be a trivial change.

2) Ensure the Default behavior does indeed generate text when the the "Generate Text" command is used on PDFs that do not contain a text stream. unless the user chooses otherwise.

test.pdf (114.18 KB)

| Download

0 0

replied on February 25, 2016 • Show version history

Based on your sample, it does not appear to be a problem with LF but with your PDF. The quality of the PDF is not very good.

0 0

replied on February 25, 2016

Only in that it does not contain a text stream. It's a valid PDF with no text stream; neither acrobat nor Laserfiche indicate there is a problem with the file.

I'm not suggesting that there isn't a route to make this sort of work. Selecting “Generate images and text for PDF files without a text stream” in conjunction with "Use an alternative method to generate text”. will generate searchable text for that file. ...but this also means that all PDFs will now go through the OCR engine which is absurd.

The software is already doing 90% of what needs to occur, e.g., checking for the absence of a text stream and then boot strapping pages and text for laserfiche if required (so long as this functionality is enabled... and it should be by default as this is far and away the most likely use case).

I've done what I can on this - between initial diagnosis, testing & isolation, and now this - I've got most of the day dedicated to what appears to be an oversight. I'm going to leave this up to you or the community to make development aware or not.

0 0

replied on February 25, 2016

Adam,

Thanks for bringing up this issue. We do agree that when working with PDFs that don't have a text stream, and when in the Client's Tools > Options > Generate Text > Advanced Settings for PDFs is configured to Generate images and text for PDF files without a text stream, that generating text during PDF import and generating text for PDFs that already exist in the repository should behave the same in that it will generate the images first and then OCR the image pages. Also, this should be done without having to specify using an alternative method to generate text and OCR existing pages, since you wouldn't want this option to apply to PDFs that do have a text stream.

We'll look into addressing this in a future release.

1 0

Question

Question

Replies

Sign in to reply to this post.