You are viewing limited content. For full access, please sign in.

Question

Question

Document is being OCR'd when handed off from Forms - Generate Searchable Text NOT selected in Options.

asked on April 3, 2019

I have an odd behavior that I cannot resolve so hoping someone can assist.  The Forms profile uses the local LF Admin account to process the Form and send to repository.  I log into the LF Windows Client as the Admin account and deselect Generate Searchable Text and Generate LF pages in the options.  Submit a document, nothing is OCR's and no pages created.  Great, works as expected.  BUT when I check ONLY Generate LF pages, pages are generated AND it is OCR'd.  Using the same account, I drag and drop directly to the repository with ONLY Generate LF Pages.  The document is processed correctly, pages generated, but not OCR'd.

My goal is to generate pages at the submission from Forms so I can OCR overnight.

Any insight or suggestions would be much appreciated.

 

 

0 0

Answer

SELECTED ANSWER
replied on April 4, 2019 Show version history

That doesn't really seem like expected behavior.

Take the original PDF you are attaching to the form submission and see if it has any searchable text before you even submit it to the form (try to highlight/select sections of the text, crtl+F to search, etc.).

The results you're getting are extremely odd and it would make a lot more sense if we find that the PDFs you're testing with already had some pre-generated text in them before you upload them.

That would relate to the Advanced PDF Import Options I was discussing earlier.

0 0

Replies

replied on April 3, 2019 Show version history

Just to be clear, do you have it set to save a form as TIFF or PDF in the Save To Repository task?

The reason I ask is that when Forms saves a document as a PDF, it actually attaches searchable text content to that PDF when it converts the form content.

The text content of a PDF like this can be extracted without running OCR when pages are generated, so the "OCR'ed Pages" title for that attribute is a bit misleading.

A good way to see this is to open one of the PDFs and highlight some text; this is almost certainly the content that ends up in the page text, not OCR data.

0 0
replied on April 4, 2019

Sorry, I should have been more clear.  I toss the TIFF and keep the attached PDF document.  I submitted the same PDF attachment once with both Generate Searchable Text and Generate LF Pages unchecked.  No OCR or pages.  Then a second time with only Generate LF Pages checked.  Both OCR'd and pages created.

 

I've tried checking the Generate Searchable Text, logging out, back in and un-checking it.  I've tried setting it from a different machine using a different client version as the Admin account.  No combination has only created pages.

0 0
replied on April 4, 2019

To clarify, what I'm saying is that is not necessarily running the OCR process. If the PDF has a text component, like the ones generated by LF Forms, then it could extract the existing text rather than running the OCR engine.

Are you talking about the document generated by the form submission itself, or files uploaded to the form?

I have forms that save as TIFF pages from the Forms process, and they save with searchable text; the text is not the result of OCR, it is a text file Forms generates when it creates the page images because it already has the actual text available.

0 0
replied on April 4, 2019

I'm talking about a PDF attached to the form, not the form itself.

 

What I did notice after extensive testing is that the OCR'd all column is not truly accurate.  I submitted the same PDF attachment in two separate form submissions.  One I OCR'd using the WF activity, the other I did not. Prior to OCRing, both documents had the same amount of pages and OCR'd value of ALL.  Then I ran OCR activity on one of the documents.  Performed a text search on both.  The OCR'd document had 25 hits for the term I searched, while the non-OCR'd one had one.  It seems that when you only have Generate LF pages selected, text is generated on the first page, but not the subsequent ones.  It's not until you run the OCR activity that text is generated on all pages.  Does that make sense? 

0 0
SELECTED ANSWER
replied on April 4, 2019 Show version history

That doesn't really seem like expected behavior.

Take the original PDF you are attaching to the form submission and see if it has any searchable text before you even submit it to the form (try to highlight/select sections of the text, crtl+F to search, etc.).

The results you're getting are extremely odd and it would make a lot more sense if we find that the PDFs you're testing with already had some pre-generated text in them before you upload them.

That would relate to the Advanced PDF Import Options I was discussing earlier.

0 0
replied on April 4, 2019

Great call!  The doc I've been testing with did have searchable text on the first two pages out of 88.  Now it makes sense.  Thanks!

0 0
replied on April 4, 2019

Just as follow up, I tested with a document that had no searchable text, but after generating only pages, the value for OCR'd was still All.  So it does seem like a bug.

0 0
replied on June 2, 2020 Show version history

Hi Thomas,

I have a similar observation using the latest 10.4 release when form TIF images are saved to the repository but are not being indexed by FTS default setting 'Always index on document creation'

I checked the DOCID in Workflow Subscriber to confirm the document action was 'Entry Created' however the document isn't indexed but the parent folder is.

I think that we are both seeing some buggy behaviours.

replied on June 2, 2020 Show version history

Images showing folder is indexed not the image document "Job Application - Mike Jones" from Forms

replied on June 2, 2020

This is more of a quirk I think for Support to explain.

Why are folders being FTS indexed?

 

@Support, please move this to a new Answers post if you need.  I only have access to reply to existing.

Thanks

You are not allowed to follow up in this post.

Sign in to reply to this post.