Question

How do I retrieve field values from a PDF document?

Workflow

Updated March 24, 2024

asked on March 18, 2024

Hello everyone.

I have a problem that I don't know how to solve.

My client generates various PDF files from a third-party application. Depending on the data entered in the application, the visual of the PDF changes. For customer 1, I can have fields A and B and for customer 2, I can have fields A and C (or D and E or F,G,H,I....). Please note that the missing fields are not hidden; they are simply not included when the PDF is generated.

I first used the occlusion and text recognition method to retrieve the field values.

But I now realize that I can have a PDF document with 50 pages. In which case I get the error message "The RegEx engine timeout expired while attempting to match a template with an input string. This problem can occur for many reasons, including large inputs or excessive traceback caused by nested quantifiers, back references and other factors." and the workflow never stops (On the print screen, my workflow has been running for 2 days.).

Is there any other way to retrieve field values?

Thanks in advance for your help.

Regards

0 0

Replies

replied on March 20, 2024

If it is a structured PDF that truly has fields (a flat PDF will not work), you can use the "Retrieve PDF Form Content" workflow action to get those values. You'll have to first upload a master PDF form that has examples of those fields, then choose the entry to retrieve the content from. Quick Fields can also do this.

1 0

replied on March 19, 2024

Hi Olivier,

That error isn't necessarily a Laserfiche error. It might be more of a regex engine error that exceeds the matching timeout.

You may be able to cleanup the regex to create a more specific search. But I think with that amount of text the regex will not always be reliable.

A few things I might try if I am going to stay with the regex route....Run the regex and the text externally on a site like regex101.com or download and install a regex app (expresso maybe?). You will be able to run tests to see how long or how many iterations the regex uses to find the match.

The other thing I have done when I have large amounts of text is that I will regex a regex. Meaning that if my document has certain markers of text, like page number or a heading, I will create a regex that matches between those markers. Then I will run a regex on THAT regex to get my matches.

0 0

replied on March 24, 2024

Hi Olivier,

1. What's the version of Workflow you are using?

2. Have you tried to test the pattern match in designer?

3. In the Retrieve Document Text activity, you can specify the pages to retrieve, instead of retrieve all the 50 pages in the pdf, so that the retrieved text can be limited to a small size.

0 0

You are not allowed to follow up in this post.

Question

Question

How do I retrieve field values from a PDF document?

Replies

Sign in to reply to this post.