You are viewing limited content. For full access, please sign in.

Question

Question

Pulling PDF Lines into Metadata

asked on December 18, 2023

Here's my data.

 

I don't have a way of knowing prior to receiving the document how many lines there are, but I want to pull data from each line into metadata so that I can use that metadata to create EDI documents. 

 

So far, I've tried using a dynamic anchor point and I tried to create a table to put it in but got confused and couldn't find documentation. 

 

Can anyone either explain how I'd accomplish this or point me to documentation where I can get a good enough understanding to figure it out myself?

0 0

Replies

replied on December 21, 2023

Hi Jack, it's possible this could be done using workflow. Can you provide the actual Document Text layer to review?
 

0 0
replied on December 21, 2023

Can you explain what you mean by Document Text layer? Is it this?

 

0 0
replied on December 21, 2023

Yes, that's what I was looking for. The Text was pulled from the document in Columnized form so that would need to change. This is a rather large project. I'd be happy to jump on a call with you to discuss strategies if you would like. you can flip me an email at steve.knowlton@ricoh.ca

0 0
replied on December 21, 2023

Can you give me the high level of what would be needed? This is both for me and anyone else who reads this thread later.

0 0
replied on December 21, 2023

Hi Jack, the thought is within workflow, to use Pattern Matching to Identify and Break out the text for Line # Sections on the pages into a Multi-value Token. Then using that Token, as you loop through each sections Text, use additional Pattern Matching to pull the specific information from that's tokens text to create the individuals item in a table or metadata fields. There are lots of caveats here that have to be considered such as if a Line # section cross pages, differing length of sections, etc. It would require a lot of testing and possible exception handling to ensure reliability.

0 0
replied on December 30, 2023

Hi Jack, 

Piggybacking on what Steve said...If this is a standardized form, meaning it always follows the same format within each "Line #" section you have two options: 1-Pull the text from the document or 2-Try Quickfields. 

If it were me I would use option #1 and test it out. I would find several of these docs to make sure there isn't any nuance to the structure. Again, multiple line # sections is ok, but each line # section should be similar. 

You could then read the text from the document in to a workflow and parse the data with regex. Once you have the parsed data you could drop them into your metadata fields. 

I have done this before with a similar task for a Justice Court. The doc was much like yours with multiple sections (traffic tickets). It's quite powerful!

0 0
replied on January 2, 2024

When you say "pull the text from the document", do you mean that I would just pull the entire area and parse it out from there?

0 0
replied on January 2, 2024 Show version history

Yes - OCR the document. Make sure the text displays accurately and identify the pattern. Create a workflow.

***Here's the trick - you need to know regex. 

In the workflow you can use the "Retrieve Document Text" activity to get the text. Then use the "Pattern Matching" activity to parse the text.

 

This workflow:

  • Gets the document
  • Create Tokens (for pattern matching)
  • Gets the Document Text
  • Regex the individual pages (in this case the pages had headers and it was easier to work with the text in-between the header and footer)
  • Then for each page:: run different pattern matches to get the data, drop them in to a token, assign field values

  

 

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.