Associating Data

replied on July 5, 2024

I would classify this as semi-structured data--it's not totally consistent but it does have some structure and pattern to it. That makes it challenging to deal with, but it may be possible to extract. In particular, I think regular expressions (via the Pattern Matching activity in Workflow) are going to be your best strategy here (along with using zones to only extract the data from parts of the image).

However, you need a larger sample size to build robust regular expressions--samples that include most/all of the possible variations.
Example questions:
Is "UPC" always present?
Is it always a number prefixed by the text "UPC:"?
Is there always at least one Batch?
Does it always have a "Batch:" label?
Will there always be a material number?
Will every material number have a description?
Are the column widths always the same?

Given what you've shown in your post, you can do something like the following:

Create a MV token containing the material numbers (each row corresponds to one value in the token)
Create a MV token containing the description (each row corresponds to one value in the token--and importantly, the same value as the material number)
Use additional regular expressions to extract each part of the description for each row. Using a For Each Value activity to iterate over each value in token created by Step 2 would probably be the easiest way to do this.

If this is done correctly, you can then correlate the values in the multi-value tokens by their index (i.e., the first value in each token will correspond to the first row's values, the second value to the second row's values, etc.).

See bottom of this post for more details about the regular expressions that might work.

Regarding:

I know that I can have a single field with multiple values, and I can extract the batch, description, qty, and UPC but I don't think that there's anything like a table field

Laserfiche has something called Multi-Value Field Groups that act sort of like "field-level records" and may be what you're looking for.

One additional challenge you're going to have to deal with is how correlate multiple batches to a single item (e.g., a comma-separated list might work since they are numbers) or, alternatively, to expand the batches into multiple rows that use the same material number (depending on what you want).

Note: The "split" token function can easily split apart comma-separated value lists and token indexing can easily recombine them.

Sample regular expressions that might help you get started, but again, aren't very robust because you only posted a very limited sample size.

1) To split the Description column into rows (assuming they have a UPC):

((?:.*\n)+?UPC:\s*\d+)

This captures everything before and including "UPC: <number>" (including line breaks).

2) To extract the batch numbers from a Description row (assuming they always end with CV):

(\d+\s+\d+)\s+CV

This captures the digits (and whitespace between the digits) before the letters CV.

3) To get the item name from the Description row (assuming they always have a batch number):

([^\n]*)\nBatch:

This captures all non-line break characters before the word "Batch:"

4) To get the UPC

UPC:\s+(\d+)

This captures all digits after the text "UPC:"

0 0

View 1 previous reply

replied on July 8, 2024

Several things...

The capture profile should have a zone for each column of data (i.e., one for material number and one for description)
You need a separate Pattern Matching activity with a regular expression token that breaks apart the Description column into rows--see the first regular expression in my post above. The Input for the regular expression token should be the Description token outputted by the Run Capture Profile activity.
You need a For Each Value activity that is configured to loop through the values of the token produced in Step 2.
The Pattern Matching activity you have needs to be inside the For Each Value activity.
The Input for each of the UPC/Batch/Description tokens should be the current value token outputted by the For Each Value activity. The regular expressions I provided are intended to work on a single row, not the entire column. The UPC and Description token should only return the first match since they'll only ever have one value. The Batch token should return all matches, but you're going to need to combine the matches using a token index when appending the value to the multivalue token (part of Step 6). If you don't do this, it's going to mess up the indexing in your multivalue tokens and the material number won't be correctly associated anymore! (though the workflow will run without an error)
The last Assign Token Values activity needs to be inside the For Each Value loop and should be configured to append the values from the UPC/Batch/Description tokens to the multivalue tokens you created in the first Assign Token Values activity.

Note: The Assign Field Values activity should be after (and not inside) the For Each Value activity.

0 0

replied on July 8, 2024

I'm not getting the same result as you for the first step. It's only picking up the first batch number and not the others where yours is getting all of them.

Here's the full data set

24/15 OZ KROGER SM LIMA BEANS
Batch:
0000287210-340CV
COO:USA
UPC:10011110803068
24/15 OZ KROGER PEAS & SLI CRTS
Batch:
0000295597-425CV
COO:USA
UPC:10011110812503
24/14.5 OZ KROGER CUT EX-GRN GR BN
Batch:
0000286058-340CV
COO:USA
UPC:10011110114362
12/14.5 OZ KROGER ITAL CUT GRN BNS
Batch:
0000287209 -170CV
COO:USA
0000279671 -510CV
COO:USA
UPC:10011110114447
12/14.5 OZ KROGER CUT WAX BEANS
Batch:
0000287207 -170CV
COO:USA
0000259275-170CV

0 0

replied on July 8, 2024

To be clear, I was trying to give you a starting point--you're going to have to adjust things based on your data set. As I pointed out in my first response, there's a lot of open questions (and thus I had to make a lot of assumptions) based on your initial post.

This is one of those cases: your batch numbers now have hyphens between the two numbers where they didn't in your original post, so you'll need to update the regular expression to account for that. Also, between the numbers and before the CV sometimes doesn't have a space, in this data, so I changed the \s+ (1 or more whitespace characters) to a \s* (0 or more whitespace characters) in two places.

Something like this should work better:

(\d+\s*-?\d+)\s*CV

Also, the regular expression tester in the Workflow Designer is invaluable for debugging these sorts of things.

0 0

Question

Question

Replies

Sign in to reply to this post.