You are viewing limited content. For full access, please sign in.

Question

Question

Needing some help with splitting a comma delimited string into tokens with regex in Workflow please

asked on May 28, 2018

Hi guys,

I need to extract individual entries from a comma delimited file via Workflow.  I've got a pattern matching activity that splits the rows into individual tokens and leaves me with a string of column delimited text e.g.

Employee,12345,Smith,Fred,01/05/2018,29/05/2018,Department,Location

I can run a second pattern matching activity splitting into individual tokens per delimiter e.g. to get the Employee Type token:

(.*),.*,.*,.*,.*,.*,.*,.*

... to get the Employee Number token:

.*,(.*),.*,.*,.*,.*,.*,.*

... etc

And that works really well, right up until there's a department with a comma in the name, so it's in quotes:

Employee,12345,Smith,Fred,01/05/2018,29/05/2018,"Other, Department",Location

Testing with various regex's I found online, the following will work to get the Employee token when there are quotes around the Department:

(.*),.*,.*,.*,.*,.*,\"(?:[^\"]+|\"\")*\"|[^,],.*

And that successfully extracts all the tokens from that string, i.e. with a department with the quotes around the name and one or more commas, but then all the other lines without quotes fail.

Any suggestions on how to successfully populate the pattern matching tokens regardless of whether the department entry has quotes or not?

Here's the Workflow:

 

And here's the structure of the pattern matching activity for CSV split:

 

Thanks very much,

Mike

0 0

Answer

SELECTED ANSWER
replied on May 29, 2018 Show version history

Sounds like you have things working Mike but I find it hard to pass up a chance to play with reg-exps  :)     Thought I would throw one more solution your way.

If your CSV line always has 8 csv values you could use one reg-exp that will match all the values into a multi-value token,  with or without a comma in the department:   ( All matches (as multi-value) token) )
(\".+?\"),|(.*?),|(.+)$
or if you don't want the quotes on the dept name
\"(.+?)\",|(.*?),|(.+)$

Individual token values could then be assigned by indexing the multi value token created in the pattern match:
Type = %(PatternMatching_Token 1#[1]#)
Code = %(PatternMatching_Token 1#[2]#) 
....
Location = %(PatternMatching_Token 1#[8]#)

EDIT: 
I misunderstood the data,  so sometimes the department may have a comma with quotes around the name, other times not.
In that case you could use the same reg-exp and check if the multi value token contains 8 or 9 values.
If it contains 9 values then:
Department = %(PatternMatching_Token 1#[7]#) %(PatternMatching_Token 1#[8]#)
Location = %(PatternMatching_Token 1#[9]#)

 

3 0

Replies

replied on May 29, 2018

For anyone else needing to do this, here's the outcome.  I went with Andrew's suggestion, and although Ben's regex worked as well, Andrew's second regex removed the quotes from the Department result which was handy.  Here's the final workflow:

And here's the setup of the "Assign Split Token Values" token:

Thanks again guys!

3 0
replied on May 28, 2018

I have a similar suggestion as Olivier.

Instead of getting comma-delimited text, you can separate everything with quotation and comma.

For example, if you can get:

"Employee","12345","Smith","Fred","01/05/2018","29/05/2018","Department","Location"

Then all you have to do is use something like "(.*)?" to grab whatever is inside the quotation.

Both of these solutions should work.

2 0
replied on May 29, 2018

There's this solution

(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)

from Stack Overflow, too

https://stackoverflow.com/questions/3776458/split-a-comma-separated-string-with-both-quoted-and-unquoted-strings

2 0
replied on May 29, 2018

Thanks very much for your input guys, I would definitely prefer to have a one-regex-for-all approach.  :o)   Ben, I saw and tested that string as well, but then struggled with the separation into separate token aspect, but I'll try that with Andrew's indexing suggestion and see how that goes.

Thanks again guys.

Cheers,

Mike

1 0
replied on May 28, 2018

Hi Mike,

I was the same difficulty and this is my solution (maybe not the best one, but it's working ^^)

1. Before do anything, using the workflow, search and change all comma in your text to something special (example "," => "-@@@@-").

2. Do your process,

3. Then at the end of your process, change back your special character to your comma by changing all "-@@@@-" by ",".

 

Hope this can help you.

Regards

0 0
replied on May 28, 2018

Hi guys,

Thanks very much for your suggestions, the problem Kentaro is that I don't have any control over the csv file as it is provided by a third-party system and uploaded to Laserfiche, so some department fields will have the quotes, others won't.

I'll try the text replace approach for any commas within quotes and see if that sorts it.

Thanks again,

Mike

0 0
replied on May 28, 2018 Show version history

In this case I think you need to treat the department separately.

I would create one pattern matching activity first, just to evaluate if " exists. 

Then create a conditional decision such that if " exists, pattern match the department separately by using something like "(.*)?" and grab the other values just like you did (for example: (.*),.*,.*,.*,.*,".*",.* ).

If " does not exist, then your original approach should work because the department does not have a comma.

To keep things organized, you can create an empty multi-value token at the beginning before the conditional and within each branch, keep appending to that empty token using values retrieved from each pattern matching activities. This way, you would end up with one multi-value token containing everything you need in a clean format. 

0 0
replied on May 28, 2018

Thanks Kentaro,

I was so focused on trying to work out the correct regex to use that I completely missed this approach.  Unless someone comes up with a magic regex that does the trick, I'll take this approach to at least get the Workflow in place.

Thanks very much,

Mike

1 0
SELECTED ANSWER
replied on May 29, 2018 Show version history

Sounds like you have things working Mike but I find it hard to pass up a chance to play with reg-exps  :)     Thought I would throw one more solution your way.

If your CSV line always has 8 csv values you could use one reg-exp that will match all the values into a multi-value token,  with or without a comma in the department:   ( All matches (as multi-value) token) )
(\".+?\"),|(.*?),|(.+)$
or if you don't want the quotes on the dept name
\"(.+?)\",|(.*?),|(.+)$

Individual token values could then be assigned by indexing the multi value token created in the pattern match:
Type = %(PatternMatching_Token 1#[1]#)
Code = %(PatternMatching_Token 1#[2]#) 
....
Location = %(PatternMatching_Token 1#[8]#)

EDIT: 
I misunderstood the data,  so sometimes the department may have a comma with quotes around the name, other times not.
In that case you could use the same reg-exp and check if the multi value token contains 8 or 9 values.
If it contains 9 values then:
Department = %(PatternMatching_Token 1#[7]#) %(PatternMatching_Token 1#[8]#)
Location = %(PatternMatching_Token 1#[9]#)

 

3 0
You are not allowed to follow up in this post.

Sign in to reply to this post.