You are viewing limited content. For full access, please sign in.

Question

Question

Most efficient use of Workflow for batch processing

asked on January 27, 2016 Show version history

I have batch documents (anywhere from 1 to 2500 pages) being brought in via Import Agent. I need to split each page into it's own document and file them appropriately. The metadata needs to come from another system. I'm curious of how to best use Workflow without creating any issues with performance or unnecessary duration.

The way I see it, there are two options:

1. Purely use Workflow Activities:

This would use a repeat activity to remove pages from the batch document apply pattern matching on the text to find the identifier. Then I'd use the batch number obtained from the document, as well as the page identifier to do a database query to retrieve the metadata. I could then apply it to the template and file the document.

2. Use an SDK Script Activity

This would mean creating an SDK Script that would load all of the data for the batch from a web service, and then iterate through each page, creating documents as I go. I'd populate their template from the data that I'd previously loaded. From here I would likely allow another workflow to actually file it, to reduce the scope of the script.

Any thoughts?

 

0 0

Answer

SELECTED ANSWER
replied on January 27, 2016

I guess Quick Fields is not an option here, it's definitely the kind of thing it's designed for and probably would have better performance over Workflow.

If it must be Workflow, I'd have one that splits the page off the doc and creates a new document and then hand this document over to another workflow for further processing through Invoke Workflow. That way at least the processing part is multi-threaded.

As far as getting the pages out of the document, you have a few options:

  • Always move page 1 within the loop. This is the simple way, but it's got the worst performance impact because when you pull the page out, the server has to shuffle all the pages up by one.
  • Always move the last page. This avoids the performance hit of re-shuffling pages when you pull one out, but it requires calculating the last page based on the document's initial page count and the current iteration.
  • Copy each page as you go. This will give you better performance at the expense of increased disk usage on the volume since you're essentially making a copy of the whole document. At the end you can either delete the original document as a whole or move it to a temporary holding folder for deletion later once you confirm nothing went wrong.
1 0
replied on January 27, 2016

I love Quick Fields. I really do. However, we've had some issues with large processes getting hung up in QF Agent, so I'm always cautious with large batches. Plus, we're right at the edge of the QF server having it's schedule completely full. I'm willing to give it another try in this case.

One thing in particular that I'm curious about is the overhead of thousands of database calls, vs one. I'm not worried about the DB server, but does it cause any issues with WF?

If I were to go the QF route, would it be better to have QF do the lookup, or should I pull out each page's identifier, and pass the rest off to Workflow?

Finally, is there any concern about doing the page splitting and template population in an SDK script as part of a workflow?

0 0
replied on January 27, 2016

It wouldn't be thousands of calls to the database vs one, not really. It's more thousands of sequential calls vs thousands of calls in groups of concurrent calls with size 4 times the number of CPUs on the Workflow server.

The first case is Quick Fields or a single workflow doing everything, you still get one call per page, but one at a time.

The second is the invoked workflow case. Since document creation is going to slow each one down a bit, it would most likely be less than 4 times the number of CPU at any given time.

As far as a script goes, there isn't really a concern other than you are going to do your own error handling rather than relying on Workflow's built-in retrying mechanisms.

1 0
replied on January 28, 2016

As far as a script goes, there isn't really a concern other than you are going to do your own error handling rather than relying on Workflow's built-in retrying mechanisms.

 

This is probably the key point for me. I'll take a slightly slower process if it means Workflow is handling more of the low level details. We've done much crazier things purely in Workflow for one-time tasks and haven't had any overall issues. This was just the first time we've tried to build this kind of stuff into a production Workflow.

 

Thank you Miruna!

 

0 0

Replies

replied on January 27, 2016 Show version history

I suspect option #1 is not the answer but am curious.

 

In my experience on a bulk migration, an exe written with the SDK was more than 10 times faster than using workflow activities and a loop. The task involved taking a tiff image, finding associated metadata, creating a Laserfiche entry in the desired file location, appending the associated pages in order, assigning the metadata, saving and unlocking the entry.

 

I'm not sure if significant overhead exists using an SDK Script Activity compared to an SDK application and look forward to the discussion.

 

 

 

0 0
replied on January 27, 2016

I love Quick Fields. I really do. However, we've had some issues with large processes getting hung up in QF Agent, so I'm always cautious with large batches. Plus, we're right at the edge of the QF server having it's schedule completely full. I'm willing to give it another try in this case.

One thing in particular that I'm curious about is the overhead of thousands of database calls, vs one. I'm not worried about the DB server, but does it cause any issues with WF?

If I were to go the QF route, would it be better to have QF do the lookup, or should I pull out each page's identifier, and pass the rest off to Workflow?

replied on January 27, 2016

Interesting question!

The steps as I see them are; lock source document, create new target document, move/copy page 'n' from source document to target document, read and interpret text, reverse lookup into SQL, assign template and metadata, repeat until finished, then delete source document.

I think the bottleneck is going to be the reverse lookups into SQL so improving that step will give you the best performance.  If you can grab all of the data necessary to process a source document with a single query then that will be where you see your improvements.  Also making sure the queries themselves are efficient; ie appropriate indexes available for SQL to use for the lookups, etc.

As far as SDK Script activity versus building assemblies or apps with the SDK they are two completely different animals.  If I understand the mechanics of the Script activities they are interpreted and compiled at runtime (much like the older interpreted languages) so they will always be inherently slower than an application or assembly built with the SDK. (Perhaps the engineers can confirm that?)

0 0
replied on January 27, 2016

I don't know at what stage it happens, but scripts are compiled and will be executed like any other .NET code. It seems logical that they be compiled during publishing. However, the IL will still need to be JITed, which would probably happen at runtime. This is just general info on the typical flow of a .NET code to runtime process.

 

 

0 0
replied on January 27, 2016

It would be interesting to find out when the compilation to MSIL actually occurs as my experience has been that the Script activities tend to run slower than a custom activity with the same functionality...

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.