You are viewing limited content. For full access, please sign in.

Question

Question

Mass Field Update/Workflow Reprocessing in Repository

asked two days ago Show version history

Our Commissioner's Office is migrating from one Revenue System to another in a few months. System A's account number is used in a field that talks to a SQL table to auto-populate other fields and that account number field is used by Workflow to auto-name documents, file documents, etc. This same concept will need to be applied using system B's account number after migration.

We could have a table where column A is system A account # and column B is system B account #. I've tested out running a workflow in our development environment to retrieve the current account number (A), query a table of fictitious account numbers (B), and update the field with the new account number. It works on a small set of documents tested. 

After the account number fields are updated, documents will need to be reprocessed through the auto-workflow that in addition to naming, and filing as stated above, assigns record management properties, creates shortcuts, in some cases even multiple shortcuts, etc. A LOT of branches in play, many conditional. It is possible that by the time we get to migration we will have 120,000+ documents in this repository to reprocess.

Is there a more efficient way of updating a field in mass than what I have described? Has anyone here reprocessed an extremely large number of documents through a workflow that has numerous branches like this? Is there a "best practice" for this scenario? 

Ultimately, we are looking for guidance on both field update process efficiency and process timing, ways of preventing Workflow from "locking up" by processing too many documents at once, or any other "gotchas". 

I attached the conversion testing workflow and a file that shows the number of branches on the autofile workflow, not that you can read the details, just to provide an idea of the number of conditional steps that exist.

COR AUTOFILE.png
Acct # Conversion Test.png
COR AUTOFILE.png (31.42 KB)
0 0

Replies

replied two days ago

Unfortunately the attached photo is too small to see on my end, but my recommendation would be:

  1. Split up the workflows
    1. You will be limited in parallel operations by each workflow definition. So if you have one workflow doing all of the work you will only be able to process one entry at a time
    2. It looks like you have some branches, so if you could have a unique workflow for some combination of record property or metadata field (i.e., these docs all do thing X, these other docs only do thing Y). you could loop over all of the found entries and kick off each of the unique workflows and scale it out that way.
    3. I've also just duplicated a workflow and kicked off workflow 1-4 based on if its entry id being even/odd/etc. 
  2. Do not use track tokens in any of the workflows, this will destroy performance and load on your WF's SQL server
  3. Once you scale more horizontally like this you will hit the WF and SQL servers pretty hard so I would recommend doing it off hours or take the repo offline for however long it takes. 
  4. 120k docs is not small, but its not a large amount. Just make sure you've covered all of the ways you need to process these docs before running it on everything (or break it up into chunks) because you won't want to have to run it multiple times
2 0
replied two days ago

Thank you for these helpful tips!

We would absolutely do this outside of operating hours or have everyone out of the repository while processing. 

One good thing, if there is a good thing, is that I have everything that would apply to the numbered branches you see in their own records folders within the repository.  What you see with a red X is a branch that we have disabled, as there is no longer a need for it. 

For example. I may have 9,000 docs that would meet the criteria for A-1 all together, 6,000 that would meet the criteria for A-3 all together, and so forth all the way down to C-2 where I could have 25,000.

So let's say I was working on reprocessing the docs in the records folder associated with group C-2, I could disable primary branch A, primary branch B, and secondary branch C-1 then process those that I know meet C-2 criteria so that all the work is following a vertical, vs horizontal conditional process.  Then follow suit with each remaining branch. 

If the steps being taken in Workflow are all vertical, do I still have concerns with the volume of documents that would be selected to hit the workflow at any given time? 

COR AUTOFILE SECTIONS.png
0 0
replied two days ago

Because all of your branches/logic are constrained to a single workflow you will hit a concurrency limit of 4 at a time. I can't quite tell what starts your colored screenshot, but I'm assuming its a metadata conditional starting event. This isn't necessarily bad, and would be needed if you also want to change metadata from the repository client to re-trigger the update workflow. For your large batch I would recommend copying the topmost branches that define A/B/C to the workflow that loops over the entries and queries the database for the new account numbers (I'll call this the main workflow). All you need is what gets it into either branch A/B/C you don't need the logic for branch 1-X. Then I'd make three copies of you colored workflow (I'll call this the filing workflow), so you'd have filing A, filing B, filing C. You can keep them as is, even though you already know if its A/B/C at the point its invoked its minimal overhead and you don't have to worry about breaking the workflow disabling stuff. Then in your main workflow you just invoke filing wf A, B, or C from within the condition making sure "wait for wf" is not checked.

Now you should be able to get roughly 3x the speed. If A is still 2x the other quantities that will be your slowest workflow to complete, but should still be a smaller number than your original amount.

 

Note: I'm assuming none of this filing touches the same entries i.e., two workflows potentially running at the same time trying to create the same parent folder. If thats potentially the case and the workflow isn't built to handle that I'd say just run it how you have it now (still prefer directly invoking instead of waiting for field change) and let it run over a weekend.

1 0
replied two days ago Show version history

We have a lot of experience processing large volumes and breaking it into smaller branches and sub-workflows is definitely the way to go. My first LF project involved migrating 40 million documents from a legacy system, and after I got the workflows optimized, we finished 6 months ahead of schedule.

Another thing to keep in mind is the number of iterations within a particular workflow instance; with each iteration, workflow tracking data will accumulate and depending on how many activities you have it can start to have a serious impact on performance after a few hundred iterations.

In general, I limit these types of workflows to about 250 documents per instance. I might have multiple instances in parallel, but the idea is that I don't want a single instance to run so long that it becomes inefficient (i.e., an activity that takes seconds in the beginning can take minutes by the end if try to do too much in a single instance).

However, I'd be extra careful if you don't check the "wait for wf" option. If you go that route you want to be very sure you're not bombarding the server with too many sequential requests because you could easily overwhelm the server if you're not careful (e.g., having people exit a plane row-by-row vs just telling everyone to get up and go at once).

I tend to prefer creating parallel branches that wait so I can create a sort of flow regulator, but that does require an extra workflow to manage the parallel branches. I also agree that directly invoking the workflow is preferable to waiting for a field change because that gives you a lot more control.

With enough resources, Laserfiche can handle some pretty serious volume, so it just depends on your environment and workflow design. For example, our main document processing workflows can handle about 5,000 documents per hour during business hours, and that's with extra steps like OCR and PDF generation.

2 0
replied two days ago

Thank you so much for your reply! 

0 0
replied two days ago

We do not OCR any documents in this workflow. Under normal circumstances, it is initially kicked off using a conditional starting rule to monitor for a document landing into a particular repository folder.

To enable this workflow to apply changes to documents that already exist in the repository where someone might change a field on a shortcut that requires it to run again, there is a step at the beginning that checks to see if it is a shortcut and will delete the original one. From there the decisions begin, is it a business, individual, or veteran account based on template. After that, depending on document type, all sorts of things happen.

I am going to paste pictures of the segments of the workflow, start to finish, for how one document might flow down through the business segment of the workflow.  p.s. I can't recall why the Simple Sycn Sequence was used, but it was for performance reasons and added by our solution provider.

This next step needs to be cleaned up as now there is no condition, it just needs to create a filing date for all that hit that branch. 

 

Next, we are looking to see, based on document type identified by a field in the repo, which retention classification it falls into, creating date tokens, routing it to a folder, and updating records management properties accordingly, 

Lastly, we are creating shortcuts of the records for one subset of documents and routing to a particular folder and ultimately all documents have a shortcut created in an account number based folder structure where our employees work with the documents. 

replied two days ago

We do not OCR any documents in this workflow. Under normal circumstances, it is initially kicked off using a conditional starting rule to monitor for a document landing into a particular repository folder.

To enable this workflow to apply changes to documents that already exist in the repository where someone might change a field on a shortcut that requires it to run again, there is a step at the beginning that checks to see if it is a shortcut and will delete the original one. From there the decisions begin, is it a business, individual, or veteran account based on template. After that, depending on document type, all sorts of things happen.

I am going to paste pictures of the segments of the workflow, start to finish, for how one document might flow down through the business segment of the workflow.  p.s. I can't recall why the Simple Sycn Sequence was used, but it was for performance reasons and added by our solution provider.

This next step needs to be cleaned up as now there is no condition, it just needs to create a filing date for all that hit that branch. 

 

Next, we are looking to see, based on document type identified by a field in the repo, which retention classification it falls into, creating date tokens, routing it to a folder, and updating records management properties accordingly, 

Lastly, we are creating shortcuts of the records for one subset of documents and routing to a particular folder and ultimately all documents have a shortcut created in an account number based folder structure where our employees work with the documents. 

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.