Question

Workflow on 1 million documents / Recursive Workflow advice

Workflow

Updated September 29, 2020

asked on March 4, 2019

I have a simple task to do but with a large number of documents (more than 1 million) to do it to. All I need to do is remove hyphens from the SSN field in each of those 1 million+ documents. They all reside in a system of subfolders under the same root level folder. Here are the options I have considered so far.

1) Search Repository action that finds all documents under that root level folder. For each document, invoke a workflow on it. That auxiliary workflow updates the SSN field on it.

2) Search Repository action that finds all documents under that root level folder. For each document, update the SSN field.

3) Search Repository action that finds all folders under that root level folder. For each folder, invoke a workflow. That auxiliary workflow searches for all documents directly under the current folder (no subfolders) and updates the SSN on them. I received this general idea from Laserfiche Presales for another large job.

4) Find Entries under the root level folder. For each entry, if the entry is a document, update the SSN field, and if it is a folder, invoke this same workflow on it. (idea from https://answers.laserfiche.com/questions/144601/Update-metadata-in-all-documents-in-folder-and-subfolders#144607)

I am worried about this because I have been trying to do choice #1, and after 65 hours of progress it has only completed 340,000 documents. I would have expected it to be faster since it is only one activity for each document (Assign Field Values.) Which of those 4 options would be the most optimal, and is there another strategy that would be better?

Thanks

0 0

Answer

SELECTED ANSWER

replied on March 4, 2019

Workflow has too much overhead (this isn't a bad thing, read on) for this kind of tight-loop maintenance operation. It keeps a lot of logs about each instance, it's activities, the documents it touches, the tokens, etc. This is all great when troubleshooting day-to-day workflows. However, this tends to slow things down when you are trying to run a large batch through Workflow.

This also exposes you to some potential issues. If the workflow fails for any reason, do you have a way to reliably restart from where you left off? You also risk bloating the size of the Workflow database, as it tries to log all of this activity. I imagine that your SQL transaction logs are quite large right now. Also, as you've discovered, it's slow.

The recommended solution for this kind of project is to use the SDK, and write a standalone script to perform the operations that you need. The SDK will allow you to perform searches, examine documents and their metadata, modify template fields, and save the changes.

I like to add another layer in the case of large batches. I'll directly query the repository database in order to figure out the documents that I want to modify. Then I'll store the entry Ids in a table. (We have a database that contains tables for lookup items and temporary data like this.) In addition to the entry id column, the table will also include an IsDone (bit) column and possibly an ErrorMessage (varchar) column. The SDK script will iterate through the rows, perform the appropriate operations against each entry, and mark the row as done. This provides a little bit of resiliency and error tracking in case something happens.

4 0

replied on March 7, 2019

Devin, I combined your idea of an SDK solution with Miruna's suggestion and for which option to use which was Jason's solution, and by your powers combined we went from:

340,000 documents in 65 hours (Workflow method)

1,066,311 documents in less than 6 hours (SDK method)

Thank you all!

1 0

replied on April 3, 2019

Devin,

Can you provide a sample of an SDK where you search the repository for documents to update, process, and invoke another work for filing?

0 0

Replies

replied on March 4, 2019

I'd go with Devin's route if you don't need all the logging of Workflow.

As far as your options go, #4 should be fastest because you're retrieving entries directly. However, at some point you're still going to run into the fact that you can only set fields on 4 things at once (per CPU on the Workflow server). #3 is a variation of #4, but using search instead of Find Entries. #2 will be the slowest.

However, if you have a large number of entries and only a smaller percentage of them have the SSN field, then search might be more efficient since you'd be sifting through a smaller number of documents overall.

Processing search results usually goes better if you do it in batches of about 1000 at a time. I'd have Search Repository or Find Entries retrieve the field and pass it to the invoked workflow as an input parameter. That will cut the number of calls to Laserfiche down by 1 per document because they'd be part of the call to get the search results

4 0

replied on March 5, 2019

As Devin mentioned, the buildup of instance data is what causes your performance decrease. When we first switched to Laserfiche I had to run a workflow to populate metadata on approximately 40 million documents.

My first iteration was a single looping structure, and I found that it would start at about 10 seconds per document, but after a day it would be down to over 1 minute per document.

Breaking it into one or more child workflows allowed me to maintain the efficiency. I broke it into 32 parallel processes (4 parallels * 8 cpus) and managed to finish far faster than we expected.

However, like Devin also mentioned, this generated a huge amount of data in the Workflow database which caused some issues with searching the activity and the automated db maintenance.

Fortunately, we used a completely separately Workflow server to do this work, shortened the time it saved the logs, and wiped the whole thing when we were finished.

3 0

View 1 previous reply

replied on March 5, 2019

Our production Workflow database has backups running every 30 minutes. This keeps the transaction logs to a manageable level.

Something to also keep in mind is that even running an SDK script, you'll have some chatter in the main Workflow logs because of the subscriber evaluating starting rules. So, keep an eye on the starting rule queue on your main instance. I once got it 100k deep before users started complaining that none of the workflows were working. Now my big import process is tuned to stay just under the limit.

0 0

replied on March 5, 2019

You can exclude your SDK user from Workflow monitoring to speed up processing of these events.

1 0

replied on March 5, 2019

We did that, but the subscriber is still seeing the events.

0 0

replied on September 29, 2020

It is worth mentioning to purge any documents eligible for destruction (if possible) before starting a large process like this. If you cut the number of docs down, the process time will drop as well. Sometimes this can be more trouble than its worth depending on the personalities and protocol involved. Just putting out there as a best practice.

1 0

You are not allowed to follow up in this post.

Question

Question

Workflow on 1 million documents / Recursive Workflow advice

Answer

Replies

Sign in to reply to this post.