You are viewing limited content. For full access, please sign in.

Question

Question

Does import agent know what documents it already imported?

asked on February 3, 2022

Hello, I know this sounds like a funny question but what I'm working on is a migration from an older document system to our Laserfiche repository. It looks like around 3-4 TB of documents I need to move from the volumes of this old system to LF. I have a workflow ready to go and the idea was to point the Import Agent at the root folder where everything is stored and turn it on. I have some concerns. 

1. Since this process will take a couple of days, I'm worried my current users will notice the system slow down if my workflow server is processing documents all day long. Will they notice? 

2. If I set up Import Agent to only run during off-hours, say like 5PM- 3AM, will it try to import the same documents it already has imported if I don't have any post-processing turned on. (like deletion or moving files).

Thank you. 

 

 

 

0 0

Answer

SELECTED ANSWER
replied on February 4, 2022 Show version history

@████████ If your users were reporting slowness, my guess would be that it probably has less to do with Workflow being overloaded than it does with the repository being hit with a lot of changes all at once.

I'd recommend you narrow down "where" they were seeing the slowness, because it is unlikely that users would be all that aware of slow workflow performance, but a bogged down repository would be very noticeable.

Hard to say exactly where the problem might be, but my first guess would be the repository database being hit hard with all of the document changes.

2 0

Replies

replied on February 3, 2022 Show version history

Don't just point the import agent at the old folder. Import Agent does not have a database to keep track of imports it either moves them to a new folder or deletes them so it doesn't grab them again; you can only select move or delete, there is no "none" option for post-processing.

So, if you don't want those files moved or deleted, then you'll need to do something else. What we did was copy all of the legacy files to a separate environment and allowed IA to delete the copies.

That way, we could keep track of processing without affecting the legacy system.

As far as the performance hit of any workflows, that will really depend on how you set things up; i.e., how the workflows are triggered, how much work they're doing, if you're looping and how many times, and more importantly, how many instances will be generated.

 

 

2 0
replied on February 3, 2022

Hello Jason, ya I just noticed that the only options are delete or move. I think I was thinking about QF agent. Doesn't that let you have no post-processing options? Anyways, my workflow was going to kick off on document creation so 1 workflow instance per document. I guess that might not be the most efficient. I wonder if doing 1 workflow that searches and processes the documents after they are imported would be better? idk. What do you think? 

0 0
replied on February 3, 2022

So my first project with Laserfiche was to import about 45 million documents from a legacy system and I learned a lot of lessons in the process.

After some trial and error, and a lot of revisions early on, this is what we landed on:

  1. Robo-copied all the source files, broke them into subfolders of about 200-250 files each, and gave the folders simple names like 00, 01, 02, etc. and grouped them in parent folders with the naming while trying to distribute everything as evenly as possible (i.e., 00\00,00\01,01\00,01\01)
  2. I created an Import Agent profile that targeted the parent folder set to retrieve from subfolders, then used the folder\subfolder name as part of the repository path so the same structure would be maintained
  3. I copied the IA profile to create one for each of the parent folders 
  4. I set up a workflow (let's call it Subfolder WF) that looped through the contents of one subfolder (i.e., everything in 00\00) then set metadata, moved them, etc. and deleted the empty folder after the loop.
  5. I set up another workflow (let's call it Folder WF) that would loop through all the contents of a parent folder. It would run Subfolder WF on the current entry of the loop and wait for it to finish
  6. Finally, I set up a "Batch" workflow that would trigger the Folder WF on all of my "parent" folders simultaneously (either by looping and not waiting, or by using parallel branches) for controlled parallel processing.

 

The multiple IA profiles allowed me to pull things in faster, but be mindful of your IA server resources because more profiles only helps depending on how many threads the service can use at any given time.

The folders/workflows structured in that way allowed me to avoid running too many workflow instances while also avoiding the efficiency problems you get if you do too many loop iterations in a single workflow.

Running the parallel instances allowed me to speed things up, while also keeping tight control over how many could be running at once.

Now you do want to consider your WF activities when you decide how many items to loop through in a single subfolder. The more activities, the more data to track for each loop, and a workflow's execution can slow down dramatically if that builds up too much; it takes a lot, but in a bulk process like this it can definitely happen.

For example, at first I was looping through thousands of documents all in one go, then I realized it would start at about 7 sec per document but toward the end of the loop it would be closer to 1 min per document.

Breaking it into batches solved that problem because it broke it down into "bite size" pieces so I could finish a batch/instance before the data built up enough to affect performance time, and start a fresh one for the next batch.

0 0
replied on February 3, 2022

gotcha, thank you for all the info. I'll do some more testing and I appreciate the info! The bite-size chunks does sound like a good idea because I know how WF can get bogged down. 

0 0
replied on February 3, 2022

On that note, I'd create some sort of "holding" folder off the root, not put them in the root. That way, if anything goes wrong, your users don't have to wade through thousands of docs to get to their normal work.

If these documents need individual processing and there's no need to know anything from the other ones in the batch, the triggering on document creation IS the most efficient way to process them. That lets you take advantage of Workflow's multi-tasking capabilities (which you wouldn't if you got all folder contents and iterated over them). Processing is faster this way with less overhead on WF, LF and SQL servers.

3 0
replied on February 3, 2022

I'll also note that Import Agent 10 throughput scales well with cores/vCPUs up to four, after which the marginal benefit of each additional core falls off hard. Two machines running Import Agent, each with 4 vCPU, will have something like 60-80% higher throughput than one machine with 8 vCPU.

Don't have Import Agent do page generation for the migration. It'll slow down your import throughput by an order of magnitude or more. If you need page gen, use DCC with Workflow on the backend.

1 0
replied on February 4, 2022

@████████ Good points. I suppose I should rephrase. While triggered workflows for individual items is the most efficient from a performance standpoint, my concern would be on the database impact of creating so many instances strictly based on the problems we had with our bulk import.

This is obviously anecdotal and I don't think it affected processing times, but when we were triggering individual workflows with such a large number of documents it didn't take long before we started having issues with the instance history.

Again, I don't think it really affected throughput, but it did get to a point where we couldn't search workflow history reliably in the designer, and the automated database maintenance processes were failing (seemed to be timing out).

1 0
replied on February 4, 2022

yeah, that makes sense Jason. I just ran a test and since I was using my production workflow server to process these docs, my users started complaining about slow performance with their normal use of LF. I had to stop importing and processing. I'm going to have to either limit what times the import is running or another idea I had was to spin up an extra workflow server just for processing these documents. The other thing I'm not sure how to accomplish is an extra IA instance. Even though I have RIO, I cannot install another IA program on an extra machine. I would love to do that to speed up importing based on what Samuel Carson suggested.  

I just tested importing about 12 GB of documents and after 17 hours, my system only made it through just over 10GB. Since I'm looking at 2-3 TB of data to import, if my math is correct, 5100 hours for 3TB or 212 days, and that would be running 24 hours a day. I definitely need to speed this up. 

0 0
SELECTED ANSWER
replied on February 4, 2022 Show version history

@████████ If your users were reporting slowness, my guess would be that it probably has less to do with Workflow being overloaded than it does with the repository being hit with a lot of changes all at once.

I'd recommend you narrow down "where" they were seeing the slowness, because it is unlikely that users would be all that aware of slow workflow performance, but a bogged down repository would be very noticeable.

Hard to say exactly where the problem might be, but my first guess would be the repository database being hit hard with all of the document changes.

2 0
replied on February 4, 2022

That's a fair point. Both individual instances per document and workflows that would loop over large numbers of documents can cause SQL performance issues. And it's important to understand how maintenance works to plan for it.

Workflow clears out completed instances from the database a certain time after completion (30 days by default). And it does that in chunks of 5000 at a time, during the maintenance window, to keep its impact on SQL to a minimum. So individual instance can expire faster than one giant instance that loops through lots of documents, but unless that instance takes weeks, that's probably not going to make a lot of difference.

Now, all instances have to keep track of all the work they do and, obviously, a larger instance has more work to keep track of. That means when a larger instance completes, there's more data to be shifted around in SQL from the active tracking tables to the log tables where it lives until cleanup time. So that can have an impact on SQL performance, which is why we recommend that you keep your loops under 500 iterations.

However, the most important part of all this is the overall load on SQL. During a project like this, you're doing a lot of writing to SQL to keep track of instances and their work. So it's important to have a SQL maintenance plan in place rebuilding statistics and indexes periodically on the WF database. Usually a weekly schedule for this is more than enough. But during a higher than normal load event like this, you'd want to step up its frequency to probably nightly to account for the extra activity.

And that also applies to the repository and its SQL database. If you don't have a SQL maintenance plan for it, get one. And again, more documents imported results on more writing to the repository's SQL. And so does WF activity. So you should consider the increasing the frequency of this maintenance plan too.

Audit Trail is probably another thing you want to consider during this. If your audit SQL database is on the same SQL Server as the repository and Workflow, it will get hit very hard trying to keep up with the events in the repository.

3 0
replied on February 4, 2022

@████████, you're definitely getting bottlenecked somewhere outside Import Agent, assuming you aren't generating text/pages like I mentioned earlier. Below are throughput results from Import Agent multi-threaded performance testing we ran some years ago. The first line, "Pdf_Small_NotGeneratePage&Text" shows 516.7 MB/minute (31 GB/hour) for Import Agent 10 with four threads. Note that there is a 20x drop in throughput for the next test with page and text generation.

2 0
You are not allowed to follow up in this post.

Sign in to reply to this post.