You are viewing limited content. For full access, please sign in.

Question

Question

Efficiency for importing thousands of files

asked on August 1

We are manually importing thousands of PDFs from an external scanning company

We need to generate pages and then OCR these files and move to a workflow

Some are thousands of pages.  (In LF Cloud)

Suggestions on how we can improve this process?

Thank you in advance

0 0

Replies

replied on August 1

What are your limitations? I haven't done LF Cloud migrations, but I've done on-prem. This might not be what you're asking, but I will let you know my thoughts. Since simultaneously running many workflows during production hours could negatively affect performance, here was my approach to importing millions of already-digitized pages from a different repository system. Once they were imported, instead of having a workflow run as soon as an entry is created (which can generate quite a few simultaneous workflows), I instead used an hourly workflow which searched for entries that weren't yet processed and returned the average amount that could be processed in an hour (in my case this was about 400). I then used a "for each entry" loop to perform the work (like generating pages, OCR, renaming, updating metadata, moving) within a "deadline" activity. This deadline would end the workflow if it ran for 59 minutes, which prevented it from running as the same time as the next hourly workflow.

This approach allows easy scalability by the hour, because I can schedule it to only run once every hour for any hour of the day by only changing the starting rule, not the workflow itself. Additionally, each iteration of the "for each entry" loop is only a single workflow running at a single time, so there was essentially zero chance that it could spawn runaway workflows and bog the server down.

1 0
replied on August 5

All Laserfiche Cloud accounts come with Import Agent which can extract pages from PDFs and OCR.

1 0
replied on August 1

We have had a lot of success using the Windows client on a machine with a fast CPU. This way you can upload all the documents at once with the same setting to generate and OCR in one operation and check on it until it completes. Since the processing all happens client side, the faster the CPU, the faster the job gets done. Not sure if more cores help though as it might be a linear job.

0 0
replied on August 6

Can you provide more detail on what you mean by move to a workflow? What actions would this workflow be performing on the entry?

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.