You are viewing limited content. For full access, please sign in.

Question

Question

Quick Fields Processing Documents with 9000+ pages

asked on September 22

I've got a PDF doc that is 81.8MB and 9700 pages. After running through a QF session it's only scanning in the first 841 pages before it stops processing with no obvious error. If it is broken up into chunks, there doesn't seem to be an issue, but we're trying to avoid that. We've tried scanning with the LF capture engine vs the universal capture engine, but no luck. Any suggestions to get this to work without having to break this file up?

0 0

Replies

replied on September 22 Show version history

If you don't mind me asking (as a Capture Team developer)...
Where did this file come from?
How it get so large?
And why don't you want to break it up?

Unfortunately, you're going to encounter all sorts of timeouts trying to process something like that (esp. with LFCE), timeouts that exist for good reasons.

The typical solution for this situation is to break the document up, either using some external process or with a separate Quick Fields session that sort of pre-processes the document and stores its parts in the repository where another Quick Fields session can pick them up via LFCE. Though, honestly, even using a separate Quick Fields session might not work because it's a PDF and you'd, at the least, need to run page generation (which might take too long--I'm not really sure without trying it) so you can identify how to split the document.

0 0
replied on September 22

If the Distributed Computer Cluster is an option, that may provide a viable solution. Although Quick Fields is capable of page generation, it is not always the best option, especially with larger files.

Since version 11 the DCC can be used to offload page generation via the Schedule PDF Page Generation activity, and the callback settings can be used to move the document onto the next part of processing or handle any errors.

After version 11 was introduced, we started switching most page generation activities over to the DCC instead of Quick Fields and it has proven to be much more efficient and reliable provided the DCC workers have enough resources.

1 0
replied on September 23 Show version history

@████████


Where did this file come from? It comes from our state pay system. Its W2s for our employees. 

How it get so large? Its a page for everyone that worked that year. I expect this years to be bigger. We had over 10K of employees work this year.


And why don't you want to break it up? Breaking it up would disrupt our automated process. We would now need to introduce a manual step in our process to break it up in 1k chunks. Now we are working for the system as appose to the system working for us. 

As you mentioned, this would need to be an external step since i can't get QF to read the PDF with universal capture. I've started to see huge delays in scanning files over 4k pages which is averaging about 13 hours to process. I've successfully run jobs with two or three 4K page documents but it takes a weekend to process and at the end it leaves the server at 90% ram utilization after the job is complete and will not release the ram back unless i reboot the server. 

 @████████ ,

I'm not familiar with how DCC works but sounds like this might help my processing issue.  At the very least speed up the process. Its very frustrating to see scanning at 5 pages per minute. 

1 0
replied on September 23

Thanks, that's actually very helpful.  For one thing, it means that if there was an automated way to break it up (e.g., a Quick Fields session that could handle it or DCC), it can be broken up--you just don't want a manual step to do it (which is very understandable!).

Have you tried using a Quick Fields session that has nothing in it except page generation (and uses Universal Capture, not LFCE)?

The idea would be that it would run first and break up the document, say, every 500 pages, into several documents that would be stored in a different local folder.  And then a second Quick Fields session would process those documents afterwards.

0 0
replied on September 23

No problem. Glad to clarify. I'm fairly new to LF so i don't know the world of possibilities but i can defiantly try it. Just to make sure i'm following you:

 

1. Upload file to  LF Repo

2. Run QF session just to generate pages and break up to 500pg chunks

3. Run 2nd QF session to scrap data and capture meta data. 

 

If i got any of this wrong please let me know.

 

thank you

0 0
replied on September 26

That's the idea, yeah.

 

I would try using Universal Capture (so don't do step 1--just use step 2 to store the ____ page documents) since it might avoid a timeout that I know LFCE has.  That said, I'm not guaranteeing this will work--the PDF might just be too big to handle in Quick Fields.  You'll also likely need to experiment with different sized chunks: 500 might still be too big or it might not.  But using an empty session would be a quick way to test whether Quick Fields is going to be able to generate pages without timing out or whether you're going to need to use an external process.

Make sure you are NOT storing the original PDF as an edoc with the 500 page documents--you don't want it being reprocessed.

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.