You are viewing limited content. For full access, please sign in.

Question

Question

I need to bulk import pdfs that are already text searchable (in Acrobat say) - what's best approach

asked on July 28, 2015

I have some 7-800 archive documents which have already been scanned and OCRed by the bureau and delivered as PDFs with searchable text in them - something I have confirmed myself.

I take it that when I configure Import Agent I don't need to turn on OCR but I DO need to turn on create PAGES.

Is there an optimal way to use the existing text such that the speed of the processing and indexing is optimised as this is about 10G of data and time is of the essence.

Cheers,

Will

0 0

Replies

replied on July 28, 2015

Will,

What happens if you just drag-and-drop some (or all) of these files into the LF client? I do manual imports of searchable PDFs that way all the time, and I've used both the "Generate Pages" option or just brought the PDFs in as PDFs, and either way they generally come in with "OCR'ed Pages=All" after the import.

BTW, much as I like the "Generate Pages" option for highlighting and positioning on search results, the file bloat when converting to pages is pretty amazing. A typical 53KB PDF tax doc we get here (8 pages) turns into a 1.78MB doc when I use that option - 30x the original size. A 605KB PDF tax doc (226 pages) turned into a 97MB monster using "Generate Pages", so we've decided just to store them as PDFs.

Geoff

2 0
replied on July 28, 2015

So I tried a couple of things to test the speed, I find that the time is going to be pretty long however I do it because it doesn't seem to be able to pull the text stream straight out of the PDFs.

For example the time to 'Generate Pages' for one 35 page document was minutes and another - a 345 page document took over 12 minutes and then threw what I find is the infamous [740] no response from server error posted here without many really effective solutions.

So I dug a little deeper and found some very strange things:

First of all one of the documents that refused to OCR at all with any of the possible combinations of settings (iFilter, alternative methods, use page images etc) I tried to open it in Pages and found it was one of the mystery missing documents I have seen from time to time - usually as recurring errors when I try to OCR everything a search for all documents with no OCR text in them picks them up.  It is the:

"This document contains no pages" and "No response from server [740]" errors.

So, I thought I'll just go and grab the original and have a look at the PDF itself from the source disk:

and using the name in Windows Search - the file was simply not there - simply not on the original DVD from which the set was copied into the ImportAgent ingestion folder!  What the?

 

see some of the screengrabs for evidence. I must say I'm perplexed.

Notice in the one in the Windows Search that it finds two hits - but even those are not the same naming convention...

using FINANCIAL not FINANCE and if I enter the next character the ...5 in 2835 I get no hits.

Very weird.  Must be doing some kind of renaming??

Screen Shot 2015-07-28 at 10.55.11 pm.png
Screen Shot 2015-07-28 at 10.52.01 pm.png
Screen Shot 2015-07-28 at 11.09.11 pm.png
0 0
replied on July 29, 2015

Thanks Geoff, I will have to experiment a bit with these suggestions.  I really appreciate your feedback.

Will

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.