So I tried a couple of things to test the speed, I find that the time is going to be pretty long however I do it because it doesn't seem to be able to pull the text stream straight out of the PDFs.
For example the time to 'Generate Pages' for one 35 page document was minutes and another - a 345 page document took over 12 minutes and then threw what I find is the infamous [740] no response from server error posted here without many really effective solutions.
So I dug a little deeper and found some very strange things:
First of all one of the documents that refused to OCR at all with any of the possible combinations of settings (iFilter, alternative methods, use page images etc) I tried to open it in Pages and found it was one of the mystery missing documents I have seen from time to time - usually as recurring errors when I try to OCR everything a search for all documents with no OCR text in them picks them up. It is the:
"This document contains no pages" and "No response from server [740]" errors.
So, I thought I'll just go and grab the original and have a look at the PDF itself from the source disk:
and using the name in Windows Search - the file was simply not there - simply not on the original DVD from which the set was copied into the ImportAgent ingestion folder! What the?
see some of the screengrabs for evidence. I must say I'm perplexed.
Notice in the one in the Windows Search that it finds two hits - but even those are not the same naming convention...
using FINANCIAL not FINANCE and if I enter the next character the ...5 in 2835 I get no hits.
Very weird. Must be doing some kind of renaming??