In the case where documents which do not have associated text do not need to be text searchable how do we disable this feature? Also is it possible to offload it to a DCC?
Question
Question
How do we disable Search Engine Text Extraction
Answer
The bulk TextProvider process through the search engine itself can hit the CPU pretty hard, but the individual text extraction operations to the user in the Client are nothing compared to OCR. Text Extraction on any given document is quick and low processor intensive compared to OCR. We aren't concerned about restricting it on Web Access for example, as opposed to OCR.
To locate the documents, I'd suggest looking for efiles without pages and starting from there. So long as text is generated once for them, TextProvider won't try to process them during index.
Replies
What are you looking to gain by disabling it? Also, it's performed on the search engine machine, not the client, so DCC doesn't really come into the picture.
This thread has some discussion on blocking search engine text extraction on certain extension types: https://answers.laserfiche.com/questions/50692/LFFTS-BlockedExtensions-List-registry-key
Thanks. Looks like I can just configure it to run on permitted extensions with no extensions. I don't fully understand what the textprovider is doing but from everything I have read it appears to OCR the documents temporarily for the search index. Then it drops all the work it performed rather than save the associated text only to OCR the documents again. OCR requires a lot of cpu cycles and we are looking for ways to free up cpu. I would rather setup a DCC job to run a once and done OCR of all documents.
There's no OCR'ing going on. All TextProvider does is perform text extraction on electronic documents - say, through iFilters - at index time. It's a way to get documents into the search queue without relying on users to have known to manually text extract them. That said, yes, it drops it each time, and I would strongly recommend doing the text extraction yourself in the Client, and then it can just use this. However, again, there's no OCR going on. This solely has to do with generating text through text streams of electronic files.
How do I find and run permanent text extraction on these documents? I only know how to find non-OCRed images and OCR them.
If you run 'generate searchable text' on an electronic file, it will actually use text extraction. It just uses OCR on imaged documents. Strictly speaking, OCR is just one way to generate text on documents - specifically image documents in this case - and that's why the actions in the Client have more generic names. OCR is MUCH more processor intensive and time consuming then efile text extraction, which is why it's the one people thing of.
Well the text provider is a heavy hitter on CPU cycles. Even if it is not OCR, whatever it is doing takes a lot of operations very similar to OCR. How can I find all these documents, the search criteria in Laserfiche says has pages without text. These files would not have any pages right?
The bulk TextProvider process through the search engine itself can hit the CPU pretty hard, but the individual text extraction operations to the user in the Client are nothing compared to OCR. Text Extraction on any given document is quick and low processor intensive compared to OCR. We aren't concerned about restricting it on Web Access for example, as opposed to OCR.
To locate the documents, I'd suggest looking for efiles without pages and starting from there. So long as text is generated once for them, TextProvider won't try to process them during index.
Got it. There is over 10 years of documents. That would explain it. Thanks for your help
Sure thing! Is this a recent upgrade scenario from pre-8? That functionality has been there for a while.