How do we disable Search Engine Text Extraction

replied on March 2, 2015

What are you looking to gain by disabling it? Also, it's performed on the search engine machine, not the client, so DCC doesn't really come into the picture.

This thread has some discussion on blocking search engine text extraction on certain extension types: https://answers.laserfiche.com/questions/50692/LFFTS-BlockedExtensions-List-registry-key

0 0

View 5 previous replies

replied on March 3, 2015

Thanks. Looks like I can just configure it to run on permitted extensions with no extensions. I don't fully understand what the textprovider is doing but from everything I have read it appears to OCR the documents temporarily for the search index. Then it drops all the work it performed rather than save the associated text only to OCR the documents again. OCR requires a lot of cpu cycles and we are looking for ways to free up cpu. I would rather setup a DCC job to run a once and done OCR of all documents.

0 0

replied on March 3, 2015

There's no OCR'ing going on. All TextProvider does is perform text extraction on electronic documents - say, through iFilters - at index time. It's a way to get documents into the search queue without relying on users to have known to manually text extract them. That said, yes, it drops it each time, and I would strongly recommend doing the text extraction yourself in the Client, and then it can just use this. However, again, there's no OCR going on. This solely has to do with generating text through text streams of electronic files.

0 0

replied on March 3, 2015

How do I find and run permanent text extraction on these documents? I only know how to find non-OCRed images and OCR them.

0 0

replied on March 3, 2015

If you run 'generate searchable text' on an electronic file, it will actually use text extraction. It just uses OCR on imaged documents. Strictly speaking, OCR is just one way to generate text on documents - specifically image documents in this case - and that's why the actions in the Client have more generic names. OCR is MUCH more processor intensive and time consuming then efile text extraction, which is why it's the one people thing of.

0 0

replied on March 4, 2015

Well the text provider is a heavy hitter on CPU cycles. Even if it is not OCR, whatever it is doing takes a lot of operations very similar to OCR. How can I find all these documents, the search criteria in Laserfiche says has pages without text. These files would not have any pages right?

0 0

SELECTED ANSWER

replied on March 5, 2015

The bulk TextProvider process through the search engine itself can hit the CPU pretty hard, but the individual text extraction operations to the user in the Client are nothing compared to OCR. Text Extraction on any given document is quick and low processor intensive compared to OCR. We aren't concerned about restricting it on Web Access for example, as opposed to OCR.

To locate the documents, I'd suggest looking for efiles without pages and starting from there. So long as text is generated once for them, TextProvider won't try to process them during index.

0 0

replied on March 5, 2015

Got it. There is over 10 years of documents. That would explain it. Thanks for your help

0 0

replied on March 5, 2015

Sure thing! Is this a recent upgrade scenario from pre-8? That functionality has been there for a while.

0 0

Question

Question

How do we disable Search Engine Text Extraction

Answer

Replies

Sign in to reply to this post.