Disabling indexing and remove OCR'd text from repository.

asked on June 10, 2014

Question from one of our clients:

"We’ve had a couple of instances where our Search Catalog becomes corrupt and needs to be rebuilt. One thing I noticed this last time I rebuilt the catalog is that a large percentage of our documents are being indexed, even though we don’t ever need to perform a text search against them. (i.e. Deeds of trust all have nearly the exact same text with signatures)

I want to remove the stored text associated with an entire category of documents. Any idea how to do that?

Alternately, is there a way to disable Full Text indexing across a category of documents so only those we want indexed are included in the catalog?"

Is there a way to disable the indexing of already indexed documents? Also, is there a simple way to do this as a group instead of individually?

0 0

replied on June 20, 2014 • Show version history

Hi Jason,

I can't quite tell if your customer is using "OCR" and "index" interchangeably, or if they're drawing a distinction between the two. To make sure we're on the same page here, an OCR is the process of identifying and capturing the characters on a document, while indexing is the process of compiling that information from all files into a central, organized location. Assuming we're talking about TIFFs, the easiest way to prevent a document from being indexed is to not OCR it.

If you'd like large sections of your repository to be unindexed/OCR'd, your best bet is probably to turn off OCR'ing by default. Any Quick Fields or Scanning sessions can easily have OCR disabled, and the automatic OCR'ing of documents dragged into the repository is an attribute that can be disabled and pushed out to users. Of course, I imagine there are still a lot of documents that you would like to OCR and index. To accomplish that, you might want to use our Distributed Computing Cluster module (DCC). This will allow you to schedule the OCR of documents from Workflow, by using the "Schedule OCR" activity. So documents won't be OCR'd by default, but any document that needs to be OCR'd can trigger this workflow automatically and be OCR'd.

As for removing the OCR'd text from a large number of existing documents... That's a pretty atypical request, and I don't think the functionality to accomplish it is built into the client interface. You might need to write some sort of simple SDK script to accomplish that.

0 0

replied on June 20, 2014

And to add to what Brett said, even if a document has not been OCRed, it should still show Yes in the indexed column as it has been processed by the full text indexing engine. The Indexed column is a bit misleading. The Indexed status just shows weather it was processed by the Indexing service and all documents should be processed by the indexing service.

1 0

replied on January 20, 2015

Hi Brett,

I just found your post, and I may be coming from left field, but we're struggling with the full-text search continuing to fail and having to delete/recreate the Search Catalog. With over 5.6M documents in our repository, without being able to compartmentalize what is indexed (and what's not), it makes for an extremely slow process (that we keep having to do every month or so).

I appreciate any insight you may have!

0 0

replied on January 21, 2015

Hello Sarah. Can you have your reseller open a support case and attach the application and system logs in event viewer, please?

0 0

Question

Question

Disabling indexing and remove OCR'd text from repository.

Replies

Sign in to reply to this post.