You are viewing limited content. For full access, please sign in.

Question

Question

How to search for documents that are not OCRed

asked on May 15, 2014

 I was using the search for page count < 1 until I found that not all documents get pages added simply by using OCR. A .docx file for example does not get pages added just because it has been OCRed. What is the best way to find all documents that have not been OCRed.

0 0

Replies

replied on May 15, 2014

To find documents that have not been OCRd you can run a search like the following:

2 0
replied on May 15, 2014

It appears this is excluding documents that have text associated with them. But these documents have not been OCRed. I am looking for a way to search for all documents that have not been OCRed and therefore less relevant in search results and no search details will be shown.

0 0
replied on May 15, 2014 Show version history

The search I posted if used by itself will search the entire repository for all documents that don't have text (OCRd text) associated with them. Doing a test, it still does not address your issue of finding electronic files that have not had text extracted from them.

0 0
replied on May 15, 2014

The problem is that when I use your search I get no results, because all the documents have extracted text. If I right click a single document and select OCR, then it gets an improved search ranking and more details in the context hits location. I would like to OCR all documents that have extracted text but no OCR.

0 0
replied on May 15, 2014

That's interesting. I would have thought that extracted text would count as having been OCR'd and that telling it to OCR the document would have no effect other than to re-extract the text. It's not like it's going to Snapshot the document and OCR the images.

0 0
replied on May 15, 2014

It makes a big difference. First we found that no context details are listed for only extracted text, also there is no location data. Then we found that search rankings go up when a document is OCRed, it moves up in the search results.

0 0
replied on May 16, 2014

You cannot OCR an electronic file. You can only extract its text. When you import an electronic file and tell it not to generate searchable text, when you perform a search it gives what Chad is referring to and will show no pages for the electronic file.

 

 

If you tell it on import or after it's in the repository to "OCR/Extract Text" the electronic file, it actually creates pages for the file (see the IIS Reset file below).

 

The pages don't show up as an image when viewing it in the document viewer, but you will see a thumbnail. Nor do they show in text pane in the document viewer. If you perform a search on these documents you get the more detailed results as Chad was referring to as well.

 

In the end, Chad has a great question. How do you find all electronic files that have not had text generated for them since what is produced when using the Generate Text button on an electronic document doesn't show up in the Text pane and since it doesn't have pages, you cannot perform a "has pages with no text" search?

 

3 0
replied on May 16, 2014

Could try this Advanced Search:

 

{LF:OCR=None} 

0 0
replied on May 16, 2014

It doesn't seem to work in the latest version. I am not sure what the search results are, I only get results if I search the entire repository.

 

For example there are some PDF files in here that have password protection, invalid pointers, and general database errors (does a PDF even have a database?). They can't be OCRed. I should be able to find them.

 

When I search for .pdf extension and include that advanced search I get no results. So it can't be returning everything.

0 0
replied on May 14, 2016 Show version history

Brian,

{[]:[S Film ReOCRed]="No"} & ({LF:OCR=None})

I don't know if you found the solution or not, but I thought that I would post this as I had the same question.

We had to re-OCR the entire repository as the OCR settings had been set incorrectly.   But we wanted to start with documents who were never OCR'ed.  The above worked for us.  We wanted to search for a template field S Film ReOCRed as well.  We verified the results and the search returned the correct record set.

Will do some further checking and add more details if necessary.

 

 

 

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.