I was using the search for page count < 1 until I found that not all documents get pages added simply by using OCR. A .docx file for example does not get pages added just because it has been OCRed. What is the best way to find all documents that have not been OCRed.
Question
Question
Replies
To find documents that have not been OCRd you can run a search like the following:
It appears this is excluding documents that have text associated with them. But these documents have not been OCRed. I am looking for a way to search for all documents that have not been OCRed and therefore less relevant in search results and no search details will be shown.
The search I posted if used by itself will search the entire repository for all documents that don't have text (OCRd text) associated with them. Doing a test, it still does not address your issue of finding electronic files that have not had text extracted from them.
The problem is that when I use your search I get no results, because all the documents have extracted text. If I right click a single document and select OCR, then it gets an improved search ranking and more details in the context hits location. I would like to OCR all documents that have extracted text but no OCR.
That's interesting. I would have thought that extracted text would count as having been OCR'd and that telling it to OCR the document would have no effect other than to re-extract the text. It's not like it's going to Snapshot the document and OCR the images.
It makes a big difference. First we found that no context details are listed for only extracted text, also there is no location data. Then we found that search rankings go up when a document is OCRed, it moves up in the search results.
You cannot OCR an electronic file. You can only extract its text. When you import an electronic file and tell it not to generate searchable text, when you perform a search it gives what Chad is referring to and will show no pages for the electronic file.
If you tell it on import or after it's in the repository to "OCR/Extract Text" the electronic file, it actually creates pages for the file (see the IIS Reset file below).
The pages don't show up as an image when viewing it in the document viewer, but you will see a thumbnail. Nor do they show in text pane in the document viewer. If you perform a search on these documents you get the more detailed results as Chad was referring to as well.
In the end, Chad has a great question. How do you find all electronic files that have not had text generated for them since what is produced when using the Generate Text button on an electronic document doesn't show up in the Text pane and since it doesn't have pages, you cannot perform a "has pages with no text" search?
Could try this Advanced Search:
{LF:OCR=None}
It doesn't seem to work in the latest version. I am not sure what the search results are, I only get results if I search the entire repository.
For example there are some PDF files in here that have password protection, invalid pointers, and general database errors (does a PDF even have a database?). They can't be OCRed. I should be able to find them.
When I search for .pdf extension and include that advanced search I get no results. So it can't be returning everything.
Brian,
{[]:[S Film ReOCRed]="No"} & ({LF:OCR=None})
I don't know if you found the solution or not, but I thought that I would post this as I had the same question.
We had to re-OCR the entire repository as the OCR settings had been set incorrectly. But we wanted to start with documents who were never OCR'ed. The above worked for us. We wanted to search for a template field S Film ReOCRed as well. We verified the results and the search returned the correct record set.
Will do some further checking and add more details if necessary.