Yet another OCR demystification question

asked on August 31, 2015

Hi guys,

I am wrangling a number of issues to do with the searchability and in-fact the accountability of the searchability of the repository I have for a client.

To explain: they have a mix of pure electronic documents (MSOffice, PDF principally) plus self-scanned paper documents some of which have been OCR'd on import and some not due to speed and other constraints and also a heap of externally OCR'ed PDFs (bureau did scanning, OCR and ImportAgent added them).

They need to know for audit purposes and compliance that all the contents of their document management system (some of which are stored as records) are deep-text searchable and I have made up workflow that generates a report based on search syntax that pulls in "All documents within a container (or whole Repo)", "Documents with ALL/SOME/NO pages OCR'ed" and emails an admin person.

The ultimate goal is to have the entire Repo with a tag and some meta-data stating for each and every document that it has been Fully OCR'ed (ie pages with text=ALL) and on what date this was determined (metadata datefield set by looking for a change in the OCR=SOME/NO to ALL (possibly by tagging on prior iterations of the search process that selects documents for OCR processing on discovery/import etc and dating them in metadata as NO/SOME etc etc) and cleverly adding/removing tags etc)

This would allow for a date history to accumulate in metadata and tags as documents proceeded through NO to SOME (in cases where there were errors or say the DCC baulked a job or timed-out on a massive number of pages etc).

An auditing company could do searches (or run business process prepared for the purpose) on demand to determine at any time how much of the Repository was available to deep-text search.

One of my many problems in solving this is that when I do a search of the three OCR= cases (ALL/SOME/NO) I get say 18,304 documents for ALL, 21 for SOME and 181 for NO but the total number of documents generated by a more thorough search produces the true number of the contents of the repository:

37,586 documents. (confirmed by the volumes)

So about half of the documents are accounted for using the cases of OCR state and therefore it is not a conclusive indicator of the text searchability of the database.

When I add electronic components to the file set it reduces the numbers detected - ie the 18K number to 12.5K so I can see that the fact that there are electronic documents in the system that have contents that can be indexed is not directly related to the OCR status. I suspect I will have to do some kind of comparison between pages and OCR and indexed status?

This is the search that looks at "everything" and brings back the true total contents:

{LF:Name="*", Type="DB"} & {LF:LOOKIN="CLIENT_Repos\"}

and the kind of search I am using to pull in OCR status:

(({LF:AssociatedPages="Y"} & {LF:OCR=none}) & {LF:LOOKIN="CLIENT_Repos\"}) - {LF:LOOKIN="CLIENT_Reposs\zzDeveloper"}

Note I've tried this without the spec for AssociatedPages etc because I know some importing is not pulling in and generating Pages.

Just a bit stumped as to how to make this all auditable and fit their concerns about compliance.

They have run searches and found that documentst ht they know int he system that contain say an Invoice number don't return a result until they have manually OCR'ed the document having manually found it and generated pages.

0 0

replied on September 30, 2015

It sounds like the main issue is the missing documents from your OCR search. Searching for documents that contain text on all/some/no pages ({LF:OCR=all/some/none}) only searches through documents that have existing pages. Pages contain TIFF images and/or text of the document, and not all electronic documents necessarily have pages.

You will likely find the missing documents not included in your initial search if you search for “{LF:pagecount = 0}” through the advance search syntax. This returns all documents that do not have any pages, and are thus not searchable. The first step in accomplishing your goal of searchability and accountability is to make sure all documents have pages generated, and have been OCRed or have had their text extracted. Hopefully this helps point you in the right direction.

0 0

Question

Question

Yet another OCR demystification question

Replies

Sign in to reply to this post.