Hi guys,
I am wrangling a number of issues to do with the searchability and in-fact the accountability of the searchability of the repository I have for a client.
To explain: they have a mix of pure electronic documents (MSOffice, PDF principally) plus self-scanned paper documents some of which have been OCR'd on import and some not due to speed and other constraints and also a heap of externally OCR'ed PDFs (bureau did scanning, OCR and ImportAgent added them).
They need to know for audit purposes and compliance that all the contents of their document management system (some of which are stored as records) are deep-text searchable and I have made up workflow that generates a report based on search syntax that pulls in "All documents within a container (or whole Repo)", "Documents with ALL/SOME/NO pages OCR'ed" and emails an admin person.
The ultimate goal is to have the entire Repo with a tag and some meta-data stating for each and every document that it has been Fully OCR'ed (ie pages with text=ALL) and on what date this was determined (metadata datefield set by looking for a change in the OCR=SOME/NO to ALL (possibly by tagging on prior iterations of the search process that selects documents for OCR processing on discovery/import etc and dating them in metadata as NO/SOME etc etc) and cleverly adding/removing tags etc)
This would allow for a date history to accumulate in metadata and tags as documents proceeded through NO to SOME (in cases where there were errors or say the DCC baulked a job or timed-out on a massive number of pages etc).
An auditing company could do searches (or run business process prepared for the purpose) on demand to determine at any time how much of the Repository was available to deep-text search.
One of my many problems in solving this is that when I do a search of the three OCR= cases (ALL/SOME/NO) I get say 18,304 documents for ALL, 21 for SOME and 181 for NO but the total number of documents generated by a more thorough search produces the true number of the contents of the repository:
37,586 documents. (confirmed by the volumes)
So about half of the documents are accounted for using the cases of OCR state and therefore it is not a conclusive indicator of the text searchability of the database.
When I add electronic components to the file set it reduces the numbers detected - ie the 18K number to 12.5K so I can see that the fact that there are electronic documents in the system that have contents that can be indexed is not directly related to the OCR status. I suspect I will have to do some kind of comparison between pages and OCR and indexed status?
This is the search that looks at "everything" and brings back the true total contents:
{LF:Name="*", Type="DB"} & {LF:LOOKIN="CLIENT_Repos\"}
and the kind of search I am using to pull in OCR status:
(({LF:AssociatedPages="Y"} & {LF:OCR=none}) & {LF:LOOKIN="CLIENT_Repos\"}) - {LF:LOOKIN="CLIENT_Reposs\zzDeveloper"}
Note I've tried this without the spec for AssociatedPages etc because I know some importing is not pulling in and generating Pages.
Just a bit stumped as to how to make this all auditable and fit their concerns about compliance.
They have run searches and found that documentst ht they know int he system that contain say an Invoice number don't return a result until they have manually OCR'ed the document having manually found it and generated pages.