Hi 'fichers,
So, I'm a fair way down the track with the DCC now and have two worker bees hammering away at tasks of various sorts to work out what kind of DCC solution is going to work for my client.
The problem is that the issues I have had in overcoming some of the resulting "Partial OCR" document results (too much to go into here but ranging from iFilter 9 vs 11, some kind of missing page HTTP error for workers that isn't a problem when the job is manually OCRed on the client, hanging OCR and real difficulty being able to easily see where a task or tasks assigned to one of the workers is at - and I should note that I am not yet dabbling in the powershell scripts -that will come soon enough though).
In the mean time I have a client with the requirement to:
1) get through the current backlog of documents not OCRed so that they can draw a line in the sand and begin with a new workflow that will:
2) make reports or records that can be audited as the OCR process proceeds.
The thing is that the documents include a great many types but a typical important document is an Invoice that simply must be "deep text searchable" across the repository.
When documents have failed and are only Partial there is the very real chance that invoice numbers they are looking for will be lost and from an auditing perspective this has been flagged as a risk that needs to be appropriately mitigated.
So, as one of my approaches I have put together a simple search and report by email of the current status at any time.
It sends an email that gives the total number of overall documents, the total number of none, some and all pages with text - and this can easily be added up to check the total etc.
Clearly we are looking for a way of going from a report that, on any one day, shows a number of UN-OCRed documents (freshly submitted) and those that (for whatever reason) are only partially completed.
The email report is date/time stamped and I want to be able to run is before and after the workflow that tackles the daily or weekly OCR job (however that is done) with the critical information of providing a list of Entry IDs that need attention, as I am finding that repeatedly sending the same "Partials" to the processor is not necessarily working them out.
Hence my thought to start a process whereby I flag and tag the documents to begin a methodology whereby the OCR status can be rigorously and robustly assessed and addressed.
So one of my colleagues sent me a manual OCR Script that can be called via VB out of a workflow.
This will have the distinct advantage of being able to fit into workflow and be called while the parent remains in a wait state for the results to come back, thus retaining a degree of control lost with DCC.
I propose something like this:
1) Run the report on all the counts (Full, None, Some, TOTAL)
2) Search None (UN-OCRed - cleanskins as I'm calling them)
3) Assign a Red-Flag Tag to each and every one (This will be the signifier that the OCR processor has identified the cleanskin and commenced work on it. I would like to add metadata of date/time and current "Cleanskin" status somehow but I've not got to that in workflow yet)
4) Search for all other Red-Flag Tagged Documents in the repository (these will be any missed somehow or tagged Red for reprocessing (might be useful) and Partials because the red is not removed until the OCR sub-process returns successfully.
5) Call the OCR Sub-Workflow with a recursive "for each entry" and simply OCR every job returning the result to the workflow to say Full or error or whatever. On Full received back the Workflow will write the new date/time to metadata and set the status to OCRed remove the Red-Flag and set a Green-Flag tag.
6) Some kind of final process could be introduced to then look at all un-flagged documents and recurse through them "checking" their OCR status and applying the Red or Green Flag Tags until the whole repository (up until the last running of the workflow) is Green.
Now as far as I have got I made a simple workflow to call the OCR script but it is failing and I think it might be the way that the authentication is happening with the current version of Rio I am running 9.2.1 and where the script came from.
Can anyone have a stab at what is going on here for me and suggest how I could improve my approach, fix the code or start again with something else?
Here is the script and attached are the pics that may help start the troubleshooting if anyone has bothered to read down this far into my ramblings :)
Imports System Imports System.Collections.Generic Imports System.ComponentModel Imports System.Data Imports System.Data.SqlClient Imports System.Text Imports LFSO91Lib Namespace WorkflowActivity.Scripting.ManualOCRnoDCC '''<summary> '''Provides one or more methods that can be run when the workflow scripting activity is performed. '''</summary> Public Class Script1 Inherits SDKScriptClass90 '''<summary> '''This method is run when the activity is performed. '''</summary> Protected Overrides Sub Execute() 'Write your code here. Dim wrapper as ConnectionWrapperRA = Me.ConnectToRA(ConnectionMethodRA.RepositoryAccess90) dim docinfo as Laserfiche.RepositoryAccess.DocumentInfo = wrapper.BoundEntry 'dim docinfo as docinfo = me.BoundEntryInfo 'dim docinfo as Laserfiche.RepositoryAccess.DocumentInfo = wrapper.boundentry Dim myOCR as Laserfiche.DocumentServices.OcrEngine = Laserfiche.DocumentServices.OcrEngine.LoadEngine() MyOCR.AutoOrient = True MyOCR.Decolumnize = False MyOCR.OptimizationMode = 2 myOCR.Run(docinfo) End Sub End Class End Namespace
I should say that I think I managed to nail all the assemblies - at least the script editor isn't complaining, and yet maybe the SDK runtime I have is newer etc.
So I have Rio 9.2.1, Windows 2012R2 64Bit
SDK is installed
Still new to all this guys so please be gentle and thanks in advance to anyone who may be able to suggest my best approach.
Will