Question

Some OCR Script assistance please: DCC is proving unwieldy and I have to be able to execute workflow and log/audit progress

Workflow

Updated July 27, 2015

asked on July 25, 2015

Hi 'fichers,

So, I'm a fair way down the track with the DCC now and have two worker bees hammering away at tasks of various sorts to work out what kind of DCC solution is going to work for my client.

The problem is that the issues I have had in overcoming some of the resulting "Partial OCR" document results (too much to go into here but ranging from iFilter 9 vs 11, some kind of missing page HTTP error for workers that isn't a problem when the job is manually OCRed on the client, hanging OCR and real difficulty being able to easily see where a task or tasks assigned to one of the workers is at - and I should note that I am not yet dabbling in the powershell scripts -that will come soon enough though).

In the mean time I have a client with the requirement to:

1) get through the current backlog of documents not OCRed so that they can draw a line in the sand and begin with a new workflow that will:

2) make reports or records that can be audited as the OCR process proceeds.

The thing is that the documents include a great many types but a typical important document is an Invoice that simply must be "deep text searchable" across the repository.

When documents have failed and are only Partial there is the very real chance that invoice numbers they are looking for will be lost and from an auditing perspective this has been flagged as a risk that needs to be appropriately mitigated.

So, as one of my approaches I have put together a simple search and report by email of the current status at any time.

It sends an email that gives the total number of overall documents, the total number of none, some and all pages with text - and this can easily be added up to check the total etc.

Clearly we are looking for a way of going from a report that, on any one day, shows a number of UN-OCRed documents (freshly submitted) and those that (for whatever reason) are only partially completed.

The email report is date/time stamped and I want to be able to run is before and after the workflow that tackles the daily or weekly OCR job (however that is done) with the critical information of providing a list of Entry IDs that need attention, as I am finding that repeatedly sending the same "Partials" to the processor is not necessarily working them out.

Hence my thought to start a process whereby I flag and tag the documents to begin a methodology whereby the OCR status can be rigorously and robustly assessed and addressed.

So one of my colleagues sent me a manual OCR Script that can be called via VB out of a workflow.

This will have the distinct advantage of being able to fit into workflow and be called while the parent remains in a wait state for the results to come back, thus retaining a degree of control lost with DCC.

I propose something like this:

1) Run the report on all the counts (Full, None, Some, TOTAL)

2) Search None (UN-OCRed - cleanskins as I'm calling them)

3) Assign a Red-Flag Tag to each and every one (This will be the signifier that the OCR processor has identified the cleanskin and commenced work on it. I would like to add metadata of date/time and current "Cleanskin" status somehow but I've not got to that in workflow yet)

4) Search for all other Red-Flag Tagged Documents in the repository (these will be any missed somehow or tagged Red for reprocessing (might be useful) and Partials because the red is not removed until the OCR sub-process returns successfully.

5) Call the OCR Sub-Workflow with a recursive "for each entry" and simply OCR every job returning the result to the workflow to say Full or error or whatever. On Full received back the Workflow will write the new date/time to metadata and set the status to OCRed remove the Red-Flag and set a Green-Flag tag.

6) Some kind of final process could be introduced to then look at all un-flagged documents and recurse through them "checking" their OCR status and applying the Red or Green Flag Tags until the whole repository (up until the last running of the workflow) is Green.

Now as far as I have got I made a simple workflow to call the OCR script but it is failing and I think it might be the way that the authentication is happening with the current version of Rio I am running 9.2.1 and where the script came from.

Can anyone have a stab at what is going on here for me and suggest how I could improve my approach, fix the code or start again with something else?

Here is the script and attached are the pics that may help start the troubleshooting if anyone has bothered to read down this far into my ramblings :)

Imports System
Imports System.Collections.Generic
Imports System.ComponentModel
Imports System.Data
Imports System.Data.SqlClient
Imports System.Text
Imports LFSO91Lib



Namespace WorkflowActivity.Scripting.ManualOCRnoDCC
    '''<summary>
    '''Provides one or more methods that can be run when the workflow scripting activity is performed.
    '''</summary>
    Public Class Script1
        Inherits SDKScriptClass90
        '''<summary>
        '''This method is run when the activity is performed.
        '''</summary>
        Protected Overrides Sub Execute()
            'Write your code here.


            Dim wrapper as ConnectionWrapperRA = Me.ConnectToRA(ConnectionMethodRA.RepositoryAccess90)
            dim docinfo as Laserfiche.RepositoryAccess.DocumentInfo = wrapper.BoundEntry
            'dim docinfo as docinfo = me.BoundEntryInfo
            'dim docinfo as Laserfiche.RepositoryAccess.DocumentInfo = wrapper.boundentry




            Dim myOCR as Laserfiche.DocumentServices.OcrEngine = Laserfiche.DocumentServices.OcrEngine.LoadEngine()
                MyOCR.AutoOrient = True
                MyOCR.Decolumnize = False
                MyOCR.OptimizationMode = 2
                myOCR.Run(docinfo)



        End Sub
    End Class
End Namespace

I should say that I think I managed to nail all the assemblies - at least the script editor isn't complaining, and yet maybe the SDK runtime I have is newer etc.

So I have Rio 9.2.1, Windows 2012R2 64Bit

SDK is installed

Still new to all this guys so please be gentle and thanks in advance to anyone who may be able to suggest my best approach.

Will

My OCR_with_Fail.PNG (56.49 KB)

| Download

My_OCR_report.PNG (174.49 KB)

| Download

My_OCR-the-errors.PNG (315.92 KB)

| Download

0 0

Answer

SELECTED ANSWER

replied on July 25, 2015 • Show version history

Will - You don't need the SDK kit to use the SDK script activity in Workflow (confusing!). The SDK kit is required to develop stand-alone Laserfiche applications and integrations. The SDK Script activity on the other hand is available as part of the Workflow installation. Just about anything that you can do with the SDK kit you can do in an SDK Script activity (although it would be counter-productive to use Workflow to do things that _should_ be separate Laserfiche integrations).

The major issue that SDK Script users face is that the documentation for the Laserfiche objects exposed by the SDK Script activity are provided with the SDK kit and is not generally available to non-SDK kit owners.

Back to the issues you are running into;

File browsing is turned off by default on the Workflow server. To enable file browsing start the Workflow Admin console, open the 'Security' node, and double-click on the 'File Browser Options';

Make sure the 'Global Assembly Cache' (GAC) is checked as we might need to actually browse the GAC to find the DirectoryServices assembly.

When you try to add the DirectoryServices assembly from the SDK script activity click on the 'Name' column to make sure the assembly names are sorted in alpha ascending order then look for the 'Laserfiche.DocumentServices' assembly.

If you do have to 'browse' to find the DirectoryServices assembly then select the GAC (Global Assembly Cache) option.

Since you have a 9.2 Rio system you will find the 9.2 version of the DirectoryServices assembly as it is part of the Laserfiche Rio server installation. (Again make sure to click on the Name column header to make sure the entries are sorted in alpha ascending order)

1 0

Replies

replied on July 25, 2015 • Show version history

Will - As far as the script is concerned; yes the problems you are running into is that you are trying to run an older script on your system. Here is a 9.2 SDK script that works on my VAR kit Avante system.

Imports System.ComponentModel
Imports System.Data
Imports System.Data.SqlClient
Imports System.Text
Imports Laserfiche.RepositoryAccess

Namespace WorkflowActivity.Scripting.SDKScript
    '''<summary>
    '''Provides one or more methods that can be run when the workflow scripting activity is performed.
    '''</summary>
    Public Class Script1
        Inherits RAScriptClass92
        '''<summary>
        '''This method is run when the activity is performed.
        '''</summary>
        Protected Overrides Sub Execute()
            'Write your code here. The BoundEntryInfo property will access the entry, RASession will get the Repository Access session
            
            If Me.BoundEntryInfo.EntryType = EntryType.Document Then
                Dim docInfo as DocumentInfo = Me.BoundEntryInfo
                Dim myOCR As Laserfiche.DocumentServices.OcrEngine = Laserfiche.DocumentServices.OcrEngine.LoadEngine()
                
                MyOCR.AutoOrient = True
                MyOCR.Decolumnize = False
                MyOCR.OptimizationMode = 2
                myOCR.Run(docinfo)
   
            End If
	                 
        End Sub
    End Class
End Namespace

Drop a brand new SDK Script activity in your workflow and copy _only_ the code between the 'Sub Execute()' and the 'End Sub'

In order to reference the DocumentServices routines you will need to add the appropriate reference. Here are some screen shots;

Right mouse click on the 'References' node in the Project Explorer, select 'Add Reference'

In the list presented look for the appropriate version of the DocumentServices assembly (9.2 in this case) and click OK

You should now be able to execute the script on a document to test it.

Let us know if you continue to have problems...

1 0

replied on July 25, 2015

Some additional thoughts;

I would assume that running OCR on documents from Workflow is going to have a negative impact on Workflow system performance. Therefore, if you are going to use this approach I would try to make it as efficient as possible.

One option would be to run the entire search, loop through results, and OCR in a single SDK script entry.

My assumption is that there is some overhead in loading and unloading the OCR engine and even though it will get cached it would be more efficient to load it once and run it against a series of documents instead of loading it for each document to OCR (As your current approach is going to do).

Maybe something like this;

If you think this has merit I can code up a script that will accomplish this for you...

1 0

replied on July 25, 2015

Wow Cliff, that's great, thank you!

- I am looking now at my current queue of DCC operations and the main batch I set to work last light is now spinning its wheels after 9 hours with very little gains to show for it when I do the search for number of "none" in the directories I was working on.

I feel like it just isn't timing out efficiently and stays stuck on jobs that are not getting anywhere. This will mean doing a cancel all and either running again the same process to see if it handles the docs differently or changing the order (ie start on partial etc).

Anyway, your thoughts most certainly have merit. As I have a main server that is really only available for at most 5 hours a night due to the need for high availability at other times I will definitely need to design something with efficiency in mind.

I will have a crack at the new script and report back. Thanks again!

Will

0 0

replied on July 25, 2015 • Show version history

I've run into another obstacle - it looks like the 9.2 SDK is not properly installed - although the Runtime instance was installed earlier.

When I run the 92. SDK installer and it gets to the Licensing Block allocation even when I do a renew master license it never provides a block to install it. I get an error stating that this license does not provide for a valid license block. Pic attached.

The reason I'm digging in this direction is that I can't see any assembly instances in the Script Editor like you showed me to identify so I assume it is because the shared libs are not installed.

And I can't browse to boot... Nowhere I can see to configure allowing broswe the tree.

Probably I can hook onto one of the existing instances right? I might try that now but I would like to understand why Rio can't add SDK - its not a product constraint is it? Extra $$$s?

So, do I need the SDK at all or am I barking up the wrong tree?

Thanks for the help.

Will

license-bloclk.PNG (95.54 KB)

| Download

add_assembly.PNG (24.87 KB)

| Download

Assemblies.PNG (315.55 KB)

| Download

no-browse.PNG (138.38 KB)

| Download

0 0

replied on July 26, 2015

Roger that Cliff, I'll get all that sorted now.

While I cook up a good approach I am manually processing the backlog from the client. Interesting to see which documents are still baulking the system.

I'll post back once I make progress.

Really appreciate your detailed assistance! Thanks Mate!

Will

0 0

replied on July 26, 2015

Awesome, so now it is properly assembled with its dependencies. I will hook together a simple test workflow to call it and see how it goes.

I am not familiar with calling scripts and assessing the status returned success/fail but I'll have a play.

Will

0 0

replied on July 26, 2015

Good news! Let me know if you need help returning token values from the Script activity (Hint: 'Me.SetTokenValue(tokenName, tokenValue)')

1 0

replied on July 26, 2015

Wow I'm struggling to get the OCR to behave - errors even when launched manually - so handling these will be interesting through the script.

I have deferred the scripting till I have manually completed the OCR backlog.

I have kept a shadow tree structure at the root level with samples of all the trouble documents.

0 0

replied on July 27, 2015

Will,

Not certain if this is your problem, but I spent a lot of time back in March and April processing around 500,000 files with DCC, Nuance PowerPDF and the Omnipage (Nuance) OCR engine. Turns out there are certain types of file content that cause a lot of problems for the OCR engine. I worked for several months with Nuance, sending them multiple sample files that displayed the issue (usually stalled files), and after three rounds of hotfixes they have pretty much nailed the issue - at least for the problem files I submitted.

The changes they've made allow the engine to give up sooner on the problem areas of the page - areas that were never going to OCR anyway, and that made a huge difference in completion rate on large conversions since the engine now rarely stalls on bad files.

That's the good news. The bad news is those hotfixes were only applicable to the version of the OCR engine used by PowerPDF (also a Nuance product) and have not been applied to the OEM version that Laserfiche redistributes. I posted here a while back asking when Laserfiche might be able to incorporate an updated version of the Nuance OCR engine that includes the hotfixes, but got no response. Maybe someone from Laserfiche will pick up on this topic and have an answer.

Geoff

0 0

You are not allowed to follow up in this post.

Question

Question

Some OCR Script assistance please: DCC is proving unwieldy and I have to be able to execute workflow and log/audit progress

Answer

Replies

Sign in to reply to this post.