OCR only first page in WF

SELECTED ANSWER

replied on February 28, 2018

Not at this time.

0 0

View 9 previous replies

replied on March 1, 2018

Ok, thank you.

0 0

replied on March 2, 2018

Depending on your environment and needs, you can OCR the first page of a document with a SDK Script activity, but you probably want it on a secondary Workflow server so it does not plug up your main production workflow.

In the SDK Script Activity, add a refference to the Laserfiche.DocumentServices. Then in the Imports section (top of code) add the Import Laserfiche.DocumentServices

Imports System
Imports System.Collections.Generic
Imports System.ComponentModel
Imports System.Data
Imports System.Data.SqlClient
Imports System.Text
Imports Laserfiche.DocumentServices
Imports Laserfiche.RepositoryAccess

Namespace WorkflowActivity.Scripting.SDKScript
    '''<summary>
    '''Provides one or more methods that can be run when the workflow scripting activity is performed.
    '''</summary>
    Public Class Script1
        Inherits RAScriptClass102
        '''<summary>
        '''This method is run when the activity is performed.
        '''</summary>
        Protected Overrides Sub Execute()
            'Write your code here. The BoundEntryInfo property will access the entry, RASession will get the Repository Access session
            Dim sError As String = "None"
            Try
                ' Retrieves a document to be processed with OCR.
                If BoundEntryInfo.EntryType = EntryType.Document Then
                    Using Doc As DocumentInfo = DirectCast(BoundEntryInfo, DocumentInfo)
                        Doc.Lock(LockType.Exclusive)
                        ' Instantiates a new OCR engine.
                        Using ocr As OcrEngine = OcrEngine.LoadEngine()
                            ' configure OCR options
                            ocr.AutoOrient = True
                            ocr.Decolumnize = True
                            ocr.OptimizationMode = OcrOptimizationMode.Accuracy
                            ' Generate text for page 1 of the given document
                            Dim ps As PageSet = New PageSet()
                            ps.AddPage(1)
                            ocr.Run(Doc, ps)
                        End Using
                        ' unlock the document
                        Doc.Unlock()
                    End Using
                End If
            Catch ex As Exception
                sError = ex.Message
                WorkflowApi.TrackError(ex.Message)
            End Try
            SetTokenValue("Script_Error", sError)
        End Sub
    End Class
End Namespace

1 0

replied on March 6, 2018

Hello Bert,

Yes, I want to use a secondary Workflow Server with Distributed Computing Cluster.

Tanks for your answer.

Regards

0 0

replied on March 8, 2021

Hi @Bert

Thanks for your script. I'm using it and it's working pretty well.

I have some troubles anyway.

Sometimes, the ocerization is done by line, and sometimes by column.

E.G :

Document :

Company : COPY-R

Firstname : Olivier

ocr by line :

Company : COPY-R

Firstname : Olivier

ocr by column :

Company : Firstname :

COPY-R Olivier

I'm using the pattern matching to get the values (based on the ocerization by line). Is it possible to force an ocerization by line instead by random or column?

Thanks in advance.

Regards

0 0

replied on March 8, 2021

My guess is that on the source image, the labels and the values do not share a common bottom so that the OCR Engine sometimes puts them on separate lines.

In the script, you can try changing the ocr.Decolumnize = True to False

0 0

replied on March 8, 2021

Hi Bert,

Thanks for your return. Yeah, I thought about the same, but in my mind "Decolumnize = True" already means "not in column". I was disturbed.

Going to try False. Thanks a lot.

Regard

0 0

replied on March 8, 2021

ocr.Decolumnize = True means remove extra white spacing that creates columns. So I do not think it will help. Again the issue is with the source not having the same line bottom accross the data in the line.

1 0

replied on March 10, 2021 • Show version history

Hi Bert.

I have some trouble (another one) with the SDK to OCR my files.

It looks like the OCR have a limit.

I tried to OCR a pdf with 198 pages. Only the first 80 pages was ocerized.

I have this warning

0 0

replied on March 11, 2021

The OCR engine does not OCR PDFs. It will only OCR Laserfiche images. Since you are saying that 80 pages were OCRed, I assume that you have already generated the pages.

There is no page limit on the OCR engine, so I would suggest going through a support case to try to figure out what the issue is.

2 0

replied on March 11, 2021

The warning indicates that the script took too long and it exceeded the default 2 minute timeout. The value can be changed in the WF Server properties, but it will affect all scripts.

If the goal is to OCR (and/or generate pages from) PDFs, then this is the wrong thread and you don't need scripts. Use Distributed Computing Cluster 11 and the Schedule OCR/PDF Page Generation activity.

1 0

replied on March 11, 2021

Thanks all for your returns.

@Miruna, what can be the bad effects if I change the timeout to more than 2 minutes?

0 0

replied on March 11, 2021

Scripts may stay actively running longer, so your Workflow server would take longer going through these tasks. Under load, you may end up with a backlog in processing.

1 0

Question

Question

Answer

Replies

Sign in to reply to this post.