Hello
It is possible with the Schedule OCR activity in Workflow OCRed only the first page of document?
Thanks.
Hello
It is possible with the Schedule OCR activity in Workflow OCRed only the first page of document?
Thanks.
Not at this time.
Ok, thank you.
Depending on your environment and needs, you can OCR the first page of a document with a SDK Script activity, but you probably want it on a secondary Workflow server so it does not plug up your main production workflow.
In the SDK Script Activity, add a refference to the Laserfiche.DocumentServices. Then in the Imports section (top of code) add the Import Laserfiche.DocumentServices
Imports System Imports System.Collections.Generic Imports System.ComponentModel Imports System.Data Imports System.Data.SqlClient Imports System.Text Imports Laserfiche.DocumentServices Imports Laserfiche.RepositoryAccess Namespace WorkflowActivity.Scripting.SDKScript '''<summary> '''Provides one or more methods that can be run when the workflow scripting activity is performed. '''</summary> Public Class Script1 Inherits RAScriptClass102 '''<summary> '''This method is run when the activity is performed. '''</summary> Protected Overrides Sub Execute() 'Write your code here. The BoundEntryInfo property will access the entry, RASession will get the Repository Access session Dim sError As String = "None" Try ' Retrieves a document to be processed with OCR. If BoundEntryInfo.EntryType = EntryType.Document Then Using Doc As DocumentInfo = DirectCast(BoundEntryInfo, DocumentInfo) Doc.Lock(LockType.Exclusive) ' Instantiates a new OCR engine. Using ocr As OcrEngine = OcrEngine.LoadEngine() ' configure OCR options ocr.AutoOrient = True ocr.Decolumnize = True ocr.OptimizationMode = OcrOptimizationMode.Accuracy ' Generate text for page 1 of the given document Dim ps As PageSet = New PageSet() ps.AddPage(1) ocr.Run(Doc, ps) End Using ' unlock the document Doc.Unlock() End Using End If Catch ex As Exception sError = ex.Message WorkflowApi.TrackError(ex.Message) End Try SetTokenValue("Script_Error", sError) End Sub End Class End Namespace
Hello Bert,
Yes, I want to use a secondary Workflow Server with Distributed Computing Cluster.
Tanks for your answer.
Regards
Hi @Bert
Thanks for your script. I'm using it and it's working pretty well.
I have some troubles anyway.
Sometimes, the ocerization is done by line, and sometimes by column.
E.G :
Document :
Company : COPY-R
Firstname : Olivier
ocr by line :
Company : COPY-R
Firstname : Olivier
ocr by column :
Company : Firstname :
COPY-R Olivier
I'm using the pattern matching to get the values (based on the ocerization by line). Is it possible to force an ocerization by line instead by random or column?
Thanks in advance.
Regards
My guess is that on the source image, the labels and the values do not share a common bottom so that the OCR Engine sometimes puts them on separate lines.
In the script, you can try changing the ocr.Decolumnize = True to False
Hi Bert,
Thanks for your return. Yeah, I thought about the same, but in my mind "Decolumnize = True" already means "not in column". I was disturbed.
Going to try False. Thanks a lot.
Regard
ocr.Decolumnize = True means remove extra white spacing that creates columns. So I do not think it will help. Again the issue is with the source not having the same line bottom accross the data in the line.
Hi Bert.
I have some trouble (another one) with the SDK to OCR my files.
It looks like the OCR have a limit.
I tried to OCR a pdf with 198 pages. Only the first 80 pages was ocerized.
I have this warning
The OCR engine does not OCR PDFs. It will only OCR Laserfiche images. Since you are saying that 80 pages were OCRed, I assume that you have already generated the pages.
There is no page limit on the OCR engine, so I would suggest going through a support case to try to figure out what the issue is.
The warning indicates that the script took too long and it exceeded the default 2 minute timeout. The value can be changed in the WF Server properties, but it will affect all scripts.
If the goal is to OCR (and/or generate pages from) PDFs, then this is the wrong thread and you don't need scripts. Use Distributed Computing Cluster 11 and the Schedule OCR/PDF Page Generation activity.
Thanks all for your returns.
@Miruna, what can be the bad effects if I change the timeout to more than 2 minutes?
Scripts may stay actively running longer, so your Workflow server would take longer going through these tasks. Under load, you may end up with a backlog in processing.