Question

Generate OCR to a large number of documents in an effective way

Laserfiche

Updated November 13, 2018

asked on November 13, 2018

I have a client that has 200,000 documents and 5,000,000 million images. We want to be able to generate OCR to all of these documents, however it is complicating due to the amount of files.

We have installed the Distributed Computing Cluster, however, when executing the process on 10% of the documents, the memory and cpu of the servers reach their maximum capacity.

If we try to perform the OCR through the Laserfiche Client, it fails.

Can you give me any recommendation on what will be the best scenario to execute the OCR on all these documents?

0 0

Replies

replied on November 13, 2018 • Show version history

Just a side note on the DCC. With the default configuration, each worker will allow 1 concurrent task per processor core meaning it will utilize 100% CPU under heavy load.

i.e., If the server has 8 cores, and you send it 8 documents to OCR at the same time, it will use 100% CPU until at least one task/document is completed.

In our environment, we changed this setting so the server will always have 1 available core to prevent it from staying constantly maxed out.

We set up 2 DCC workers, each with 12 cores (11 concurrent tasks + 1 free core) and 12GB of RAM; this configuration has been able to OCR more than 30,000 documents per day without getting maxed out.

6 0

replied on November 13, 2018

Also, never underestimate the value of having a few workstations that you can take over at night. During one big OCR project, I had my workstations and those of a couple other folks in IT set up to join DCC at night.

We used Windows Tasks and some PowerShell scripts to make sure that they gracefully entered and exited the cluster, as well as provide a kill switch.

2 0

replied on November 13, 2018

Very nice. We have to OCR our new documents on-the-fly, so the bulk of our load is during business hours, but I may follow up on this in case we ever need to go back and bulk OCR our legacy documents; we have about 40-50 million and a dynamic cluster like that would come in very hand to get more done after hours.

1 0

replied on November 13, 2018

Another option is to use the LF SDK, if you have developer resources. Here is a sample program that finds all documents with the "Needs OCR" field set to "Yes" and OCRs them:

using System;
using Laserfiche.RepositoryAccess;
using Laserfiche.DocumentServices;

namespace BatchOCR
{
    class Program
    {
        static void Main(string[] args)
        {

            string server = "LFServerName"; // server name here
            string repository = "MyRepoName"; // repository name here
            string username = ""; // username here
            string password = ""; // password here

            try
            {
                RepositoryRegistration rr = new RepositoryRegistration(server, repository);

                using (Session session = new Session())
                {
                    if (!string.IsNullOrEmpty(username))
                        session.LogIn(username, password, rr);
                    else
                        session.LogIn(rr);

                    using (Search search = new Search(session, @"{[]:[Needs OCR]=""Yes""}"))
                    {
                        search.Run();

                        SearchListingSettings settings = new SearchListingSettings();

                        using (SearchResultListing listing = search.GetResultListing(settings))
                        {
                            foreach (EntryListingRow row in listing)
                            {
                                int entryId = (int)row[SystemColumn.Id];

                                try
                                {
                                    using (EntryInfo en = Entry.GetEntryInfo(entryId, session))
                                    {
                                        if (en.EntryType == EntryType.Document)
                                        {
                                            DocumentInfo doc = (DocumentInfo)en;
                                            doc.Lock(LockType.Exclusive);

                                            Console.WriteLine(String.Format("OCRing {0}", en.Path));

                                            using (OcrEngine ocrEngine = OcrEngine.LoadEngine())
                                            {
                                                ocrEngine.Run(doc);
                                            }

                                            FieldValueCollection fvc = doc.GetFieldValues();
                                            fvc.Remove("Needs OCR");
                                            doc.SetFieldValues(fvc);
                                            doc.Save();
                                        }
                                    }
                                }
                                catch (LockedObjectException)
                                {
                                    // Another bot is probably OCRing the document
                                }
                            }
                        }
                    }
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.ToString());
            }
        }
    }
}

To use this, create a list field named "Needs OCR". Run a search in the LF client that returns all of the documents you want to OCR, select them all and set this field to "Yes" on them. Then you can run this program to do a batch OCR. Run it on multiple machines simultaneously to speed up the process. This uses the .NET SDK, I tested with 10.3.

1 0

You are not allowed to follow up in this post.

Question

Question

Generate OCR to a large number of documents in an effective way

Replies

Sign in to reply to this post.