You are viewing limited content. For full access, please sign in.

Question

Question

Any tips on getting Quick Fields OCR quality to at least match LF Client OCR output with PDFs?

asked on July 18, 2017

Hi guys,

I'm having a bit of a challenge getting acceptable results with Quick Fields (version 10.1.0.168) Zone OCR generation from PDF documents.  So far a 36% document fail rate (21 out of 58 of the customer's test documents fail)!  I've spent considerable time adjusting Image Enhancement settings (in the correct order, colour removal for smoothing, despeckle and smoothing pixel adjustments 1 pixel at a time then test, set, and test again, etc etc etc), testing with different DPI scaling, setting Zone OCR to "Accuracy" and so on.  I still get better OCR results if I OCR the file in the Laserfiche client, but this isn't an option as I need to extract a number from the PDF both for a metadata field as well as the file name of the document in Laserfiche.  Plus there are several thousand documents to back scan.

I'm OCR'ing a zone to find a text string "DAILY JOB RECORD" and the number that follows that text.  I verify the orientation of the page by checking that one of the three words is recognised.  I then strip all letters and spaces from the result to leave me with a 7-digit number.  The number is what I'm after.  Hopefully the screenshots below help clarify things.  

Here are the Quick Fields PDF settings:

I've tested with/without converting to B&W but get the same results, however I get better results with 600 DPI than 300.

Here's a successful extraction:

Here's an unsuccessful one - the whole string in the zone (the OCR zone is a lot bigger than displayed, I'm just limiting what I'm putting online) has been recognised simply as "7":

If I then store the documents into Laserfiche, open the file in Laserfiche and run OCR it successfully recognises the whole string including the number from the same document that failed in Quick Fields i.e. it only returned "7" from the OCR:

If I print the PDF via Snapshot I also get an accurate read:

 

I really need to be able to get at least the same quality/result in Quick Fields.  I've tested with all the different Image Enhancement settings except for Invert and Line Removal, but nothing has helped even down to settings of 1 pixel for the three Smoothing options and Despeckle.  I'm also testing with coloured original PDFs and B&W originals, but it doesn't seem to make any difference, I still get a much better result in the Laserfiche client than in Quick Fields.

Any tips or advice would be much appreciated.

Thanks,

Mike

 

1 0

Answer

SELECTED ANSWER
replied on July 19, 2017

Have you taken off the color removal?  Since the numbers are red color removal might be making them too light to read.  I also have issues with getting a good quality OCR in quick fields.

 

Have you thought about trying 2 OCR zones in the exact same area?  One would have a character preference set to Letters and look for the words so you can check orientation.  The second would have a character preference set to Numbers.  After the numbers maybe do pattern matching to look for the number of digits in a row instead of just stripping out the letters and spaces?

 

2 0

Replies

replied on July 19, 2017

My first guess would be that the Zone OCR process is actually returning text from a different location than the one you're expecting. Do you get the same result (i.e. just a "7") if you OCR the whole page in Quick Fields instead of just a zone? Do you have any deskew, resize, or border removal processes in your session? What is the 'Use Existing Text' advanced option set to for your Zone OCR process?

1 0
replied on July 19, 2017

Hi Tessa,

I have just tested with whole page OCR and found the 7 on the example document I described:

 

I've got Single Line set as True in the Advanced options for the zone OCR settings so I'll revisit that and see if I can get that one sorted without breaking anything else.  :o)

I'll also try Jennifer's two zones suggestion and see if that helps extract the number more accurately.

Thanks again,

Mike

0 0
replied on July 19, 2017

Ah, I see. If you have Single Line set to true and it's reading that little corner as a 7, then that explains why it's not getting anything else. If I were you I'd just try moving the Zone down a little bit. 

0 0
replied on July 19, 2017

HI ladies,

Thank you very much for your responses.  To answer your questions, I had colour removal on for the smoothing and found exactly the issue you mentioned Jennifer, so quickly disregarded that option.  I've got pattern matching running currently to confirm the correct number extraction and remove spaces from the number that's being returned, but I'll try the second OCR approach as that's not something I'd thought of.  Thanks.

I'm running an initial test set of 30 documents through the process Tessa - I'm also trying to identify common read errors such as "L" for 4 etc - and getting approx 60% successful OCR results so am confident that the text is being returned from the correct location, it just appears to be the accuracy of the recognition.  But I'll add a full page OCR to the process and see if that generates a different result.

I've tested deskew, but encountered a few issues with it randomly adjusting documents to 45-degree angles and losing the zone area altogether, so I've removed that.  I'm not using resize or border removal processes.  "Use Existing Text" is set to True.

Thanks very much,

Mike

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.