You are viewing limited content. For full access, please sign in.

Question

Question

In SDK can I check if document has had the generate text process (OCR) performed on it.

asked on June 12, 2014

I want to check if text has been generated on a document thru SDK code. I cannot see any attribute for DocumentInfo which would provide this information.

 

I can check each page and look at the "HasText" attribute although if it was a blank page this would not be set.

 

0 0

Answer

SELECTED ANSWER
replied on June 12, 2014 Show version history

Use the OcrPageCount entry listing column:

 

EntryListingSettings elparams = new EntryListingSettings();
elparams.AddColumn(SystemColumn.OcrPageCount);

SingleEntryListing entrylisting = new SingleEntryListing(entryID, elparams, sess);
Object colVal = entrylisting.GetDatum(1, SystemColumn.OcrPageCount);
bool hasOCRedPages = false;
if (colVal != null)
    hasOCRedPages = (int)colVal > 0;

This is whas the client does for the "OCR'ed Pages" column.

 

EDIT: The column returns an enumeration, not the actual count of OCRed pages, so the code should actually be this:

 

EntryListingSettings elparams = new EntryListingSettings();
elparams.AddColumn(SystemColumn.OcrPageCount);

SingleEntryListing entrylisting = new SingleEntryListing(entryID, elparams, sess);
Object colVal = entrylisting.GetDatum(1, SystemColumn.OcrPageCount);
bool hasOCRedPages = false;
if (colVal != null)
{
    OcrState ocrState = (OcrState)colVal;
    if (ocrState == OcrState.SomePages || ocrState == OcrState.AllPages)
        hasOCRedPages = true;
}

 

1 0
replied on June 13, 2014

Thanks for this Robert.

 

I did a quick test and it seems to return only 3 values 1,2,or3. Which seems to match the options on the client search being :

 

1 = NO pages OCR'd

2 = Some pages OCR'd

3 = All pages OCR'd

0 0
replied on June 13, 2014

You're right, it actually returns an OcrState enumeration instead of the count of text pages. I updated my original post.

0 0

Replies

replied on June 12, 2014 Show version history

Another way to do it, would be to make your own function for running the OCR service against a document and then use an if loop to check if that function was run for the document.

 

If you wanted to store that information after the program closes, you could store the document name in a text file during your OCR function and then just read that information in on program start. Since the OCR function itself would store the file name, we could then assume that the file was OCR'ed if it's name is stored.

 

This method will only help you find OCR'ed documents that were imported using your custom function, but that's better than nothing. If you want some code for this, let me know. Rob's solution is probably the better way of doing things.

 

EDIT: some words, some clarification, etc

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.