You are viewing limited content. For full access, please sign in.

Question

Question

Creating Searchable Text for Image Based PDF 's in WebAccess

asked on April 19, 2016

Have DCC and iFilters installed so when importing Office, PDF's with Text, and Tiff's using Drag and Drop in WebAccess, they all become Text Searchable.

My challenge is that the Image Based PDF's (Scans) are not, even though I have Generate Pages and Text enabled when importing these files.

 

1 0

Replies

replied on April 19, 2016

Try generating pages first, then generating text to OCR the images.

0 0
replied on April 19, 2016

In WebAccess these actions have to be done at point of import or provided by DCC as they are not available in WebAccess as drop down actions to perform once the file are imported. I've also noticed that the PDF files don't get queued up in DCC.

0 0
replied on April 19, 2016

Hi Steve, 

You can generate pages directly from the main Web Access UI. You need to use DCC for OCR'ing image pages, but that's the same as during import. 

0 0
replied on April 19, 2016

I was able to generate the text manually using the generate Searchable Text, but I believe you have a bigger issue as it should have been generated automatically on import as those functions (Generate Pages and Text) were selected at time of import.

0 0
replied on April 19, 2016

Hi Steve, 

If you are talking about PDFs specifically, you should check if they have text streams. By default, text generation on PDFs does not use OCR (since you need the generated pages for that) - instead it takes the text stream directly. If they do not have text streams, than standard text generation will cover not them. Once pages have been generated, generate text always OCR's them, so it no longer matters if they have text streams.

To ensure that OCR happen on import (to generate text even without text streams), go to 'advanced options for PDFs' in the generate text section in user options and ensure 'OCR documents without a text stream when generating pages' is checked.

0 0
replied on April 19, 2016 Show version history

Yes, I am talking about Image Based PDF's with no Text Stream. These PDF's originated from a scan and there is no text layer.

The PDF's with Text work fine. (ie: Produced from an Electronic file such as office)

Can you send a picture of that Setting Screen you refer to in the bottom paragraph as I cannot find those setting in WebAccess and I'm logged in as an Admin.

1 0
replied on February 25, 2020

I am also trying to find this option but it does not seem to be within the WebAccess options. I am using 10.3 as well.

0 0
replied on March 3, 2020

2 0
You are not allowed to follow up in this post.

Sign in to reply to this post.