You are viewing limited content. For full access, please sign in.

Question

Question

PDF Generate Pages and Text

asked on September 22

In one of our repositories, we do not generate pages and searchable text on import, but occasionally we want the PDFs to be searchable. It will allow us to generate pages and searchable text from the menu, but the text still will not allow searching.  The text quality is good. Please see attached video. The same happens with the desktop version. 

0 0

Answer

SELECTED ANSWER
replied on October 31

Ok, if it's an image-only PDF you do have to generate pages first, then OCR those generated pages to get text. The repository web client will handle extracting text from a PDF containing text directly, but it needs help to do full OCR text extraction from an image page. Since you are on a self-hosted system, text generation for OCR on pages through the repository web client occurs through DCC, a separately installed component also included in the Laserfiche package.

Have you confirmed that DCC (also known as Distributed Computing Cluster) is up and running and connected to your repository web client application server, and checked the DCC service for any issues listed there? The search service itself doesn't matter if you aren't getting text generated. This is to account for the fact that OCR text generation (from image pages) is a very processor-intensive process and could interfere with normal operations if performed on the application server directly. Your SP is correct that the locally installed windows client bypasses this because it can do the OCR text generation directly locally - it's not mandatory, it just doesn't require connecting to the DCC service to offload the operation. 

1 0

Replies

replied on September 22

Hi Vikki, 

It's not generally necessary to do BOTH generate pages and generate search text separately. Generally the process of generating pages will also add the text to the search index - generating text is usually a process you can do separate for documents that already have images. Also, when used through the repository web client, generating text is not a simultaneous process - it's either offloaded to DCC (for self-hosted) or the Cloud text generation service (for Laserfiche Cloud), and there may be a delay before the text is returned. 

Lastly, you are in the file view (viewing as a pdf) and the search you are using is the embedded adobe reader search, so none of the above options will actually impact that. You can tell this because that's an adobe reader toolbar, not the Laserfiche document viewer toolbar. It will only be able to search on text if the PDF has an embedded text stream, which is a property of the initial PDF itself, not generating text through Laserfiche. Having generated pages you want to toggle to the page view and then Laserfiche search will be used for search operations. 

1 0
replied on September 22 Show version history

Thank you for your reply, Justin. I can't get the search to work when the file view is toggled on or off. 

 

0 0
replied on September 23

I just wanted to mention that we have verified that the full text search service is running on our application server and we also restarted it, just in case. I have "generated pages" on several other PDFs in the repo this morning with the same result. 

 

0 0
replied on September 23

Does it work if you do a text search from the repository as a whole? One interesting thing is that the word IS getting highlighted, which means text has been generated there and affiliated with the text for it to find it - it's just not showing from the doc viewer search. 

0 0
replied on September 23

I highlighted that to show you where it exists, it doesn't find it. It does not find it if searching document text from the repo as a whole either. 

replied on September 23

It does not find it, I highlighted the search term and where it exists just to show that. Unfortunately it does not find it from a document text search in the repo as a whole either. 

1 0
replied on September 23

Ok, sounds like that's something you'd want to open up a support case on at this point, to dig into why it's not making it into your search index. 

1 0
replied on September 23

Thank you!

0 0
replied on October 31

@████████  The answer from Laserfiche to our solution provider was:  "To get words out of a document, the process extracts text from it. Since it is an image only PDF, they will have to generate LF pages first then OCR. The simplest way to do this for PDFs such as this is to use a locally installed Windows client."   

Our users only use the Web Client and we have been generating pages and then generating text. I am confused, as it seems this is what they are saying we should do, but when we do, there is still no searchable text.  Do you possibly have any other ideas on this issue.

0 0
SELECTED ANSWER
replied on October 31

Ok, if it's an image-only PDF you do have to generate pages first, then OCR those generated pages to get text. The repository web client will handle extracting text from a PDF containing text directly, but it needs help to do full OCR text extraction from an image page. Since you are on a self-hosted system, text generation for OCR on pages through the repository web client occurs through DCC, a separately installed component also included in the Laserfiche package.

Have you confirmed that DCC (also known as Distributed Computing Cluster) is up and running and connected to your repository web client application server, and checked the DCC service for any issues listed there? The search service itself doesn't matter if you aren't getting text generated. This is to account for the fact that OCR text generation (from image pages) is a very processor-intensive process and could interfere with normal operations if performed on the application server directly. Your SP is correct that the locally installed windows client bypasses this because it can do the OCR text generation directly locally - it's not mandatory, it just doesn't require connecting to the DCC service to offload the operation. 

1 0
replied on November 3

This is very helpful. I know DCC was set up in the past but something related here must be the issue. We will check that. Thank you!

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.