In one of our repositories, we do not generate pages and searchable text on import, but occasionally we want the PDFs to be searchable. It will allow us to generate pages and searchable text from the menu, but the text still will not allow searching. The text quality is good. Please see attached video. The same happens with the desktop version.
Question
Question
Answer
Ok, if it's an image-only PDF you do have to generate pages first, then OCR those generated pages to get text. The repository web client will handle extracting text from a PDF containing text directly, but it needs help to do full OCR text extraction from an image page. Since you are on a self-hosted system, text generation for OCR on pages through the repository web client occurs through DCC, a separately installed component also included in the Laserfiche package.
Have you confirmed that DCC (also known as Distributed Computing Cluster) is up and running and connected to your repository web client application server, and checked the DCC service for any issues listed there? The search service itself doesn't matter if you aren't getting text generated. This is to account for the fact that OCR text generation (from image pages) is a very processor-intensive process and could interfere with normal operations if performed on the application server directly. Your SP is correct that the locally installed windows client bypasses this because it can do the OCR text generation directly locally - it's not mandatory, it just doesn't require connecting to the DCC service to offload the operation.
Replies
Hi Vikki,
It's not generally necessary to do BOTH generate pages and generate search text separately. Generally the process of generating pages will also add the text to the search index - generating text is usually a process you can do separate for documents that already have images. Also, when used through the repository web client, generating text is not a simultaneous process - it's either offloaded to DCC (for self-hosted) or the Cloud text generation service (for Laserfiche Cloud), and there may be a delay before the text is returned.
Lastly, you are in the file view (viewing as a pdf) and the search you are using is the embedded adobe reader search, so none of the above options will actually impact that. You can tell this because that's an adobe reader toolbar, not the Laserfiche document viewer toolbar. It will only be able to search on text if the PDF has an embedded text stream, which is a property of the initial PDF itself, not generating text through Laserfiche. Having generated pages you want to toggle to the page view and then Laserfiche search will be used for search operations.
Thank you for your reply, Justin. I can't get the search to work when the file view is toggled on or off.
I just wanted to mention that we have verified that the full text search service is running on our application server and we also restarted it, just in case. I have "generated pages" on several other PDFs in the repo this morning with the same result.
Does it work if you do a text search from the repository as a whole? One interesting thing is that the word IS getting highlighted, which means text has been generated there and affiliated with the text for it to find it - it's just not showing from the doc viewer search.
It does not find it, I highlighted the search term and where it exists just to show that. Unfortunately it does not find it from a document text search in the repo as a whole either.
Ok, sounds like that's something you'd want to open up a support case on at this point, to dig into why it's not making it into your search index.
Thank you!
@████████ The answer from Laserfiche to our solution provider was: "To get words out of a document, the process extracts text from it. Since it is an image only PDF, they will have to generate LF pages first then OCR. The simplest way to do this for PDFs such as this is to use a locally installed Windows client."
Our users only use the Web Client and we have been generating pages and then generating text. I am confused, as it seems this is what they are saying we should do, but when we do, there is still no searchable text. Do you possibly have any other ideas on this issue.
Ok, if it's an image-only PDF you do have to generate pages first, then OCR those generated pages to get text. The repository web client will handle extracting text from a PDF containing text directly, but it needs help to do full OCR text extraction from an image page. Since you are on a self-hosted system, text generation for OCR on pages through the repository web client occurs through DCC, a separately installed component also included in the Laserfiche package.
Have you confirmed that DCC (also known as Distributed Computing Cluster) is up and running and connected to your repository web client application server, and checked the DCC service for any issues listed there? The search service itself doesn't matter if you aren't getting text generated. This is to account for the fact that OCR text generation (from image pages) is a very processor-intensive process and could interfere with normal operations if performed on the application server directly. Your SP is correct that the locally installed windows client bypasses this because it can do the OCR text generation directly locally - it's not mandatory, it just doesn't require connecting to the DCC service to offload the operation.
This is very helpful. I know DCC was set up in the past but something related here must be the issue. We will check that. Thank you!