Hi,
Is it possible to make a scanned PDF file text-searchable (perform an OCR on the document) using the .NET SDK version 8.2?
I tried using the OrcEngine, but it does not work since my document does not have pages.
Thanks,
Hi,
Is it possible to make a scanned PDF file text-searchable (perform an OCR on the document) using the .NET SDK version 8.2?
I tried using the OrcEngine, but it does not work since my document does not have pages.
Thanks,
Sorry, I did not notice that you specified that the PDF was a scanned document.
If there is no text associated with the PDF itself, then there is nothing for TextExtractor to do. You would need to generate pages from the PDF, then run OCR on the generated pages. Our OCR engine can only OCR Image Pages.
I do not believe it is possible to generate pages with the SDK 8.2 methods (as mentioned in this post we cannot directly enable page generation from PDFs because of third-party licensing)
In SDK 9.0+, CAT (Client Automation Tools) allows you to access the image generation tools available (Snapshot and PDF page generation) by launching and manipulating the Laserfiche Client.
Quick Fields can also generate pages, if you wanted to first process documents using Quick Fields, then run the SDK application.
01 | LFDocument Doc = null ; |
02 | LFConnection conn = null ; |
03 | try |
04 | { |
05 | int docId = 599759; |
06 |
07 | // Creates a new application object. |
08 | LFApplication app = new LFApplication(); |
09 | // Finds the appropriate server. |
10 | LFServer serv = (LFServer)app.GetServerByName( "server" ); |
11 | // Gets the repository from the server. |
12 | LFDatabase db = (LFDatabase)serv.GetDatabaseByName( "rep" ); |
13 | // Creates a new LFConnection object. |
14 | conn = new LFConnection(); |
15 | // Sets the user name and password. |
16 | conn.UserName = "username" ; |
17 | conn.Password = "password" ; |
18 | // Connects to repository. |
19 | conn.Create(db); |
20 |
21 |
22 | // Instantiates a new text extractor. |
23 | TextExtractor MyTE = new TextExtractor(); |
24 | // Retrieves a document. |
25 | Doc = (LFDocument)db.GetEntryByID(docId); |
26 |
27 | ///* Extracts text from the electronic file associated with |
28 | //* the document and appends new text pages to the |
29 | //* Laserfiche document.*/ |
30 | Console.WriteLine( string .Format( "Start Extract on document {0}" , Doc.FullPath)); |
31 | MyTE.ExtractText(Doc, 1, Import_Page_Action.IMPORT_PAGE_ACTION_OVERWRITE); |
32 | Console.WriteLine( string .Format( "Done Extract on document {0}" , Doc.FullPath)); |
33 |
34 | |
35 | } |
36 | catch (Exception error) |
37 | { |
38 | string errorMsg = error.Message; |
39 | Console.WriteLine(errorMsg); |
40 | } |
41 | finally |
42 | { |
43 | if (Doc != null ) |
44 | Doc.Dispose(); |
45 | if (conn != null ) |
46 | conn.Terminate(); |
47 | } |
This does nothing to the PDF file. Am I doing something wrong?
The Laserfiche Text Extraction will only extract text that is already in the PDF. It will not create a text layer in a PDF that has no text.
If you are using Acrobat to make your PDFs, youcan create searchable PDFs first before bringing them into LF. When it is searchable, you can then use the ifilters to extract text from it. https://acrobatusers.com/tutorials/how-to-create-a-searchable-pdf-file
For electronic files such as a PDF, you need to perform a text extraction using TextExtractor, which is covered in the SDK documentation.
OCR (Optical Character Recognition) is a process by which a computer attempts to "read" the text in an image, so it is used on Laserfiche imaged pages to turn a picture of text into text. However, Electronic files typically already have text associated with them, and the text must simply be saved in a way the Laserfiche search engine can use. This process is text extraction.
What exactly does the TextExtrator do? I tried it and it does not seem to do anything to the PDF file. It does not create pages or convert the file into a searchable PDF.
I also have to mention that the PDF file has no text, since it was scanned. So it needs to be OCR before doing the text extraction.
Am I not understanding this correctly?
Sorry, I did not notice that you specified that the PDF was a scanned document.
If there is no text associated with the PDF itself, then there is nothing for TextExtractor to do. You would need to generate pages from the PDF, then run OCR on the generated pages. Our OCR engine can only OCR Image Pages.
I do not believe it is possible to generate pages with the SDK 8.2 methods (as mentioned in this post we cannot directly enable page generation from PDFs because of third-party licensing)
In SDK 9.0+, CAT (Client Automation Tools) allows you to access the image generation tools available (Snapshot and PDF page generation) by launching and manipulating the Laserfiche Client.
Quick Fields can also generate pages, if you wanted to first process documents using Quick Fields, then run the SDK application.
What application did you use when you created the PDF?
Hi Kevin,
If your question has been answered, please let us know by clicking the "Mark this reply as the answer" button on the appropriate response.
If you still need assistance with this matter, just update this thread. Thanks!