You are viewing limited content. For full access, please sign in.

Question

Question

How to make a PDF file text-searchable?

SDK
asked on October 31, 2014

Hi,

Is it possible to make a scanned PDF file text-searchable (perform an OCR on the document) using the .NET SDK version 8.2?

I tried using the OrcEngine, but it does not work since my document does not have pages.

 

Thanks,

0 0

Answer

APPROVED ANSWER
replied on October 31, 2014 Show version history

Sorry, I did not notice that you specified that the PDF was a scanned document.

If there is no text associated with the PDF itself, then there is nothing for TextExtractor to do. You would need to generate pages from the PDF, then run OCR on the generated pages. Our OCR engine can only OCR Image Pages.

I do not believe it is possible to generate pages with the SDK 8.2 methods (as mentioned in this post we cannot directly enable page generation from PDFs because of third-party licensing)

In SDK 9.0+, CAT (Client Automation Tools) allows you to access the image generation tools available (Snapshot and PDF page generation) by launching and manipulating the Laserfiche Client.

Quick Fields can also generate pages, if you wanted to first process documents using Quick Fields, then run the SDK application.

0 0

Replies

replied on October 31, 2014
01LFDocument Doc = null;
02LFConnection conn = null;
03try
04{
05    int docId = 599759;
06 
07    // Creates a new application object.
08    LFApplication app = new LFApplication();
09    // Finds the appropriate server.
10    LFServer serv = (LFServer)app.GetServerByName("server");
11    // Gets the repository from the server.
12    LFDatabase db = (LFDatabase)serv.GetDatabaseByName("rep");
13    // Creates a new LFConnection object.
14    conn = new LFConnection();
15    // Sets the user name and password.
16    conn.UserName = "username";
17    conn.Password = "password";
18    // Connects to repository.
19    conn.Create(db);
20 
21 
22    // Instantiates a new text extractor.
23    TextExtractor MyTE = new TextExtractor();
24    // Retrieves a document.
25    Doc = (LFDocument)db.GetEntryByID(docId);
26 
27    ///* Extracts text from the electronic file associated with
28    //* the document and appends new text pages to the
29    //* Laserfiche document.*/
30    Console.WriteLine(string.Format("Start Extract on document {0}", Doc.FullPath));
31    MyTE.ExtractText(Doc, 1, Import_Page_Action.IMPORT_PAGE_ACTION_OVERWRITE);
32    Console.WriteLine(string.Format("Done Extract on document {0}", Doc.FullPath));
33 
34                 
35}
36catch (Exception error)
37{
38    string errorMsg = error.Message;
39    Console.WriteLine(errorMsg);
40}
41finally
42{
43    if (Doc != null)
44        Doc.Dispose();
45    if (conn != null)
46        conn.Terminate();
47}

 

This does nothing to the PDF file. Am I doing something wrong?

1 0
replied on October 31, 2014 Show version history

The Laserfiche Text Extraction will only extract text that is already in the PDF.  It will not create a text layer in a PDF that has no text.

1 0
replied on November 3, 2014 Show version history

If you are using Acrobat to make your PDFs, youcan create searchable PDFs first before bringing them into LF.  When it is searchable, you can then use the ifilters to extract text from it.  https://acrobatusers.com/tutorials/how-to-create-a-searchable-pdf-file

1 0
replied on October 31, 2014

For electronic files such as a PDF, you need to perform a text extraction using TextExtractor, which is covered in the SDK documentation. 

OCR (Optical Character Recognition) is a process by which a computer attempts to "read" the text in an image, so it is used on Laserfiche imaged pages to turn a picture of text into text. However, Electronic files typically already have text associated with them, and the text must simply be saved in a way the Laserfiche search engine can use. This process is text extraction.

 

 

0 0
replied on October 31, 2014 Show version history

What exactly does the TextExtrator do? I tried it and it does not seem to do anything to the PDF file. It does not create pages or convert the file into a searchable PDF.

0 0
replied on October 31, 2014

I also have to mention that the PDF file has no text, since it was scanned. So it needs to be OCR before doing the text extraction.

 

Am I not understanding this correctly?

0 0
APPROVED ANSWER
replied on October 31, 2014 Show version history

Sorry, I did not notice that you specified that the PDF was a scanned document.

If there is no text associated with the PDF itself, then there is nothing for TextExtractor to do. You would need to generate pages from the PDF, then run OCR on the generated pages. Our OCR engine can only OCR Image Pages.

I do not believe it is possible to generate pages with the SDK 8.2 methods (as mentioned in this post we cannot directly enable page generation from PDFs because of third-party licensing)

In SDK 9.0+, CAT (Client Automation Tools) allows you to access the image generation tools available (Snapshot and PDF page generation) by launching and manipulating the Laserfiche Client.

Quick Fields can also generate pages, if you wanted to first process documents using Quick Fields, then run the SDK application.

0 0
replied on October 31, 2014

What application did you use when you created the PDF?

0 0
replied on November 6, 2014

Hi Kevin, 

If your question has been answered, please let us know by clicking the "Mark this reply as the answer" button on the appropriate response.

If you still need assistance with this matter, just update this thread. Thanks!

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.