You are viewing limited content. For full access, please sign in.

Question

Question

Want searchable text but not the TIFFs

asked on June 26, 2017

What I have:  PDFs in LF from a scanned source.

What I need:  The same PDFs with searchable text files, but not TIFF images.

I get nothing when I "generate searchable text" from the scanned PDFs.  I can "generate pages" from them using the LF Client, and then produce the searchable text files.  However, this leaves me with both the TIFF images and the searchable text TXT files.

Is there a way to remove the TIFFs but leave the searchable text TXT files?

I'm pretty handy with the SDK...

0 0

Answer

SELECTED ANSWER
replied on June 26, 2017 Show version history

Do you just need the PDF to be searchable, or are you to utilize text search in the repository?

If it is the latter, I'm pretty sure OCR is tied to tiff pages for a variety of reasons and I don't believe there's any way around that requirement.

If it is the former...

  1. Generate your pages (can't do this with SDK so it will need to be done manually, with Quick Fields, or with Import Agent)
  2. Generate the OCR text (could combine this with step 1 depending on your approach)
  3. Use the SDK's DocumentExporter to create a PDF and set IncludeText in the export options
  4. Ditch the tiff pages and replace your original PDF with the new text-searchable version

Now that you have a searchable PDF, you might be able to extract the text stream without generating pages, and get searchable text with no TIFF pages (results might be spotty depending on the document).

The problem you're facing is that OCR requires an Image, and text "extraction" requires a text stream to be in the electronic document. From the sound of things you don't have either at the moment.

However, I'm not sure it's possible to automate text extraction so that could be the biggest sticking point.

0 0
replied on June 29, 2017

Thank you -- I'm going to give that a try.

0 0
replied on June 29, 2017 Show version history

It worked!  The steps:

1. "Generate Pages" in LF Client.

2. "Generate Searchable Text" in LF Client.

3. With SDK, export the PDF with text.  The C# code is:

de = new DocumentExporter();
de.ExportPdf(di, di.AllPages, PdfExportOptions.IncludeText, "C:\\temp\\temp.pdf");

4. With the SDK, extract the text stream from the exported PDF file.  I used the free iTextSharp library for this.  The C# code is:

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

dimp = new DocumentImporter();
dimp.Document = di;
                
using (PdfReader reader = new PdfReader("C:\\temp\\temp.pdf"))
{
    StringBuilder sb = new StringBuilder();

    for (int i = 1; i <= reader.NumberOfPages; i++)
    {
        sb.Append(PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy() ) );
        ct = sb.Length;
    }

    byte[] byteArray = Encoding.UTF8.GetBytes(sb.ToString());
    using (MemoryStream stream = new MemoryStream(byteArray))
    {
        dimp.ImportText(stream, Encoding.UTF8);
    }
}

 

1 0

Replies

You are not allowed to reply in this post.
You are not allowed to follow up in this post.

Sign in to reply to this post.