Want searchable text but not the TIFFs

SELECTED ANSWER

replied on June 26, 2017 • Show version history

Do you just need the PDF to be searchable, or are you to utilize text search in the repository?

If it is the latter, I'm pretty sure OCR is tied to tiff pages for a variety of reasons and I don't believe there's any way around that requirement.

If it is the former...

Generate your pages (can't do this with SDK so it will need to be done manually, with Quick Fields, or with Import Agent)
Generate the OCR text (could combine this with step 1 depending on your approach)
Use the SDK's DocumentExporter to create a PDF and set IncludeText in the export options
Ditch the tiff pages and replace your original PDF with the new text-searchable version

Now that you have a searchable PDF, you might be able to extract the text stream without generating pages, and get searchable text with no TIFF pages (results might be spotty depending on the document).

The problem you're facing is that OCR requires an Image, and text "extraction" requires a text stream to be in the electronic document. From the sound of things you don't have either at the moment.

However, I'm not sure it's possible to automate text extraction so that could be the biggest sticking point.

0 0

replied on June 29, 2017

Thank you -- I'm going to give that a try.

0 0

replied on June 29, 2017 • Show version history

It worked! The steps:

1. "Generate Pages" in LF Client.

2. "Generate Searchable Text" in LF Client.

3. With SDK, export the PDF with text. The C# code is:

de = new DocumentExporter();
de.ExportPdf(di, di.AllPages, PdfExportOptions.IncludeText, "C:\\temp\\temp.pdf");

4. With the SDK, extract the text stream from the exported PDF file. I used the free iTextSharp library for this. The C# code is:

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

dimp = new DocumentImporter();
dimp.Document = di;

using (PdfReader reader = new PdfReader("C:\\temp\\temp.pdf"))
{
StringBuilder sb = new StringBuilder();

for (int i = 1; i <= reader.NumberOfPages; i++)
{
sb.Append(PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy() ) );
ct = sb.Length;
}

byte[] byteArray = Encoding.UTF8.GetBytes(sb.ToString());
using (MemoryStream stream = new MemoryStream(byteArray))
{
dimp.ImportText(stream, Encoding.UTF8);
}
}

1 0

Question

Question

Want searchable text but not the TIFFs

Answer

Replies

Sign in to reply to this post.