Text cannot be extracted from this file type

asked on December 21, 2015

The following code attempts to import a PDF into a newly created document and extract the text from it, using LFSO83 and DocumentProcessor83, but gets the following error during the call to ExtractText:

Text cannot be extracted from this file type.

DocumentImporter importer = new DocumentImporter();
importer.Document = (LFDocument)lfEntry;
importer.ImportElectronicFile(PdfPath);
TextExtractor te = new TextExtractor();
te.ExtractText((LFDocument)importer.Document, 1, Import_Page_Action.IMPORT_PAGE_ACTION_OVERWRITE);

A previous LF Answers post states that PDF text can be extracted using the TextExtractor class. Are there prerequisites (like Adobe Reader) that need to be installed on the machine in order to successfully run this code?

0 0

SELECTED ANSWER

replied on December 22, 2015 • Show version history

IFilter for PDF files are required for text extraction. You can follow https://support.laserfiche.com/KB/1011240 to install the IFilter. If another version of PDF IFilters (Adobe / Adobe Reader) is already installed you may need to uninstall it first.

You can use TextProvider.exe to test whether the IFilter works or not:

"C:\Program Files\Common Files\Laserfiche\Text Provider\TextProvider64.exe" -cmd -ExtractTextFromFile c:\your-pdf-file.pdf c:\your-text-file-output-64.txt

and

"C:\Program Files (x86)\Common Files\Laserfiche\Text Provider\TextProvider.exe" -cmd -ExtractTextFromFile c:\your-pdf-file.pdf c:\your-text-file-output-32.txt

After that, check the content in c:\your-text-file-output-64.txt and c:\your-text-file-output-32.txt. If at least one of them has text, then the IFilter is installed correctly.

If it is a 32bit machine, run the following command instead

"C:\Program Files\Common Files\Laserfiche\Text Provider\TextProvider.exe" -cmd -ExtractTextFromFile c:\your-pdf-file.pdf c:\your-text-file-output-32.txt

You may also want to have a look at this post.

0 0

replied on December 23, 2015

Thank you, I did not know about the TextProvider.exe files.

I uninstalled Adobe Acrobat Reader DC v15 and installed the IFilter 5 via the link provided, with a reboot. However, the TextProvider.exe commands still do not produce text output using the attached PDF file. Is there any logging or other troubleshooting helps to know why it is failing?

Even if we do get this working, requiring the IFilter could be a distribution and configuration obstacle for us. We will probably explore other routes as well.

BytesTest1_635101086606833545.pdf (5.99 KB)

| Download

0 0

replied on December 23, 2015

Hello,

The attached PDF file works for me using Adobe Reader 7 (link). Unfortunately, IFilters are 3rd party applications, and I don't think PDF IFilter has logs. For PDF IFilters, some versions don't work at all; some versions work for some PDF files only. So that you may need to try several different versions for your files. In your case I suggest to try Adobe Reader 7 first.

0 0

replied on December 28, 2015

Thank you again for your assistance. We will explore a non-IFilter solution for now.

0 0

Question

Question

Text cannot be extracted from this file type

Answer

Replies

Sign in to reply to this post.