Question

extract text from imported document using DocumentImporter.ExtractTextFromEdoc

SDK

Updated October 19, 2015

asked on June 22, 2015 • Show version history

Hello,

I want to extract text from an imported pdf file. There is an option in the SDK (DocumentImporter.ExtractTextFromEdoc) which specifies that you can do that . The document gets imported with no issues but it is without any searchable text. I have also attached a sample file that i am using for testing. The file definitely has searchable text.

What am i doing wrong ?

 //Prevent memory leak with the using statement
                    using (DocumentInfo doc = new DocumentInfo(mySess))
                    {
                        //Check if the destination folder exists
                        EntryInfo docentryinfo = Entry.TryGetEntryInfo(laserfichedestinationfolder, mySess);
                        if (docentryinfo != null)
                        {
                            //Create a document in the destination folder
                            doc.Create(laserfichedestinationfile, "DEFAULT", EntryNameOption.AutoRename);
                            docentryinfo.Lock(LockType.Exclusive);
                            DocumentImporter DI = new DocumentImporter();
                            DI.Document = doc;
                            
                            //Doesnt do anything
                            DI.ExtractTextFromEdoc = true;  
                           
                            //Import the local file into the laserfiche document
                            DI.ImportEdoc(System.Web.MimeMapping.GetMimeMapping(Path.GetFileName(localfilepath)), localfilepath);
                            
                     
                            docentryinfo.Unlock();

                        
                        }
                    }

SouthHadley7229A__App.pdf (108.81 KB)

| Download

lfpagessnip.JPG (208.75 KB)

| Download

0 0

Answer

SELECTED ANSWER

replied on October 13, 2015

Would you please copy or export the PDF file to a local folder on machine you run SDK, and then run the following commands:

"C:\Program Files\Common Files\Laserfiche\Text Provider\TextProvider64.exe" -cmd -ExtractTextFromFile c:\your-pdf-file.pdf c:\your-text-file-output-64.txt

and

"C:\Program Files (x86)\Common Files\Laserfiche\Text Provider\TextProvider.exe" -cmd -ExtractTextFromFile c:\your-pdf-file.pdf c:\your-text-file-output-32.txt

After that, check the content in c:\your-text-file-output-64.txt and c:\your-text-file-output-32.txt.

If the text file is empty, please uninstall all the PDF IFilters (Adobe / Adobe Reader), and install the one on https://support.laserfiche.com/KB/1011240, and try again.

If it is a 32bit machine, run the following command instead

"C:\Program Files\Common Files\Laserfiche\Text Provider\TextProvider.exe" -cmd -ExtractTextFromFile c:\your-pdf-file.pdf c:\your-text-file-output-32.txt

1 0

Replies

replied on June 22, 2015

The text extraction of electronic documents is done with the use of iFilters. Make sure that the server and and the machine doing the import have the correct iFilter for the file format.

2 0

replied on August 21, 2015 • Show version history

There were problems with the PDF iFilters in version 10 and higher. Remove the v11 PDF iFilters and then get and install Acrobat Reader 9.

1 0

replied on June 30, 2015

I have installed http://www.adobe.com/support/downloads/thankyou.jsp?ftpID=5542&fileID=5550

Pdf Ifilter on both the client and server . But the problem still persists.

Any clue ?

0 0

View 2 previous replies

replied on August 13, 2015

The IFilter has to be installed on the same machine where your program that calls into DocumentImporter is running.

0 0

replied on August 14, 2015

I am running the program on the client so i am guessing this has to be installed on the client machine.

I have been advised to install Ifilter v 7. Hence, I am installing AdbeRdr70_enu_full on

ftp://ftp.adobe.com/pub/adobe/reader/win/7x/7.0/enu/ .

Is this the right version that you mentioned ?

0 0

replied on August 20, 2015

Ifilter is installed and it didnt fix the issue

0 0

replied on August 21, 2015

I called support ,i was told to submit troubleshooting details here and this is what happened

1) I was asked to post on answers.laserfiche.com (which i already did) . Tech support doesn't provide support of sdk programming (makes sense)

2) Tech support talked with the dev and was giving me instructions

2) I have installed ifilter (both 32 and 64 bit) on my client and i told support that. 32 bit ifilter comes with adobe reader itself according to this documentation http://www.adobe.com/support/downloads/detail.jsp?ftpID=5542 .

and 64 bit is installed from the same above link.

3) Find out whether ifilter is loading . To do that i was asked to enable logging on my client machine.

I was told to enable both 32 and 64 bit logging (Just to make sure) . The log file is created in C:\logs but the log file is empty

HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Laserfiche\TextProvider ---32 bit

HKEY_LOCAL_MACHINE\SOFTWARE\Laserfiche\TextProvider --64 bit

32 bit

0 0

replied on August 21, 2015 • Show version history

Hi Bert,

Thank you for the reply. I installed reader 9 . But there are a couple issues with it

1) It extracts text but i can't see the document in the electronic file pane . I took some steps to troubleshoot that , which are as follows

1a) I made sure that in tools--> options -->view -->open with --> open pdf files by default using -- > The laserfiche document viewer option is checked

1b) I made sure that in tools--> options -->view -->open with --> electronic file pane --> Show PDF files in the electronic file pane option is checked

1c) I checked that ie 11 has adobe reader addon installed. I also enabled and disabled it

1d) I also checked "display pdf in browser " option is selected in adobe reader --> preferences --> Internet

2) The second thing is its a security issue if i ask the IT department to install software that's two versions old on the end users computer.

3) Its inconvenient to use an external application to open the pdf when the metadata and the text are a part of the laserfiche document viewer ( hence the metadata and the text can't be seen using an external application which is very counter productive )

Any thoughts ?

---Thanks

0 0

replied on August 27, 2015

For the benefit of other readers, unfortunately, Tushars issue is not specific to SDK but with his client install. He has opened a Support case for his client issue.

0 0

replied on September 16, 2015 • Show version history

Update: I have installed the latest version of the client and i am using the latest sdk. I can do the extract text using the client , but i still cannot do the extract text using the sdk.

The text attraction works using the laserfiche client (latest) using both

1) native text extraction and 2) extraction using pdf ifilter (See image for details)

I have pdf ifilter 9 (64 bit) with adobe reader 11 installed.

Here is the code that i am using for the sdk

using (DocumentInfo doc = new DocumentInfo(mySess))
                {
                    
                    try
                    {
                        //Check if the destination folder exists
                        folderdocentryinfo = Entry.TryGetEntryInfo(laserfichedestinationfolder, mySess);
                        if (folderdocentryinfo != null)
                        {
                            //Create a document in the destination folder

                            doc.Create(laserfichedestinationfile, "DEFAULT", EntryNameOption.AutoRename);
                            folderdocentryinfo.Lock(LockType.Shared);
                            DocumentImporter DI = new DocumentImporter();
                            DI.Document = doc;
                            //Doesn't work
                            DI.ExtractTextFromEdoc = true;
                            docid = doc.Id;
                            //Import the local file into the laserfiche document
                            DI.ImportEdoc(System.Web.MimeMapping.GetMimeMapping(Path.GetFileName(localfilepath)), localfilepath);

                   
                             
                            
                        }
                    }
                    catch (Exception e)
                    {
                        throw e;
                    }
                    finally
                    {
                        if (folderdocentryinfo != null)
                            folderdocentryinfo.Unlock();
                        if (documententryinfo != null)
                            documententryinfo.Unlock();

                    }
                }

lfsnip.JPG (168.33 KB)

| Download

0 0

SELECTED ANSWER

replied on October 13, 2015

Would you please copy or export the PDF file to a local folder on machine you run SDK, and then run the following commands:

"C:\Program Files\Common Files\Laserfiche\Text Provider\TextProvider64.exe" -cmd -ExtractTextFromFile c:\your-pdf-file.pdf c:\your-text-file-output-64.txt

and

"C:\Program Files (x86)\Common Files\Laserfiche\Text Provider\TextProvider.exe" -cmd -ExtractTextFromFile c:\your-pdf-file.pdf c:\your-text-file-output-32.txt

After that, check the content in c:\your-text-file-output-64.txt and c:\your-text-file-output-32.txt.

If the text file is empty, please uninstall all the PDF IFilters (Adobe / Adobe Reader), and install the one on https://support.laserfiche.com/KB/1011240, and try again.

If it is a 32bit machine, run the following command instead

"C:\Program Files\Common Files\Laserfiche\Text Provider\TextProvider.exe" -cmd -ExtractTextFromFile c:\your-pdf-file.pdf c:\your-text-file-output-32.txt

1 0

replied on September 30, 2015

Any ideas ?

0 0

replied on October 19, 2015

Using the iFilter 5, the SDK text extraction works (except for one file that may have been a bad file, we will monitor)

The computer running the application is 64 bit, but the application is compiled as 32 bit and Outlook/Office is 32 bit. This should be a supported configuration for us.

We are able to have the iFilter 5 installed along with a later version of Adobe Acrobat.

0 0

You are not allowed to follow up in this post.

Question

Question

extract text from imported document using DocumentImporter.ExtractTextFromEdoc

Answer

Replies

Sign in to reply to this post.