You are viewing limited content. For full access, please sign in.

Question

Question

extract text from imported document using DocumentImporter.ExtractTextFromEdoc

SDK
asked on June 22, 2015 Show version history

Hello,

          I want to extract text from an imported pdf file. There is an option in the SDK (DocumentImporter.ExtractTextFromEdoc) which specifies that you can do that . The document gets imported with no issues but it is without any searchable text.  I have also attached a sample file that i am using for testing. The file definitely has searchable text.

What am i doing wrong ?

 //Prevent memory leak with the using statement
                    using (DocumentInfo doc = new DocumentInfo(mySess))
                    {
                        //Check if the destination folder exists
                        EntryInfo docentryinfo = Entry.TryGetEntryInfo(laserfichedestinationfolder, mySess);
                        if (docentryinfo != null)
                        {
                            //Create a document in the destination folder
                            doc.Create(laserfichedestinationfile, "DEFAULT", EntryNameOption.AutoRename);
                            docentryinfo.Lock(LockType.Exclusive);
                            DocumentImporter DI = new DocumentImporter();
                            DI.Document = doc;
                            
                            //Doesnt do anything
                            DI.ExtractTextFromEdoc = true;  
                           
                            //Import the local file into the laserfiche document
                            DI.ImportEdoc(System.Web.MimeMapping.GetMimeMapping(Path.GetFileName(localfilepath)), localfilepath);
                            
                     
                            docentryinfo.Unlock();

                        
                        }
                    }

 

lfpagessnip.JPG
lfpagessnip.JPG (208.75 KB)
0 0

Answer

SELECTED ANSWER
replied on October 13, 2015

Would you please copy or export the PDF file to a local folder on machine you run SDK, and then run the following commands:

 

"C:\Program Files\Common Files\Laserfiche\Text Provider\TextProvider64.exe" -cmd -ExtractTextFromFile c:\your-pdf-file.pdf c:\your-text-file-output-64.txt

and

"C:\Program Files (x86)\Common Files\Laserfiche\Text Provider\TextProvider.exe" -cmd -ExtractTextFromFile c:\your-pdf-file.pdf c:\your-text-file-output-32.txt

After that, check the content in c:\your-text-file-output-64.txt and c:\your-text-file-output-32.txt.

If the text file is empty, please uninstall all the PDF IFilters (Adobe / Adobe Reader), and install the one on https://support.laserfiche.com/KB/1011240, and try again.

If it is a 32bit machine, run the following command instead

"C:\Program Files\Common Files\Laserfiche\Text Provider\TextProvider.exe" -cmd -ExtractTextFromFile c:\your-pdf-file.pdf c:\your-text-file-output-32.txt

 

1 0

Replies

replied on June 22, 2015

The text extraction of electronic documents is done with the use of iFilters.  Make sure that the server and and the machine doing the import have the correct iFilter for the file format.

2 0
replied on August 21, 2015 Show version history

There were problems with the PDF iFilters in version 10 and higher.  Remove the v11 PDF iFilters and then get and install Acrobat Reader 9.

1 0
replied on June 30, 2015


I have installed http://www.adobe.com/support/downloads/thankyou.jsp?ftpID=5542&fileID=5550

Pdf Ifilter on both the client and server . But the problem still persists. 

 

Any clue ?

 

0 0
replied on June 30, 2015

There are widely reported issues with Adobe's v11 ifilter. Please uninstall v11 and install v9 or v7 (v7 is faster)

 

ftp://ftp.adobe.com/pub/adobe/reader/win

0 0
replied on August 13, 2015 Show version history

I am guessing this has to be installed only on the laserfiche server right ?

 

I am installing AdbeRdr70_enu_full on 

ftp://ftp.adobe.com/pub/adobe/reader/win/7x/7.0/enu/   . Is this the right version that you mentioned ?

 

 

0 0
replied on August 13, 2015

The IFilter has to be installed on the same machine where your program that calls into DocumentImporter is running.

0 0
replied on August 14, 2015

I am running the program on the client so i am guessing this has to be installed on the client machine.

I have been advised to install Ifilter v 7. Hence, I am installing AdbeRdr70_enu_full on 

ftp://ftp.adobe.com/pub/adobe/reader/win/7x/7.0/enu/   .

Is this the right version that you mentioned ?

 

0 0
replied on August 20, 2015

Ifilter is installed  and it didnt fix the issue

0 0
replied on August 21, 2015

I called support ,i was told to submit troubleshooting details here  and this is what happened 

 

1) I was asked to post on answers.laserfiche.com  (which i already did) . Tech support doesn't provide support of sdk programming (makes sense)

2) Tech support talked with the dev and was giving me instructions

2) I have installed ifilter (both 32 and 64 bit) on my client and i told support that. 32 bit ifilter comes with adobe reader itself according to this documentation   http://www.adobe.com/support/downloads/detail.jsp?ftpID=5542 .

and 64 bit is installed from the same above link.

 

3) Find out whether ifilter is loading . To do that i was asked to enable logging on my client machine. 

   I was told to enable both 32 and 64 bit logging (Just to make sure) . The log file is created in C:\logs but the log      file   is empty

  HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Laserfiche\TextProvider ---32 bit

 HKEY_LOCAL_MACHINE\SOFTWARE\Laserfiche\TextProvider --64 bit

 

 

32 bit 

 

 

0 0
replied on August 21, 2015 Show version history

Hi Bert,

            Thank you for the reply. I installed reader 9 . But there are a couple issues with it

 1) It extracts text but i can't see the document in the electronic file pane . I took some steps to troubleshoot that , which are as follows 

                1a) I made sure that in tools--> options -->view -->open with --> open pdf files                           by default    using -- > The laserfiche document viewer   option is checked

                1b)  I made sure that in tools--> options -->view -->open with --> electronic file                           pane --> Show PDF files in the electronic file pane option is checked

                1c) I checked that ie 11 has adobe reader addon installed. I also enabled and                                 disabled it

                1d) I also checked "display pdf in browser " option is selected in adobe reader -->                         preferences   --> Internet

 

2) The second thing is its a security issue if i ask the IT department to install software that's two versions old on the end users computer.

3) Its inconvenient to use an external application to open the pdf when the metadata and the text are a part of the laserfiche document viewer ( hence the metadata and the text can't be seen using an external application which is very counter productive )

 

Any thoughts ?

 

---Thanks 

0 0
replied on August 27, 2015

For the benefit of other readers, unfortunately, Tushars issue is not specific to SDK but with his client install.   He has opened a Support case for his client issue.

0 0
replied on September 16, 2015 Show version history

Update: I have installed the latest version of the client and i am using the latest sdk. I can do the extract text using the client , but i still cannot do the extract text using the sdk. 

The text attraction works using the laserfiche client (latest) using both

1) native text extraction and 2) extraction using pdf ifilter  (See image for details)

I have pdf ifilter 9 (64 bit) with adobe reader 11 installed. 

 

Here is the code that i am using for the sdk

using (DocumentInfo doc = new DocumentInfo(mySess))
                {
                    
                    try
                    {
                        //Check if the destination folder exists
                        folderdocentryinfo = Entry.TryGetEntryInfo(laserfichedestinationfolder, mySess);
                        if (folderdocentryinfo != null)
                        {
                            //Create a document in the destination folder

                            doc.Create(laserfichedestinationfile, "DEFAULT", EntryNameOption.AutoRename);
                            folderdocentryinfo.Lock(LockType.Shared);
                            DocumentImporter DI = new DocumentImporter();
                            DI.Document = doc;
                            //Doesn't work
                            DI.ExtractTextFromEdoc = true;
                            docid = doc.Id;
                            //Import the local file into the laserfiche document
                            DI.ImportEdoc(System.Web.MimeMapping.GetMimeMapping(Path.GetFileName(localfilepath)), localfilepath);

                   
                             
                            
                        }
                    }
                    catch (Exception e)
                    {
                        throw e;
                    }
                    finally
                    {
                        if (folderdocentryinfo != null)
                            folderdocentryinfo.Unlock();
                        if (documententryinfo != null)
                            documententryinfo.Unlock();

                    }
                }

 

lfsnip.JPG
lfsnip.JPG (168.33 KB)
0 0
SELECTED ANSWER
replied on October 13, 2015

Would you please copy or export the PDF file to a local folder on machine you run SDK, and then run the following commands:

 

"C:\Program Files\Common Files\Laserfiche\Text Provider\TextProvider64.exe" -cmd -ExtractTextFromFile c:\your-pdf-file.pdf c:\your-text-file-output-64.txt

and

"C:\Program Files (x86)\Common Files\Laserfiche\Text Provider\TextProvider.exe" -cmd -ExtractTextFromFile c:\your-pdf-file.pdf c:\your-text-file-output-32.txt

After that, check the content in c:\your-text-file-output-64.txt and c:\your-text-file-output-32.txt.

If the text file is empty, please uninstall all the PDF IFilters (Adobe / Adobe Reader), and install the one on https://support.laserfiche.com/KB/1011240, and try again.

If it is a 32bit machine, run the following command instead

"C:\Program Files\Common Files\Laserfiche\Text Provider\TextProvider.exe" -cmd -ExtractTextFromFile c:\your-pdf-file.pdf c:\your-text-file-output-32.txt

 

1 0
replied on September 30, 2015

Any ideas ?

0 0
replied on October 19, 2015

Using the iFilter 5, the SDK text extraction works (except for one file that may have been a bad file, we will monitor)

The computer running the application is 64 bit, but the application is compiled as 32 bit and Outlook/Office is 32 bit.  This should be a supported configuration for us.

We are able to have the iFilter 5 installed along with a later version of Adobe Acrobat.

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.