You are viewing limited content. For full access, please sign in.

Discussion

Discussion

Workflow SDK script to extract text from pdf.

posted on July 29, 2019

I want to generate text for pdf files WITHOUT creating LF pages.  That would be the same thing as right-clicking on the pdf in Laserfiche and selecting "Generate Searchable Text".

Workflow has no such ability, as it can only Schedule OCR for tiff images using the DCC.

As this is a function in the LF client, and I know it's also an ability of the SDK, I would hope that we could do an SDK script activity to accomplish it.  The problem is that I'm not a developer, so I have no idea how to write the code.

Any of you very smart individuals care to help me out??

Thanks in advance!

0 0
replied on July 30, 2019

Here's a very simplified example, however, I believe you have to have an iFilter installed for the associated file type before the extractor will work.

Be sure to add a reference to Laserfiche.DocumentServices

// initialize extractor
using(TextExtractor te = TextExtractor.LoadExtractor()){
    // get the document info object
    DocumentInfo doc = (DocumentInfo)this.BoundEntryInfo;
    
    // extract the text
    te.ExtractFrom(doc);
    
    // release the document info objecct
    doc.Dispose();
}

Note that when testing a script in the script editor, it runs on whatever PC you're working on, but when you run it in a workflow it runs server side so you'll need the iFilter in both places.

3 0
You are not allowed to follow up in this post.

Sign in to reply to this post.