You are viewing limited content. For full access, please sign in.

Question

Question

Extracted text maximum line length

asked on January 24, 2018

I have enabled "Automatically extract text when saving documents from Microsoft Office" option for Generate Text and it seems the extracted text has a limit for the number of characters per line. 

 

For example, if a Word document has one paragraph, without any line breaker:

Laserfiche enables organizations to manage documents, videos, photos and other content. Investing in a content services platform will help your company eliminate paper, optimize costs and power innovation.

Then, the generate text will display:

Laserfiche enables organizations to manage documents, videos, photos and other content. Investing in a content services platform will help your company eliminate paper, optimize costs
 and power innovation.

A new line character seems to be added after costs

 

The reason I care about this is that I am using the generated text as a token in a Workflow. Does anyone know about this limitation? 

 

Any help would be appreciated. Thanks in advance.

0 0

Replies

replied on January 25, 2018

I can't seem to duplicate the issue, and I'm pretty sure that there aren't any settings that affect this. Are you running into this issue consistently with all Word documents? Do you have a sample that is safe to share?

There really isn't a guarantee that extracted text will be useful for anything other than supporting full text searching. Given the way text is stored in a Word document, there may be factors at work that aren't readily visible by just looking at it. You might open one of the documents affected and show formatting marks to see if there's anything weird that is there.

That being said, I've never had an issue with a contiguous paragraph like you are describing.

For reference, here's how to show hidden formatting characters in Word:

 

0 0
replied on January 25, 2018

This issue is consistent throughout all Word documents in Laserfiche. I can re-create a new Word document with a long paragraph, and the generated text page will create its own line breaks which is totally different from the original document.

Below is the text panel of the document preview from a Word document sample which contains only one very long paragraph. No new line mark found until the end of the paragraph. If I extend this text panel wide enough, I can see that the paragraph is divided into several lines. (It is not word-wrapped like the Notepad does.)

I found this while trying to find out a way to convert a Word file into PDF, and I was wonder if this is something that can be changed or fixed with OCR process.

0 0
replied on April 6, 2018

I'm experiencing the same issue.  I attempting to import Word documents and text files (plain text and CSV) to the repository, both manually and via the Import Agent.  In all cases, it looks like the lines in the extracted text are all broken (at a space) around the 180 character mark.  This lines up well with the example you posted originally.

I was hoping to use the Import Agent to import CSV files provided to us by a third party into the repository, triggered a Workflow process that would parse text of the CSV and create a number of Forms instances.  It doesn't look like this will work, however, as each record line in the CSV is much longer than where the extracted text is being broken apart.

0 0
replied on April 6, 2018

As a workaround, I used "Download Electronic Document" in Workflow and used the result. It kept the full length of the original text lines when used for another Workflow activity such as "Update Word Document". I am not really sure if it would work for CSV files though.

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.