You are viewing limited content. For full access, please sign in.

Question

Question

Extracting document with SDK - filename extensions are missing

asked on November 18, 2021

I'm using the SDK to extract documents, but have found something strange with some file types.

The code to extract the document is:

Using myFile As FileStream = File.Create(myOutFile)
    Using lfReadStream As LaserficheReadStream = myDocumentInfo.ReadEdoc(zMimeType)
        lfReadStream.CopyTo(myFile)
    End Using
End Using

This works fine if the file is .htm, .zip, .pdf or .eml.  However, for .txt, .tif or .png files, the above results in an empty file.

When I look at the properties of the files I'm working with, the files that don't work indicate they have no 'Electronic Document Properties'.  Looking at the code I'm using, I can see that since I'm using 'ReadEdoc', and there's no electronic document, that would explain why I'm getting an empty file.  So I have a few questions:

  • Why are some types stored as electronic documents, and others are not.  Can that be changed?
  • How do I extract the contents of the files when they're not stored as an electronic document
  • The filenames are returned without an extension, but for the electronic documents that can be determined from the properties.  How can I determine the original filename extension for the files that don't have electronic documents?

 

Thanks!

0 0

Answer

SELECTED ANSWER
replied on November 18, 2021

Documents that are not stored as "Electronic Documents" are stored as native Laserfiche Documents, which consist of TIFF pages, so if there is no Electronic Document component it means the document was either created as a native document, converted on import, or pages were generated after import and the original electronic document was removed.

The key thing to note is that purely native documents are not storing the original file at all; they are creating new files from that content and storing the resulting items as a Laserfiche document. Basically, if you import a PDF and store it as a native doc without no eDoc attached, then it is no longer a PDF.

 

Whether or not applicable file types are converted to a native format happens at import/creation. For example, Import Agent has options for generating pages, keeping the original PDF, or both. The client also has similar options when users manually import files. However, these settings all apply to what happens when you bring in a new file; you can't "undo" it for existing documents.

If there is no electronic document, then there is no way to determine the "original" format because that format was not stored in Laserfiche. A native document can have both (we store both for some of our documents), in which case you can have the native pages without losing the original eDoc, but you can't "recover" an eDoc file type if the eDoc was removed or never stored in the repository.

If you need to know the original format, then you would need to save that while you still have the eDoc, either by keeping both, or by storing the original file type in the metadata or something.

 

To provide a comparison/analogy:

You can generate a PDF from a Word document, an Excel file, printing a web page, etc., but once you generate that PDF there is no record of what the document "used" to be and there is no easy way to "reverse engineer" the original file type.

The key difference with Laserfiche is that you can retain that information, however, it must be done while the electronic document file is still available.

1 0
replied on November 18, 2021

Jason, thanks for the excellent explanation. smiley

I'm dealing with both documents that were imported from another storage system that is being retired (Fortis), as well as new documents going forward, some of which will be created by the program I'm writing now, some from scanning/faxing, and some manually uploaded through the GUI or Web interface.

Since I was anticipating that determining the original file type might not be possible, I was thinking of adding the extension to the filename, so "Document1.tif" would be loaded as "Docment1_tif.tif", "Data1.txt" as "Data1_txt.txt", etc.  Though I also like your idea of saving it in the metadata.

The data I'm working with now was an initial import from Fortis done by a 3rd party vendor.  Though I've run into some other issues with how that was done, so it's looking like the data will need to be converted again anyway (the vendor won't be happy, since there are many terabytes of documents), so we can include these things when it's redone.

0 0
replied on November 18, 2021

I feel your pain. When I first started my current job, my main task was importing about 40-45 million documents from the legacy system into Laserfiche.

About a month into the process we found some issues with how the vendor's import tool was converting the documents and had to wipe everything and start over.

 

As for storing the extension, one big reason you may want to avoid putting it in the name is that names are very easily changed.

Putting it in the metadata means it is out of the way, and easily locked it down by restricting which accounts/users can edit that field.

0 0
replied on November 18, 2021

So it seems to me that Native documents is the preferred storage method, and Electronic Documents is the fall-back for document types LF doesn't support as native.  It seems crazy that the native format doesn't automatically keep track of the original document type.

I will need to discuss this with the users, but they currently pull out documents in the same format they were stored in.  If that is truly the case, is there a downside to removing all the extensions from the File Conversion options and storing everything as an Electronic Document?  My biggest concern would be if there's a big difference is storage size, such as if native documents are compressed more than electronic.

0 0
replied on November 18, 2021

Actually, native documents tend to take up a lot more space than electronic documents because TIFF images are fairly high quality and can get pretty large when they're in color or grayscale, especially when the source had a high DPI.

The downside of non-native documents is that you wouldn't have all of the Laserfiche editing capabilities, like annotations and such.

It all depends on how you intend to use Laserfiche; if you're just storing files and metadata and that's all you need, then keeping electronic files is fine. If you need to redact, stamp, add/remove pages, etc., then you'd need native documents.

0 0
replied on November 18, 2021

Good point about the annotation.  I know that is very important to the end user for certain documents.  I'll have to discuss this with them and let them decide how they want it handled.

Thanks again for all your input on this!

0 0

Replies

replied on November 18, 2021

I've found the answer to my second question as to how to extract the document data.

However, I still need to determine what the original file type was in order to extract it in the correct format.  I don't see anything in the DocumentInfo that provides that.

0 0
replied on November 18, 2021

The two ways you're going to know the file type of something that's been imported with "generate pages" is if you also attached the Edoc to the same entry, or you've stored it somewhere - say a field in the entry's metadata ... otherwise it's multiple .tiff files.

Even if you had the original file type, you're going to have a rough road converting it back to it's original type - unless it's an image file type.

For instance, if you generate pages on an Excel or Word file (and don't include the Edoc), you're not going to be able to convert it from multiple .tiffs to either of these, so exporting to .pdf is your best option.

 

 

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.