You are viewing limited content. For full access, please sign in.

Question

Question

Generate Text vs Generate Pages - The Difference ? PDF vs DOC

asked on June 2, 2014 Show version history

We recently switched our systems to LF and we are running into dilema : Shall we use Generate OCR Text option or Generate Pages option for PDF and DOC files.

 

Noticed that when using Generate Pages option of PDF's our disk utilization jumped from 195 GB to 219 GB  by this one process of Generation of Pages for PDF's that total  to 11 GB only.

 

So jumped here to see if anyone can shed more light, and explain the technical difference of Generating OCR txt . vs Generating Pages, and would PDF require more space for GP compared to GP ran on Doc files ?

 

We still have tons of files to get OCR'ed so our Context Hits show the location of results when searching.

 

Plus is there a way to delete the Pages generated for files ?

0 0

Replies

replied on June 2, 2014 Show version history

Generating Text and Generating Pages are two completley different processes.

 

Generating Text generates a .txt file on the server that contains the OCR'ed text of the document for searching purposes.

 

Generating Pages actually creates tiff images of the PDF (or other electronic document).  

 

So the first is a small text file that just contains text.

 

The second creates entirely new tiff images which are going to be larger.  If you need to convert the PDF's to Tiff then you use this process (going to Tasks - Delete Electronic Files afterwords to remove the PDF).

1 0
replied on June 2, 2014

Chris,

I see there are two options to Generate Pages as well from PDF using Snapshot Printer the pages generated are much smaller in size compared to default setting on the client. What is the difference there in Tiff Generation ?

 

I need to figure out a way to delete all pages for PDF's is that feasible without SDK ?

0 0
replied on June 2, 2014

You can generate image and text pages from the electronic files associated with electronic documents in one of two ways: PDF electronic documents can have image pages generated natively from the PDF file (this method can retain PDF annotations) or you can generate pages from any type of document (including, but not limited to, PDF files) by processing them with Snapshot.

 

Usually people import PDF's, convert them to Tiffs, and delete the PDFs.  But just to confirm, you wish to go find the PDF's that you have generated pages for and delete the tiff images and not the PDF?

0 0
replied on June 2, 2014

That would be correct, need to remove the Tiff's because primarily users open the pdf in native app instead of laserfiche pages.

 

1 option would be to revert version and delete other versions of the document. But this I cannot automate it seems, looked in Workflow designer did not see any controls for removing versions, or even removing LF Pages from a document. I do not want to go through the 3K pages we processed over the weekend one by one by hand !

0 0
replied on June 2, 2014

I am working on this Workflow, basically i search repository first to find the documents I want and second search will output 1 file only (this will be the dumping file where pages get added to), and using Move Pages  option will move to the second search result found.

 

Would that be a simple solution ? I am not sure how it impacts the Version control of the file having its pages stripped. I do not want to keep that version where it can revert to that has pages.

 

0 0
replied on June 2, 2014

Yes, you would do something like that.  Not sure what your second search would be for, but you would need a For Each loop and a Create Entry.

 

1.  Search for your PDF's.

     For Each returned result

          2.  Create a empty document with the same name as the Current Entry you                            are on (%(ForEachEntry_CurrentEntry_Name)).  Put it in a new                                    separate folder.

          3.  Move the pages from the results found to the newly created emply doc.

 

Just do a few tests to make sure it does what you need it to do.

 

0 0
replied on June 2, 2014

This is what I have come up with, too bad there is no easy workflow button that removes Versions of a document, would have made things easier.

 

0 0
replied on June 3, 2014

Built a working workflow that removes the pages, stores the extracted pages (will delete manually), and deletes the original file (to get rid of its version history). I wish LF developers will add the simple feature/button inside workflow designer to remove version from a file, if it existed I would not have to build this lengthy workflow :(.

 

0 0
replied on June 4, 2014

Hi Sumeet,

 

First of all, let's address the root cause here. You're finding that TIFF images are generated for your PDFs, but you don't want TIFF images. If that's the case, you can disable the Generate Laserfiche Pages option upon import.

Make sure that box is unchecked, and you won't automatically have TIFF pages generated. You'll still have the option to generate pages in the system after the fact, if desired.

 

As for removing the pages that you already have, all of you are right that Workflow is the way to go. It sounds like you've got something that works already, but here's how I would approach this. 

 

  1. Search for PDFs in the system with pages.
  2. For each:
    1. Move the original document into a holding folder
    2. Create a blank doc in that holding folder to receive the pages
    3. Move the pages out into the blank document
    4. Duplicate the document and put the new copy back into the original document's path (by having the original document and the new document in separate folders you'll avoid naming conflicts)
  3. Delete everything in the holding folder once you're done processing each PDF

 

Finally, what's the motivation for removing the TIFF images? It sounds like you want to do this because your users primarily  work with PDFs in a native application and not within  the Laserfiche. If that's the case, it's worth noting that by default a PDF (whether it has pages or not) will open up in the Laserfiche Document Viewer. Removing the pages won't change this default behavior. However, a user can always right click on the document and select either 'Edit Electronic File' or 'View Electronic File.' This functionality is there for any electronic document. If you would like to remove the TIFF images for a different, please let us know so we can better understand your situation!

0 0
replied on August 25, 2015

Hello,

Is there a way from admin console to disable "Generate Laserfiche pages" for all users

Perhaps adding or modifying an attributes.

0 0
replied on August 26, 2015

Please do not post the exact same question in multiple places: https://answers.laserfiche.com/questions/82745/Generate-Laserfiche-Pages

0 0
replied on June 13, 2014

Hey Sumeet,

 

If one of these posts answered your question, go ahead and mark it by pressing the "This answered my question" button on the appropriate response.

 

If you still need assistance with this matter, just update this thread. Thanks!

0 2
You are not allowed to follow up in this post.

Sign in to reply to this post.