You are viewing limited content. For full access, please sign in.

Question

Question

Comparing documents in Laserfiche

asked on May 9, 2018 Show version history

Dear Laserfiche gurus,

 

Is there a way to somehow compare documents / images to determine if they are absolutely identical? We have two large sets of same or similar documents and we would love to delete duplicates, thus, we are looking for a way to do this with minimum human intervention since the numbers are large. Is this something a workflow can do? Or is there another tool that can take care of this? Thank you for any ideas!!!

 

Olga.

1 0

Answer

SELECTED ANSWER
replied on May 25, 2018

One thing to keep in mind is that no two documents will have the same entry ids. Those are assigned at the time the document is created, and are unique within a repository.

You can possibly start out by doing a workflow that walks through a search of documents and compares metadata, and puts the duplicates someplace.

As far as comparing pages, are the contents of the documents only rendered Forms submissions, or can there be more attached? If it's only from forms, then you should have a fairly consistent base to start comparing pages. Documents that have matching metadata, but different page counts need to be looked into further. Documents with matching metadata and the same page counts also need to be looked into, but in a different way.

Are the results of Forms submissions stored in a database somewhere? If so, you can use that to inform the process of weeding out duplicates. Otherwise, you might have to OCR the documents and then pick out differentiating values, and use those. If you have access to the data in database form, you can weed out duplicates that way, and then get rid of documents you don't need.

If you are allowed to query the Laserfiche repository database directly, you can run queries to look at the metadata and spot duplicates there as a first cut. Then you can save those entry ids off into a different database, or even just a spreadsheet. I often have tables that have just a single column of entry ids. I populate it using Laserfiche searches or database queries, and then I have a list of entry ids that I can act on in Workflow.

You certainly don't need to hit the database to make this happen. You can do it all by collecting multi-value tokens within a workflow. I don't prefer this because often I'll want to explore documents in several different ways, and add to my list of "To Delete/do operation 'x'" entry ids in several batches over time. It makes it much easier when it's in a separate database table that I can write to. We have a scratchpad database that we use for just this kind of thing.

There's no catchall method for finding duplicates. Just think of it as an iterative process, and use various criteria to whittle the list down one chunk at a time. I'm sorry some of this is vague and rambling, that's the way my process tends to be when I'm faced with these kinds of tasks. Let me know if anything needs clarifying.

1 0

Replies

replied on May 11, 2018 Show version history

It depends...I have some possible solutions in mind.

What technical resources and skills do you have access to?

It also depends on the metadata.

  • Is the metadata identical with identical pages?
    • Doable
  • Is the metadata identical with different pages?
    • Doable, but could get messy
  • Is the metadata different with identical pages?
    • Also messy
  • Is the metadata similar with similar pages?
    • This will need to be done by hand
0 0
replied on May 24, 2018

Hi Devin,

I will answer the second part first. We have a lot of temporary copies of form submissions. These have never been deleted. The metadata of the form submission and the temporary copy is different, but I think we could find something that is common for both (entry ID?).

The pages would sometimes identical, sometimes similar, sometimes only one is saved, but not the other, etc. This is what we are trying to determine. If the two are absolutely the same - delete the temporary copy, if the two are different - compare to decide which to keep, if only one is saved - keep it and rectify the issue, etc.

I thought a workflow can be created to find documents with identical entry IDs (?) and then compare the pages somehow to determine if the match is 100% or partial and then delete, keep or review.

I am not entirely sure about all the tools, we use workflows, SDK, Quick Fields. Are there any specific ones you have in mind?

Thank you,

Olga.

 

0 0
SELECTED ANSWER
replied on May 25, 2018

One thing to keep in mind is that no two documents will have the same entry ids. Those are assigned at the time the document is created, and are unique within a repository.

You can possibly start out by doing a workflow that walks through a search of documents and compares metadata, and puts the duplicates someplace.

As far as comparing pages, are the contents of the documents only rendered Forms submissions, or can there be more attached? If it's only from forms, then you should have a fairly consistent base to start comparing pages. Documents that have matching metadata, but different page counts need to be looked into further. Documents with matching metadata and the same page counts also need to be looked into, but in a different way.

Are the results of Forms submissions stored in a database somewhere? If so, you can use that to inform the process of weeding out duplicates. Otherwise, you might have to OCR the documents and then pick out differentiating values, and use those. If you have access to the data in database form, you can weed out duplicates that way, and then get rid of documents you don't need.

If you are allowed to query the Laserfiche repository database directly, you can run queries to look at the metadata and spot duplicates there as a first cut. Then you can save those entry ids off into a different database, or even just a spreadsheet. I often have tables that have just a single column of entry ids. I populate it using Laserfiche searches or database queries, and then I have a list of entry ids that I can act on in Workflow.

You certainly don't need to hit the database to make this happen. You can do it all by collecting multi-value tokens within a workflow. I don't prefer this because often I'll want to explore documents in several different ways, and add to my list of "To Delete/do operation 'x'" entry ids in several batches over time. It makes it much easier when it's in a separate database table that I can write to. We have a scratchpad database that we use for just this kind of thing.

There's no catchall method for finding duplicates. Just think of it as an iterative process, and use various criteria to whittle the list down one chunk at a time. I'm sorry some of this is vague and rambling, that's the way my process tends to be when I'm faced with these kinds of tasks. Let me know if anything needs clarifying.

1 0
replied on June 6, 2018

Thank you Devin for these suggestions, I think this is a very good start!

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.