You are viewing limited content. For full access, please sign in.

Question

Question

Document de-duplication

asked on March 4, 2018 Show version history

Hi,

One of our customer has a simple sounding requirement but the implementation seems anything but. They are a legal firm and receive documents containing 100s of pages per case (paralegal, medical, police reports, etc. reports rolled into one PDF document). Their problem is sometimes some pages are repeating and they want those pages to be flagged or deleted.

At the moment, they print all pages, lay them on the floor, manually de-duplicate and scan unique pages into one final PDF report. Unfortunately they have no control over duplication occurring at the source and are looking for ways to automate it after they receive it.

Can someone please suggest what the best way is to implement this using QuickFields, LF Workflows or client?

Thanks in advance.

Regards,

Adarsh

0 0

Answer

SELECTED ANSWER
replied on March 5, 2018

If the pages are indeed generated documents then an OCR only approach might suffice.  I was also thinking about scanned forms and especially forms completed by hand.  The OCR on those documents would be almost an exact match but the hand-written values would be different.

In the quick search I did on code to differentiate between two images I saw some promising algorithms that basically reduced the dpi of the image, turned it grayscale, made adjustments for any image size differences then created a uniform hash for the image.  That is the approach I think I would use as it could  handle both handwritten and generated document pages.

As you allude; I would probably shy away from Workflow for anything other than a proof-of-concept.  For a production environment I would suggest a service based integration that could be triggered by workflow and/or a business process.

Also, if I were going that far I would probably build a client plug-in that would be a better UI for displaying images side-by-side than the LF client itself.

1 0

Replies

replied on March 5, 2018 Show version history

Hi Adarsh,

If you're looking to de-duplicate on a page-by-page basis, you could write a workflow script to generate a checksum for each stored page. The solution will alert you to the page numbers that have been duplicated.

The solution could be extended to a more generic de-duplication solution, checking every page or document that eneters the repository.

 

-Ben

 

 

 

1 0
replied on March 5, 2018

Adarsh,

Interesting challenge!  The primary issue that I would see in building an algorithm would be to define what a 'duplicate' page actually means.  For example you could take a single page, scan it twice, and each scanned image will be minutely different.  A human doing a side-by-side comparison might easily determine that they are 'similar' or 'alike' but that does not make them duplicates.  In addition, if you built an algorithm that would determine that two or more images are similar then how would you determine which image to keep?  A human looking at those similar images might choose one over the other because one looks like a 'cleaner image' but how would you build that logic into an automated process?

My thought to pursue this would be to develop some type of 'hash value' that could be applied to each page of the document.  That hash value might consist of a blend of OCR'd text and/or image pixel valuations for each page.  Once the hash values were determined I would then do a bubble sort of the pages to sort the document so that 'similar' pages were next to one another.  At that point I would have a human step through and do a side-by-side comparison of adjacent pages to determine if indeed they are 'duplicates'.  Perhaps at some point if the hashing algorithm were determined to be accurate enough then maybe you could apply a rule that would eliminate pages that had similar hash values.

All that being said; I am not sure that QuickFields, Workflow, and the LF client would provide the necessary functionality and/or UI to effectively make it happen?  Perhaps this might be a set of SDK applications and/or service based integrations instead?

1 0
replied on March 5, 2018 Show version history

Hi Cliff,

Agreed, I was thinking too simplistically. Thinking this through a bit more, I'd keep away from any kind of pixel- or image-based hash. The same document, scanned twice on the same scanner will produce variations in the scan-artefacts and image details.  A single artifact and the comparison would lose all value. I would stick to just to OCR. Again, OCR isn't perfect and a small imperfection (such as a 5 transposed for an S will failed to match a duplicate).

A better, more complex approach would be to create a percentage match, comparing every page. It's likely to be a more lengthy and complex process but you could then review any documents with a match of, for example, 90% or more.

https://stackoverflow.com/questions/31315231/how-to-compare-strings-for-percentage-match-using-vb-net#31320065

(this is not my porefered method, just the first one I came accross)

As for the rest of your process, for sure.

"bubble sort" haha - that was the first sort algorithm I had to write for school and it's still useful sometimes :)

Also, yes this would have to be entirely SDK driven, either in Workflow or QF. My preference is Workflow and a custom activity at that. It would be super-easy to build and make the code more reuable. However, if using the percentage-match approach, I'd take the priocess out of Workflow, because running code that takes a longer than a second or two, is slower than Workflow was designed for.

1 0
SELECTED ANSWER
replied on March 5, 2018

If the pages are indeed generated documents then an OCR only approach might suffice.  I was also thinking about scanned forms and especially forms completed by hand.  The OCR on those documents would be almost an exact match but the hand-written values would be different.

In the quick search I did on code to differentiate between two images I saw some promising algorithms that basically reduced the dpi of the image, turned it grayscale, made adjustments for any image size differences then created a uniform hash for the image.  That is the approach I think I would use as it could  handle both handwritten and generated document pages.

As you allude; I would probably shy away from Workflow for anything other than a proof-of-concept.  For a production environment I would suggest a service based integration that could be triggered by workflow and/or a business process.

Also, if I were going that far I would probably build a client plug-in that would be a better UI for displaying images side-by-side than the LF client itself.

1 0
replied on March 8, 2018

Hi Ben, Cliff,

Thank you for your elaborate answers. It gave me a lot of knowledge and a place to start.

I will let you know how we proceed and if any good/bad comes out of that.

Thanks,

Adarsh

0 0
replied on October 29, 2018

Hi Adarsh,

Did anything come out of this? Did you create anything that worked?

-Ben

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.