Dears,
is it a server side process or is it a client side?
Thanks,
Dory Mina
Dears,
is it a server side process or is it a client side?
Thanks,
Dory Mina
Hi Dory,
This is a client side process.
Hope this helps!
Well, it can be made "server" side with Import Agent or Quick Fields!
even generating pages from PDF is client side process?
yup. Import agent and quick fields will do it too. However what's strange is that in import agent 9.2 you can't generate pages AND OCR at the same time. Hopefully that's fixed in the next version.
Hopefully they've fixed it in 9.2!
I know when it first came out that was a bug. It would generate the pages but there wouldn't be any text associated with them. I ran into that issue because I wanted to use Import Agent to process the PDF's and then have quick fields use existing text when breaking apart and identifying documents (so it would run faster, as the OCR would already be done ahead of time)
EDIT: I found the issue - the issue is that it will not *OCR* Pdf's. It always uses the textstream and extracts the text. My issue was that some of the PDF's that came in would not actually have a text stream. In those cases the PDF's are not OCR'd.
https://answers.laserfiche.com/questions/61780/Import-Agent-9--OCR-Images-files-extracted-from-PDFs
I'm curious to know if they've ever fixed this...
I'm using Import Agent 9.0.0.451 and it works fine for me? I wasn't aware there was a different version of Import Agent 9?
Perhaps it was something specific with the PDF you were using for testing?...
Ah OK! This is a different thing all together.
There are 2 types of PDF's, image and text PDF's. A text PDF has a text layer associated with it and can be OCR'd without needing to do anything else. An image PDF is theoretically just a TIFF image in a PDF wrapper, not really a PDF as such. In order for the OCR engine to get it's teeth into the text it first has to 'convert' the PDF into a TIFF image which is a workable open format, then it can extract the text from this. The limitation here I believe is adobe don't always let other applications read into a PDF which is why Laserfiche has to convert it to a TIFF image first.
The OCR engine used by Laserfiche is Omnipage so the fix for this will either come from Omnipage or Adobe themselves, this is totally out of the hands of Laserfiche.
Hope this helps explain!
I know they've said that there is a new OCR engine coming so hopefully it'll have lots of goodies!
Oh! Exited!
Just to clarify a couple things here.
There's actually nothing to do with OCR in the case of a PDF with embedded text - the text already exists, so no OCR is needed. Instead, when Import Agent (or the Client, etc.) generate image pages from the PDF, it also extracts the text stream that already exists.
If a text stream does NOT exist, then it's just like any other image doc once the pages are generated. In the Client there's an option to go ahead and automatically OCR generated pages that don't have accompanying text streams already, but that's not currently supported in Import Agent. We'll definitely look into getting that in the next update of it though.
Nothing to do with the OCR engine itself though.
Hi Dory,
The released Import Agent 10 has supported to OCR pdf pages which does not have text stream. You can refer to another post.
Thanks,
Qinmei