You are viewing limited content. For full access, please sign in.

Question

Question

Adding a PDF from Laserfiche Form doesnt have texte

asked on January 25, 2018

Hi all,

I have 1 PDF added from 2 differents starting point ("test from LLForm.pdf" and "test from Windows.pdf")

 

 

When I'm adding a pdf from Windows to Laserfiche, my file have "Text".

 

But the same PDF added from Laserfiche Form to the same Laserfiche's folder doesnt have it.

 

I don't understand why. I need this text and I need it from LFForm. How can I do?

 

Thanks in advance.

Regards

0 0

Replies

replied on January 25, 2018 Show version history

This is likely because you have your client set to extract text from PDF's on import.

 

Bringing a PDF into the repository using any other method won't run through this step.

A couple of ways to make it work are to either have a Quick Fields agent process extract the text, or to use a PDF library called from a Workflow script.

A more squirrely way to do it would be to export the file to a file system folder, and have Import Agent bring it back in, extracting the text in the process.

1 0
replied on January 25, 2018

Hi Devin,

 

Thank you for your return.

 

My client dont have "Import Agent" so I can't use it.

About Quickfield, is it means we need a human's action?

Actually, it looks like the workflow is the best way. Do you have a workflow's template?

 

Regards

0 0
replied on January 29, 2018

You can maybe do a workflow activity usinf SDK script to do it

https://answers.laserfiche.com/questions/48724/Pdf-to-tiff-via-sdk-and-tiff-to-pdf-its-possible-

0 0
replied on January 29, 2018

Hi Rene,

 

Yeah I already read this article and tried but unsucess.

I have this error.

 

It said "Invalid statement in a namespace"

This is my code

Imports System
Imports System.Collections.Generic
Imports System.ComponentModel
Imports System.Data
Imports System.Data.SqlClient
Imports System.Text
Imports System.Runtime.InteropServices

Namespace WorkflowActivity.Scripting.Script
    '''<summary>
    '''Offre une ou plusieurs méthodes qui peuvent être exécutées au moment de l'exécution de l'activité de scriptage du flux de travail.


Public Class Ghostscript

    <StructLayout(LayoutKind.Sequential)> _
    Public Structure GSVersion
        Public product As String
        Public copyright As String
        Public revision As Integer
        Public revisionDate As Integer
    End Structure

    <DllImport("gsdll32.dll", CharSet:=CharSet.Ansi, CallingConvention:=CallingConvention.StdCall)> _
    Private Shared Function gsapi_revision(ByRef version As GSVersion, ByVal len As Integer) As Integer
    End Function

    <DllImport("gsdll32.dll", CharSet:=CharSet.Ansi, CallingConvention:=CallingConvention.StdCall)> _
    Private Shared Function gsapi_new_instance(ByRef pinstance As System.IntPtr, ByVal handle As System.IntPtr) As Integer
    End Function

    <DllImport("gsdll32.dll", CharSet:=CharSet.Ansi, CallingConvention:=CallingConvention.StdCall)> _
    Private Shared Function gsapi_init_with_args(ByVal pInstance As IntPtr, ByVal argc As Integer, <[In](), Out()> ByVal argv As String()) As Integer
    End Function

    <DllImport("gsdll32.dll", CharSet:=CharSet.Ansi, CallingConvention:=CallingConvention.StdCall)> _
    Private Shared Function gsapi_exit(ByVal instance As IntPtr) As Integer
    End Function

    <DllImport("gsdll32.dll", CharSet:=CharSet.Ansi, CallingConvention:=CallingConvention.StdCall)> _
    Private Shared Sub gsapi_delete_instance(ByVal pinstance As System.IntPtr)
    End Sub

    Public Shared Sub getVersion(ByRef version As GSVersion)
        gsapi_revision(version, Marshal.SizeOf(version))
    End Sub

    Public Shared Sub run(ByVal argv As String())
        Dim inst As IntPtr = IntPtr.Zero
        Dim code As Integer = gsapi_new_instance(inst, IntPtr.Zero)
        If code <> 0 Then
            Return
        End If
        code = gsapi_init_with_args(inst, argv.Length, argv)
        gsapi_exit(inst)
        gsapi_delete_instance(inst)
    End Sub

End Class

Private Sub ToTIFFG4(ByVal sPDFPath As String, ByVal sOutputFolder As String)
    If Not String.IsNullOrEmpty(sPDFPath) Then
        If Not String.IsNullOrEmpty(sOutputFolder) Then
            If IO.File.Exists(sPDFPath) Then
                Try
                    If Not IO.Directory.Exists(sOutputFolder) Then
                        IO.Directory.CreateDirectory(sOutputFolder)
                    End If
                    Dim fi As IO.FileInfo = New IO.FileInfo(sPDFPath)
                    Dim sOutName As String = IO.Path.Combine(sOutputFolder, fi.Name.Replace(fi.Extension, "_G4.tiff"))
                    If IO.File.Exists(sOutName) Then
                        IO.File.Delete(sOutName)
                    End If
                    Dim gsVer As New Ghostscript.GSVersion()
                    Ghostscript.getVersion(gsVer)
                    If gsVer.revision > 900 Then
                        Dim argv As String() = {"PDF2TIFF", "-q", "-sOutputFile=" & sOutName, "-dNOPAUSE", "-dBATCH", "-P-", _
                     "-dSAFER", "-sDEVICE=tiffg4", "-r300", sPDFPath}
                        Ghostscript.run(argv)
                    End If
                Catch ex As Exception
                    MsgBox(ex.Message)
                End Try
            End If
        End If
    End If
End Sub

End Namespace

 

0 0
replied on January 29, 2018 Show version history

Keep in mind that this method appears to be attempting to create images from a PDF, not extract text.

One question that I should ask is what are you wanting to do with the text that you extract? Be aware that depending on how the PDF was written, it may have a wonky internal structure and it may be tricky to get text that makes sense.

 

0 0
replied on January 29, 2018 Show version history

Hi Devin, thank you for your help.

 

This is a little bit difficult

 

My customer use the software PIXI. From this, he fill a form for their customers. From this form, you have a lot of informations (id, lastname, firstname, dob, ...).

From the software, the agent print a PDF's file (i'm going to call it PDF1) and a paper for get signature.

The PDF1's name is now().pdf (exemple : 20180129094012.pdf).

When their customers sign the form, the agent scans the form signed and the attachments (passport, invoice, ...) and get a new pdf (I'm going to call it PDF2).

The PDF2's name is now().pdf (exemple : 20180129094205.pdf).

 

Under Laserfiche, I need to archive files like that :

REPOSITORY \ <Customer> \ <Form_ID> \ PDF1

REPOSITORY \ <Customer> \ <Form_ID> \ PDF2

 

The form and the attachments need to be in the same folder and I need to get informations from the form.

 

Actually my difficulties are :

#1. The agent don't rename the PDF (1 and 2),

#2. Even if the agent rename the PDF (1 and 2), he do it wrong (Exemple : 1234.pdf instead of 1235.pdf or PDF1 get 1.pdf and PDF2 get 2.pdf).

#3. The first page of the scan is never the same (sometimes is : "form, passport, invoice", and sometimes is : "passport, form, invoice".

 

My solution was to retrieve informations from PDF1 using a simply drag and drop + Workflow + pattern matching.

PDF2 is a little bit difficult. I don't know how to merge PDF2 to PDF1 without :

import agent, QF Import, drag and drop to laserfiche, and rename it.

 

My solution was to use Web Form with import's fields. Using workflow we could get informations from attachment 1 (PDF1) and simply merge attachment 2 to attachment 1.

 

But from web form, the attachment don't get text, so I can't use pattern matching on the attachments.

 

I don't know if this is very clear for you.

Don't hesitate to ask me more informations if you need.

 

Regards

 

 

 

 

0 0
replied on January 29, 2018 Show version history

Allowing the user to drag and drop a document is going to be the most consistent way to extract the text. You can configure the client to always show the metadata dialog when they drag a document into a folder. This would give them an opportunity to add metadata to the document. You could use this to find related documents and merge them.

If you are worried about data entry errors, that's where Workflow can help you out.

Here's one possible solution:

  1. User drops the first document onto the client
  2. Client displays the metadata dialog
  3. User enters metadata
  4. Workflow picks up the document, and if the user entered required fields, Workflow can validate their input against the PIXI database. If it can't find a matching record, it can route the document to a "needs correction" folder and send the user an email detailing what went wrong.
  5. The user drops the second document onto the client.
  6. Repeat step 4.
  7. Once both documents have been vetted, you can use the metadata to find both documents and merge them together.

 

Does all of that make sense?

 

0 0
replied on January 30, 2018

Hi Devin,

 

#1 : OK

#2 : OK

#3 : My customer won't to populate metadata. Anyway, this is not a problem ; using workflow I can retrieve data from the PDF.

#4 : The document will still be ok, we don't need this step.

#5 : My customer won't to populate metadata => second document is not identifiable.

Due to #5 We can't have #6 and #7

 

0 0
replied on February 1, 2018

Well, I can't help you if the users won't do their part. :)

If you don't mind slowing down the scanning process, you can have the Client OCR documents as they're scanned. That might help you. However, when you're doing pixel-based OCR as opposed to extracting text from a PDF, I recommend you still validate the data against a database. You never know if the OCR was completely accurate.

1 0
replied on February 1, 2018

Thank you Devin. We are going to get Import Agent. In my mind, this is the best solution.

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.