You are viewing limited content. For full access, please sign in.

Question

Question

Workflow Performance Degradation w/Search Activities

asked on January 6, 2014

Hello,

 

I currently experience a performance degradation with a running workflow. Basically the workflow searches within a given folder (no subfolder) based on 3 fields whose values are retrieved from an Excel file (with ~35,000 lines). If the search fails, the workflow creates a new entry within the above folder and the above fields values. As expected the pace of the workflow is going slower over time given the search activity against the increasing number of entries. I know the behavior is normal but I'd like to know what can be done to minimize the degradation. I have run the same workflow on an SQL Express w/2 CPUs vs. a Standard SQL w/8 CPUs and both environments do not meet the required deadline.

 

For instance should I create/manage a field whose content concatenates the above 3 fields values and run the search against that new field?

 

Thanks in advance,

 

Lou

0 0

Answer

APPROVED ANSWER
replied on January 6, 2014

I thought so. There is a lot of overhead with keeping track of the activities you've run in a single instance, so that's most likely the performance hit you're seeing, not from the search itself.

 

It might get marginally better if you split the search and document creation into its own WF rule and invoke it for each row. Can you describe what exactly you're trying to accomplish in more detail?

 

Yes, all of Laserfiche will be at our Empower 2014 conference next week (so replies on these forums will be way slower than usualwink)

0 0

Replies

replied on January 6, 2014

Are you saying you're running the same search 35000 times (once for each row in the Excel spreadsheet) in the same workflow instance?

0 0
replied on January 6, 2014

Hello Miruna,

 

Yes indeed...

 

PS: are you going to attend Empower?

0 0
replied on January 7, 2014

Hello Miruna,

 

I do have disabled activity tracking (all 8 categories) for the workflow in the WF Admin Console (then restarted both Laserfiche Workflow services and the workflow itself). However when I <F5> the running workflow in WF designer, I keep having the 'Retrieving activity messages' followed by the 'Processing activity message x of y' progress bars where 'y' is getting quite large. What's the explanation?

 

PS: please find attached the Advanced Server Options screenshot

Advanced Server Options.png
0 0
replied on January 7, 2014

It is not recommended to disable all activity tracking. If something goes wrong, with your current configuration you will have no way of troubleshooting. At least activity errors, warnings, conditions and tokens should be on.

 

When you open the instance details, the Designer still has to check if there are any rows in the table in SQL. Depending on when the instance ran, it may have data tracked already (as Advanced Server Options apply to new instances started after you made changes). Also, disabling activity tracking does not mean the WF Server will not log each activity it runs in the instance. So when you open the instance, it still goes through the data for 30000 times however many activities you run in For Each Row.

0 0
replied on January 7, 2014

Thank you Miruna for the info re activity logging.

 

Back to my original question, I'd like to have your opinion about searching documents using more than one field (for instance 3): how does Laserfiche translate the search SQL-wise? Does SQL launch distinct SELECT statements (one per field) then join the results to filter only the rows returned by all SELECT statements? Or does SQL execute the SELECT statements sequentially in a kind of 'Refine results' way?

 

I ask that question in order to make my mind if an additional field concatenating all (for instance 3) field values could be used instead in a search activity (in taht case) with a single field.

 

Lou

0 0
replied on January 7, 2014

It's a single query for the search. There are some more involved with checking security on the document before it is presented to the user in the search results. Like I said, most of the performance degradation will be on the WF side from keeping track of all the activities it has run in the current instance. But you will see some performance degradation on the LFS/SQL side as well from the sheer number of searches in a short period of time.

 

What is the purpose of this workflow?

0 0
replied on January 7, 2014

The client is a city, Ville de Brossard, and they are transferring all of the city's property documents into Laserfiche. Documents will be available to city employees and eventually to citizens as well. I will be pleased to meet you next week at Empower to discuss further on the project.

 

Lou

0 0
replied on January 7, 2014

Miruna,

 

If I understand it should be cheaper to split the 35,000 iterations into (for instance) 100 workflows each of which responsible to handle 350 rows. Acoording to you, to what extent the overhead involved by dispatching those 100 instances (rather than a single) will defeat the purpose?

0 0
replied on January 7, 2014

How do you plan on doing that? You would also have to split the Excel spreadsheet as Excel doesn't allow concurrent connections.

 

You'd get some performance benefit from not having so much data tracked per instance. The LF Server and SQL may be impacted more since your searches would now be concurrent.

 

But, I guess, that brings me back to my question above. What is the purpose of this workflow? If it's a one time run, then the performance is probably not that big a deal. If you're expecting it to run periodically, what is it supposed to achieve?

0 0
replied on January 8, 2014

Hello Miruna,

 

The workflow is expected to run on a daily basis which is why performances matter.

 

Here is how I intend to address the problem:

 

> I have inserted a «Modulus» column in the spreadsheet with formula =MOD(ROW(),16) so each row now has a value between 0 and 15

> I have added an input parameter «Modulus» to the problematic workflow

> I have created a (sort of) dispatch workflow that invokes 16 times the problematic workflow with value from 0 to 15

> Problematic workflow no longer retrieves all 35,000 rows in the spreadsheet but only those whose «Modulus» cell equals its «Modulus» input parameter value

 

Given that, each instance of the (formerly) problematic workflow will iterate roughly 35,000 / 16 so performance degradation should be alleviated.

 

Louis-Simon

0 0
replied on January 8, 2014

Is this basically sync-ing data from some sort of external source? Like, you get a list of customers and need to make sure they have corresponding folders in LF? If so, then would that mean that the table could get progressively bigger?

 

(I'm trying to find a more efficient solution that wouldn't involve checking for existing entries by searching every day)
 

0 0
replied on January 8, 2014

You've got the point Miruna: sync-ing data from an Excel spreadsheet as follows:

 

> one cell in the spreadsheet may have more than 1 value, each of then separated by a space

> for each value in that cell, need to check if there is a corresponding entry in Laserfiche

> if there is no corresponding entry, need to create a dummy entry - as opposed to a pdf file - with the value

 

The point is:

> rows are added from time to time in the spreadsheet, which is why I run the workflow on a daily basis

> ...and similarly for values added to the cell (in case of existing rows)

 

Lou

0 0
replied on January 8, 2014

Ok, so basically check the Excel spreadsheet every day for new values and handle those.

 

I'd recommend letting SQL do most of the work of figuring out what the new values are. I get the appeal of the UI in Workflow (wink), but SQL is a better tool.

 

So here is what I would do:

 

1. Use SQL SSIS (or the Import Data wizard in SQL Management Studio) to import the Excel spreadsheet into a table, let's call it "Reference" and run your workflow once.

2. At this point, every row in the Reference table would have a corresponding folder in LF. Add a "Processed" column to your table and set it to 1.

 

Now, you'd have a reference point for data.

 

3. Create a SQL Scheduled Job (or save the Import Data package you created above) that runs daily that imports the Excel file into a TEMP table.

 

4. Create a scheduled job that runs a query that gets the values that are in the TEMP table but not the REFERENCE one and inserts them into the REFERENCE table. Something like

 

INSERT INTO reference([column1],[column2],[column3])
SELECT [column1],[column2],[column3] FROM temp
WHERE [column1] NOT IN 
(SELECT [column1] FROM reference WHERE status = 1)

 

In your WF

 

5. Modify the Query Data activity to look at the REFERENCE table instead of the spreadsheet and only get the values where the Status column is not 1.

 

6. Do whatever you did before to create the folders

 

7. Run an Update Data activity to set the Status column to 1 for everything in the table.

 

8. Run a Custom Query to drop the TEMP table (the SQL import job would re-create it next time it runs)

 

This would basically eliminate all searches. At the end of each run, your REFERENCE table would be updated to have all the values you've processed so far. And most days, the workflow instance would only run a Query Data activity since you're saying the spreadsheet is not updated daily.

 

If whatever application is generating the Excel spreadsheet could write it to SQL directly, things would be even easier.

4 0
replied on January 7, 2014

Are you unable to do a search inside the entire repository and go through each entry and test it's metadata fields and match them to the excel file? If you were able to do things this way, you have 1 search containing every document and then you can iterate through them and mark off each row of the excel file that you have found already. then have a separate workflow at the end look for all the rows with a null value in that column to indicate needing to create that document and then have it create those.

0 0
replied on January 7, 2014

Excel spreadsheets don't allow updates, so there would be no way of marking rows off (at least not in Excel). The load on LFServer/SQL would be comparable or greater since you'd now have one search that returns possibly more than 35000 results plus one call to get fields for each one of the search results. The performance degradation on the WF side would be the same if not greater given that you'd also be iterating through the search results (potentially way more than 35000), so you'd still be recording a lot of activities for the instance.

0 0
replied on January 7, 2014 Show version history

Miruna,

 

You are actually correct. I guess I overlooked that matter. But you can use update/insert data into a spreadsheet. Or if you do not wish to do that, you can easily make the spreadsheet into a SQL table. 

 

To reduce the overhead you can have the workflow invoke itself after incrementing a counter variable and passing it through to the next invoked workflow, it will always try the list of exported documents for the next ID and if it's not there (say it's a folder perhaps that we dont care about) then you increment the counter and do a find entry for the next number when that workflow is called again. You would be making the find entry return one item at a time until you were done with the entire repository. Once done you would have a list of the files needing to be made. 

 

EDIT: I am having to do a similar amount of searching for one of our clients and I was planning on doing something similar to this so i can reduce overhead at any given time. I also have implemented a solution to get information in stages and limit the workflow to working during non-business hours

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.