Re-organize a repository using workflow

replied on January 21, 2014 • Show version history

I think you might be thinking about this a little backwards.

While the logical thinking would be to search for everything, why not do a very large search, say, for everything in the repository, and then use a "For Each Entry" activity to go through each of the entries.

Once you have this, it's only a matter of setting up routing conditions to figure out what information the entry has and then route it to the new location.

If you are worried about the overhead, you can break this down into smaller searches and run 1 per night. In this case, you can arrange a DB table to store the Entry ID's of all the documents and then the status of the location as a boolean. You do a search for the entire contents of the repository and then go through each entry, storing it's Entry ID and a 'false' value.

With that done, you can have a workflow run between end of work hours till morning going through each entry and changing the value to 'true' after handling the entry with that ID.

EDIT: Feel free to come up with variations of this idea, I know I have done this sort of thing myself. It's a bit tricky but gets around some of the inherent limitations you are finding with the current method you tried.

0 0

View 7 previous replies

replied on January 21, 2014

This was how I originally setup the workflow, however if returning 100 or more entries the search timeout is reached and the workflow is canceled. I think the timeout is at about 4-5 hours. Even if the timeout could be removed it could take a very long time for every entry referenced in the database to be returned. Then, once it is returned you have one shot at it, if there is any problems routing you will have to initiate the search all over again.

A SQL query by ID and document type should be very quick regardless of the size of the database. I am going to test soon.

0 0

replied on January 21, 2014

If you took this approach I think you'd want to do a customer DB query that returned the next ### of results that were not folders as a multivalue token then cycle through a loop of that value and kick off another workflow for each one.

If your system is so large that it times out I would want to have workflow working on multiple documents at once. If you did it all in one workflow you'd only have one continuing thread for it instead of 4 threads per CPU. You could use the same process of setting up a control file to store the last index value inside your table you create (I'm assuming this table would be something you created through SQL directly, not by a search in laserfiche) so you could easily add an index value to it. The trick is setting the right ### of results so that every time it kicks off it's finished the previous batch. Depending on overhead during the day you could potentially lower that ### to something smaller and continue to have it running during the day. Since you are not doing searches with this method the overhead wouldn't be as bad.

One other thing to keep in mind with this is your workflow database. You may want to go into advance server options and choose to turn off all tracking options for the actual filing workflow. It would normally record a new entry for each and every workflow that ran - which would cause the workflow db to grow enormously and potentially slow down your SQL server. I don't think you can turn this off completely but you can turn off all messages inside it so that the amount of data it writes per workflow session is very small.

0 0

replied on January 21, 2014 • Show version history

Good to know, I turned off logging. I just finished timing how long it takes to run on 1,000 entries. About 30 minutes. If we run 10 hour nightly sessions we will have everything migrated in 1 year. I would like to reduce that time down to 1 month but that would require 12X the performance.

1200ms out of 1400ms total per workflow time is in the lookups which reach out to an external SQL server.

I am not sure it is the external database that has the performance problem though, the lookup activity appears to have a minimum of around 500ms.

I have two lookup activities in the workflow, one does a lookup on a very large table and the other does a lookup on a table with only 20 items in it and 2 columns. In studio this query takes less than 1ms. Both lookup activites take a bit over 500ms,. This may be a an authentication and initial connection to the engine.

Edit: I am able to run 1000 simultaneous workflows but that took just as long, all the lookups hung up for some time.

0 0

replied on January 21, 2014 • Show version history

Are you able to run this on weekends? A full weekend of time can help increase your timeline.

Anyways, have you tried actually running the move activity in the workflow? This might take a lot longer than you expect in your estimates.

Also, Are you able to use the SDK? I think you might have better performance running through all these files and moving them when you take out any overhead that workflow might introduce. Maybe even using 9.1 might help you in your performance crisis.

It is interesting that you have figured out your bottleneck. You really do not want to run that many simultaneous workflows if you are using searches and SQL lookups. Can you explain how you plan to go through all the files? Can you not set up some sort of "automatic filer" workflow where you drag in a collection of files to a folder on the root and the workflow act on each one of the files within?

How many entries are we dealing with?

1 0

replied on January 22, 2014

I have two lookup activities in the workflow, one does a lookup on a very large table and the other does a lookup on a table with only 20 items in it and 2 columns. In studio this query takes less than 1ms. Both lookup activites take a bit over 500ms,. This may be a an authentication and initial connection to the engine.

With only 20 items in 2 columns I would think you could write a workflow process with all the possible values hard coded. It would take some time but then it wouldn't have to do this lookup. That would save quite a bit of time. You have hit the nail on the head with the 500 ms limit - that is roughly how long it takes to setup and teardown the connections to another table.

One possibility with all of this is the fact that you can run multiple workflow servers if you need to on multiple machines. . I'm not sure if that would be compatible with what you are trying to accomplish or not as you may be running into SQL issues at that point - to truly help with this you'd need multiple copies of that third party sql db with their own server too.

1 0

replied on January 22, 2014

Seeing all of your posts about the time and how long it's going to take I'm wondering if there's another angle that could be used? Do you have the SDK and some coding skills? (if you can code but don't have the SDK you should consider it, it's really reasonable)

So on my original plan up above I was totally thinking of a simple look at the metadata and file it away workflow for each doc. but since you need to do lookups obviously you can't go crazy with how many it processes at once.

I now wonder if you could pass the multivalue token that contains all of the found entries to a sub workflow #2 as an input parameter that then is used in a SDK script. That SDK script then makes one connection to your DB programatically and either moves the document to it's final place (i.e. you do all of the filing workflow functions via script) or it fills in 2 temp fields on that document and tags the document as 'ready for filing'. If using the temp field/tag option setting that tag then causes subscriber to kick off workflow #3 that actually files it away using those temp fields so there is no lookup that has to happen (and then removes the temp fields)

I would think you could get more throughput with this method as there would not be a setup and tear down of the connection to the database every entry. There would be a bit of a lag invoking the SDK but not much. You could also do a trick to have it work on more than one of these lists at once by using

token 1 = starting entry

token 2 = start +100 --- search - then pass to invoke workflow #2 (don't wait)

token3 = token1+100

token4 = token2+100 --- search - then pass to invoke workflow #2 (don't wait)

token 5 = token3+100

token6 = token4+100 -- search - then pass to invoke workflow #2 (don't wait)

This way you could have 3 different sdk scripts running at a time

If you eliminate the lookup piece it should cycle through the filing parts of the workflows really fast. You can code all of your logic in your sdk for filing to speed it up even more but if it's complex with lots of decision trees that you've already created it may not be worth the time since these workflows should fly. It would be faster to do this via an invoke and pass the results as input tokens but I'm not sure how you invoke with input parameters in workflow sdk. (It looks like the SDK hasn't been updated since 8.3?)

1 0

replied on January 22, 2014

Thank you for all the updates. I did come to a similar conclusion this morning. I was able to eliminate both lookups using scripts with case statements and some pattern matches. Now that the lookups are gone I can run 1000 instances simultaneously and be done in about 2 minutes. It seems the more I run simultaneously the more routes I can squeeze into a single minute but I am not sure how to solve for the optimal amount. This is only a dual core server.

0 0

replied on January 22, 2014 • Show version history

I'm pretty sure in workflow admin/advanced settings you can change how many workflows can run at the same time per core. I think the default is 4 per core.

I have no clue as to what will happen when you change those though

That's one for Laserfiche. I wish we could tag someone in here and summon @Miruna, she'd be the one to ask.

0 0

replied on January 22, 2014 • Show version history

That sounds about right. If I run them one at a time it takes only twice as long, about 4 minutes. I actually had to stop trying to run 1000 simultaneously because a strange issue was happening during testing (which was why I ran several tests). It would complete about 995 of them in under one minute, then the last 5 would sit for sometimes up to 10 minutes stuck on my script. The script is nothing more than a case statement to replace the lookup of the table with only a few rows in it. Even at 1000 per 4 minutes though that is 150,000 in one night. Much closer to my goal.

0 0

replied on August 18, 2015

Ooh, summons! No idea how I missed that.

Increasing the number of concurrent workflows to run per CPU would increase the resources used by the server. And not necessarily help in this case since searches are throttled to protect the Laserfiche Server, so these workflows may queue up if the search takes longer. You could change the number of concurrent searches too, but i wouldn't recommend that.

Chris' solution takes advantage of the Workflow architecture by increasing the number of workflow instances while minimizing the number of searches run and keeping the number of iterations in the same workflow low to limit the data tracked to SQL. The limit is actively running 4 concurrent instances per CPU per workflow definition.

So, if you break out the activities inside For Each Entry into their own workflow and invoke it, you are distributing the work and are now taking advantage of multiple instances. You search time is the same, but each iteration of For Each Entry is faster since there's only one activity. Each instance invoked is now independent of the rest of the search results and multiple documents can be processed concurrently.

If you wanted to get even more distributed, you could publish the same workflow definition, say, 5 times and run the search results through Routing Decision that distributes the ones with entry ID ending in 1 and 2 to the first workflow, 3 and 4 to the second one, 5 and 6 to the 3rd one, etc.

1 0

View 3 previous replies

Awesome idea! Do you know what happens when it hits entry IDs that no longer exist or end of the range? Does it exit the loop?

Oh yeah... never thought about that.

I'm guessing you just have to watch for that. It'll keep running, it just won't return any results. I'm not sure if it'll generate errors at that point but I don't think it'll cause any issues.

You could also write a condition in there that says if the number is larger than X, end workflow.

You don't need to do anything for that case. The search won't return any results, so the rest of the activities (assuming you're running the results through a For Each Entry loop) will not run. Or you can wrap the subsequent activities in a Conditional Sequence that checks if the ResultsCount token from the search is greater than 0.

Like Chris says, you'll want to keep an eye on it and eventually turn the workflow off.

Thanks, that makes sense!

This solution is working very well. It meets both my needs, the search by ID returns very fast and I can run it in chunks during any time period. Thank you!

Oh awesome!

In my response below I mentioned some things about your workflow DB and the logging it does. If you have a ginormous amount of files you are moving you might want to keep an eye on it. You can disable the majority of what's recorded under workflow administration/advanced server options/workflow. This way you can at least minimize how much data it has to record for each sub-workflow invoked.

Question

Question

Re-organize a repository using workflow

Answer

Replies

Sign in to reply to this post.