You are viewing limited content. For full access, please sign in.

Discussion

Discussion

How to manage 300M or more records in your Repo

posted on September 13, 2024

Hi all,

I'm planning a system for a customer who store about 300 million records after about 5 years. To me, that sounds like a TOC table of about 300 million (plus folders), and a PROPVAL table of about 3 billion records. It'll be huge.

With MSSQL, i can throw CPUs and RAM at it and it should be fine.

If 99% of the time, documents will be retrieved via EntryID, I expect results to be near instant. 

For the other 1%, users will be performing explicit metadata searches (no wild-cards).

Does anyone else have a single Repository close to this on any version? 

0 0
replied on September 13, 2024

@████████might.

Folder:Document ratios are usually higher than people think, so figure you'll have at least 400M toc entries.

You imply there will be an average of 10 metadata field values (propval) per document. At that scale, I would be thinking very hard about any and all opportunities to get away with using fewer fields without loss in business value. Dropping the average fields per doc from 10 to 9 eliminates 300M propval rows, a number that would qualify a repository as "large" by itself. Consolidating multiple single value fields to a multi-valued field doesn't count as "fewer" because each value is still a propval row.

Do ALL the SQL Server performance optimizations. I like to reference Microsoft's Checklist: Best practices & guidelines - SQL Server on Azure VMs article. Only a few of the guidelines are truly Azure-specific. The rest are either generic or translate easily enough to other hosting environments.

  • Use the absolute fastest ("frequency optimized") latest-generation CPUs available - this has a huge positive performance impact.
  • Put all eight tempdb files on their own super-fast NVMe-backed disk. Searches in large Laserfiche repositories tend to absolutely hammer SQL tempdb.
  • Use only flash-based storage for data (mdf) and log (ldf) files.
  • No less than 128 GB of RAM.
  • Have SQL index and statistics maintenance plans in place from day 1.
  • Set up extensive SQL monitoring and alerting. This is a good place to start: Monitor and Tune for Performance - SQL Server | Microsoft Learn

 

You may realistically need SQL Server Enterprise Edition to get past the 128 GB RAM limit on Standard edition.

Basic Searches (aside from Document Text, served by LFFTS rather than SQL) can easily overwhelm the database because they inject a *wildcard* on each end of the search term. I'm not sure offhand if Laserfiche Web Client 12 will allow greater out-of-the-box Quick Search customization that enables disabling all Basic Search types, but if not, you'll want to implement something like this to disable them: Can Web Client 11 search bar defaults be set? - Laserfiche Answers (I know you're familiar as you've commented in that thread; I'm stating it for the benefit of other readers).

4 0
replied on September 16, 2024 Show version history

I might think of more to add later, but Sam covered a lot of good points; system resources and database maintenance will be crucial.

Our big repository is approaching 60 million document entries, each of which has at least 20 different metadata fields and it does indeed have a cumulative/exponential effect.

In addition to resource allocation, some other things to consider are folder structure, column displays in the client, update activities, and audit trail.

  • To Sam's point about folder count estimates, in my experience, it is best to limit the number of entries within a folder for optimal client performance. For example, I started splitting things up into more subfolders because navigation can get very sluggish with 1,000 or more entries in a single folder (see next item).
  • Additionally, the fewer columns displayed in the client the better, especially calculated values like Total Document Size; every column in the client has a cost and that adds up quick when you have a large database, and even more so when you have a lot in a single folder and/or a large number of search results.
  • If the large size is accompanied by many frequent changes to entries, that can also be taxing on the workflow subscriber, and it can generate a massive amount of data in Audit Trail in addition to the transaction load on the database.

 

Another thing to be mindful of is volume size; logical volumes are a must, and our big repository is over 100TB at this point:

  • We try limit each drive to no more than 4-8TB. The repository isn't really affected by logical volume size, but larger drives can cause headaches for backup and recovery processes.
  • Leave a modest amount of breathing room on the drives; when you update a document with any new pages or electronic files, those will be added to that entry's current volume unless you migrate it, so you never want those older drives to be completely filled.

 

Basically, you want to think about the upfront implications like SQL performance and storage/resource optimization, but also make considerations for the expected level of insert/update activity, long-term growth, and implications for secondary systems that will interact with this repository.

 

We created a second repository where most of our document editing actually occurs to lessen the burden on the primary repository, i.e., 95% of the editing and workflow activity happens in a sort of 'staging' repository.

We only add/change things in the big one once a document is finalized in that first environment to minimize activity in the big repository; this is not practical in every environment, but I wanted to mention it since it provides important context for our resource allocation and our system load/performance.

 

UPDATE: Thanks for the correction Sam, 4GB would be a painfully small drive lol

4 0
replied on September 16, 2024

; )

0 0
replied on September 18, 2024

Hi @Samuel Carson and @Jason Smith,

Thanks for your replies. Tagging doesn't seem to work for me but hopefully you see this!

It's all great advice. Some of this was already part of my plan, some of this is new but having it all together in a single post, rocks.  🕺

Sam: The migrated data, which will be between 155 to 225 million records, will contain about 10 attributes each (to make sure we capture data from the priovious system). The next 75 to 145 million will contain around 5 atributes each. I'll be keeping my eye on the scope creep!

 Jason: 4 GB per volume, got it! They'll fit neatly on all the DVDs we have lying around 😜. We're lucky in that the folder structure isn't inteded to be be discovery-friendly so people will have to search, not browse. And at any rate, my filing workflow ensures no folder contains more that 1000 records. It also means I know the maximum number of folders up front. Also, the documents are edited elsewhere, before being stored in LF, so that'll help.

With your current system, do you know how many milliseconds it takes to retrieve a document using a known Entry ID, and by using Web Client or the .Net Framework API?

Regards,

Ben

1 0
replied on September 18, 2024

Hi Ben,

I don't know the exact numbers, but by far the bigger factor in retrieval is the document size. Locating a document via Entry Id is nearly instant since those are indexed primary keys; the speed is more influenced by how long it takes to retrieve the pages and/or electronic file.

2 0
replied on September 18, 2024

That's great. According to MS, their SQL server can hold 9999 squintillion* records (*under the right conditions) but glad to hear about the positive experience from someone using the same database strructure. 

0 0
replied on September 18, 2024 Show version history

And according to Laserfiche, a document entry can have up to 2^31-1 (2,147,483,647) pages, at which point the 32-bit signed int data type used for page IDs runs out of values. I wouldn't want to try opening such a document!

As Jason noted, retrieving a document by ID is basically [Network traffic time + SQL query execution time + Laserfiche Server access right checks + file data transfer time]. You can get a baseline by creating an empty entry with no edocs or images associated and testing how long the retrieve call takes to complete from the client's perspective. For real documents, any additional time on top of that baseline would be predominantly from file transfer.

1 0
replied on October 9, 2024

Since searching via the UI isn't a priotity but reducing the propval size is, can you forsee any issues in making the majority of the fields independent? That way only populated fields make it to propval. 

  • Changing the template via the UI, code or workflow requires extra steps to ensure all attributes are retained. 
  • Searching fields via template, in the UI, doesn't display all available fields, so they have to be added to the search manually however...
  • Of the 15 non-system attributes, about 8 are used for one process, and 9 are used for another, and 5 for another. Whilst there is overlap, using independent fields save 6-7 propval rows per document.
  • I considered using three templates but they are all essentially the same document and every search should bring back documents from each sub-set. Instead the one template has 3 fields used in all scenarios. I'm not sure if I'm explaining it well but so so it's working well. 
0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.