Can anything be done to maintain connectivity when AWS servers go down for maintenance?

replied on October 31, 2024

My recommendation would be to move the full text search Catalog, Audit logs and SQL Databases(if applicable) to EBS storage instead of FX. The error you point out is only in regards to full text search which a simple restart of that Full Text service should resolve.

4 0

replied on October 31, 2024 • Show version history

I will second that network file storage that is periodically unavailable for up to ~20-minute periods is not a good fit for audit logs, search catalogs, and SQL databases.

Laserfiche Repository Server will halt operation if it cannot access the currently active audit log file. I understand this to be a deliberate safety mechanism. Laserfiche audit logs track user actions and system changes to ensure accountability and traceability, which are critical for regulatory and policy compliance. If Laserfiche Repository Server is unable to log audit events for actions as they happen, it will prevent actions from happening. There is no "catch up on writing the audit logs later" mechanism, because there's no way to guarantee it will be able to write them out later, which could result in unaudited events, which is not permissible.

It's helpful to think of Laserfiche Full-Text Search catalog files as similar to SQL data files, where data in memory is flushed to the files for persistence. Inability to flush data to catalog files is generally bad and could have adverse effects, such as catalog corruption (requiring re-indexing the repository to rebuild the catalog).

Reviewing the Using FSx for Windows File Server with Microsoft SQL Server documentation is insightful:

Using Amazon FSx for Active SQL Server Data Files

Microsoft SQL Server can be deployed with an SMB file share as the storage option for active data files. Amazon FSx is optimized to provide shared storage for SQL Server databases by supporting continuously available (CA) file shares. These file shares are designed for applications like SQL Server that require uninterrupted access to shared file data. While you can create CA shares on Single-AZ 2 file systems, it is required that you use CA shares on Multi-AZ file systems for all SQL Server deployments, whether HA or not.

The Amazon FSx for Windows File Server Administering FSx for Windows file systems - File system maintenance windows documentation states the following:

Amazon FSx for Windows File Server performs routine software patching for the Microsoft Windows Server software that it manages. The maintenance window lets you control the day and time of the week when software patching occurs. You choose the maintenance window during file system creation. If you have no time preference, a 30-minute default window is assigned.

FSx for Windows File Server lets you adjust your maintenance window to accommodate your workload and operational requirements. You can move your maintenance window as frequently as required, provided that a maintenance window is scheduled at least once every 14 days. If a patch is released and you haven’t scheduled a maintenance window within 14 days, FSx for Windows File Server proceeds with maintenance on the file system to ensure its security and reliability. For more information about how to adjust your file system's maintenance window, see Changing the weekly maintenance window.

While patching is in progress, expect your Single-AZ file systems to be unavailable, typically for less than 20 minutes. Your Multi-AZ file systems remain available and automatically fail over and fail back between the preferred file server and the standby file server. For more information, see Failover process for FSx for Windows File Server. Because patching for Multi-AZ file systems involves failover and failback, any traffic to the file system during this time must be synchronized between the preferred file server and the standby file server. To reduce patching time, we recommend scheduling your maintenance window during idle periods when there's minimal load on your file system.

So, one solution may be changing the FSx for Windows File Server deployment from Single-AZ to a Multi-AZ. This is course has cost implications. Multi-AZ deployments cost roughly twice as much as Single-AZ since there's a second set of compute and storage resources provisioned.

Alternatively, you could consider two separate FSx for Windows File Server deployments:

Large Single-AZ deployment for repository volumes, which are presumably the majority of the storage usage.
1. General Laserfiche Repository Server operation is not especially impacted by loss of access to a volume, outside of operations that touch files in those volumes. Not usually a problem in the middle of the night on a weekend unless you have automated business processes that run then.
Smaller Multi-AZ deployment for Laserfiche Full-Text Search catalogs and repository audit logs. This would likely benefit from making it a continuously available (CA) share, as is recommended for SQL Server. CA shares leverage an SMB 3.0 feature called SMB Transparent Failover. Microsoft and other documentation indicates CA shares are best suited for use cases with smaller numbers of active files where the application does not tolerate loss of connectivity well. That description applies to Laserfiche search catalogs and audit logs.

There are recommendations against using CA shares for "general purpose file server usage" as there appear to be meaningful performance penalties with large numbers of files, like Laserfiche repository volumes have.

Or just use EBS volumes and avoid this headache entirely. I most commonly see Amazon FSx for Windows File Server deployed in the Multi-AZ config with Laserfiche to address "highly available storage" requirements, especially to provide shared storage for Windows Server Failover Clustered Laserfiche Repository Server & Full-Text Search instances.

If you're having this maintenance downtime issue in the first place, I suspect that means it's a single-AZ Laserfiche architecture (nothing wrong with that) with a Single-AZ FSx deployment. I would always choose local EBS volumes over FSx for a single-AZ solution architecture. It is a simpler deployment architecture with lower operational complexity, with no meaningful loss of capabilities or functionality vs a Single-AZ FSx deployment that I'm aware of.

3 0

replied on November 1, 2024

Thanks for the info Samuel. From the sound of this, if we were to move the Audit logs from the FX Fileshare to the local server, and leave the search catalogs on the FX Fileshare, do you think that would solve the issue? From what I'm reading, it's really the inability to write audit logs that is causing everything repository wise to lock up in a protective way.

0 0

replied on November 4, 2024 • Show version history

You're welcome, Michael. Moving the Audit logs from the FSx share to an EBS the local server would stop the Laserfiche Repository Server halting due to audit write failure.

However, I strongly advise moving the search catalogs locally too. You have two problems with the current storage setup, audit logs and search catalogs; solve both, not just the more visible one.

As described in my response above, you should treat them like SQL Server data files. You really do not want intermittent write failures for those where the Full-Text Search service cannot flush data in memory. That's asking for all sorts of issues, from subtle, like subsets of content not indexed, to less subtle, like search index corruption and some or all full-text searches failing. The Full-Text Search service does various search catalog optimization and "housekeeping" operations when it's not actively serving search queries. Don't assume that there's no Full-Text Search service activity during the middle of the night on a weekend just because there isn't end user activity. It could still be negatively impacted by loss of connectivity to the FSx share at that time.

0 0

Question

Question

Can anything be done to maintain connectivity when AWS servers go down for maintenance?

Replies

Sign in to reply to this post.