You are viewing limited content. For full access, please sign in.

Question

Question

Can anything be done to maintain connectivity when AWS servers go down for maintenance?

asked on October 31, 2024 Show version history

I have a client that is hosting their on-prem Laserfiche in AWS. They are using an FX fileshare for all repository and volume files. On the first Monday of the month users are unable to access the system and we have reboot the server hosting LF Server to restore access (restarting the service does not do the trick). In our research, we found that there is server maintenance happening the last Friday of the month at 10pm EST for the FX fileshare, and in Event Viewer we see the following in the Admin SearchEngine logs corresponding to the maintenance window timing:

 

"Failed to flush "\\<server>fx.domain.local\LaserficheRepositories\SEARCH\lfqueue0.idx\0.idx". Error 53: The network path was not found",

 

However the path clearly exists when navigating to it on the server, and it's working at all other times.

For the time being, the client is going to be scheduling a reboot of the LF Server early Saturday morning, but I also think this shouldn't be an issue and that there should be some sort of self recovery situation, as servers do require maintenance from time to time and you wouldn't think that needs the LF Server to be rebooted. 

The maintenance window for Laserfiche is set from 1am-3am, so there shouldn't be any collision with the AWS server's maintenance. I'm open to suggestions on how to resolve this without needing to reboot the server. 

Thanks in advance.

15 0

Replies

replied on October 31, 2024

My recommendation would be to move the full text search Catalog, Audit logs and SQL Databases(if applicable) to EBS storage instead of FX. The error you point out is only in regards to full text search which a simple restart of that Full Text service should resolve.

4 0
replied on October 31, 2024 Show version history

I will second that network file storage that is periodically unavailable for up to ~20-minute periods is not a good fit for audit logs, search catalogs, and SQL databases.

Laserfiche Repository Server will halt operation if it cannot access the currently active audit log file. I understand this to be a deliberate safety mechanism. Laserfiche audit logs track user actions and system changes to ensure accountability and traceability, which are critical for regulatory and policy compliance. If Laserfiche Repository Server is unable to log audit events for actions as they happen, it will prevent actions from happening. There is no "catch up on writing the audit logs later" mechanism, because there's no way to guarantee it will be able to write them out later, which could result in unaudited events, which is not permissible. 

It's helpful to think of Laserfiche Full-Text Search catalog files as similar to SQL data files, where data in memory is flushed to the files for persistence. Inability to flush data to catalog files is generally bad and could have adverse effects, such as catalog corruption (requiring re-indexing the repository to rebuild the catalog).

Reviewing the Using FSx for Windows File Server with Microsoft SQL Server documentation is insightful:

Using Amazon FSx for Active SQL Server Data Files

Microsoft SQL Server can be deployed with an SMB file share as the storage option for active data files. Amazon FSx is optimized to provide shared storage for SQL Server databases by supporting continuously available (CA) file shares. These file shares are designed for applications like SQL Server that require uninterrupted access to shared file data. While you can create CA shares on Single-AZ 2 file systems, it is required that you use CA shares on Multi-AZ file systems for all SQL Server deployments, whether HA or not.

The Amazon FSx for Windows File Server Administering FSx for Windows file systems - File system maintenance windows documentation states the following:

Amazon FSx for Windows File Server performs routine software patching for the Microsoft Windows Server software that it manages. The maintenance window lets you control the day and time of the week when software patching occurs. You choose the maintenance window during file system creation. If you have no time preference, a 30-minute default window is assigned.

FSx for Windows File Server lets you adjust your maintenance window to accommodate your workload and operational requirements. You can move your maintenance window as frequently as required, provided that a maintenance window is scheduled at least once every 14 days. If a patch is released and you haven’t scheduled a maintenance window within 14 days, FSx for Windows File Server proceeds with maintenance on the file system to ensure its security and reliability. For more information about how to adjust your file system's maintenance window, see Changing the weekly maintenance window.

While patching is in progress, expect your Single-AZ file systems to be unavailable, typically for less than 20 minutes. Your Multi-AZ file systems remain available and automatically fail over and fail back between the preferred file server and the standby file server. For more information, see Failover process for FSx for Windows File Server. Because patching for Multi-AZ file systems involves failover and failback, any traffic to the file system during this time must be synchronized between the preferred file server and the standby file server. To reduce patching time, we recommend scheduling your maintenance window during idle periods when there's minimal load on your file system.

So, one solution may be changing the FSx for Windows File Server deployment from Single-AZ to a Multi-AZ. This is course has cost implications. Multi-AZ deployments cost roughly twice as much as Single-AZ since there's a second set of compute and storage resources provisioned.

Alternatively, you could consider two separate FSx for Windows File Server deployments:

  1. Large Single-AZ deployment for repository volumes, which are presumably the majority of the storage usage.
    1. General Laserfiche Repository Server operation is not especially impacted by loss of access to a volume, outside of operations that touch files in those volumes. Not usually a problem in the middle of the night on a weekend unless you have automated business processes that run then.
  2. Smaller Multi-AZ deployment for Laserfiche Full-Text Search catalogs and repository audit logs. This would likely benefit from making it a continuously available (CA) share, as is recommended for SQL Server. CA shares leverage an SMB 3.0 feature called SMB Transparent Failover. Microsoft and other documentation indicates CA shares are best suited for use cases with smaller numbers of active files where the application does not tolerate loss of connectivity well. That description applies to Laserfiche search catalogs and audit logs.

    There are recommendations against using CA shares for "general purpose file server usage" as there appear to be meaningful performance penalties with large numbers of files, like Laserfiche repository volumes have.

 

Or just use EBS volumes and avoid this headache entirely. I most commonly see Amazon FSx for Windows File Server deployed in the Multi-AZ config with Laserfiche to address "highly available storage" requirements, especially to provide shared storage for Windows Server Failover Clustered Laserfiche Repository Server & Full-Text Search instances.

If you're having this maintenance downtime issue in the first place, I suspect that means it's a single-AZ Laserfiche architecture (nothing wrong with that) with a Single-AZ FSx deployment. I would always choose local EBS volumes over FSx for a single-AZ solution architecture. It is a simpler deployment architecture with lower operational complexity, with no meaningful loss of capabilities or functionality vs a Single-AZ FSx deployment that I'm aware of.

3 0
replied on November 1, 2024

Thanks for the info Samuel. From the sound of this, if we were to move the Audit logs from the FX Fileshare to the local server, and leave the search catalogs on the FX Fileshare, do you think that would solve the issue? From what I'm reading, it's really the inability to write audit logs that is causing everything repository wise to lock up in a protective way. 

0 0
replied on November 4, 2024 Show version history

You're welcome, Michael. Moving the Audit logs from the FSx share to an EBS the local server would stop the Laserfiche Repository Server halting due to audit write failure.

However, I strongly advise moving the search catalogs locally too. You have two problems with the current storage setup, audit logs and search catalogs; solve both, not just the more visible one. 

As described in my response above, you should treat them like SQL Server data files. You really do not want intermittent write failures for those where the Full-Text Search service cannot flush data in memory. That's asking for all sorts of issues, from subtle, like subsets of content not indexed, to less subtle, like search index corruption and some or all full-text searches failing. The Full-Text Search service does various search catalog optimization and "housekeeping" operations when it's not actively serving search queries. Don't assume that there's no Full-Text Search service activity during the middle of the night on a weekend just because there isn't end user activity. It could still be negatively impacted by loss of connectivity to the FSx share at that time.

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.