Hi Suzi,
Sorry to hear you're running into frustrating sync failures.
Directory Server AD sync works by running all the sync rules to fully construct the future user/licensing state in a temp table, then once complete and validated, merges it/overwrites the existing user table.
This was a deliberate design decision that involved certain tradeoffs. One of these is, as you've discovered, that any error will cause the whole sync to fail. On the flip side, this approach categorically prevents partial error states/sync failures, which can be far more subtle and messy. In some partial sync failure cases, users could end up with licenses or not seemingly at random, including existing users having their licenses removed and "given" to someone else.
Having sync be all or nothing prevents this. If there's an issue with the update, the current state is preserved. A sync failure cannot screw up your existing configuration.
Also, why isn't there a notification when the synch fails? The only way we know is if someone logs in and checks manually or when we get Helpdesk tickets that new students can't login.
I believe there is an open feature request for email notifications on group sync failures. If I can track it down, I'll update it with a link to this Answers post. More real-world use cases for a feature request are always helpful with prioritization.
Edit 2024-06-18: The feature request is #528394 "Add email notifications for AD group sync errors".
In the meantime, there is something you can do to get proactive notifications so you don't have to wait for the helpdesk tickets to come in to find out one occurred.
When LFDS encounters an AD Group Sync failure, it logs one or more Warning/Error event messages in its Windows Event Log (ETW) channel. These messages will have a unique event ID (say "1234") for those event types and/or some common identifying text like "Sync exception". Most 3rd party server monitoring and alerting tools can be configured to ingest specific ETW channels and then configured to produce alerts when certain event ID and/or event ID and message content combinations occur.
It may also be worth pursuing why these "bad accounts" are showing up "a few times a month" and seeing if there are proactive ways to address those at the source. Perhaps there could be a script that runs daily (before the LFDS sync) to check the relevant AD groups for any "bad accounts" and remove them.
If you know what identifies a "bad account", you need not wait for LFDS sync to trip up on it to take corrective action (manual or automated).
Hopefully that helps with understanding why the sync behavior is the way it is, and gives you something actionable to pursue with getting proactive sync failure notifications and addressing the source of said failures.
Best,
Sam