Alerts are often overlooked
Sometimes a problem can hide in plain sight, especially if the mechanism for warning you about it is something like a regularly scheduled email. If you don't notice the email immediately or delay dealing with it sometimes you can completely forget it's even there, even if it's sent to you every day.
In this particular case a client almost completely lost data for one of their customers. The warning signs were there - one process in particular was sending regular emails about issues but it was wrapped up in a scheduled alert that included a bunch of other, more minor issues. Eventually everybody got used to ignoring them and for a long time the only copy of some critical data was sitting in a single place with no backup or recovery option.
The warning signs aren't always obvious
The core of this issue revolved around a migration between an old mechanism for storing files and a new one. The idea was to run both mechanisms in parallel before completely switching over to the new mechanism and getting rid of the old one.
The intention was to run the new mechanism on a completely separate volume but the new volume was never formatted correctly and was effectively unusable. The application had a fallback mechanism where it would revert to using older storage mechanisms if a new one failed. This meant that despite not actually having a usable new storage mechanism the application continued to work by continuing to use the old, soon-to-be-removed mechanism.
There was one part of the application that noticed the issue and even reported on it, but it was a scheduled process that summarised all manner of routine problems and issues of generally low severity. The support staff were supposed to be reading and addressing these issues but some of them were so minor that they didn't consider it worth dealing with. Eventually they treated the whole report as noise (because almost all of it was) and in doing so actually lost a critical part of their monitoring infrastructure that almost lost a great deal of important data for their customer.
To demonstrate what they were faced with, picture a several-hundred line report, where one of the lines is usually
but in this particular case happened to be
Storage: Issue: Cannot access storage 4 (IOError)
This was the only clue that anything was wrong.
Alerts are different from reports
There are many factors that can contribute to this sort of issue and plenty of ways to attempt to remedy it. Maybe the scheduled report should have had stronger messaging to reflect the severity of the issues it found, or perhaps the support processes should have been stricter. Maybe the application should have been less accommodating of misconfiguration and been less happy to fall back on older mechanisms. Addressing these would help but they would not address the root of the problem.
Alerts should be treated differently from reports. Scheduled reports, generally, aren't read or looked at until necessary, so wrapping up potentially important information in an otherwise uninteresting report is not a reliable approach.
A different approach - only alert if something is wrong
A much more powerful approach is to have an alert system that really signifies something when it triggers; something that makes everyone get a bit nervous because for the alert to even arrive means that something serious is happening that needs attention. Another factor is to specialise your alerts: rather than a generic 'something wrong' email or 'application error' alert, you should prefer to have specific alerts with detailed information, so that the recipients can decide more easily which to address first.
More directly, your alerts should trigger only when something needs active attention. It's fine to log and record issues and errors but when sending an alert it should really mean 'someone needs to do something about this very soon'.
Unfortunately it's difficult for applications and monitoring systems to isolate what needs attention and alert accordingly (especially if something drastically wrong means they aren't able to signal at all).
Invert the alert control - all alerts need attention
Process Warden helps invert the way traditional alerts and reporting work. Instead of having to read scheduled reports or interpret many alert emails to see if the error in question is important or not it flips the whole concept on its head and answers the question of 'what needs attention?'
By setting up your alerts to monitor for success instead of failure you get notified if any issue causes a failure, regardless of what the nature of the problem is. This means that an alert is only sent when something actively needs attention to be resolved.
In this particular case Process Warden took over the job of reading the report emails and it doesn't get tired or blind to the alerts. It checks every single email, every day, and as part of those checks it ensures that the report confirms all storage mechanisms are working as expected ('Storage: OK'). If any email on any day does not confirm that the storage mechanisms are working then Process Warden will light up and trigger an alert. This will work even if the email isn't sent or doesn't contain a 'Storage' line or for any other reason other than 'Storage: OK'.
The end result is that the support staff now don't ignore the alert emails because they generally don't get any. When they do it's a significant event and gets a lot of attention. For every other day, the report emails are still sent and read but this time by a process that will read the entire thing and check vigilantly that each one declares 'Storage: OK' (and many other significant lines), and will alert the support staff if anything is not exactly as it should be.