It was 9:30pm on a Saturday night, I was halfway through a Netflix series binge session and monitoring alerts started to popup one by one on my phone. Something is down. Pause Netflix and investigate.
What is system reliability?
Outages are inevitable, it happens to any software vendor, it’s how your systems are designed to accommodate unforeseen circumstances that is important.
There are two extremes for how such outages can be handled:
Outages in systems without reliability:
The software company doesn't know that there are errors until customers log support.
When errors occur the impact is widespread and prevents the users from being able to do anything.
It is hard to find the cause of the error due to poor logging.
Once the error is resolved, actions that occurred during the outage period are lost.
Outages in systems with reliability:
The software company receives alerts real time when errors occur before the customer even knows there is one.
When errors occur they are isolated and allow other areas to function. For example, Online Enrolments or changes to records can still be processed even if there is a delay in automated tasks or emails actually being sent and delivered.
Tracking down the cause of the error is quick due to great logging. This allows the problem to be found quickly and we can focus on finding the fix.
Once the error is resolved, actions that occurred during the outage period are queued and will resume. For example, emails, reports, and automated workflows are queued and processed as soon as the service is back online.
The Wisenet Outage Result
Thankfully Wisenet is designed for reliability. We were alerted in real-time and used the alerts to isolate the issue and determine the extent of the outage. The errors were centred around the automation and events services however the core user access was not impacted. We were able to review the logs and find the root cause.
The issue was caused due to a complexity in how a SSL certificate was signed with an expiring sub-certificate - a result of our vendor. We obtained new SSL certificates and pointed all services to them which immediately resolved the issue. Service health status was continually updated and tooltip alerts were in place within the LRM application.
All of the above processes are designed to make outages:
Less disruptive and more transparent for customers.
Easier to resolve and less stressful for the Wisenet team.
Each outage is reviewed with a postmortem on what we learnt regarding the problem and the processes we took to resolution. We then determine how we can further improve our outage practices.
Did you know that we publish all of our activities on our Service Health Dashboard? Take a look!