healthcare.nvoq.com service interruption
Incident Report for nVoq Public
Postmortem

Incident Summary

Incident Start Date: April 8, 2024, 6:54 AM MDT
Incident End Date: April 8, 2024, 9:00 AM MDT
Duration: Approximately 2 hours and 6 minutes
Affected Service: Admin console, HTTP Dictations, Voice Shortcuts, Websocket Dictations
Service Environment: https://healthcare.nvoq.com

Incident Description

On April 8, 2024, 6:56 AM MDT, the nVoq operations team began receiving alerts that automated checks of the Healthcare environment were failing. This incident looked similar to previous incidents on Jan 23, 2024 and Mar 25, 2024, where application servers became unresponsive when third party token authentication requests are delayed. Normal remediation methods for this type of incident were unavailable due to a power outage at an on-premises data center which hosted the management paths to the application back-end. Alternate access methods were used to update the application and this had the unintentional effect of restarting the deployment of all application components, further delaying the recovery of the Healthcare system until the re-deployment operations completed. Once the application re-deployment completed, the Healthcare system returned to a normal state.

 

Incident Timeline

  • April 8, 2024, 6:54 AM MDT: The first alert was received that an automated check was failing on the Healthcare environment.
  • April 8, 2024, 6:58 AM MDT: The operations team began triage of the issue, complicated by the fact that normal management paths are down due to a power outage at an on-premises data center.
  • April 8, 2024, 7:26 AM MDT: The operations team began a refresh of the application server instances.
  • April 8, 2024, 7:42 AM MDT: Application server refresh completes, but no change in system behavior is observed.
  • April 8, 2024, 8:04 AM MDT: The operations team began a second application update, which resulted in the unintentional effect of re-deploying all application back-end components.
  • April 8, 2024, 9:00 AM MDT: The application re-deployment completes, and the Healthcare system returns to a normal state.

Root Cause Analysis

The root cause of the outage was identified as follows:

Root Cause: The root cause analysis indicates an issue with third party token authentication processing. Remediation steps by the operations team then unintentionally re-deployed the entire application, resulting in an extended outage of the Healthcare system.

Contributing Factors: A widespread wind event in Colorado caused extended power outages, resulting in an outage of an on-premises data center used by nVoq to manage the Healthcare system running at AWS. The length of the utility power outage resulted in a shortage of diesel and eventually the generators powering the data center ran out of fuel. As a result of this data center being offline, the operations team were unable to use the regular access methods to restore the Healthcare system. Additionally, human error was a contributing factor, as efforts to reconfigure the Healthcare system resulted in the unintentional re-deployment of the entire application back-end instead of re-deployment of just the web front-end servers. This extended the outage timeframe while the re-deployment process completed.

Preventive Measures: We will be updating our application server code to better handle token processing when the third party token verification servers are unavailable. In addition we are taking steps to remove all management path dependencies on the Colorado data center.

 

Conclusion

At nVoq, we understand that our Voice service availability is critical to your daily workflow and would like to apologize for the outage affecting the Healthcare environment on April 8, 2024.

Our team will continue to monitor the situation closely and implement any necessary changes to ensure the reliability and stability of the Healthcare environment.

Posted Apr 20, 2024 - 23:07 MDT

Resolved
This incident is now considered resolved.

We will add a root cause analysis to this incident entry once we have concluded our investigation.
Posted Apr 08, 2024 - 14:02 MDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 08, 2024 - 09:08 MDT
Update
We are continuing to investigate this issue.
Posted Apr 08, 2024 - 08:35 MDT
Investigating
We are investigating reports of errors on the healthcare.nvoq.com system.
Posted Apr 08, 2024 - 07:05 MDT
This incident affected: healthcare.nvoq.com (Healthcare: nVoq Voice Administration Portal, Healthcare: Voice Shortcuts, Healthcare: HTTPS Dictations, Healthcare: Websocket Dictations).