At nVoq, we understand that SayIt service availability is critical to your daily workflow and would like to apologize for the healthcare.nvoq.com service interruption on Friday May 31, and the service slowdowns on Monday June 3 and Tuesday June 4. An update of the system on May 30 introduced a bug that caused SayIt servers to run out of memory and crash under very certain conditions. We have deployed a new version of the SayIt software containing a fix, and have increased our testing methodologies to try and expose this type of bug before reaching production in the future. Below is our root cause analysis of the incident. If you have any questions or comments, please contact us at firstname.lastname@example.org.
nVoq’s deployment of version 14.1 of the nVoq platform was completed at 4:34 PM MDT, Thursday May 30, 2019. This version of the software contained an unknown defect that impacted application server performance. The impact was memory consumption leading to slow performance and occasional failed dictations and user-administration transactions. Signs of performance degradation began Friday May 31 at 1:53 PM MDT. The nVoq DevOps team began investigating reports of performance problems and system monitoring data. Normal system performance was restored by 4:00 PM MDT.
There were no issues on Saturday June 1 or Sunday June 2. Investigation by the DevOps team continued over the weekend.
By Monday morning, it was known that memory usage on the application servers was part of the problem and that restarting the application servers, before they reached their memory limit, resolved the issue. However, the root cause was still unknown. As a temporary workaround, the DevOps team added additional monitoring and lowered alert thresholds to alert on this new set of conditions. A pool of spare application servers was created that could immediately replace the production servers before they reached their memory limit. To minimize user impact, production servers were manually replaced with servers from this pool when programmatic alerts indicated high memory use. This continued until the patch was deployed.
Memory captured Tuesday morning from the application servers led to an identification of the root cause. The defect was then reproduced on nVoq’s test systems while a solution was developed, and a patch was built. The build was tested internally to verify the fix. Tuesday evening the patch was applied to eval.nvoq.com. Wednesday morning, further testing was performed on eval.nvoq.com by both nVoq employees and some ISV partners.
The patch was deployed on healthcare.nvoq.com Wednesday June 5 at 1:12 PM MDT. Monitoring has validated that the defect and resulting performance degradation were resolved by the patch. It is estimated that up to 10% of dictations were impacted by slowness or failure conditions, with the greatest impact occurring Friday afternoon. Canadian and Agent Assist customers were not impacted.