Incident Report #20170801
Incident Date: August 1, 2017 11:35AM PDT
Outage Incident Report
On August 1, 2017 between 11:35 AM and 11:55 AM PDT Lambda’s Shared Professional Cluster environment was offline.
If any user was logged in at the time of the incident a 502 Gateway error would initial show when performing an action on the site. After which, any users attempting to access the site would experience a Site Cannot Be Reached message from their browser.
The downtime was the result of an incorrect web server (ngnix) configuration that was loaded during the creation of a new auto-scaler unit.
All times below are in Pacific Daylight Time.
- 11:35 AM incident begins
- 11:36 AM initial client tickets beging to be submitted
- 11:37 AM monitoring software provides warning of shared sites showing as offline
- 11:39 AM confirmation that access to multiple sites is offline is performed by Support
- 11:39 AM investigation by System Operations is conducted into affected environment
- 11:47 AM location of problem within auto-scaling unit detected
- 11:49 AM isolated cause within auto-scaling unit to ngnix configuration
- 11:50 AM auto-scaling unit with issue stopped
- 11:51 AM ngnix configuration updated to remove problem
- 11:52 AM auto-scaling unit(s) restarted
- 11:53 AM sites begin to come back online
- 11:52 AM support provides incident ticket
- 11:55 AM monitoring software reports all sites back online
- 11:55 AM incident ends
Number of Incidents
Impact on Uptime SLA
Resolution and recovery
Lambda’s Operations team located within the web service (ngnix) a client configuration file (.conf) that is no longer hosting with Lambda. This file makes a reference to the folder location within the web server.
Due to this folder being deleted, the web server was not able to complete startup procedures and come online during the auto-scaling unit creation. This had a cascade effect that caused the auto-scaling unit fail, which in turn had an effect on the management unit of the hosting environment, leading to the unavailability of any LMS sites on this environment.
The Operations team isolated the auto-scaling unit and stopped it. The configuration of the web server was updated to not include the client configuration file and a new auto-scaling unit was started.
Corrective and Preventative Measures
The client folder that was deleted on the hosting environment was done during the night between July 31, 2017 and August 1, 2017. The procedure to fully update the hosting environments configuration for this change was not completed.
Lambda’s Director of IT will go over procedures with the Operations team on necessary steps required to complete a client cancellation on auto-scaling environment.
PDF Attachement below: