Incident Report #20180910 |
Incident Date: Sept. 10, 2018 |
Outage Incident Report
Summary
On the morning of September 10, 2018 at approximately 9:29AM PDT the hosting instance labelled PHP7 with IP address 35.164.206.247 was taken offline. This was the result of a resource problem related to the file scanning software Clam AntiVirus.
ClamAV began to consume large amounts of resources first appearing as slowness when accessing LMS. Ultimately leading to a state where there were not enough resource to sustain both the AntiVirus and LMS Access.
The Operations team has resolved the recourse problem, by doubling system resources on the hosting environment. We are continuing to investigate ClamAV and its configuring to resolve this issue fully.
Timeline
All times are listed in Pacific Daylight Time.
- 8:39AM - First reported instance of slowness with LMS instance by single client
- 8:40AM - Lambda support monitoring with System Operations to isolate issue
- 9:15AM - Logs indicate source of problem with Clam AntiVirus
- 9:29AM - Instance begins to experience unresponsiveness from ClamAV
- 9:29AM - Instance goes offline
- 9:32AM - Operations team increases resources to instance
- 9:34AM - Instance is rebooted
- 9:39AM - Instance comes online
Statistical evaluations
Number of Incidents |
Recovery/Resolution time |
Impact on Uptime SLA |
1 |
10mins/1h |
Yes/10 Minutes downtime |
Resolution and recovery
- Existing ClamAV processes were terminated
- System Resources (CPU & RAM) were increased
Corrective and Preventative Measures
- Corrective Measures include
- Increase of system resources (CPU & RAM) to allow for restoration of normal functionality to hosting instance
- Preventative Measures
- Review of ClamAV settings on all instances
- Adjust ClamAV to run in parallel with normal system operations
Comments
0 comments
Article is closed for comments.