Incident Report #20170224 |
Reported by: Marcos Rocha Incident Manager: Sam McCullough |
Incident Date: Feb 24, 2017 PST Report Date: Mar 6, 2017 PST |
Outage
Incident Report
Summary
- An individual site procedure on Shared Server Environment, enrolling thousands of users, caused performance issues.
- Scheduled Task was unable to complete within allotted time before a second enrollment process was triggered.
- A backlog of processes were built up over time
- Issue started on Thursday, Feb 23 at 6:00PM PST
- Site time response impacted
- Site accessibility impacted
- Sites were moved to a new infrastructure located on AWS Canada
Timeline
- Thursday, Feb 23 at 6:00PM PST
- Mistaken procedure on single LMS, enrolling thousands of users
- Process unable able to complete and blocked other processes from completing
- Friday, Feb 24 at 0:00AM PST
- Infrastructure issue on Softlayer, that causes problem with network connectivity
- Disk IO on the database instance reached 100% causing slow responses resulting in bigger time response of sites
- Sunday, Feb 26 at 6:00PM PST
- Site’s time response reached levels to trigger warnings
- Monday, Feb 27 at 10:00AM PST
- Starting moving clients to new infrastructure in AWS Canada
- Tuesday, Feb 28 at 07:00AM PST
- Finished moving clients to new infrastructure in AWS Canada
Resolution and recovery
- Enrollment Scheduled Task on identified LMS was corrected to default scheduled time.
- All sites were moved to new infrastructure located on AWS Canada
- All data were moved without any data loss
Comments
0 comments
Article is closed for comments.