Postmortem
Usually, a postmortem has the following requirements:
Issue Summary (that is often what executives will read and contains):
- duration of the outage with start and end times (including time zone)
- what was the impact (what service was down/slow? What were user experiencing? How many % of the users were affected?)
- what was the root cause
Timeline which contains:
- when the issue was detected
- how was the issue detected (monitoring alert, an engineer noticed something, a customer complained…)
- actions taken (what parts of the system were investigated, what were the assumption on the root cause of the issue)
- misleading investigation/debugging paths that were taken
- which team/individuals was the incident escalated to
- how the incident was resolved
Root cause and resolution contains:
- a detailed explanation of what was causing the issue
- a detailed explanation of how the issue was fixed
Corrective and preventative measures containing:
- what are the things that can be improved/fixed (broadly speaking)
- a list of tasks to address the issue (be very specific, like a TODO, example: patch Nginx server, add monitoring on server memory…)
Issue Summary
Duration of Outage: Friday, March 24th, 2024, from 7:11 AM to 7:32 AM GMT (21 minutes)
Impact: The Apache web server was malfunctioning, causing all websites hosted on the server to return a 500 Internal Server Error. Users attempting to access any website hosted on the server would be unable to view content.
Root Cause: Apache failed to access a file due to incorrect file permissions.
Timeline
- 7:11 AM GMT: An engineer noticed the website was not loading correctly and verified a 500 error using curl.
- 7:12 AM GMT: Initial investigation focused on configuration errors or incorrect permissions for Apache itself.
- 7:15 AM GMT: After no issues were found with Apache configuration, strace was used to track system calls made by the Apache process.
- 7:20 AM GMT: Strace identified an error during a file access attempt, leading to the discovery of incorrect file permissions.
- 7:25 AM GMT: File permissions were corrected using a Bash script.
- 7:30 AM GMT: The issue was resolved, and the website functioned normally.
Root Cause and Resolution
The root cause of the outage was incorrect file permissions for a file required by Apache. Strace, a system call tracing tool, revealed that Apache attempted to access a file but was denied due to permission restrictions. The Bash script used to fix the issue directly modified the file permissions to grant Apache the necessary read access.
Corrective and Preventative Measures
- Improve Monitoring: Implemented automated monitoring to detect 500 errors from Apache and notified administrators promptly.
- Automate File Permission Management: Utilized the Puppet code below to manage file permissions for critical files accessed by Apache.
# automated puppet fix (to find out why Apache is returning a 500 error)
exec { 'Fix wordpress site':
command => 'sudo sed -i "s/.phpp/.php/" /var/www/html/wp-settings.php',
provider => shell,
}
This eliminates the need for manual intervention and reduces the risk of human error.- Code Review: Review application code to identify any hardcoded file paths that might lead to permission issues if the server environment changes.
- Regular Permission Audits: Conduct periodic audits of file permissions for critical server applications to ensure proper access control.
Postmortem Conclusion
This incident highlights the importance of proactive monitoring and automated configuration management. By implementing the corrective and preventative measures outlined above, we can significantly reduce the risk of future outages caused by permission issues and ensure a more robust and reliable web server environment. It's important to note that this is a fictional scenario, but the troubleshooting steps and corrective actions are applicable to real-world situations.
Comments
Post a Comment