Postmortem

Scenario

Kofi realized all the Apache servers at his company were returning a 500 error. Upon investigation, Kofi found the issue, fixed it and then automated it using Puppet. 
Additionally, he has to write a Postmortem to provide the rest of the company’s employees easy access to information detailing the cause of the outage. Often outages can have a huge impact on a company, so managers and executives have to understand what happened and how it will impact their work.
And to ensure that the root cause(s) of the outage has been discovered and that measures are taken to make sure it will be fixed.

Usually, a postmortem has the following requirements:

Issue Summary (that is often what executives will read and contains):

  • duration of the outage with start and end times (including time zone)
  • what was the impact (what service was down/slow? What were user experiencing? How many % of the users were affected?)
  • what was the root cause

Timeline which contains:

  • when the issue was  detected
  • how was the issue detected (monitoring alert, an engineer noticed something, a customer complained…)
  • actions taken (what parts of the system were investigated, what were the assumption on the root cause of the issue)
  • misleading investigation/debugging paths that were taken
  • which team/individuals was the incident escalated to
  • how the incident was resolved

Root cause and resolution contains:

  • a detailed explanation of what was causing the issue
  • a detailed explanation of how the issue was fixed

Corrective and preventative measures containing:

  • what are the things that can be improved/fixed (broadly speaking)
  • a list of tasks to address the issue (be very specific, like a TODO, example: patch Nginx server, add monitoring on server memory…)
A postmortem must be brief and straight to the point, between 400 to 600 words to communicate what happened, why it happened, and how you’re preventing it from happening again. It’s not about assigning blame, but about learning and improving your systems. Now, lets take a look at Kofi's Postmortem.

Issue Summary

Duration of Outage: Friday, March 24th, 2024, from 7:11 AM to 7:32 AM GMT (21 minutes)

Impact: The Apache web server was malfunctioning, causing all websites hosted on the server to return a 500 Internal Server Error. Users attempting to access any website hosted on the server would be unable to view content.

Root Cause: Apache failed to access a file due to incorrect file permissions.

Timeline

  • 7:11 AM GMT: An engineer noticed the website was not loading correctly and verified a 500 error using curl.
  • 7:12 AM GMT: Initial investigation focused on configuration errors or incorrect permissions for Apache itself.
  • 7:15 AM GMT: After no issues were found with Apache configuration, strace was used to track system calls made by the Apache process.
  • 7:20 AM GMT: Strace identified an error during a file access attempt, leading to the discovery of incorrect file permissions.
  • 7:25 AM GMT: File permissions were corrected using a Bash script.
  • 7:30 AM GMT: The issue was resolved, and the website functioned normally.

Root Cause and Resolution

The root cause of the outage was incorrect file permissions for a file required by Apache. Strace, a system call tracing tool, revealed that Apache attempted to access a file but was denied due to permission restrictions. The Bash script used to fix the issue directly modified the file permissions to grant Apache the necessary read access.

Corrective and Preventative Measures

  • Improve Monitoring: Implemented automated monitoring to detect 500 errors from Apache and notified administrators promptly.
  • Automate File Permission Management: Utilized the Puppet code below to manage file permissions for critical files accessed by Apache. 
# automated puppet fix (to find out why Apache is returning a 500 error)

exec { 'Fix wordpress site':
  command  => 'sudo sed -i "s/.phpp/.php/" /var/www/html/wp-settings.php',
  provider => shell,
}
    This eliminates the need for manual intervention and reduces the risk of human error.
  • Code Review: Review application code to identify any hardcoded file paths that might lead to permission issues if the server environment changes.
  • Regular Permission Audits: Conduct periodic audits of file permissions for critical server applications to ensure proper access control.

Postmortem Conclusion

This incident highlights the importance of proactive monitoring and automated configuration management. By implementing the corrective and preventative measures outlined above, we can significantly reduce the risk of future outages caused by permission issues and ensure a more robust and reliable web server environment. It's important to note that this is a fictional scenario, but the troubleshooting steps and corrective actions are applicable to real-world situations.

Comments