When investigating an incident

I just solved a very frustrating incident that literally woke me up at 4am.

I looked at the normal stuff, logs, tracers to find out what the issue was but I should have thought about rolling back.

This is a reminder to me.

When investigating an incident, there are 2 important factors to keep in mind; either it's a code problem or it's an infrastructure problem.

  1. If it's an infrastructure problem (ie. Kubernetes), restarting the pod will almost always solve the issue. If this doesn't helo, then we go to the second option.
  2. If it's not an infra problem, then it's most likely related to deployed code.
  • check if the staging environment has the same issue
  • check the logs to see what the errors are
  • Go to the fragment that has the most errors and check the latest merged pr and also the latest version on the CI pipeline
  • Use Helm to rollback to a stable version (do first on staging then live)
  • Check if the problem has been mitigated

This is just a reminder for me in the future, these things are easy!

This is a Grafana dashboard of when a service is doing well

Error rates are down

Powered By Swish