When investigating an incident

Friday, June 03, 2022

#on-call

I just solved a very frustrating incident that literally woke me up at 4am.

I looked at the normal stuff, logs, tracers to find out what the issue was but I should have thought about rolling back.

This is a reminder to me.

When investigating an incident, there are 2 important factors to keep in mind; either it's a code problem or it's an infrastructure problem.

If it's an infrastructure problem (ie. Kubernetes), restarting the pod will almost always solve the issue. If this doesn't helo, then we go to the second option.
If it's not an infra problem, then it's most likely related to deployed code.

check if the staging environment has the same issue
check the logs to see what the errors are
Go to the fragment that has the most errors and check the latest merged pr and also the latest version on the CI pipeline
Use Helm to rollback to a stable version (do first on staging then live)
Check if the problem has been mitigated

This is just a reminder for me in the future, these things are easy!

This is a Grafana dashboard of when a service is doing well

Error rates are down

Other Posts