When investigating an incident
I just solved a very frustrating incident that literally woke me up at 4am.
I looked at the normal stuff, logs, tracers to find out what the issue was but I should have thought about rolling back.
This is a reminder to me.
When investigating an incident, there are 2 important factors to keep in mind; either it's a code problem or it's an infrastructure problem.
- If it's an infrastructure problem (ie. Kubernetes), restarting the pod will almost always solve the issue. If this doesn't helo, then we go to the second option.
- If it's not an infra problem, then it's most likely related to deployed code.
- check if the staging environment has the same issue
- check the logs to see what the errors are
- Go to the fragment that has the most errors and check the latest merged pr and also the latest version on the CI pipeline
- Use Helm to rollback to a stable version (do first on staging then live)
- Check if the problem has been mitigated
This is just a reminder for me in the future, these things are easy!
This is a Grafana dashboard of when a service is doing well
Error rates are down
Other Posts
Powered By Swish