100% uptime is impossible. Modern architectures are designed around failure but what does that mean for the human aspect of incident management? This talk will consider how to prepare for outages, how to structure the response, and how those experiences and techniques differ for small and large companies.
Key topics will include:
On call - rotations, scheduling, systems and policies
Preparing for downtime - teams, systems and product architecture
Documentation
Checklists and playbooks
How we actually handle incidents
Post mortems