Deployment is a solved problem. Sure there is still work to be done, but the DevOps community has successfully proven that anyone can both scale deployment automation and distribute the capability to execute deployments. Now, we have to turn our attention to the next critical constraint: What happens after deployment?
We all know that failure is inevitable and is coming our way at any moment. How do we respond quickly and effectively to those failures? What works when there is just a small set of teams or an isolated system to manage will quickly break down when the organization grows in size and complexity. But on the other hand, what has been commonly practiced in large-scale enterprises is proving to be too cumbersome, too silo dependent, and simply too slow for today's business needs.
How do we rapidly respond to incidents and recover complex interdependent systems while working within an equally complex and interdependent organization? How do Ops teams embrace the DevOps and Agile inspired demand for speed while maintaining quality and control?
This talk examines the trial-and-error lessons learned by some forward-thinking enterprises who are currently streamlining how they:
- Resolve incidents
- Reduce friction between teams
- Divide up operational responsibilities
- Improve the quality of their ongoing operations (and organizational learning)