Site reliability engineering basics Release engineering Change management Incident management Postmortems Troubleshooting Distributed design Organization
Site reliability engineering (SRE) is an emerging paradigm in DevOps. The biggest names in tech—companies like Google, Netflix, Microsoft, and LinkedIn—all use SRE. In fact, industry-wide, "site reliability engineer" is replacing "DevOps engineer" in job posts. Simply put, SRE is software engineering applied to operations—for the cloud-native era. This course introduces the basics of site reliability engineering, including how SRE fits into DevOps and how it can be integrated into your unique business environment. Instructors Ernest Mueller and James Wickett cover the major areas of expertise, including release engineering, change management, incident management and retrospectives, self-service automation, troubleshooting, performance, and deliberate adversity. Learn how to define reliability through SLAs and SLOs, handle crisis, design distributed systems, and scale your systems and your team. Plus, explore time and project management strategies that bring humanity back to the SRE's job.