An Introduction To Site Reliability Engineering

Introduction Towards the end of 2019, the term Site Reliability Engineering (SRE) has quickly been growing in the IT Services and DevOps domains. It might be the first time that you are hearing about SRE, so I thought it would be a good idea to write down the basic ideas and concepts. With this article, you will be up to speed on some fundamental SRE basics in under five minutes. What is Site Reliability Engineering? Site Reliability Engineering is a term that is quickly growing to prominence, mainly because it is the main operating model for IT Service Management at Google. From around 2016 onwards, Google started with the creation of SRE-teams to manage production systems. A great way to explain Site Reliability Engineering is explained in the Site Reliability Book, which was written by Jones, Petoff and Murphy[1]: “Site Reliability Engineering is what happens when you ask a software engineer to design an operations team” Although this is obviously not an official definition, I think it highlights a core focal point of Site Reliability Engineering: it applies (software) engineering best practices towards IT operations. If you are familiar with software engineering, you will now that this domain contains many problem-solving techniques, from debugging to root cause analysis. Above all else, software engineering requires problem-solving attitude and patience. The approach of integrating Development (Dev) best [...]