Site Reliability Engineering | An Introduction by Jan-Willem Middelburg
Introduction to SRE
It’s 10 o’clock so we will just get started with the webinar today and before we get started let us introduce a couple of the rules that we use during these kinds of webinars.
First of all, this session is recorded, so if you are missing anything during the webinar, you can always review everything back later in time. After the webinar, you will receive a link with the digital download, so where you can download this complete video, and will also post it on YouTube so you can see it every time back. The second rule is that you can ask questions during the webinar. We have around 25 people on the call today, so in order to moderate that a little bit, I would kindly like to ask you that you ask questions on the left-hand side where you see questions. I have placed everybody by default on mute and only towards the end of the webinar when we will review the questions, and I can open up the mics for selected individuals but otherwise it becomes a bit too many people at the same time. Last but not least, we really would like to know how you experienced this webinar afterwards, so if at the end of the webinar you will be prompted to give a little bit of feedback. We would really appreciate it if you do that, also let us know which kinds of topics you would like to learn more about, and we will schedule those accordingly in the future.
So, let’s just get started for today because I think we have a very interesting topic, but before we do that, let me take a quick moment to introduce myself. My name is Jan-Willem Middelburg, and I am one of this CEO and co-founder of Cybiant, which is a company that is all about Automation and Big Data. I have a long history in IT, have been working there for a decade. I have also authored a number of books including the ones on Serious Gaming, the Enterprise Big Data Framework and Service Automation. I do a lot of talks all across the world and a lot of training courses, so what we will do today is that I will provide you with a bit of a deep dive into what SRE is all about and why are so many companies looking into it today.
I prepared a bit of an agenda for the session today. The session will be around 30 minutes with afterwards we have a 10-minute Q&A session. What I will try to do in this session is cover the following topics. I will start with an Introduction to SRE or Site Reliability Engineering, it’s a relatively new term which has come to prominence especially in the last 2 to 3 years. We see many organisations are now looking into it and some of the larger ones have even started to adopt it. So, I will start with providing a bit of the background and also introduction of what is the formal definition of SRE, and why should or shouldn’t it be interesting for your particular organization.
In the second part of the webinar, I will dive into the differences between Site Reliability Engineering, DevOps and ITSM, and I will also touch a little bit more on how those 3 topics are related. Next, we will move on towards the core concepts of SRE, so exactly what makes this particular topic different from for instance DevOps, or from for instance ITSM. What I have tried to do is give a bit of an overview of the in comparison between the different frameworks and how you should look at each of those into more detail. Then, we will define Reliability and we will do that with something which we will call Service Level Objectives, so that’s what we will be looking into next. I will define Service Level Objectives a little bit and provide an overview of how you could define them yourself as well. Well, end the webinar with some sharing resources and then digital downloads that you can look into in case you are interested to learn more.
It’s impossible to cover everything just within these 30 minutes, so we have given a bit of an overview of some, where you can find some additional resources in case you want to read more about Site Reliability Engineering. At the end of the webinar, we will host the live Q&A session, so once again you are able to ask questions on the left-hand side of the screen, and towards the end we will collect of all those and review them together.
What is SRE?
So, that is a bit of the agenda for today, so let’s get started and I think it’s impossible to give an introduction about SRE without first discussing what is the most common definition of what Site Reliability Engineering is, and maybe the surprising elements about SRE is that there is not an official definition available, but if you browse online a bit you will see that the one that is most frequently used as the following. “Site Reliability Engineering (SRE) is what happens when a software engineer is tasked with something what used to be called operations.” I would like you to take a moment to let this definition sink in for a while.
First of all, I would like you to notice that this definition was quoted by Ben Treynor, which is the VP Engineering at Google and as I will explain a little bit later, Google has had a very impact on developing SRE, for them, it is their target operating model. I think it is also one of the reasons why SRE is so popular today is because this has been the default way in which Google looks at service management for the last decade. It is not something that is completely new, it had been in use for quite a long time, but I think it good times always… good things always take a little bit of time and to transpire to words other parts of the organizations. That’s I think why the buzz around SRE has really come up in the last 2 to 3 years and the 10 years ago when Google started with it.
The second thing, I would like you to know about this definition is that you see that there is definitely a combination of software engineering and together with the operations department, so this might not be surprising for the people who are already familiar with DevOps, but what SRE is really all about is how do we collaborate between the software or development (Dev) departments and the operations (Ops).
The third major thing I want you to take away from this definition is that I think it’s a bit provocative is that it says something that is “used to be” called operations, so here there is also a very clear distinction or approach that people move away from operations or at least have a different definition for it. So, what that exactly means we will cover in some of the next slides but it’s a good introduction, I really think that this is a very fundamental aspect of what SRE is.
Where does SRE come from?
I already mentioned the definition on the previous slide was from somebody who works at Google, and I think it’s also good to know in order to place SRE into the perspective is where it is exactly come from. It’s good to know that SRE is basically a methodology that was developed at Google and it has been their approach to service management for over a decade, so if you browse online and you go to the Google YouTube channel, you will find some additional videos from teams that actually consists of SRE teams, so the Site Reliability Engineer is an actual job function, and the actual teams within Google that consists of these engineers. It’s good to know that this is not just a paper-based methodology, it’s not something that has been invented by academics and then has been transferred over to organizations. This SRE is a really a methodology that came out of the operations departments themselves.
The fact that Google uses it is also of course a very good aspect for the increase in popularity. Because it works at Google, a lot of other organizations are looking for it as well. Some of the main materials that I will cover today, and which are covered into way more detail in the books or by 2 of the books that you see here on the slides. The good news is that these books are available for free online, so you can download them as I think if you just Google the titles then you will get a links towards the individual books by themselves. I think it’s always good to have a bit of understanding of where SRE comes from in order to place it into a bigger perspective on how can it be applied and especially what is the thinking behind it.
The Basic Idea
Let’s look at the basic idea behind SRE. In most organizations, there are 2 different teams that focus on different aspects of the enterprise. On the left-hand side, you see development teams. Development teams are working on producing code, they are developing applications, they are making apps or shipping new ideas, and what the teams on that side of the wall do is that they continuously update and integrate new features into their products, so that’s what we refer to as the development teams. On the other side of the wall, you see that there are the operations teams. The operations teams frequently manage the core and critical IT infrastructure, and they ensure that the operations keep running smoothly.
Traditionally, in most organizations, these have been two different teams frequently also with different managers, different line managements, and even different departments. What has been set is that between those teams there’s frequently a wall of confusion meaning that developer shipped their code towards operations, and operations provide feedback to developers, but sometimes there is a bit of a confusion between the two. That is mainly because they have different core objectives, and they really have different aspects that they are looking into what should be done within the organizations. I think that is best explained by looking at different objectives of the 2 different organizations.
Developers traditionally want to be very Agile, they want to ship their code very frequently, they want to make sure that they can adapt new changes into the products very quickly so that they are always in front of the market being able to develop new ideas requires a sense of agility, so the development teams by definition have an almost undesirable hunger for agility. That is completely opposite from the operations teams as you can imagine, in order to manage critical infrastructure, you need to have a high level of stability and security. What that means is frequently operations departments have a very process-oriented approach, there is checks and balances in order to ensure that nothing disrupts the core infrastructure of the organization. As you might have imagined the operations department is not very keen to continually update or ship new code into production, and because of that different objective, objectives between the 2 teams, we can say that there is a wall of confusion between the two and that has typically been identified as the “Classic Clash” between the Dev teams and the Operations teams. On the one hand, we have the developers that want to ship code fast so they can continually improve features and innovate on their products. For them, they have a very Agile way of working and they want to make sure that they can move their code production as quickly as possible. They prefer to do continuous deployments where they can immediately get code into live environments.
On the other hand of the clash, we have something which is the operations departments. Operations traditionally want to have stability and they want to have an IT infrastructure that is secure, and it can be controlled and in order to make sure that you can control and structure your IT operations efficiently, those are part of the organization frequently use processes. They make sure that there’s checks and balances in place in order to ensure that no things get introduced into the environment that can potentially disrupt or make the environment on flying. As you can see between those two, they are very frequently opposing goals and targets whereas on the one hand side we have the developers that want to do things quickly, on the other hand we have the operations side that are keen on stability. But wait a minute, is this not exactly the things that we have covered in the DevOps? For the people who are familiar with DevOps, this is a methodology that also really looks at this particular integration. So, where exactly are the difference between the two?
The definition of DevOps
In order to understand that let’s take one step back and first review what DevOps is all about again. On the screen here, you see Patrick Dubois who was the person that first coined the term DevOps during the DevOpsDays in Gent, Belgium, all the way back in 2009 when he first organized this. Patrick really recognized very early on that this opposing goals between Dev and Ops frequently inhabits the organization to do what it’s supposed to do, so he coined the terms DevOpsDays and DevOps in particular, in order to move towards a more collaborative effort between development and operations, and out of that the DevOps movement originated. If we look at DevOps and the definition that is most currently used today, it says that “DevOps is a set of cultural practices that has been designed to foster collaboration between Dev and Ops and other part of the organization.” I think that is a very good way to describe what DevOps exactly is. In order to overcome the differences of the wall of confusion that we covered in the previous slide, we need to have a better collaboration between the development team and the operations team, and what is the better way than to make sure that they are all aligned and strive towards a common goal, and that is what DevOps is all about.
A second key aspects of DevOps is that it’s a movement, sometimes referred to as a culture or a philosophy, it does contain some processes and practices in order to make sure that the organization can reach this particular goal, but more importantly it is “by the people who practices DevOps for the people who definitely practice DevOps”, so in that sense it’s not very prescriptive, it doesn’t say that you have to do a particular process, it doesn’t say that he needs to have a particular formal organizational structure. So, in that sense is really a collaborative movement. It’s a cultural movement that aims to foster a culture of collaboration between different kinds of departments. If we dive a little bit deeper into what are some of the core aspects of DevOps, you could say that if you dive deeper into what DevOps exactly is and why it works in organizations, that there are 5 core elements that constitute DevOps.
First of all, what DevOps aims to do is to reduce the organizational silos. We already mentioned that in most traditional organizations the development department is a different organization than the Ops department or IT operations. So, traditionally they have also been managed by different teams and what kind of happens when you do that is that you create organizational silos, so there is different teams and each of the silo functions on its own. It’s really difficult to break through that so what DevOps really wants to do is reduce those organizational silos to make sure that the people work within development continuously collaborate with the people in IT operations, so a DevOps team is a mixed team with that contains both people from development as well as operations.
A second key part of DevOps practices is that we need to accept failure as normal because we are managing IT systems, we need to recognize that we are dealing with technology and technology will fail every once in a while. In order to deal with that we need to recognize that those failures might occur so that we can deal with that as they move along. There’s very famous case studies are out there to actively disrupt your own organization in order to that you can plan better for failures in the future. So, one of the DevOps practices is to make sure that you accept those failures as normal way of operation and make sure that you take the precautionary measures that when a particular failure occurs, you are optimally prepared to make sure that you can deal with that particular failure.
A third DevOps practice is to implement gradual change, now always refer to that is if you take a lot of baby steps every single day, towards at the end of the year you will have accomplished major change. A large change is nothing else than a number of very small steps combined, and by taking these gradual steps over time your organization will approve accordingly.
The fourth and maybe one of the biggest topics within DevOps is automation, so how can you leverage tools and tools change in order to move all the way from code to production in an almost automated fashion. Today, we don’t have the time to cover the concepts of continuous delivery and continuous deployment but these are very strongly correlated towards the automation aspect of DevOps. So, what we would like to have in a DevOps environment is to have some technology in place that we can quickly ship code towards production, and by focusing on these particular tools or changes, we will build the infrastructure that is required to ensure that we bring things live quickly but also securely.
Last but not least, the final practice of DevOps is to make sure that you measure everything. It’s almost impossible to make any kind of improvement if you don’t know where you are today, so by ensuring you measure everything on a day-to-day basis you can really set the goals and make the implementation towards the next stage. These are some of the five core DevOps practices and basically this is what DevOps is in a nutshell, so this of course raises the question well most organizations are already doing DevOps and if you look at slide like you see here a little bit more closely, and you see that there is already quite a number of organizations that are fully working with DevOps already. You can also say well if this is a model that works fine to improve the collaboration between development and Ops, what would we need SRE? What is the additional benefit of having Site Reliability Engineering over DevOps if we have a model that is working well? And of course, this is a very vague question, let me try to answer that to you within the next couple of minutes.
The reason for SRE
What is the core reason for SRE? Well, although the 5 basic principles that I just outlined sounds very good on paper, a lot of organizations are really struggling with implementing these practices. What is the reason for that? Well, since DevOps is a cultural movement or sometimes referred to as a philosophy, it’s quite difficult to say when you have implemented DevOps successful because it is a movement and that it’s not prescriptive. Every organization can implement DevOps in their own way, so a DevOps implementation by definition will always be successful because it means different things for different people and what a number of especially the larger enterprises found in recent years is that this particular aspect of DevOps is quite difficult to implement because how do you implement a philosophy? or how do you implement something like a cultural movement? Especially at Google, they were looking for a more measurable targets to define when are we successful, when do we/when have we established a successful collaboration between the development department compared to the operations department. So, what they came up with is to focus mainly on Reliability as a concept.
I would like you to take a moment to think about what Reliability means for organization because isn’t this is the end what the customer is willing to pay for when they consume a service? If you log in to your Netflix account or to your Office 365, the main thing that you are concerned with is that it works and that it works when you actually want to use it. And this term “Reliability” is something which is at the core of SRE, it’s actually the second word within the Site Reliability Engineering definition. By focusing on Reliability, Google found a way to make sure that DevOps becomes a bit more practical, and also a bit more prescriptive so that you can measure whether your collaboration between Dev and Ops is actually successful. Another way to look at it, is look at SRE as a more prescriptive or accomplished way of implementing the DevOps philosophy. That also raises the point that DevOps and SRE are very closely related, and it’s not that one bites the other or the one replaces the other. It’s just that SRE looks a little bit deeper towards what are the more practical aspects of DevOps and how can we measure them successfully.
For the Programmers
What you see on the next slide for the people who have a more programming background, I think really explains what SRE is. This comes out of the Java language so for the Java programmers here in the room, and you could say that SRE is an implementation of the DevOps practices. I think this is one of the strongest definitions that is out there, so you could look at SRE as a class and that implements another function of the functionality of DevOps. So, I think this is a really strong way of looking at the combination between the two, it looks at DevOps as a central set of practices, but SRE is more of a practical approach at a practical implementation of that. Let’s see what they mean with that if we explore that a little bit further indeed detail.
Class SRE Implement DevOps
On the top of this table, which you see here in the next slide, we have outlined the DevOps practices that you have seen on the previous couple of slides so again the organizational silos, the acceptance of failure as normal, the implementation gradual change, the tooling and automation, and the measurement approach.
These are the core 5 DevOps practices that we discussed earlier, what you see below that is basically how SRE implements DevOps. If we look for instance at organization of silos, we see that SRE provides guidance to make sure that ownership between developers can create a shared responsibility within the SRE teams, or they use the same tools and technology between the different departments, so between operations and development so that they can continuously keep working together. We will look at their acceptance of failures, we see something that I will cover in more detail in the next couple of slides which is the definition of SLOs, Service Level Objectives. What we really trying to do there is define the reliability as a measurable target in order to make sure that we can implement that on a consistent basis and as we go along, I’m not gonna read out all of the different aspects of SRE, but the key thing and I want you to remember from this particular slide is that SRE provides a number of additional practices that makes DevOps practical. This is how you actually can measure and work with it in your own organization. So, you could say that the class SRE implements the DevOps philosophy or the DevOps cultural movement, I hope this make a bit of sense.
How to define an SLO?
In the next section, I follow will be good to zoom in a little bit further onto one of the cores within Site Reliability Engineering, and that is the definition of Service Reliability Targets and those are typically referred to as SLOs. I think most people here on the call will be familiar with how SLA which Service Level Agreements are defined coming from best practices in ITIL for IT Service Management and even in DevOps. Service Level Agreements define what are the key targets that an IT organization strives for, and I think we are all familiar with things like we want to have a 99.98% availability, or a 100% uptime. Most contracts that are still closed today define Service Level Agreements, so those are the minimum targets that an organization needs to achieve in order to meet its SLA. It’s also frequently means that SLA are coupled to monetary rewards so as long as you keep meeting your SLA you are gonna get a particular payment.
However, when we look closer to what’s the concept of Reliability, you could say well if 100% uptime or a 100% reliability is a wrong reliability target to begin with because we have just established in one of the five DevOps principles that failures will happen, so you could say that 100% reliability target is basically wrong for anything because that leave no room for improvement, it leaves no room for additional error. A better way instead of looking at an SLA is to define a Service Level Objective and that’s basically a measure of how “Reliability” is or how “Reliable” the service should be.
A very good example to think a little bit further about this is think about your online banking service. Do you really require that to be available 24/7, 7 days a week and always available to close your transactions? As you know, even banks have down time windows and which they do updates, and the answer of course is no. There is not a high likeliness that you need to make a lot of transactions between 3am and 3:15 in the morning, so you are fine that your banking service might be down a couple of hours during the month as long as you could use it during critical peak hours in which you need to process a lot of transactions, and that’s exactly what service level objectives try to specify.
An SLO or Service Level Objective should capture the performance and availability levels that if you barely meet them, you will keep your typical customers happy, and with typical customers will mean the customer of a particular service, so if we meet our SLO targets, it means that we have happy customers. If you don’t meet your SLO targets that the opposite is true, you don’t have happy customers at all. The key question of course is well what is a SLO target? Well, by focusing your target on something which will keep your customers happy, you will make room for improvements because it will never be a 100%, most customers don’t require their services to be up and running 24/7, that also make it very very expensive for example. By defining SLO targets you really ensure that you have a different approach towards defining your targets within your organization, and you can work further to achieve those, so I thought it will be good to share this particular example because this is something that you will learn when you dive into SRE in more detail. Obviously, as we have seen in the previous slide, there are a number of other practices that are within SRE but within this particular webinar, we don’t have the time to cover all of them.
Let’s recap because we are nearing the half hour time, so what have we learned so far in this particular? I think here are 5 key lessons that I want you to take away from this particular session, so these are I would say the five core learning point or the things that I have tried to talk you through in the last 30 minutes.
First of all, the definition of Site Reliability Engineering and I want you to remember that SRE is an extension of DevOps with more measurable targets, and that makes it a bit of a more practical approach for most organizations. I also explained to you where SRE comes from or originates from. SRE is originally developed by Google, and I think it kinda grows as quickly as it does today because Google operated it and uses it as their operating model. Third, the key objective of SRE is reliability that is second word within SRE, and Reliability is a measure of the service level that if you meet a particular target will keep your typical customers of a service happy. There are number of different ways that you can define your SLO targets and if you like to have more information on that I suggest to contact me after the webinar. Fourth, we talked a little bit about the definition of a service level objective, so SLOs is a different way of looking at a customer satisfaction and meeting targets, it’s different way than the traditional service level agreement that most organizations operate upon still today. Last but not least, I think a very good way to memorize what SRE is, is by looking at the programming implementation of SRE, so you could say that the class SRE implements DevOps. I think that’s a really good summary what everything that we tried to cover in this particular webinar.
So, what if you would like to know more about SRE? Well, the good news is that with Cybiant, we are the first company in Asia to launch SRE courses which have been developed by DevOps Institute, and the first course is actually already in one week’s time which we host in Kuala Lumpur, and after that we will be in Singapore in March to also teach the SRE course. SRE is a 2-day course which we will dive into way more detail about some of the things we just covered in this webinar today, so if you would like to know a little bit more about what is in the program? What are the learning objectives? What is the examination look like? I highly encourage you to visit the link below where you can find out more about the SRE foundation course, and where you can also register for this particular webinar.
Other than that, I have also covered a SRE in a written format and that is available on the website as a blog post, where it explains a little bit more about what SRE is into more detail. I also highly encourage you to read the books which are available online, and which provide way more background information about what SRE is and how it can help benefit your organization. That’s basically the main topic that we cover in the webinar today. I think it’s time for some questions, so if you have some questions which you would like to ask about SRE or anything else that we covered in this webinar. Please feel free to answer that or ask them in the box which you see at the left, I will try moderate these right now.
The first question that we received is what is the format for an examination for a SRE and will I get a certificate? The answer for that is yes, absolutely, so what we have tried to do is we launched the official SRE course for DevOps institute which is a 2-day course, and at the end of the second day you will immediately sit for your examination. After that, within 2-3 days you will get the results and you also get official certificates. The next question that I received is, are there any companies already in Malaysia that you know of that practice SRE? and the answer is yes, we are talking to a number of larger or enterprise organizations that are already practicing SRE, or that have started with their SRE implementation at this particular moment. it’s definitely not only Google who does that there are a number of other large companies out there that already do it. I also know for a fact in Singapore a number of companies that I work with that are already establishing SRE within their own practices.
Let’s look at the other questions that are coming in, can you move towards SRE if you already have DevOps practices in place? The answer is definitely yes because I was regard SRE as an extension of DevOps with more practices and more practical aspects and more measurable targets that are coming in, so definitely you can move towards SRE if you already have DevOps in place. I would even say that it’s an easier journey than if you don’t have DevOps in place because fundamentally the philosophy is the same. The next question coming in is what makes a good SRE, a Site Reliability Engineering especially in the job role? I would say an extreme focus on doing things right, so making sure that everything that you do from the beginning to the end all the way from code development to production is done in high quality, and as you also have a mindset for continuous improvement. So, I think one of the core objectives or I would say the things that I really like about the SRE framework is that it allows for continual improvement. It says that you don’t need to achieve 100% and that also means that you have windows for further improvements. A good SRE as a person is somebody who is collaborative who is also able to as a very extreme focus on quality and that also wants to improve over time.
Thank you so much for all the questions that are coming in that really makes it fun to moderate all of this. One more question, somebody says I have started to look at SRE and found some materials online however I’m still looking for templates and materials with which we can provide some additional guidance. That’s a very good question, we have a couple of workshops specifically for SRE and they also contain all of the templates. Those materials are available online, so if you are not able to find them, I can send them to you afterwards. I think these are very good presentation and also some workshop templates which you can start to define your own Service Level Objectives, so you can define your own SLOs. I think by doing a particular workshop like that you can really start to embed SRE in your organization or at least take a first step. Well, I think that is the last one that we received in the queue, so I think this is a good moment to end this particular webinar.
Thanks again all for zooming in today and joining this particular session, it’s always really fun when there’s lots of questions and lots of interaction during these kinds of session. So, I really like to thank you for participating in this webinar, if you have any questions, please feel free to contact us. Our contact details are on the next slide and otherwise you can find them on our website, and I will be teaching the first as a couple of SRE courses myself. If you are interested, please reach out to us and I hope to see you in any of the upcoming classes. If not, we keep hosting webinars on a very regular basis than that so then I hope to see you in any of our upcoming webinars. Again, thank you very much for watching today’s session, you will receive the recording of this session within the next hour. Thank you so much and I will see you next time, bye bye.