The 5 Pillars of the AWS Well-Architected Framework: III – Reliability
26.08.2020 - Read in 16 min.
We’d like to present to you the 3rd part of the “Five Pillars of AWS Well-Architected Framework” series. In this article, you’ll find out why your infrastructure should be maximally reliable and should give you the sense of stability even in the face of major failures, and what best practices you can implement to make your system as reliable as possible.
According to AWS documentation, a list of pillars the architecture of your projects can be based on may look like this:
- Operational Excellence
- Performance Efficiency
- Cost Optimization
This list can be used as a sort of foundation for the infrastructure you will build your services upon. Following these guidelines (but treating them as suggestions, of course) will give you a form that’s stable, secure, and efficient both functionally and financially.
This chapter is about: Reliability
Here is the official description of this pillar:
“The ability of a workload to perform its intended function correctly and consistently when it’s expected to, this includes the ability to operate and test the workload through its total lifecycle”
In short, your infrastructure must ensure maximum reliability and must give you the sense of stability even in the case of serious failures. This reliability is based on scaling and self-healing.
Here are the main principles of this pillar:
- Automatically recover from failures
Your system will fail from time to time, things may break, and if you think that’s not the case, you probably don’t monitor your system close enough and you are yet to know that. Obviously, it is impossible to properly configure and monitor everything right away, yet step-by-step it can be done. As we have said in the chapter on Security, every post-mortem gives you a huge amount of knowledge about your system and its shortcomings. Thanks to this process, you can adjust your monitoring and required automation, as Łukasz Husarz explained in the IT monitoring – your good friend! article. It is the automation that is particularly useful in your system and allows quick recovery from a failure. Some system components, e.g. Kubernetes or ReplicaSet in Deployment, can self-heal and maintain a pre-defined value of Pods.
- Test recovery procedures
The rule “we back-up our system so we’re safe” is only true if you can quickly and seamlessly use this archive. If you don’t have a well-described and properly tested system recovery procedure, it’s just like you don’t have anything to support the process of recovering system components like e.g. a database. AWS gives you multiple options to take snapshots that can be used to restore your data to a previous version. EBS disks can be archived using AWS Data Lifecycle Manager, RDS databases have minimum daily retention, and even S3 supports object versioning. There’s a lot of possibilities, but remember to test system recovery to a required version, and to test your system in general, e.g. using Chaos Monkey.
- Scale horizontally to increase aggregate workload availability
Your system becomes more reliable if e.g. you replace a single large instance of EC2 with a few smaller ones. Obviously, you need to use common sense. Some large instances have their benefits, like e.g. a larger number of supported IP addresses so much needed for a Kubernetes cluster (thanks to the AWS Custom Network Interface, CNI). However, you can pick smaller instances in such a way to keep reliability and performance at the same or even higher level. Various setting allow you to look for savings e.g. by dividing a larger number of instances into On-Demand and Spot ones.
- Stop guessing capacity
Your system should match your needs, it cannot be too small (or you will suffer from excessive usage of vCPU or RAM) or too large (or you will pay for unused resources). This requires proper monitoring of loads, resource usage, etc. AWS CloudWatch enables you to view metrics for particular resources. They can be further expanded by e.g. Prometheus with Grafana, which together with e.g. Kubernetes (AWS EKS) and On-Premise will significantly lower the cost of metrics (Custom Metrics in CloudWatch increase them quite noticeably). Regardless of the metrics that you use, you simply have to do that and you have to respond to underestimated resources. AWS Autoscaling may be particularly useful here. If need be (e.g. higher vCPU usage), it can add new EC2 instances to increase the amount of available cloud resources. Once the load lessens, the mechanism will decrease the number of instances. This way, you don’t have to pay for unused resources.
- Manage change in automation
You should automate everything, from coding the infrastructure and implementing changes by creating AMI images or containers and automated testing, to implementing new versions of services built in your system.
This perfectly fits into the notion of GitOps, i.e. maintaining the system by implementing changes in git. Each change should be proceeded via a Pipeline, where different automation components can be added at different stages, usually to test the change and our system. You should be informed about any problems, so that you can appropriately respond. The system is even better when the problem can be resolved without your input.
AWS defines a series of good practices to be implemented in your environment to ensure the highest possible reliability of your system. They are:
- Workload Architecture
- Change Management
- Failure Management
When it comes to reliability, it is key to understand the foundations behind it.
- Resiliency: is the ability of a workload to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues
- Availability: is the percentage of time that a workload is available for use
The percentage is calculated for a defined period of time, e.g. a month or a year. The formula is pretty straightforward:
Usually, this is called “the number of nines”, as indicated in the table below:
In interdependent systems, i.e. connected in series, the resultant availability is a product of availabilities of individual parts of the setup:
So, for three connected components with availability of 99.99% the final result is 99.97%.
For systems connected in series, e.g. with repeated components in different Availability Zones, the resultant availability is calculated as follows:
So, for the system presented below, the resultant availability is 99.9999%
If you trust mathematics, adding a redundant component in the second AWS AZ will lead to a greater availability. That sounds logical!
The above formulas are very simple and allow you to quickly calculate the availability of a given system. However, if you don’t know the , it is going to be hard. To calculate it, you can use the following metrics:
- Mean Time Between Failures (MTBF), i.e. the time between two failures within the system
- Mean Time To Recovery (MTTR), i.e. the time after which your system is restored to its full efficiency
The availability of your component can be calculated as follows:
Assuming that the app had been working correctly for 92 days, and restoring its full effectiveness took 15 minutes (minutes must be converted into days!), the availability is:
Not bad, even though the system fails every three months. Naturally, by placing a second application in a different Availability Zone or by adding its replica into a Kubernetes cluster, we can raise availability to 99.9996%. This means that it’s well worth the effort to factor redundancy into your application so that your system can work in many Availability Zones. However, each additional replica means increased cost, e.g. for another EC2 instance or a separate system to deploy a new version. You need to realise that if you are aiming at the availability/cost balance.
Efficient and reliable architecture is a combination of infrastructure and software it will host. As far as the former goes, there are usually no problems with that—you create additional networks and use additional Availability Zones or Regions. It is the software created by developers that has to correctly operate in this environment when e.g. scaled to five copies. Some people find it perfectly normal that a service operates in several copies. However, during a conference on Kubernetes, I was surprised by a certain anecdote.
A person was talking about a situation when he joined a company where Kubernetes was to be deployed. After some introduction, he asked a fundamental question: Why would you want to use this technology? The answer was simple. This is a modern, currently popular and commonly used technology. It’s hard to argue with that. However, a new problem appeared after a while, when the client asked “How come? Should our app be able to operate in several different copies? This can’t be done!” Well, so much for scaling, replication, and high availability.
Breaking a monolith into microservices is the first step towards high availability, as you can read in our e-book entitled Microservices or Monolith – a free e-book. The components become smaller and more manageable. Deployments are safer and quicker, as long as you maintain the Continuous Integration / Continous Deployment (CI/CD) discipline. Obviously, this causes different problems like appropriate synchronisation, cooperation, additional tests, etc. High availability comes at a certain price that has to be paid.
The same applies to infrastructure.
Instead of a single monolithic AWS account, let’s use several and connect them into an Organisation. Instead of a single VPC, let’s use several and connect them using vpc-peering or AWS Transit Gateway. Instead of a single network, let’s use at least two, Public and Private, or possibly add Database Network, Frontend Network, Backend Network, etc. (not too many, as the system will become difficult to configure, and in case of failure it’s going to be difficult to achieve higher MTTR. In moderation.)
By separating infrastructure components and microservices, you eliminate interdependencies, which gives you more freedom of operation. This way, you eliminate potential problems like “when Service_A fails, Service_B does not receive information from it and reports an error, due to which Service_C …”, and so on. In the modern microservices approach, such situation is unacceptable. The same applies to infrastructure components that should not be interdependent. An example of such dependency is the “1 Client – 1 Server” pair, where in case of a server failure the client cannot access a service hosted on this server. This problem can be solved by connecting multiple servers and using Load Balancer for communication with them. Another way to separate the Client –Server pair is to introduce AWS Simple Queue Service (SQS) for EC2 instances to fetch messages from. Should any problem with reception occur, the message stays in the queue and can be received and appropriately processed by others.
Tights and Loosely Coupled systems
A great example of separating microservices (actually enforcing such a separation) is by placing them in a Docker container and operating via Kubernetes (AWS EKS). The entire system comprising such services should by definition be resistant to a sudden disappearance of the container. One microservice should not expect persistent data to be delivered by another service, and should be prepared for sudden changes. By definition, such containers are stateless and use services like cache. In AWS, you can use AWS ElastiCache. Session and service information stored there does not have to “know” anything about any previous events, nor does it need to permanently store such information.
In one of my projects, the main goal of Client’s service transformation process was to move microservices to Kubernetes and breaking a monolith app into smaller parts. The process was quite smooth, developers were up to the task, and thanks to their professionalism and expertise, our services were independent from one another. Microservices were ran in the AWS Elastic Kubernetes Service (EKS), which was a natural environment for them. How nice it was that we didn’t have to worry whether the service has four copies or if a restart can cause problems. This is all in compliance with the notion of microservices.
Microservices with Lazy Loading caching strategy
Your system undergoes continuous changes. Increased traffic from an End Client results in AWS Autoscaling or Kubernetes Pod Autoscaling, and new business requirements force including such solutions in your infrastructure. Such changes should be conducted purposely and subjected to monitoring and tests. This chapter describes basic practices that should be considered for any infrastructure that “wants to” be efficient, which are:
- Resource monitoring and appropriate scaling
- Appropriate management of implemented changes
Ad 1. Resource monitoring
Every reliable and efficient system has to be appropriately monitored. Monitoring should cover resources and their usage, as well as failures that decrease the reliability of our product. Metrics and logs should provide maximum information about the running app and its auxiliary components. They should work as a knowledge base not only for techs, but also for the business in order to know how the functionality of the end product can be changed. It is crucial in order to increase its value and become more competitive on the market.
When you begin building a system, you are not sure what to measure or where possible failures may occur. Rely on your intuition and experience to collect as much information as possible at the early stages. You will collect even more information with time, and (unfortunately) they will originate from failures. As mentioned in the previous chapters, a Post Mortem should provide you with the details on what kind of monitoring and logs are still missing from you system. Obviously, in case of a hybrid infrastructure, where the Cloud is connected with On-Premise systems, you should also monitor resources at the side you’re migrating from. Our cloud system now shares the load with the On-Premise one, so they influence each other. This means that any delays with request processing on either of the sides will affect the stability of our product.
You can use AWS CloudWatch or Prometheus to monitor your resources or the number of requests in AWS. Logs can easily be collected using services like AWS CloudWatch or AWS Elasticsearch with Kibana, and then analysed in AWS Athena or AWS EMR. It all depends on the amount of data or the expected simplicity of requests. In addition to that, you can monitor and measure requests between our apps by using AWS X-Ray or CNCF Graduate project, the Jaeger. Whenever a given metric exceeds its defined threshold, a special process should be invoked that will put the system state back to normal. In case of high strain on the resources, they should be expanded by an automated Autoscaling process, and if a system admin is required to react to that, he should be informed via the Slack channel or by phone, for instance thanks to PagerDuty.
Ad 2. Appropriate management of implemented changes
If you control and purposely introduce changes, your ecosystem is stable, and in case of any problems you can quickly revoke a given change. Using Terraform or Pulumi to code your infrastructure, and then describe VM instance components using Puppet CM or Ansible ensures repeatability when deploying additional components. Furthermore, app configuration placed in the Kubernetes cluster using Helm or AWS ECS using AWS AppConfig gives you high level of control over its operation. All these tools will help you to restore your system to a previous version in case of a failure after change implementation.
Obviously, by testing those changes, you get a lot of information about your system operations. Canary Deployment or Blue/Green Deployment enable early verification of changes with no significant impact on the End Client.
Canary Deployment allows you to test new changes “using” certain users of your product or system. You can deploy a new version and direct e.g. 10% of your traffic there to see if it is stable, what errors occur, or whether the users like the way it works. If all tests are completed with success, you can switch the remaining traffic to the new version.
In one of our previous projects, we had a similar situation while migrating a monolith app into a new one based on microservices. In order to know which users were to be directed to a new system, we marked them with a special tag. A dedicated service, the so-called Application Router, was added to detect the tag in user queries and direct those users to the new system. A series of tests to monitor the efficiency of the new system showed whether it was stable or not. Once we collected positive results, the remaining users were switched to the new system.
Blue / Green type deployment lets us control the switching of the system to a new version. In this case, you assume that the infrastructure or app is already duplicated. At first, traffic is directed to the part containing the existing version—marked Blue. Admin or Developer releases new versions of their products, which are put in the copied part of the system—marked Green. After deployment, you can run first tests to see if everything works as intended. If yes, you redirect the traffic from Blue to Green containing the new version. Obviously, since the actual traffic from the End Client is directed there, you have to ensure that tests and any measurements are continuously conducted. If something goes wrong, you can instantly restore a previous version by switching to the Blue part.
Proper maintenance and management of a system provides huge value in terms of its stability and reliability. Nevertheless, even the best systems or apps can fail. It’s not that bad when you know what can possibly fail due to implemented, documented, and monitored technical debt. However, when a failure occurs in the least expected moment or area of the system, that’s a problem. The key is to get information before the failure impacts a wide group of End Clients. In the article on Security, I mentioned dispersed responsibility characteristics of Cloud infrastructures. When you decide to go with the AWS, you don’t have to worry about failures of the servers that your EC2s run on, because by definition it is Amazon’s role. They also govern and are responsible for the Availability Zone infrastructure. As the user of their cloud, you have no control over those 2 items. However, it’s up to you to decide if you want to go with one or two Availability Zones for the sake of High Availability (HA). If your systems require high availability, you cannot afford the lack of it. Thus, our services, which I’ve already mentioned several times, should be able to work in several copies.
Multi-tier architecture across three Availability Zones
The same principle refers to utilising numerous AWS Regions, which provide you with even more security and faster data access for your end clients if your service is used in many places in the world. Invoking resources for your service in the Ireland region will take care of fast access for clients from Europe, and utilizing the Sydney region will take care of those from Australia and Oceania. Amazon Global Edge Network works on a similar basis, and it’s used by AWS CloudFront to offer quicker access to the contents, such as objects held in AWS S3. If you use several regions, keep in mind that you can establish connections with them using AWS Route 53 and the Geolocation based Routing mechanism, in case of which requests will be sent to an appropriate destination based on the location of the End Client.
Another way to minimise loses in case of a system failure is to constantly Back up Data. “Data” in the broad meaning, that is database entries, S3 objects, app and infrastructure configuration, etc. A very telling failure case is the one in which a whole test environment was (literally) deleted. As you know, there are many safeguards that can protect you from such situations from happening in the first place, however in this very case they couldn’t be used. Everything was lost, except for the configuration of the servers, the app, and the database contents. Members of the team spent a whole day on restoring the system to its previous state. Then, they introduced new solutions into the monitoring and safeguards area to avoid such situations in the future, especially in the Production environment.
RTO / RPO relationship
In such cases, the right indicators are Recovery Time Objective (RTO) and Recovery Point Objective (RPO). The former one reflects the acceptable delay between a system failure and restoring its full operability, defined by the company. And the RPO tells us the time that has passed since the last full backup of the system or its component. It’s obvious that the shorter these times, the better, but it’s not always possible. Therefore, it’s important to pay attention to how often you back up your system and how efficient its failure monitoring solution is, and also the scope of your automated solutions responsible for restoring the system as soon as possible.
The aforementioned indicators are the basis to tell your Disaster Recovery (DR) level. There are numerous definitions of the process, depending on how quickly you want your system to be back and running, and with what amount of data.
- Backup and Restore (RPO measured in a few hours, RTO shorter than 24 hours): data and app backup to a different AWS Region.
- Pilot light (RPO measured in minutes, RTO expressed in hours): keeping a basic version of the system/environment or the app, containing the most critical components, in a different AWS Region. In case of emergency you’ll redirect users to those parts of the system, and meanwhile you’ll be repairing the basic one, to which you will return once it’s stable again.
- Warm standby (RPO measured in seconds, RTO measured in minutes): you keep a full version of your environment in a different AWS Region, however with a smaller number of active instances. Your business critical components are always on standby, and once you switch over to them, the system is momentarily scaled up to keep up with the network traffic.
- Multi-region active-active (RPO not calculated, RTO measured in seconds): the whole system is replicated in several AWS Regions. You keep full data synchronisation between the instances, etc. When a disaster strikes, a service like AWS Route 53 switches the whole network traffic over to a different region.
As you can see, several options are available and each of them has its price in terms of the time it takes to restore full operability of the system and the cost of e.g. keeping a full backup in a different region. Naturally, none of these solutions may work as it should if you don’t test it during the period of normal system operation. Even the best strategy may not work without prior tests.
As you can see, ensuring full efficiency and availability is not that hard. You need to remember a few things: separate interdependent services, enable work of numerous copies, use several Availability Zones, and use automation for monitoring and data archiving. Obviously, don’t forget to test your system restore mechanisms.
We encourage you to read part 4 – “5 Pillars of AWS Well-Architected Framework”: IV – Performance Efficiency