• Home
  • Blog
  • The 5 Pillars of the AWS Well-Architected Framework: I – Operational Excellence

TECHNOLOGIES, TRENDS

11.08.2020 - Read in 12 min.

The 5 Pillars of the AWS Well-Architected Framework: I – Operational Excellence

11.08.2020 - Read in 12 min.

We encourage you to read the first article from “The 5 Pillars of the AWS Well-Architected Framework” series. It has been based on our cooperation with our clients and on AWS best practices. We begin with Operational Excellence, one of the foundations our solutions are based upon.

The 5 Pillars of the AWS Well-Architected Framework: I - Operational Excellence

According to AWS documentation, a list of pillars the architecture of your projects can be based on may look like this:

  • Operational Excellence
  • Security
  • Reliability
  • Performance Efficiency
  • Cost Optimization

This list can be used as a sort of foundation for the infrastructure you will build your services upon. Following these guidelines (but treating them as suggestions, of course) will give you a form that’s stable, secure, and efficient both functionally and financially.

This chapter is about: Operational Excellence

 

Here is the official description of this pillar:

“The ability to support development and run workloads effectively, gain insight into their operations, and to continuously improve supporting processes and procedures to deliver business value.”

In short, you should support implementation work (service code and infrastructure) and system maintenance to meet business requirements, mainly delivering value to the end user.

 

This pillar has a lot in common with the broad definition of DevOps that is about removing barriers between Developers (Dev), who create services for the end user, and Admins (Ops), who create space for Devs’ code in the infrastructure. This cooperation (admittedly, maybe not reflected very well in the name) also includes the broadly understood Business (Biz) and other groups responsible for different project domains like Network (Net), Quality Assurance (QA), or Security (Sec). This whole ensemble aims at achieving the highest transparency and the greatest level of cooperation possible among the groups. It’s this agreement and understanding of everyone’s common goal that creates the most value. It’s the first step of a project.

 

Design Principles

Here are the main principles of this pillar:

  • Describe everything with code.
    The word “code” is usually associated with the programming of an app that runs on a server and is reserved for Developers. However, in this new infrastructure management approach, you should also use code—or at the very least a list of variables interpreted by a designated management system—to describe your infrastructure. In my experience, looking at infrastructure and its components, there are three layers described with code:

      • AWS infrastructure — comprised of AWS VPC, Security Groups, AWS S3, AWS EC2, and other components. This is mostly managed by Terraform and Terragrunt. You can also use other tools like AWS CloudFormation, AWS CDK, or Pulumi. Sometimes, for automation purposes (for example, creating and rotating AMI Golden Images), AWS SDK in the form of a Python boto3 library, and Packer are used.

 

Terraform code used to configure AWS EFS

        
#
# https://www.terraform.io/docs/providers/aws/r/efs_file_system.html
# Provides an Elastic File System (EFS) resource
#
resource "aws_efs_file_system" "this" {
  count = var.enable ? 1 : 0
  encrypted        = var.encrypted
  kms_key_id       = var.kms_key_id
  performance_mode = var.performance_mode
  throughput_mode  = var.throughput_mode
  lifecycle_policy {
    transition_to_ia = var.transition_to_ia
  }
  tags = var.tags
}
        
    
    • Operating system (OS) — here, you can describe what’s installed in Linux systems with code. Puppet CM with Hieradata is the best choice for this job. With a plethora of well-proven and advanced modules available to the public, and even its own classes, it’s perfect for describing services installed on servers.

 

Puppet code used to configure the pyjojo service

        
# === Class pyjojo::run_service
#
# The class which starts the pyjojo service
#
class pyjojo::run_Service {
  if true == $pyjojo::manage_service {
    service { 'pyjojo':
      ensure   => $pyjojo::service_ensure,
      name     => 'pyjojo',
      enable   => $pyjojo::service_enable,
      provider => $pyjojo::init_style,
    }
  }
}
        
    
    • Docker containers — orchestration of containers in Kubernetes, a service that’s part of the area you work in every day, should also be described with code. The settings of all aps and supporting services should be described in YAML files. Helm is perfect for this task. It greatly supports the process of describing a containerisation system with code, working with CI/CD tools like Flux or Helm-Operator.

 

Description of a service implemented with Helm (excerpt)

        
---
apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: kuard
  namespace: apps
spec:
  releaseName: kuard
  chart:
    git: git@gitlab.com:kuard/devops/gitops.git
    path: charts/kuard-services
    ref: master
  values:
    master:
      deployment:
        enabled: true
        image:
          registry: gcr.io
          repository: kuar-demo/kuard-amd64
          tag: 2
          imagePullPolicy: Always
        replicaCount: 1
        envVars:
          DEBUG: true
      service:
        enabled: true
        annotations:
          alb.ingress.kubernetes.io/healthcheck-timeout-seconds: "120"
          alb.ingress.kubernetes.io/healthcheck-interval-seconds: "300"
        targetPort: 8080
        type: NodePort
        protocol: TCP
        port: 80
        
    
  • Make frequent, small changes. When you describe an infrastructure with code (as shown above), you often discover that the changes you’ve made are no good and must be fixed or completely reverted. We are often tempted to just sit down and write a lot of code all at once and then implement it in the environment a few days later. However, if you do that, instead of enjoying a new, perfectly working configuration, you will often have to fix several issues or even revert it to a previous version. It’s not a problem if you find the errors quickly, but if you made a lot of changes at once, it could be rather difficult. To avoid such situations, make very frequent changes in the code and environment, keep them small, and fix any issues as soon as possible. You must get a feel for what a “small change” means to you. For some it could be 10 lines of code, while for others it could be 500 lines.
    When it comes to Puppet CM code, I’m in favour of limiting changes to e.g. one class, checking its functionality, and implementing subsequent changes. When it comes to Terraform, the first implementation is usually limited to performing the simplest functionality like e.g. creating AWS S3 Bucket, and then expanding its functionalities with e.g. versioning or ACL Policies in the next steps.

 

  • Review procedures. Just because something works great today and matches your ecosystem, it doesn’t mean it will be the same tomorrow.
    Every procedure related to deployment, system launch, coding concept, base infrastructure look, etc. can be improved.
    The first AWS services we built for our client were rather simple: one shared AWS account in which we implemented VPCs that reflected environments like DEV, STG, PROD, etc. Despite its limitations, this approach worked wonderfully. However, our subsequent projects were already a step forward: they implemented several AWS accounts connected in AWS Organizations. It was all topped by a different approach to access organisation and service launching method. Therefore, we moved from implementing services in Kubernetes with Jenkins to the GitOps approach, thanks to gitlab-ci and FluxCD.
    Are these changes final? Definitely not. They will most likely get replaced soon with something more efficient.

 

  • Practice failure scenarios and learn from actual ones. If you’re creating an infrastructure where a developer app or its supporting services (such as databases) will be launched, you must keep in mind the possibility of a failure within that system. We all want to rest easy and not be forced to fix database replication or restart services after OOM Killer gets triggered—at least not at night. You should keep in mind that behind every system—even the most automated one—is a person who must know what to do in case of failure, after everything else has already failed. I believe that every system should be as simple and transparent as possible. However, that doesn’t mean it’s supposed to comprise e.g. just one AWS EC2 and S3 instance. “Simplicity” means having logical connections between components, everything described with code and documented, without special hacks (if necessary, they should be very well-documented). The following figures show two systems in separate AWS Regions. Which of them is easier to use? That’s not a trivial question. There is only one correct answer: “It depends”. If you have well-maintained documentation, you will take to both systems like a duck to water. However, if the “smaller” system has some secret configurations, specific dependencies that only its author knows of, then maintaining it and reacting to failures will be much more difficult than in the case of a properly configured and documented “bigger” system.

 

Which system is easier to use?

The 5 Pillars of the AWS Well-Architected Framework: I - Operational Excellence
The 5 Pillars of the AWS Well-Architected Framework: I - Operational Excellence

Once you have a system like this, the first thing you should figure out is what can fail in it. Will it be a DNS issue? Maybe something could go wrong with the Internet connection? What if one of the services disappears? Will it affect any other services? How severely? And what if an admin accidentally deletes the AWS RDS? Are you prepared for that? Do you have data archiving? You can keep going like this. Asking such questions shows that you’re mature and aware of the system and its risks. Of course, even in simple systems it’s impossible to predict everything.
Once you have a list of what can go wrong, you are better equipped to properly configure the infrastructure, add additional safeguards, or document shortcomings and threats that may arise.
When a failure happens and gets resolved, take some time to prepare a Post Mortem. It should include a description of the failure, gathered logs, metrics, a description of all the steps taken to resolve the failure, etc. In the end, you should implement fixes in the defective part of the system and improve it to minimise the risk of the failure happening again.

 

The 5 Pillars of the AWS Well-Architected Framework: I - Operational Excellence

Best Practices

AWS defines four best practices in the Operational Excellence pillar. Of course, anyone can add more practices, but these four describe it rather accurately. They are:

  • Organization
  • Prepare
  • Operate
  • Evolve

 

Organization

The most important thing is to keep the business goal in mind and follow the right path to achieve it. All teams should cooperate, not disrupt one another. Maintaining transparency between team members, teams, and the business is key. However, being transparent with the business can be problematic. Let’s face it: the reason is fear. In “pathological” companies/projects, there is sometimes this belief that if you estimate your work will take a long time, you will get “punished” in some way. The table below shows different types of companies/projects based on how they process information. Take a look around and see where you stand…

The 5 Pillars of the AWS Well-Architected Framework: I - Operational Excellence

In my opinion, truth always wins. You should lay your cards on the table. If you feel that something cannot be done in a week and it will take three weeks instead, the Business should know that. The same applies to potential threats caused by faulty services connected with your app or even by faulty architecture assumptions adopted during its planning phase. All of the above should be communicated clearly so that everyone is aware of the system’s stability and the ways it can affect the end customer.

 

During one of the projects I recently participated in, our team was standard: we had Developers, Testers, a Scrum Master (SM), a Project Owner (PO), and a Cloud Engineer (me). Project Owner was the contact person with the Business and was the only team member not sitting in the room with everyone else. The rest of the team worked in a single room, which fostered communication. Each one of us knew the goal they had to achieve to meet the Business’ requirements. Of course, goals were modified, added, and removed during the project. With respect to the table above, this customer definitely belongs in the Generative column – working with them was a real pleasure. There were no misunderstandings. Everything was fully transparent and clear.

During my first meetings with the PO as the representative of the Business, we gained knowledge about the project, established what we wanted to achieve and—most importantly—when we would achieve it. During our first talk, the image of the desired architecture in AWS had already started to take shape. The next meeting was with the Developers. During that talk, they listed their current issues, things they were missing, etc. Their issues were mostly about the code implementation method and constantly running out of disk space. This talk was a true “cherry on top” when it came to infrastructure architecture.

The main assumptions were as follows:

  • environments should be separated, so Development (DEV), Staging (STG), and Production (PROD) should not influence each other
  • we can add more environments that will act as Shared (SHR) intermediaries or a shared DNS system
  • the connection between AWS Cloud and On-Premise should be stable and secure
  • apps should be run as Docker images in Kubernetes (AWS EKS)
  • microservices deployments should be automated, and manual in PROD environment after a series of tests
  • services should be scaled and accessible through Service Discovery via DNS
  • customer data must be stored in such a way to ensure its confidentiality and safety, and should be accessible in reasonable time-frames
  • security has a high priority
  • metrics should be collected from AWS services as well as microservices run in Kubernetes
  • services should be scaled horizontally
  • every configuration should be described as a code
  • everything should be documented, README and CHANGELOG should include information on the use and changes in the system

The above-mentioned prerequisites met the business and programming assumptions of the infrastructure.

 

Prepare

For everything to work as intended, you need to know what everybody is responsible for, what they can expect from the system, and what the task priorities are. From the infrastructure’s perspective, in its previous stage our project comprised basic Bare Metal servers, and the deployed services were interconnected with one another, and they usually were Single Points of Failure (SPOFs). Therefore, all of this should be transferred to the AWS, with ensured scalability, security, archiving, and full cost management. We knew that we wanted to stick to the Docker service, now run in Kubernetes. The further plan assumed separation between accounts, high requirements in terms of user rights (the least privileges), appropriately collected metrics and system logs.

 

On-Premise

The 5 Pillars of the AWS Well-Architected Framework: I - Operational Excellence

AWS Cloud

The 5 Pillars of the AWS Well-Architected Framework: I - Operational Excellence

Obviously, all of this was described in code with the use of Terraform with Terragrunt. With these assumptions in mind, we have proceeded to building the infrastructure by creating a few AWS accounts and connecting them in AWS Organizations. Next, user groups were created in AWS IAM, and IAM Roles with IAM Policies were assigned to them. Obviously, Developers had different access rights than Testers. In the subsequent steps, further services were added to support the deployment and maintenance of the product. Next, further components were prepared along with their documentation and a list of possible technical debts.

 

Operate

Visibility of system operation is very important, and it helps in quickly evacuating a problem or shortcoming. Naturally, this has a high impact on the business goals and the reception by the end customer. The first iteration was the running of the metrics system in AWS CloudWatch. This service is very good when it comes to the visibility of metrics connected with e.g. RDS or AWS Load Balancer. You can easily check whether you are approaching resource threshold, and quickly generate an event to trigger AWS Lambda or AWS Autoscaling. We have also used CloudWatch for monitoring on-premise systems at one of our clients using the so-called CloudWatch Agent, and we wanted to share these metrics with non-technical staff. After a while, it turned out that this service was not as cheap or good as we had thought.

 

AWS CloudWatch and collecting metrics

The 5 Pillars of the AWS Well-Architected Framework: I - Operational Excellence

Using it for on-premise system monitoring meant the collection of the so-called Custom Metrics, which significantly increased the cost of the service. Additionally, the need for sharing it to less-technical people and requiring them to sign-in to AWS Console (Single Sign-On was not implemented at that time) seemed inefficient and not quite user-friendly. After evaluating the solutions, we ceased to use CloudWatch with on-premise systems and abandoned the plan of sharing the metrics with the business. The service was used for visualising standard AWS system metrics. App metrics were accessed via the Prometheus app with Grafana visualisation system. Jackpot! With that, we had the view of our services in Docker containers on a Kubernetes cluster.

 

Prometheus supports AWS CloudWatch

The 5 Pillars of the AWS Well-Architected Framework: I - Operational Excellence

Evolve

As you can see from the descriptions above, the initial approach did not always turn out to be the final solution. The system must evolve in order to meet the expectations of its users and be comfortable in use. Any logs containing possible issues should be collected and analysed. Also, any feedback e.g. from business representatives is important, as they have a different perspective from technical employees. In our project, one of the key questions came from the Finance Dept. and was quite prosaic: “What are we exactly paying for in AWS?”. It seems basic and trivial. While responding to it, we learned which components are misused and generate excessive costs. Obviously, everything was then corrected to ensure the optimal cost to efficiency ratio. However, we will get back to this in the part regarding Costs Optimization.

 

Costs Optimization Tools

The 5 Pillars of the AWS Well-Architected Framework: I - Operational Excellence

It turned out to be a great idea to sit in one room with the Developers and the Tester, as they could instantly communicate their needs and report issues in current implementation of the infrastructure. An issue with a JPEG image loading slowly into the app via a webpage was an interesting result of that. We found an image that was loading very slowly, even with a relatively fast Internet connection. I performed tests on the infrastructure side with the use of iperf, AWS Access Logs and AWS CloudWatch. The data transmission rates in AWS infrastructure were compliant with the documentation, so the issue must have been somewhere else. On the other hand, the Developers ran tests on libraries used in their microservices’ code to find out that one of them contained an error. This required the change of not only the library, but also the programming language. The new microservice written in node.js did not cause similar problems.

To sum up, I think it’s good to be open and prepared for changes, instead of clinging to only one way. There may be numerous ways to implement a given business requirement. Each problem and its solution is a lesson that should be analysed, so that you can avoid repeating this error or quickly solve it if it re-appears.

 

RST Software Product scaling by cloud solutions

 

Summary

As you can see, the scope of the Operational Excellence pillar is quite broad and includes numerous aspects of the system and the infrastructure. You need to pay attention to the ultimate goal and the resources required for achieving it. Each of the above-mentioned areas can be explored more deeply and described in greater detail, which leads to additional pillars that will be described in the next chapters. Read the article: The 5 Pillars of the AWS Well-Architected Framework: II – Security.

Article notes

Udostępnij

RST Software Masters

Kamil Herbik

AWS Cloud Engineer

Experienced Cloud Engineer who loves DevOps, new technologies and building new things. When involved in a project, he is not afraid to use cutting-edge technologies and open new paths of its development. Few years ago, he was one of the persons who initiated the DevOps transformation at RST, which has had a significant influence on the style of work at the company. He does not believe in the sentence "it cannot be done", and he always tries to convince himself and the others that everything is possible. In a free time he performs bouldering and spends time with his family.

Thank you!

Your email has been sent.

Our website uses cookies to work correctly. Using this website with current settings means that cookies will be stored in the browser’s memory. Cookies settings can be changed in the browser’s options. For more information please visit Cookies policy.