The 5 Pillars of the AWS Well-Architected Framework: IV – Performance Efficiency
31.08.2020 - Read in 15 min.
We encourage you to read the 4th article from “The 5 Pillars of the AWS Well-Architected Framework” series. It has been based on our cooperation with our clients and on AWS best practices. See more about Performance Efficiency.
Make sure to read the three previous parts of “The 5 Pillars of AWS Well-Architected Framework” series, where we have described in detail:
Today, it’s time to take a closer look at the fourth pillar i.e. Performance Efficiency.
According to AWS documentation, a list of pillars the architecture of your projects can be based on may look like this:
- Operational Excellence
- Performance Efficiency
- Cost Optimization
This list can be used as a sort of foundation for the infrastructure you will build your services upon. Following these guidelines (but treating them as suggestions, of course) will give you a framework that’s stable, secure, and efficient both functionally and financially.
This chapter is about: Performance Efficiency
Here is the official description of this pillar:
“It includes the ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve”
In short: you need to make sure that your resources used in AWS Cloud have the correct size to meet your current demands, and that they can be easily scaled when needed. This is a reasonable approach to costs, as it’s pointless to pay for things you don’t need–I will elaborate on this in the next article on Cost Optimization.
Here are the main principles of this pillar:
- Democratize advanced technologies
A few years back, apps were created in the form of a monolith. They were utilising a basic architecture with a single HTTP server and optionally a MySQL database. The system was simple and clear. However, with time this solution has turned out to be poorly scalable and inefficient in actual deployments, and consequently more problematic in maintenance. Additional leader – follower or even leader – leader services introduced yet another component to an already complicated jigsaw puzzle of the product. Moving to microservices made it easier to write app code, but also introduced new components that had to be maintained and looked after in case of a failure. In the past, one monolith was connected to one database server, while now almost every Service Oriented Application (SOA) service has its own SQL or noSQL server that also requires full monitoring, archiving automation, data recovery, etc. What’s more, containerisation and maintaining a Kubernetes Control Plane or Elasticsearch cluster does not make Admin’s life any easier.
While moving to a cloud like AWS, you should consider using “as a Service” model. Instead of building a complex PostgreSQL or MySQL cluster, use AWS RDS, or to ensure higher performance and reliability, choose AWS Aurora. Instead of manually maintaining noSQL servers, use AWS DynamoDB, and instead of “worrying” if the etcd service in a Kubernetes cluster operates correctly and whether it is stable, use a readymade AWS Elastic Kubernetes Service (EKS), etc. This gives you time for more interesting tasks, like process automation, deployment system, tests, etc.
Then there’s the other side of the coin: not all “as a Service” solutions are equally efficient as their manually run counterparts. There were times when AWS Elasticsearch in certain situations was not as stable or efficient as the one run on AWS EC2 instances. So, the problem with AWS EKS is that you cannot update internal components of a cluster in real-time, and you need to wait for official AWS support. For the majority of products or deployments, the “as a Services” model is sufficient and effectively increases the comfort of working with a given infrastructure. It also makes the transition from On-Premise to Cloud more convenient.
- Go global in minutes
In the previous article on Reliability, I have mentioned many times that your configuration should ensure High Availability (HA). To get an even more reliable and more efficient environment and a globally available product, you should consider placing copies of your service in various regions like e.g. Ireland, Tokyo, and Sao Paulo. This not only ensures higher availability, but also higher efficiency of services being located closer to Clients in various time zones. Naturally, this means additional work for coding such infrastructure and managing data synchronisation. However, the above-mentioned AWS “as a Service” functionality will help you out in many complex tasks.
- Serverless architecture
Using serverless services, i.e. those that free you from looking after the infrastructure, takes the responsibility of maintenance off your hands. Using AWS S3 as a static web hosting eliminates the need for owning a web server, and using DynamoDB, which is a fully AWS Managed Service, also facilitates the creation of key – value databases. Finally, the very essence of serverless – AWS Lambda, i.e. a service that allows you to run microservices as individual functions and lets you easily and quickly meet any business requirements without any traditional server instances with a Python, Go, or PHP engine.
- Experiment more often
Clearly, this refers to a continuous evolution of your system, as well as testing both existing and new solutions. It’s good to try new things, experiment more, and gather conclusions. Obviously, this requires “additional” time, but it pays off. Such a solution is great for selecting new technologies, when you have several options available and you need to choose the one that you ultimately plan to implement. In such situations, it is a good idea to hold a meeting and put down the requirements regarding the new component in the form of a Mind Map. Then, select three (if available) solutions and perform Proof-of-Concept (PoC). After completing the PoC, you should have a summary and the target solution. Experimenting pays off!
- Consider mechanical sympathy
Using technologies best-suited to your requirements is what you should pay special attention to. This may seem like a truism, but let’s take the example of selecting a database. In the majority of cases, MySQL is selected regardless of what it is intended for. This happens because this is a well-known technology, and its libraries are proven in action. Not many people ask themselves: “Do I really need a relational database this time?”, “Wouldn’t using a key-value solution like AWS DynamoDB increase the efficiency of my queries?”, “Shouldn’t I use an actual document db here instead of a forced use of PostgreSQL?”, etc. Consider the technology of your choice together with your actual app.
AWS defines a series of good practices to be implemented in your environment to ensure the highest possible efficiency of your system. They are:
Selection means choosing a technology, and it is a significant step towards a good and efficient system performance, both in terms of technology and costs. When moving to the Cloud, you should learn more about the technologies available there. It’s not easy to select an optimal technology based only on the header in its documentation. Run implementation and operational tests. Let’s take the example of a message queuing system. In On-Premise, a very popular option is the RabbitMQ service, which could be replaced by AWS Simple Queue Service (SQS). However, in some cases of using RabbitMQ you cannot directly switch to a counterpart available in AWS. I’m speaking certain functionalities of the queuing system used by your app that are not available in the cloud. In such a case, you have two options. You can re-write your app to support technologies available in the cloud, or you can implement On-Premise technologies to EC2 instances and manage them yourself. Each options has pros and cons – moving the app may take too long, or the lack of RabbitMQ counterpart in SQS may lower the quality of your service. You cannot accept either. On the other hand, maintaining your own queuing cluster may cost a lot of time.
Another example is too much confidence that Cloud and Cloud Native are remedies to your problems with On-Premise, and since they’re extremely popular at the moment, “everything will be fine”. Let’s take the example of moving a database server cluster from On-Premise not to AWS RDS or Aurora, but… to a Kubernetes cluster in AWS EKS. To me, this is an example of overhyping new technologies. Clearly, it’s not a problem to put a database (even a leader – leader one) in Docker containers and manage them in Kubernetes. Yet to me this is more demanding in terms of service time, implementation and experience in comparison with a traditional set kept in EC2 instances. In the majority of demanding cases, using AWS Aurora is enough and frees you from responsibilities in terms of low-level system maintenance. Therefore, you need to consider everything and make the right choice.
Another important part of Selection is to decide how you are going to migrate your services to the cloud. Whether you’re going to rely as much as possible on built-in ”as a Service” solutions, or you prefer to stick to traditional EC2 instances and install your components there, just like in On-Premise. This is important, as it determines the future of your services in the cloud.
If you select the former approach and decide to use services provided by AWS Cloud, you get a maintenance-free solution, e.g. in the form of database cluster management in AWS RSD or Puppet server configuration in AWS OpsWork. AWS will be responsible for that, and you can focus on other issues, like how to deploy your app, optimise SQL queries, etc. To get there, you need to describe everything using code, as per the rule described in the article on Operational Excellence. Regardless of the tool used, describing those new technologies will take a lot of time. Ask yourself: do you have space for that in your project?
Another issue is the efficiency of “as a Service” resources. It may happen that they don’t satisfy your needs in this area. If you add that to operational costs, it may turn out that they are inferior in comparison with typical EC2 instances and their maintenance. For instance, let’s take AWS RDS and microservices. Creating separate databases for each microservice may be inefficient in case of using small EC2 instances like db.t3.micro in TEST environment. A characteristic feature of these servers is that they offer high power (the so-called Burst CPU) for a short while, and then their performance drops. Obviously, this is a cycle, so after a while you can use its full performance again. As you can see, in order to experience better results in terms of handling numerous SQL queries for a longer time, first you need to purchase more capable (and more expensive) instances. Looking at the performance / price ratio, you can see that it’s higher in case of buying two EC2 instances (leader – follower configuration) and configuring multiple databases on them in comparison with RDS. Obviously, I do not recommend that for PROD environment, as the number of SQL queries will probably be higher, which means the ratio is going to be different than in TEST environment.
Selection: Compute Nodes
If you move all your On-Premise instances 1:1 (so-called Lift and Shift) to EC2 instances, despite it being a rather quick process, you will still experience too much time devoted to servicing and maintaining the entire system. When choosing this option, you need to carefully select the instances that will maintain your services. Make sure to select the appropriate size to avoid excessive payments or overloading your machines. You can use metrics from AWS CloudWatch to see the level of utilisation of your EC2 resources. Next question is – whether you should use On-Demand instances, or Spot ones which allow you to save up to 70%. Low price has its drawbacks as well, as such instances can be suddenly closed by AWS, and you have only 120 seconds to handle such an event. You have to ensure appropriate automation, so that when an instance is closed, a new one is opened. Naturally, your services must be resistant to such situations. A good practice is to divide your system into parts, with the primary group placed in EC2 On-Demand, while any add-ons are in EC2 Spot instances. This alone will allow you to lower your costs.
When it comes to AWS, it offers multiple types of EC2 instances with different properties, some of which are:
- Graphics Processing Units (GPU): with high computing power from GPUs, recommended in 3D rendering of images and video compression
- Field Programmable Gate Arrays (FPGA): FPGA enables hardware support of custom solutions
- AWS Inferentia (Inf1): dedicated instances for Machine Learning, supporting image recognition, speech recognition, or abuse detection. Perfect for apps using TensorFlow or PyTorch
- Burstable instance families: enable the use of large amount of resources for a moment when needed. Perfect for general applications, where there is no high demands regarding vCPU
Another approach to choosing the place to maintain your microservices are Docker containers. AWS offers two services that can manage them – AWS Elastic Container Service (ECS) and AWS Elastic Kubernetes Service (EKS). Both will do the job, but the latter one is more common and is maintained by CNCF, which guarantees stability and broad community support. Switching to containers requires major changes in your approach to app operation and supporting connected peripherals. However, full automation of deployment processes and following the GitOps way means that your system is going to be very stable and efficient.
Another solution is to go serverless and choose AWS Lambda, which can be used to create a micro-functionality to satisfy your business goals, and therefore further reduce the maintenance and utilisation costs of app servers. An additional support for this service is the AWS ApiGateway, allowing you to create and monitor API calls to your functions, and manage them.
You already know that selecting the place for maintaining your apps is important. The same precision must be applied to selecting data storage space. You need to realise what types of data you use. Is shareable space required? Does the data have to be handled by a file system, or can it be simply stored in an Archival Storage space? The below table shows the impact on the performance.
Another very important piece of puzzle you need to consider is selecting the right database system. As mentioned before, the most commonly chosen is the traditional, relational database. Whether you actually need Transactions and ACID or not, usually the selection is based on the simplicity of communication, broad experience, and already acquired knowledge. Should you have a closer look at data and query optimisation, you would see that you don’t have to stick to the rigid framework of relational databases.
- Relational DBs: include pre-defined data schemas and relations between them. They support the so-called ACID Transactions (atomicity, consistency, isolation, durability). In AWS available as RDS, Aurora, and RedShift
- Key-Value DBs: optimised to handle similar queries and large amounts of data. They offer short response times even in case of large number of simultaneous requests. One example is AWS DynamoDB, fully managed by AWS
- In-memory DBs: perfect for apps which require very short response times. Data is stored in cache, which means access in microseconds. An example of such database is AWS ElasticCache, compatible with Redis and Memcached
- Document DBs: designed to support JSON data structures. AWS provides AWS DocumentDB, which offers similar functionalities to MongoDB
- Graph DBs: this solution is perfect for apps that have to operate on and query huge amount of relations between connected graphs, where the response time must not exceed milliseconds. Like AWS Neptune, which meets those requirements
- Timeseries DBs: allow you to effectively collect data varying in time, in areas like IoT apps, telemetrics, etc. Available as Amazon Timestream
- Ledger DBs: the latest trend in databases, which enables authentication of data sent between entities like banks or other financial institutions. This allows you to store information about any changes in entries and protects them against deliberate deletion or modification. AWS provides such a service in the form of Amazon Quantum Ledger Database (QLDB)
As you can see, there are numerous databases to choose from depending on your needs. You should study what each of them can do, and select the one that’s best for your product.
Once you decide what technologies and resources you are going to use in the cloud, it is a good idea to review them once more and confirm your selection. Obviously, as stated in the previous articles, this is just a single selection that you are most likely to change in the future. Your environment keeps evolving. To embrace those changes and implement them as soon as possible, you need to follow certain practices. You should:
- describe your infrastructure with code, which ensures repeatability of deployments and certain systemic approach. Changes will be visible in the so-called Deployment Pipeline and monitored for unauthorised activity.
- define metrics to describe your system. These should include system metrics (like CPU or storage used) and business metrics to monitor the usage of the end product by the End Client.
- conduct load tests, which shows you if your system could survive high loads, e.g. during sales events.
- ensure visibility of metrics, to be able to track e.g. the CI/CD process or the number of failures in the system.
Monitoring and visualisation should also include the visibility of new versions of your system components. This is especially important when using Kubernetes in AWS and services connected to this system. When you take a closer look at the CloudNative ecosystem, you will find hundreds of projects that can be used in your containerisation system. A common feature to all of those components is their rapid change, which is sometimes hard to keep up with. However, to have a stable and relatively up-to-date contents of your Kubernetes cluster, you need to update your services as often as possible. Naturally, you should test the new versions prior to deployment and–if possible–review the CHANGELOG.
In order to maintain high effectiveness of your system, you need to monitor it. As you can see, this is yet another time we mention it. You should know at all times what is happening in your system. Complex monitoring comprises five stages:
- Generation: indicates the scope of monitoring or the number of metrics
- Aggregation: indicates metrics aggregation method, filters, and visibility from various data fetch points.
- Real-time processing and alarming: indicates data processing and interpretation, and the way information is provided in case of abnormalities
- Storage: indicates data storage location, the scope of retention, etc.
- Analytics: indicates the way of displaying, reporting, and analysing monitoring data
As it is widely known, AWS CloudWatch is a tool that’s built-in into the AWS ecosystem and used for monitoring your system. It can also trigger different AWS components, like AWS Lambda, when a certain event is detected. Then, you can properly address the event and communicate it to system admins. Another tool is Prometheus, which together with visualisation components like Alert Manager or Grafana supports the monitoring process. As mentioned before, this tool is just perfect for the so-called Custom Metrics and connecting it with On-Premise, or to observe the Kubernetes ecosystem. Selecting the tool is up to you, but one thing is for certain–you must keep a close eye on your system to ensure it meets business requirements and offers high efficiency and availability.
Clearly, to ensure the highest efficiency of your system and to make sure that the technologies were selected correctly, you need to have as many efficiency metrics as possible. These include e.g. the number of transactions, slow queries, I/O latency, number of HTTP(S) requests, etc. We had an interesting issue with the performance of the AWS system, namely the combination of AWS CloudFront, AWS S3, and AWS EKS Pods, when users of one of our clients were experiencing issues with loading JPEG files in the Client app. Until the service was moved to AWS, everything was working just fine, and after the move loading times increased drastically–reaching as high as a few minutes. After measuring the bandwidth, throughput, latency, etc. nothing suggested an issue on AWS’s side. It turned out that the huge delay was caused by the library used. Replacing it with a different one solved the problem. This example shows how important it is to have good system metrics and to ask the right questions: What? Where? Why?
While building your system, there will be many situations requiring a trade-off. You won’t be able to implement everything just the way you envisaged it when sketching your idea on a piece of paper. You need to realise that so you can avoid the “something went wrong” feeling when something cannot be implemented right away. Building an infrastructure takes a lot of time, there are sometimes plot twists that require a change in your approach, or even abandoning the initial concept altogether. To a large extent, you can control your trade-offs and accept them only when their priority is significant.
At the beginning, you need to identify the most needed and critical components of the system. Consider both business and technical requirements, and compare them with your own view on the implementation of your solutions. When you’re making a trade-off, e.g. “we don’t use encryption inside the local network”, you should immediately ask yourself: “How will that impact the End Client and their data?”. Being aware of the threats in this case will let you answer another question, i.e. “How important is this trade-off?”.
Fully focusing on the plan of your infrastructure and its components has a significant impact on its operation in the future. This can be a make-or-break for its efficiency, costs of maintenance, or tech debt. You should be aware of the needs from the very beginning, and select technologies and solutions that are best suited to meet them. Although certain situations may require a trade-off, let your system evolve and keep modifying it, as this is the only way to perfection.
We encourage you to read part 5 – “5 Pillars of AWS Well-Architected Framework: V – Cost Optimisation”.