Cloud Cost Questions for Engineering Managers
You have assumed the leadership of a team that is operating in a cloud environment. It’s a new beginning, you are excited about the future (hopefully), the team members, and most of all, the thrill of a new challenge. After the excitement settles down you start asking questions to better understand the work and the team. Among the list of questions you have, you should include questions pertaining to cloud cost and cost optimization.
This article was originally published on Medium. Link to the Medium article can be found here.
In this article, you will find a set of questions that are beneficial for you and your team to further explore. These are questions I have found beneficial in the past and I believe they will be beneficial to you too. Without further ado, let’s dive into it.
Q: Do we have any budget alarms established?
This is a simple question but the answer will reveal a lot of information about the team, the organization, and the emphasis placed on cost management. The ideal answer is “yes”, and depending on the maturity of the team, and the organization you might find out that there are several layers of budget alarms. Sometimes, these budgets are for different services and environments. If you are in the ”yes” camp, give the team kudos 👏
For those of you who find yourself in the “no” camp ⛺️ , don’t despair. Yes, there is a lot of work to do here, but it’s also an opportunity to stand out, and raise the standard. All major cloud providers offer the ability to set budget alarms. There are many ways to use the budget alarms, but the primary reason you want to use alarms is for cost awareness and to change behavior. Yes, behavior change. You want to get you and your team to take a moment and ask themselves “what impact will this change have on the budget”. The alarms help remind the team to act more responsibly from a financial perspective. If there is no set budget, then review the billing information for the past three to six months and identify a baseline/average.
Budgets are not a one-and-done kind of deal. Budgets are a moving goal and should be reviewed often. You want to aim for a goal but the reality is that accurately forecasting cost is difficult and often subject to change due to many external factors. As you and your team develop a good understanding of what the major cost drivers are, you can then start having conversations on how to reduce the expenses. But it all starts with measuring. As the saying goes “What gets measured, get’s done”
Q: Is there a tagging policy in place?
Tagging is important for teams to more accurately understand their cost, but it’s critical at the organizational level to understand where financial resources are being allocated. Let’s break this down further, starting at the team level.
By tagging resources, you and your team can more accurately understand expense reports generated by the cloud provider. Let’s use a real-world example. Assume you and your team have a fleet of virtual machines (VMs). A VM can belong to a different part of the application architecture. Without tagging, how would you identify which part of the architecture is costing X amount of dollars in a given month? Take this example a bit further, assume it’s a multi-tenant environment. If various teams are consuming VMs without tagging, it would be very difficult to understand the cost of each team. You could use a naming convention to identify different teams or parts of your application architecture but that will not help you break down the cost of VMs at the end of the month when you are reviewing the bill. Cloud service costs can be broken down by tags. The ability to break down service costs by tags is why tagging is important from a cost management perspective. The two screenshots below illustrate this. The bottom image showcases how the EC2 cost for the month of August is narrowed down to the resources with the tag 12345.
Let’s go back to the fleet of VMs example. By understanding the cost for the different components of the architecture, you and your team can now have a meaningful discussion related to the expenses. Perhaps the cost is justified, or maybe there is room for optimization?
Let’s go up a layer and look at it from an organizational perspective. Similar to an application that is made up of numerous components, the same applies to an organization and the teams that make up the organization. Organizations develop budget plans in order to remain profitable and responsibly manage their expenses. From a cloud consumption perspective, this includes understanding how the teams are utilizing cloud resources. Below are a few questions commonly asked by organizations:
- Development costs vs production costs
- The total storage cost and growth rate
- Fixed cloud costs vs team variable cost (primary infrastructure vs application infrastructure)
- Compute resource utilization
Without tagging, it’s difficult to break down costs for different environments. Another reason why tagging is important to organizations is to hold leaders accountable for their team’s actions. Leaders play a critical role in helping the organization meet budget goals. If leaders are not held accountable for the cost their team is incurring then it’s hardly surprising if the organization fails to stay within the planned budget. Tagging allows the organization better understand the cost of each team and how much to budget for future expenses.
If organizations have a good grasp on their cloud expenses and resource utilization, they can start researching the purchase of reserved compute instances or pre-pay for resources and receive a discount. This may sound counterintuitive at first but it can go a long way in helping organization reduce their cost.
Tagging is not a perfect solution, but it goes a long way in helping demystify the cost incurred in public cloud environments. There are more benefits to tagging besides financial reporting. Tagging can be used to help drive automation, labeling data, security, and other meta-data purposes. AWS has an example of tagging categories that can be found here.
Q: How are resources primarily provisioned?
How is your team deploying cloud resources? Are they doing “ClickOps”, meaning they utilize a user interface and click their way through deploying the required infrastructure? Or are they writing everything down, and deploying resources through infrastructure as code (IaC)? This could be Terraform, CloudFormation, ARM Templates, or other IaC flavors.
The preferred behavior is to leverage infrastructure as code and avoid manually deploying resources. Manually deploying resources increase opportunities for mistakes to happen during the deployment process and the cleanup process. Someone could forget to deploy a required resource, assign the required permission, or perhaps assign a value to a required environment variable. It’s also likely that something will be missed during the cleanup process. It would be a shame if expensive resources such a managed Kubernetes clusters, data warehouse clusters, or MapReduce clusters were left running over the weekend not utilized
There are many reasons why teams should leverage infrastructures as code, but that’s outside of the scope of this article. Many of the popular IaC tools help with cost management. Such as Terraform Cloud’s cost estimation and Sentinel policies. There are also free solutions available, such as terraform-cost-estimation, aws estimate-template-cost, infracost, to mention a few.
Terraform Cloud Estimation feature example The neat aspect of these IaC tools is that you can use them with an open-source tool called Open Policy. Open Policy is a framework that allows administrators to write rules for what is allowed and disallowed to be deployed. This can be used to enforce standards in teams but it can also be used to prevent mismanagement of cloud resources. This greatly helps you as the leader of the team prevent accidents or misconfiguration from being deployed.
Q: Is there automation in place to remove idle resources?
This is an investment that all organizations and teams should invest in. Accidents happen, priorities change, and the potential for resources not being cleaned up can happen to the best of us. A simple automation script that pauses compute instances after a specific time (cron based) is a simple investment that can help reduce unnecessary expenses. Below is a simple example for pausing EC2 instances I’ve used in the past. The script below is executed by an AWS Lambda and triggered by a CloudWatch rule.
However, many more powerful tools can help you with this type of automation. The open-source tool Cloud Custodian is an excellent resource that many organizations and teams use to maintain their cloud environments. The key is to be proactive and address the most likely scenario that could lead to unnecessary expenses.
In this article, we looked at a few cost-related questions that all engineering managers should ask themselves and their teams operating in cloud environments. The questions discussed in this article are only the beginning of the cost optimization journey. It’s a journey as addressing behavior change and changing habits take time and effort. The work may appear mountainous and intimidating in the beginning. Start small and celebrate the early wins. The early successes will help build confidence and encourage behavior changes. If you enjoyed this topic, drop a comment and share ideas of what you would like to learn more about.