FinOps Challenges in Google Cloud Platform

Alexander Goida
5 min readJul 18, 2023

This article discusses the challenges posed by financial operations in Google Cloud Platform (GCP) and provides strategies for managing expenses effectively. It emphasizes the need for a dedicated team or individual to perform FinOps operations.

One of the biggest challenges that our team faces on GCP is managing FinOps routines.

FinOps, as described here: https://cloud.google.com/learn/what-is-finops, refers to the financial management of cloud operations, which requires precise tracking of expenses. However, there are complications that prevent us from getting the ideal coverage and understanding of the expenses.

  • Firstly, we found that in some cases it is impossible to obtain detailed billing information, which would describe the specific operations that have produced the expenses we see in the billing.
  • Secondly, the delayed nature of billing, which is delayed for almost two days, sometimes leads to the risk of receiving a high bill when we experiment on the cloud platform, and there is no easy way to proactively avoid it.
  • Lastly, controlling expenses with quotas can be over-complicated, as there is no clear relationship between observed records in the billing and the quotas, nor is there a description of that mapping.

Nonetheless, we have found approaches to tackle these problems, although they all require experiments and tuning. Some of them were kindly described by our partner who is helping us with GCP operations. I will outline two problems that would address the aforementioned issues:

  1. Addressing the issue with delayed billing
  2. Addressing the issue with not having detailed expenses

Delayed Billing

There are several ways of dealing with delayed billing. GCP has budgets and alerts. Alerts could be set on forecasted budgets. This might be good enough to track and see that the trend of the expenses is turning to have over-expenses. But this is not automatically stops any operations. So that if there is no human interaction following the alerts, you will get a high bill anyway. Also standard alerts won’t cover spikes in expenses. Basically, the trend line won’t react on unexpected raise in expenses, and you will be still billed with high bill. Out of this we have two technical problems:

  1. To stop operations which incur high expenses automatically
  2. To spot a spike in expenses immediately, and not in two days when forecasting and real billing would notice them

To address these problems, we need to focus on the actual processes happening on our cloud, rather than just billing. Before implementing any changes, we need to do our homework and identify cloud operations that could result in high expenses, as well as the related resources, such as VMs or other cloud services. Then, every budget alert should be accompanied by a cloud function that will be called once the alert is raised. This function can modify quotas, stop processes, and change infrastructure configurations according to your mitigation strategy. However, it’s important to note that this approach is only effective for addressing known issues and may not be helpful for unknown problems.

According to Google, it is not possible to set a budget cap for a GCP project, which means there is no way to have a hard limit on expenses. However, there is a workaround that may help achieve this goal. It is possible to programmatically disable billing for a GCP project, which will completely stop billing. However, this approach is not recommended as it can harm the infrastructure and cause data loss and other issues. Therefore, it should be seriously assessed and tested before implemented.

Another issue to address is identifying expense spikes. Instead of relying on billing information, we need to identify the operations where spikes are expected and create alerts in cloud monitoring to react to raised rates of usage. Cloud monitoring provides real-time monitoring, allowing us to quickly spot increased rates of operations on BigQuery or Cloud Storage and use the same technique with a cloud function to stop them or perform another procedure according to your mitigation strategy.

Detailed Expenses

We have found only one way to obtain detailed expenses on GCP. The billing provides quite detailed information using SKUs and services, but some cloud services are represented by several SKUs, which have no obvious relationship and are not convenient for monitoring. For example, the Dataproc service billing consists of several parts: Licensing Fee for Dataproc, Compute Engine Cores, and RAMs. Without additional efforts, these expenses cannot be distinguished from other VMs running for different purposes. Another example is Cloud Functions, which are used for different APIs. In the standard billing, they will be represented by the same SKUs. Another example, which we don’t have a solution for now, is BigQuery expenses, which sometimes take higher costs than estimated, even when we are using a pricing model that enforces some top limit for resources. Basically, using only SKUs won’t provide transparency about how much expenses are incurred by each of our business services.

To improve the situation, you can assign labels to created resources, although not all resources can have labels. After assigning labels, you need to extract expenses to BigQuery. You can configure periodic extraction to get the most recent billing data (https://cloud.google.com/billing/docs/how-to/export-data-bigquery). To gain insights from this extract, we recommend creating a pivot table using every key of our labels as a dimension. This allows you to see expenses by department, system, and components that have a business purpose. You can also understand how much expenses each of your business operations incur. Additionally, you can create custom expense alerts based on this pivot table, in addition to standard budget alerts offered by Google.

To efficiently manage all labels, we are following the IaaC approach and creating all our cloud infrastructure elements using written scripts. Scripts also help us easily recreate every component we need on new GCP projects.

Bottom Line

The main point of this article is that FinOps requires an immersive effort and a dedicated team or person to perform various analyses, develop mitigating strategies based on domain knowledge, and execute FinOps operations. Unfortunately, the current level of support for such operations from GCP leaves us wanting more, but I believe that it will improve in the future. However, I think the situation is no better for other vendors such as Azure, AWS, and others. I would be happy to hear different stories about them. Meanwhile, my team and I will continue executing FinOps for our DataOps.

--

--

Alexander Goida

Software Architect in Cloud Services and Data Solutions