Introduction
The rise of SAAS Data platforms has given users and admins the ability to create on-demand resources such as compute, RAM, and GPU. While this is beneficial, it can also lead to situations where costs spiral out of control. That is why it is crucial to proactively monitor costs; a practice known as "FinOps."
FinOps, short for Financial Operations, is not just about managing cloud finances. It also promotes a cultural shift within the organization, fostering collaboration between finance, engineering, and business teams. One key aspect of FinOps is shared responsibility. Another important aspect is making data-driven decisions. By actively monitoring costs, organizations can identify areas for optimization and make informed choices.
The goal of FinOps is to achieve a balance between cost optimization and meeting business needs. It is about managing your cloud finances effectively to catch waste early, optimize resource utilization, stay within budget, and make informed decisions.
Databricks and FinOps Phases
There are three phases of FinOps, namely Inform, Optimize, and Operate, which can be applied to Databricks, too. These phases allow organizations to gain visibility into costs, optimize resource usage, and align their plans with business objectives.
-
During the Inform phase of FinOps in Databricks, organizations need to gather and analyze data to gain a comprehensive understanding of fully loaded costs and benchmark performance.
This involves identifying all the costs associated with running workloads on Databricks, including compute, storage, and data transfer costs.
-
In the Optimize phase, the focus shifts toward enabling real-time decision-making and optimizing usage. This involves implementing cost optimization strategies such as rightsizing instances, leveraging spot instances, and optimizing data storage.
-
Lastly, in the Operate phase, organizations align their plans with the business objectives and optimize rates. This involves negotiating contracts with Databricks and other cloud service providers to ensure favorable pricing terms.
FinOps in Databricks
To effectively manage expenses in a Databricks workspace, consider the following strategies:
-
Understand and verify DBUs: DBUs represent computational resource usage on Databricks. The number of DBUs used depends on the cluster's node count and VM instance capabilities. Different cloud providers have different DBU rates, so verify the rates for your specific cloud service.
-
Implement Cluster Policies: Use cluster policies to control DBU usage. Set parameters such as maximum node count, instance types, and auto-termination rules to optimize resource allocation and prevent unnecessary expenses.
-
Monitor Compute Costs: Keep track of compute consumption, including VM expenses. Identify underutilized or overutilized resources and adjust cluster sizes accordingly. Consider using auto-scaling to dynamically adjust cluster capacity based on workload requirements.
-
Manage Storage Costs: Monitor storage expenses within the workspace. Understand data growth patterns and optimize storage distribution. Make use of managed storage solutions provided by the cloud service.
-
Regular Evaluation and Monitoring: Continuously assess DBU, compute, and storage utilization. Look for opportunities to reduce costs and adjust resource allocation as needed. Ensure that the workspace operates within budget constraints.
Understanding DBU
The fundamental unit of consumption is a Databricks Unit (DBU). The number of DBUs consumed, except SQL Warehouses, is determined by the number of nodes and the computational capabilities of the underlying VM instance types that form the respective cluster. In the case of SQL Warehouses, the DBU rate is the cumulative sum of the DBU rates of the clusters comprising the endpoint.
At an elevated level, each cloud provider will have slightly different DBU rates for similar clusters due to variations in node types across clouds. However, the Databricks website provides DBU calculators for each supported cloud provider (AWS | Azure | GCP).
Cluster policies: Managing Costs
An admin can use a cluster policy to control configurations for newly created clusters. These policies can be assigned to either individual users or groups. By default, all users can create clusters without restrictions, but caution should be exercised to prevent excessive costs.
Policies limit configuration settings. This can be achieved by setting specific values, defining value ranges using regex, or allowing an open default value. These policies also restrict the number of DBUs (Database Units) consumed by a cluster, allowing control over settings such as VM instance types, maximum DBUs per hour, and workload type.
Node count limits, auto-termination, and auto-scaling
Clusters that are underutilized or inactive are a frequent problem with computing expenses. Databricks provides solutions to address these issues dynamically without requiring direct user intervention. By implementing auto-scaling and auto-termination functionalities through policies, users can optimize their computational resources without any hindrance to their access.
Node count restrictions and automatic scaling policies can be enforced to ensure the activation of auto-scaling capability with a minimum number of worker nodes.
Also, when setting up a cluster on the Databricks platform, users can configure the auto-termination time attribute. This attribute automatically shuts down the cluster after a specified period of inactivity. Inactivity is determined by the absence of any activity related to Spark jobs, Structured Streaming, or JDBC calls.
Spot Instances and Cloud instance types
When configuring a cluster, you can choose VM instances for the driver and worker nodes, each with its own DBUs rate. To simplify instance management, you can use the "allowlist" or "fixed" type to restrict usage to a single type. For smaller data workloads, it is recommended to choose lower memory instance types. GPU clusters are ideal for training deep learning models due to their higher DBU requirements, but it is important to balance the limitations of instance types.
If a team needs more resources than allowed by policy restrictions, the job may take longer and incur higher costs.
Spot instances are discounted VMs offered by cloud providers. However, they can be reclaimed with minimal notice (2 minutes for AWS and 30 seconds for Azure and GCP).
Efficient Storage Attribution
Databricks offers a significant advantage by seamlessly integrating with cost-effective cloud storage options such as ADLS Gen2 on Azure, S3 on AWS, or GCS on GCP. This becomes particularly beneficial when utilizing the Delta Lake format, as it not only facilitates data governance for a complex storage layer but also enhances performance when combined with Databricks.
One common mistake in storage optimization is the failure to implement lifecycle management effectively. While it is recommended to remove outdated objects from your cloud storage, it is crucial to synchronize this process with your Delta Vacuum cycle. If your storage lifecycle removes objects before they can be vacuumed by Delta, it may result in table disruptions. Therefore, it is essential to thoroughly test any lifecycle policies on non-production data before implementing them extensively.
Networking Optimization
Data in Databricks can come from various sources, but the main way to save bandwidth is by writing to storage layers like S3 or ADLS. To reduce network costs, deploy Databricks workspaces in the same region as your data and consider regional workspaces if needed. For AWS customers using a VPC, you can decrease networking costs by using VPC Endpoints. These allow connectivity between the VPC and AWS services without an Internet Gateway or NAT Device. Gateway endpoints connect to S3 and DynamoDB, while interface endpoints lower costs for compute instances connecting to the Databricks control plane. These endpoints are available when using Secure Cluster Connectivity.
Leveraging Serverless compute
For analytics workloads, one may want to consider utilizing an SQL Warehouse with the Serverless option enabled. With Serverless SQL, Databricks manages a pool of compute instances that can be allocated to a user as needed. This means that Databricks takes care of the costs associated with the underlying instances rather than having separate charges for DBU computing and cloud computing. Serverless offers a cost advantage by providing immediate compute resources for query execution, reducing idle costs from underutilized clusters.
Monitoring Usage
Databricks offers tools for administrators to monitor costs, such as the Account Console for an overview of usage and the Budgets API for active notifications when budgets are exceeded. The console's usage page allows visualization of consumption by DBU or Dollar amount, with options to group by workspace or SKUs. The upcoming feature of Budgets API simplifies budgeting by sending notifications when budget thresholds are reached, based on custom timeframes and filters.
Conclusion
The emergence of SAAS Data platforms has led to on-demand resource provisioning, which can sometimes result in unmanageable expenses, necessitating proactive cost monitoring, known as "FinOps." FinOps fosters collaboration between finance, engineering, and business teams, emphasizing shared responsibility and data-driven decision-making. Its three phases - Inform, Optimize, and Operate - are applicable to Databricks too. Cluster policies are crucial for controlling configurations and optimizing computational resources, enabling auto-scaling and auto-termination functionalities to prevent excessive costs.