Efficient Storage Attribution
Databricks offers a significant advantage by seamlessly integrating with cost-effective cloud storage options such as ADLS Gen2 on Azure, S3 on AWS, or GCS on GCP. This becomes particularly beneficial when utilizing the Delta Lake format, as it not only facilitates data governance for a complex storage layer but also enhances performance when combined with Databricks.
One common mistake in storage optimization is the failure to implement lifecycle management effectively. While it is recommended to remove outdated objects from your cloud storage, it is crucial to synchronize this process with your Delta Vacuum cycle. If your storage lifecycle removes objects before they can be vacuumed by Delta, it may result in table disruptions. Therefore, it is essential to thoroughly test any lifecycle policies on non-production data before implementing them extensively.
Networking Optimization
Data in Databricks can come from various sources, but the main way to save bandwidth is by writing to storage layers like S3 or ADLS. To reduce network costs, deploy Databricks workspaces in the same region as your data and consider regional workspaces if needed. For AWS customers using a VPC, you can decrease networking costs by using VPC Endpoints. These allow connectivity between the VPC and AWS services without an Internet Gateway or NAT Device. Gateway endpoints connect to S3 and DynamoDB, while interface endpoints lower costs for compute instances connecting to the Databricks control plane. These endpoints are available when using Secure Cluster Connectivity.
Leveraging Serverless compute
For analytics workloads, one may want to consider utilizing an SQL Warehouse with the Serverless option enabled. With Serverless SQL, Databricks manages a pool of compute instances that can be allocated to a user as needed. This means that Databricks takes care of the costs associated with the underlying instances rather than having separate charges for DBU computing and cloud computing. Serverless offers a cost advantage by providing immediate compute resources for query execution, reducing idle costs from underutilized clusters.
Monitoring Usage
Databricks offers tools for administrators to monitor costs, such as the Account Console for an overview of usage and the Budgets API for active notifications when budgets are exceeded. The console's usage page allows visualization of consumption by DBU or Dollar amount, with options to group by workspace or SKUs. The upcoming feature of Budgets API simplifies budgeting by sending notifications when budget thresholds are reached, based on custom timeframes and filters.
Conclusion
The emergence of SAAS Data platforms has led to on-demand resource provisioning, which can sometimes result in unmanageable expenses, necessitating proactive cost monitoring, known as "FinOps." FinOps fosters collaboration between finance, engineering, and business teams, emphasizing shared responsibility and data-driven decision-making. Its three phases - Inform, Optimize, and Operate - are applicable to Databricks too. Cluster policies are crucial for controlling configurations and optimizing computational resources, enabling auto-scaling and auto-termination functionalities to prevent excessive costs.
Explore more about FinOps for Data Mesh and Data Fabric Importance of FinOps for Google Cloud