Introduction
Managing expenses for cloud data has emerged as a crucial concern for individuals overseeing their data operations on platforms such as Databricks, Snowflake, BigQuery, Dataproc, and Amazon EMR. Uncontrolled expenses can swiftly inflate monthly bills, posing challenges for organizations to forecast and handle costs accurately. Numerous entities find it challenging to comprehend the allocation of their funds, often surpassing their budget limits.
What is DataFinOps
DataFinOps can be seen as a combination of two acronyms i.e. “DataOps + FinOps = DataFinOps”.
DataOps can be interpreted differently depending on who you ask. In this context, we are focusing on a mindset that involves an Agile, proactive, self-service, and automation-driven approach to addressing issues before they escalate.
According to the FinOps Foundation, FinOps is a dynamic discipline and cultural practice in cloud financial management that facilitates organizations in maximizing business value through collaborative decision-making on data-driven expenditures.
DataFinOps essentially aims to align all stakeholders in making informed decisions regarding cloud data budget utilization. By integrating financial discipline, performance engineering, and business priorities, a DataFinOps framework streamlines the process of balancing financial, engineering, and business interests to optimize the cost-efficiency of running data workloads in the cloud.
The role of DataFinOps
DataFinOps goes beyond cost reduction. It can save millions by eliminating waste and reducing cloud data costs by 30-50%. Its focus is maximizing the value of investments in a modern data stack while keeping costs optimal. It integrates performance and financial data to provide a holistic understanding of spending and a per-unit cost perspective of cloud resources.
Principles of DataFinOps
-
Cloud cost governance requires collaboration for effective cost management and conflict prevention. Independent work leads to a lack of alignment, so teams must work together towards a common goal.
-
Spending decisions are driven by business value, as different priorities and investment levels exist. Informed decisions about trade-offs between performance, quality, and cost should be based on the value it brings to the business
-
Individuals are accountable for their cloud usage, shifting budget responsibility to those who incur expenses. This is crucial for controlling cloud costs on a scale.
-
Accessible and timely reports are crucial for data-driven decisions. All stakeholders should have access to the same information for a unified understanding.
The DataFinOps Team
Inquiries about DataFinOps usually involve forming a team to implement this approach. Different organizations handle data workloads differently. Some follow the "you built it, you own it" principle, where one person is responsible for data engineering, operations, and cost management. Others divide these roles and responsibilities.
Regardless of your company's structure, a DataFinOps team needs representation from Finance, Data Engineering (including Operations), business individuals like product owners, and technical stakeholders like data architects and platform owners. Each group has a vested interest in cloud data workload performance and cost and approaches it from different perspectives.
Three staged DataFinOps lifecycle
Often, an organization operates in multiple stages concurrently, with various departments, teams, and individual users at varying levels of maturity.
- Inform: Gain insight and awareness into current activities, expenses, measurements, and quantifications, monitor costs, and pinpoint the destination of funds.
-
Optimize: Pinpointing areas for waste and inefficiency reduction, leveraging cost-effective cloud pricing alternatives, and enhancing the efficiency of cloud data operations.
-
Operate: Transitioning from reactive issue resolution to proactive issue prevention, enabling individuals to independently optimize costs, establishing automated safeguards, and maintaining continuous enhancements.
Pitfalls to DataFinOps
Many organizations often make the understandable mistake of attempting to utilize the same observability and cost-management tools for DataOps teams that have proven effective for DevOps software teams.
However, it is important to recognize that data applications are fundamentally distinct from web applications. They are constructed and function differently, operate within a unique environment, utilize entirely different technologies, and serve different purposes.
Some application tools were never specifically designed or intended to meet the unique requirements of data applications:
-
They do not capture the specific type of telemetry data generated by modern data stacks, including detailed information on sub-parts of parallel job processing.
-
They do not apply the necessary analysis techniques to understand, identify, and resolve the distinct class of problems, root causes, and remediations associated with data apps. These issues include factors such as degree of parallelism, load imbalance, skew, and code execution.
Even the point tools offered by cloud providers and platform-native solutions like Overwatch, AWS Cost Explorer, and Microsoft Cost Management still require significant effort on your part to make informed decisions regarding your spending habits and resource allocation.
Inform stage
The Inform stage focuses on empowering the different personas involved in DataFinOps. Cloud data platforms and providers generate millions of cost-billing data points, providing detailed insights into the resources rented, their costs, and duration of usage. This information ultimately contributes to the final bill.
It is crucial to have access to the necessary information to make informed decisions based on data. This involves combining financial data with performance data to clearly understand where the expenses are allocated.
Allocating costs
To control costs effectively, it's crucial to understand the financial aspects of your cloud data. This means knowing who is spending how much, where, and when. A DataFinOps approach can visualize cloud costs from different business perspectives by using tagging and data observability. This includes analyzing costs by user, job, team, project, and more.
Cost visualization
Cost visualization involves the following operations.
-
Creating accurate show-back/chargeback reports
-
Identifying top spenders
-
Tracking actual spending against the budget
-
Forecasting usage/costs
Optimize stage
The Optimize phase focuses on identifying and rectifying areas of overspending, such as waste, inefficiency, variable pricing decisions, and unnecessary expenses. There are numerous opportunities to optimize cloud data workloads for cost efficiency, ranging from minor adjustments to significant changes.
Cost optimization opportunities exist at various levels:
-
Application level: This is where costs are initially incurred, occurring frequently throughout the week. The configuration and code details of each job or group of sub-jobs determine the overall cost of execution.
-
User level: Another perspective on cost optimization at the job level is considering individual users' impact on expenses.
-
Platform level: The platform on which jobs are executed, such as Databricks, Snowflake, Amazon EMR, BigQuery, or Dataproc, also contributes to the overall cost.
-
Pipeline level: The efficiency of the entire data pipeline significantly affects costs.
-
Cluster level: This level focuses on cloud service provider expenses, encompassing platform, pipeline, and cluster costs to simplify DataFinOps practices.
Cost optimization involves the following operations:
-
Shutting down idle clusters
-
Right-sizing resources at the job level
-
fixing inefficient code
-
Choosing the right cloud pricing model for the task at hand
-
Leveraging spot instance discounts
Operate stage
Cost optimization in cloud data workloads requires continuous attention and monitoring to manage expenses effectively. A DataFinOps approach shifts responsibility for controlling costs to individuals, who are held accountable for their cloud usage. Automation and AI provide real-time insights and recommendations for optimizing cloud data budgets. This proactive approach ensures effective governance of costs through the following steps for cost management.
Automated Guardrails
Establish cost governance policies by clearly defining them. Determine what constitutes excessive costs. Afterwards, implement automated guardrails to ensure compliance with these policies. Establish specific boundaries or thresholds for usage and cost based on factors such as DBU (for Databricks jobs), job duration, job size, resource consumption, and various other detailed metrics that impact expenses.
Alerts
When a cost/usage guardrail is breached, the response can vary from a simple alert to more proactive corrective action. Alerts should be customizable to meet the needs of different users, teams, applications, and projects. They must also provide meaningful information to avoid alert storms and alarm fatigue. In some cases, violations may require preemptive actions like terminating jobs or requesting configuration changes.
Self-service tools
Automation and artificial intelligence collaborate to provide engineers with targeted cost-saving suggestions directly. Whenever the AI detects excessive spending on a task, workload, or cluster, it offers a recommendation to the engineer. This eliminates the need for engineers to sift through charts, graphs, metrics, and billing details, as they receive clear, actionable insights on the necessary steps to take.
Conclusion
Managing cloud data expenses is a concern for those overseeing data operations on platforms. Unchecked expenses can lead to high bills and challenges in cost prediction and management. DataFinOps, a fusion of DataOps and FinOps, aims to align stakeholders in making informed decisions about cloud data budgets. It focuses on maximizing value while keeping costs optimal. The DataFinOps lifecycle consists of three stages: inform, optimize, and operate. Common pitfalls include using tools designed for DevOps instead of specialized analysis techniques and tools for data applications. The inform stage involves merging financial and performance data to understand cloud costs and empower various personas involved in DataFinOps.
-
Explore more about FinOps for Data Mesh and Data Fabric
-
Importance of FinOps: Optimizing the Cloud Cost