Interested in Solving your Challenges with XenonStack Team

Get Started

Get Started with your requirements and primary focus, that will help us to make your solution

Proceed Next

Kubernetes

Kubernetes Operators and Framework | Quick Guide

Gursimran Singh | 23 September 2024

Kubernetes Operators and Framework Overview

Kubernetes Operators - Streamline Cluster Management

Kubernetes is growing day by day, and now every organization is keen to adopt the containerization technology and it's gone so for that it can quickly deploy the stateful application like databases and measuring Kube, but it's not useful in managing the stateful set application. As we can see the deployment of the stateful application is easy, but as the time passes your application may require configuration update like resizing, reconfiguration, backup, or healing, so there are lots of reason to update right? Like it might be due to security issues.

This article will cover details regarding Kubernetes Operators Suppose an application has slave and master architecture or databases with the multi-node cluster and we want to add or remove instances then it might require preparation and post-provisioning steps which are done manually, and that result burden on DevOps guys. So to solve this problem and make it automated here comes Operator, Operator is automated software manager for k8s applications - install and lifecycle.The Operator was introduced by coreOS back in 2016.

Explore the powerful and adaptable Architecture of Kubernetes, designed to effortlessly manage containerised workloads with scalability and efficiency.

What is Kubernetes Operators?

"An Operator is a means of packaging, deploying, and maintaining a Kubernetes application-coresOS." Anyone can build an operator for their application beside he must be the expertise of their application along with kubernetes, now we have lots of operators out as opensource for different complex applications like for Prometheus, Kafka. So Operators are purpose-built to run a Kubernetes application, with operational knowledge baked in.

In this blog, we are going to see

  1. Kubernetes can be used to manage and automate the training process for any machine learning task.
  2. TensorFlow has dominated the ML field and allows developers to train and deploy their models to any platform or language.
  3. There are now many operators available that can ease the training process, such as tf-operator and mpi-operator.
  4. Tf-operator and mpi-operator are used to train models.

Kubernetes, the open source software for container orchestration, has become a popular choice for automating the deployment and management of workloads on containers. Inspired by Google's own orchestration system, Borg, Kubernetes follows the principles of simplicity and flexibility, allowing for scalability and usability in diverse contexts.

However, the initial functionality of Kubernetes is limited to a set of commands and operations exposed through APIs. To perform more complex tasks, additional automations are needed. This is where Operators come into play.

A portable open-source platform that helps in managing container services and workloads. Click to explore about, Kubernetes Security - Tools and Best Practices

Kubernetes Operators are controllers that manage application logic and ensure the desired state of the cluster. By executing loops to check the actual state and reconcile it with the desired state, Operators streamline the management process. These Operators are built on the principles of simplicity, flexibility, and automation, which are the foundation of the Kubernetes architecture.

Kubernetes Operator Framework

The Operator Framework is an incredible open source project that offers a range of developer and runtime Kubernetes tools, empowering you to expedite the development of an operator.

The Operator Framework consists of:

1. Operator SDK

This innovative tool empowers developers to build operators based on their expertise, without the need for in-depth knowledge of Kubernetes API complexities.

2. Operator Lifecycle Management

A crucial component that oversees the installation, updates, and management of the lifecycle of all operators running across a Kubernetes cluster.

3. Operator Metering

An essential feature that enables the reporting of usage for operators providing specialized services.

With the Operator Framework, you have all the necessary tools and resources to streamline the development and management of operators, making your Kubernetes experience even more efficient and effective.

Kubernetes Operator's Goals for Machine Learning Tasks

ML needs a pretty good infrastructure to handle all the operation which is required in order to train a model, ya its correct that requirements of infrastructure depended on someone's training models, along with these things they have to take care of resource management, monitoring because at end they have to make sure that ML models are scalable and portable which is really painful.

So we have seen lots about Operator and kubernetes, but why we need to move our training things to Kubernetes? From above we might have learned one thing that the tech Containerisation makes things easy to deploy, managed, and monitor and this is what the ML developer needs now a day and make things automate through DevOps.

What is TensorFlow operator?

It's a k8s operator comes under Kubeflow. This Operator makes it easy to run the tensorflow jobs whether its distributed or non-distributed on kubernetes. TFjobs are the kubernetes custom resource that is used to train or run tf-training jobs on kubernetes. Kubeflow maintains all these operators, and we can say that kubeflow is collection such components that make it easy to run machine learning code in various forms within Kubernetes. So we need a tf-operator, for TFJob which will monitor and keep track of it. For example, We can deploy the Kubeflow which will scaleup tf-operator deployments, and then we can define our TFjobs accordingly and run as many as training we what on the kubernetes cluster.

An open-source system, developed by Google, an orchestration engine for managing containerized applications over a cluster of machines. Click to explore about, Kubernetes Deployment Tools and Best Practices

Deploy the tf-operator check out the Kubeflow GitHub for more details "https://github.com/google/kubeflow.git" now after deploying the tf-operator we have to define and deploy the TFJobs. Let's see an example to understand it in details - Generally, a TF cluster consists of workers and parameter servers. Workers run the copies of the training while servers maintain the model parameters. You can find more at Distributed TF Kubeflow. For sample TFJob, you can jump here. Ok, when you have to build your TFJob yaml fine now its time to deploy it on kubernetes cluster and start the training model.


The TFJob Custom Resource defines a TFJob resource for K8s. The below TFJob yaml file consist fo Tfreplicas, and these Tfreplicas establishes a set of TF processes performing a role in the job, i.e., master or worker or ps. As we can see, each TfReplica contains a kubernetes pod template, and on these templates, a process is specified which runs in each replica. TFJob can handle distributed as well as non-distributed training.

1 apiVersion: kubeflow.org/v1beta1

2 kind: TFJob

3 metadata:

4  generateName: tfjob

5  namespace: kubeflow

6 spec:

7  tfReplicaSpecs:

8   PS:

9    replicas: 1

10   restartPolicy: OnFailure

11   template:

12    spec:

13     containers:

14      - name: tensorflow

15      image: your_imagename

16      command:

17       - python

18        - -m

19        - trainer.task

20        - --batch_size=32

21        - --training_steps=1000

22 Worker:

23   replicas: 3

24   restartPolicy: OnFailure

25    template:

26    spec:

27    containers:

28    - name: tensorflow

29    image: your_imagename

30    command:

31       - python

32       - -m

33       - trainer.task

34       - --batch_size=32

35       - --training_steps=1000

36 Master:

37   replicas: 1

38   restartPolicy: OnFailure

39    template:

40     spec:

41      containers:

42      - name: tensorflow

43      image: your_imagename

44      command:

45       - python

46       - -m

47       - trainer.task

48       - --batch_size=32

49       - --training_steps=1000

So, the training of TensorFlow models using tf-operator, which relies on centralized parameter servers for coordination between workers. An alternative is a decentralized approach in which workers communicate with each other directly without using parameters servers via the MPI allreduce primitive. So let's see how we can use mpi-operator to train our models and run on Kubernetes.

Guide to Mpi-operator

It is the same as tf-operator, the mpi-operator is also a Kubernetes operator which is used to run allreduce-style distributed training. So before starting to deploy the Kubeflow on your Kubernetes cluster and mpi-operato, you can check this link for more details. Deploy the mpi-operator follow this link and git clone the repo and then move to deploy folder and run this command -

kubectl create -f deploy/
The command deploys the mpi-operator with the default values. Now check the CRD with below command.
 kubectl get crd
Now that we have successfully deployed the mpi-operator now its turn to define a mpi-jobs for our machine learning training modes. Below YAML file is most simple mpijob -
1 apiVersion: kubeflow.org/v1alpha1

2 kind: MPIJob

3 metadata:

4    name: tensorflow-benchmarks-16

5 spec:

6   GPUs: 16

7   template:

8    spec:

9     containers:

10    - image: mpioperator/tensorflow-benchmarks:latest

11    name: tensorflow-benchmarks
  To verify mpijobs status, check by running the kubectl commands -
 kubectl get -o yaml mpijobs 
tools-for-kubernetes
Minimze operational tasks of devops pipeline which includes monitoring, deploying, scaling, and rolling out changes to the applications. Explore our Kubernetes Consulting Services

Automate Workloads with Kubernetes Operators


The rise of Kubernetes and containerization tech has made the deployments of application or services more accessible, and we can scale up as many as services we want, can monitor it all these without any difficulties. And due to operators its become easier to manage and automate even stateful application.

Empower your Kubernetes experience with the automation capabilities of Kubernetes Operators. These powerful controllers streamline the deployment and management of workloads, ensuring the desired state of your cluster. With the Operator Framework, you have all the necessary tools and resources to expedite the development and management of operators, making your Kubernetes experience even more efficient and effective.

With the automation capabilities of Kubernetes Operators, developers can focus on training their models without the hassle of infrastructure management. By leveraging the power of Kubernetes and Operators, machine learning tasks become more scalable, portable, and automated. Experience the benefits of automating your workloads with Kubernetes Operators today. To learn more about Kubernetes we advise taking the following steps 

Table of Contents

Get the latest articles in your inbox

Subscribe Now