Model offloading
Offload burst traffic from your model deployments to Beamlit.
When you deploy applications on your own clusters, these apps experience diverse and varying traffic throughout the day - but all consumers expect to be served without downtime. If your applications are GPU-accelerated AI workloads, this can result in critical failures due to how time-consuming it is to boot additional worker machines, which often require manual intervention.
In this situation, IT teams like to allow software (think: a model) to overflow on additional clusters. Beamlit provides extensive offloading capabilities so you can use Beamlit’s Global Inference Network as a standby computing platform in case of traffic burst or outage.
With Beamlit, you can reference your own private Kubernetes cluster as the origin for a replica of your model on Beamlit. This defines a hybrid deployment between your cluster and Beamlit, where load-balancing is handled automatically based on a triggering metric. It is also possible to define a replica on another one of your own private clusters, so you can effectively federate clusters to others across multiple regions.
In short, there are two supported modes for model offloading:
- Offload from your own Kubernetes cluster to Beamlit Global Inference Network: this gives you access to Beamlit’s smart routing of inferences, highly available edge computing regions and complete observability.
- Offload between two of your own Kubernetes clusters: this gives you total control over the execution clusters themselves but requires more upfront work from you (setting up the clusters). This is completely open-source software and can be done at no extra cost using Beamlit open-source controller.
Activating offloading
Offloading is controlled by an open-source controller that must be installed in your own Kubernetes cluster. This controller is responsible for remotely creating and controlling the model on the destination cluster, as well as monitoring the health of your app. Meanwhile, a Beamlit gateway (installed alongside the controller) load-balances requests between your deployment and the one in the destination when needed.
Offloading is triggered when a metric (called the offloading metric) hits a certain threshold, which you can both specify. Once offloading is triggered, the Beamlit controller will automatically route part of the inbound traffic to the destination cluster, using a configurable strategy. This is completely transparent to your application consumers who will call the app through the same way.
Offloading state remains until the offloading metric remains out of the threshold zone for a certain buffer duration. When there is not a current active state of offloading (meaning the offloading metric has not hit the threshold), all inbound requests go to your own pods on your cluster.
To summarize, offloading is a two-phase process:
- First, you configure one of your models or AI applications to be offloaded to a destination, based on a metric and threshold value. At this stage, all requests are handled by your cluster as normal but the Beamlit controller is in Standby (i.e. just watching) state.
- Then, if offloading is triggered, the Beamlit gateway will load-balance between your own pods and the destination. It will keep doing so until the offloading condition is no longer met and remains so for long enough.
Offload from your own Kubernetes cluster to Beamlit Global Inference Network
To set up model offloading to Beamlit, you need to deploy a model on Beamlit that uses a model from your cluster as its source. This is done using the open-source Beamlit controller.
Prerequisites
-
A Kubernetes cluster, on which you have a Deployment (or StatefulSet, ReplicaSet, DaemonSet) that is your model or AI application you want to offload to Beamlit
-
One of the supported metric servers, for monitoring the health of your app.
-
Have the Beamlit controller installed on your cluster. The controller is fully open-source and allows to manage Beamlit resources in a Kubernetes-native way.
This operation uses a CRD (Custom Resource Definition) which gets installed at the same time as the controller. You will create a CR (Custom Resource) pointing at your origin Deployment and apply it via the controller to deploy a ‘copy’ of the model on Beamlit.
Create the following CR in a file my-deployment.yaml
, and edit the values as needed:
The offloading metric in the example above is a resource metric (CPU utilization) from Kubernetes metrics-server. Read our guide about how to set up this offloading metric.
Apply the CR using kubectl:
The model is now deployed on Beamlit and the controller is in watching state for the offloading condition to be met. ✅
To un-deploy the model, simply delete the deployment from Beamlit using kubectl (via the controller):
Offload between two of your own Kubernetes clusters
This deployment mode exclusively uses the open-source Beamlit controller. It’s important to note that you’re responsible for deploying your service on both clusters. The Beamlit controller and gateway will then handle the traffic offloading and routing from the origin cluster to the destination when the specified offloading condition is met.
Prerequisites
- A destination Kubernetes cluster. We can provide documentation for correctly setting up your clusters. The original Deployment must also be deployed in this second cluster and reachable on a specific address.
This operation uses a CRD (Custom Resource Definition) which gets installed at the same time as the controller. You will create a CR (Custom Resource) pointing at your origin Deployment and apply it via the controller to make offloading available.
Create the following CR in a file my-deployment.yaml
, and edit the values as needed:
Apply the CR using kubectl:
The controller is now in watching state for the offloading condition to be met. ✅
To un-deploy the model offloading, simply delete the model deployment using kubectl (via the controller):