When you deploy applications on your own clusters, these apps experience diverse and varying traffic throughout the day - but all consumers expect to be served without downtime. If your applications are GPU-accelerated AI workloads, this can result in critical failures due to how time-consuming it is to boot additional worker machines, which often require manual intervention.

In this situation, IT teams like to allow software (think: a model) to overflow on additional clusters. Beamlit provides extensive offloading capabilities so you can use Beamlit’s Global Inference Network as a standby computing platform in case of traffic burst or outage.

With Beamlit, you can reference your own private Kubernetes cluster as the origin for a replica of your model on Beamlit. This defines a hybrid deployment between your cluster and Beamlit, where load-balancing is handled automatically based on a triggering metric. It is also possible to define a replica on another one of your own private clusters, so you can effectively federate clusters to others across multiple regions.

In short, there are two supported modes for model offloading:

  • Offload from your own Kubernetes cluster to Beamlit Global Inference Network: this gives you access to Beamlit’s smart routing of inferences, highly available edge computing regions and complete observability.
  • Offload between two of your own Kubernetes clusters: this gives you total control over the execution clusters themselves but requires more upfront work from you (setting up the clusters). This is completely open-source software and can be done at no extra cost using Beamlit open-source controller.

Activating offloading

Offloading is controlled by an open-source controller that must be installed in your own Kubernetes cluster. This controller is responsible for remotely creating and controlling the model on the destination cluster, as well as monitoring the health of your app. Meanwhile, a Beamlit gateway (installed alongside the controller) load-balances requests between your deployment and the one in the destination when needed.

Offloading is triggered when a metric (called the offloading metric) hits a certain threshold, which you can both specify. Once offloading is triggered, the Beamlit controller will automatically route part of the inbound traffic to the destination cluster, using a configurable strategy. This is completely transparent to your application consumers who will call the app through the same way.

Offloading state remains until the offloading metric remains out of the threshold zone for a certain buffer duration. When there is not a current active state of offloading (meaning the offloading metric has not hit the threshold), all inbound requests go to your own pods on your cluster.

To summarize, offloading is a two-phase process:

  1. First, you configure one of your models or AI applications to be offloaded to a destination, based on a metric and threshold value. At this stage, all requests are handled by your cluster as normal but the Beamlit controller is in Standby (i.e. just watching) state.
  2. Then, if offloading is triggered, the Beamlit gateway will load-balance between your own pods and the destination. It will keep doing so until the offloading condition is no longer met and remains so for long enough.

Offload from your own Kubernetes cluster to Beamlit Global Inference Network

To set up model offloading to Beamlit, you need to deploy a model on Beamlit that uses a model from your cluster as its source. This is done using the open-source Beamlit controller.

The ‘source of truth’ for these types of models is your cluster. As such, creating and managing them can only be done though the Beamlit controller, not from the GUI or APIs.

Prerequisites

  • A Kubernetes cluster, on which you have a Deployment (or StatefulSet, ReplicaSet, DaemonSet) that is your model or AI application you want to offload to Beamlit

  • One of the supported metric servers, for monitoring the health of your app.

  • Have the Beamlit controller installed on your cluster. The controller is fully open-source and allows to manage Beamlit resources in a Kubernetes-native way.

This operation uses a CRD (Custom Resource Definition) which gets installed at the same time as the controller. You will create a CR (Custom Resource) pointing at your origin Deployment and apply it via the controller to deploy a ‘copy’ of the model on Beamlit.

Create the following CR in a file my-deployment.yaml, and edit the values as needed:

apiVersion: deployment.beamlit.com/v1alpha1
kind: ModelDeployment
metadata:
  name: my-model
spec:
  model: my-model-on-Beamlit
  environment: production # Can also be development
  modelSourceRef:
    apiVersion: apps/v1 
    kind: Deployment 
    name: my-model 
    namespace: default
  serviceRef:
    name: my-model
    namespace: default
    targetPort: 8080 # Port where your model listens on
  offloadingConfig:
    behavior:
      percentage: 50
    metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 50

The offloading metric in the example above is a resource metric (CPU utilization) from Kubernetes metrics-server. Read our guide about how to set up this offloading metric.

Read our full reference on ModelDeployment for a complete description of all other parameters.

Apply the CR using kubectl:

kubectl apply -f my-deployment.yaml

The model is now deployed on Beamlit and the controller is in watching state for the offloading condition to be met. ✅

To un-deploy the model, simply delete the deployment from Beamlit using kubectl (via the controller):

kubectl delete -f my-deployment.yaml

Offload between two of your own Kubernetes clusters

This deployment mode exclusively uses the open-source Beamlit controller. It’s important to note that you’re responsible for deploying your service on both clusters. The Beamlit controller and gateway will then handle the traffic offloading and routing from the origin cluster to the destination when the specified offloading condition is met.

Prerequisites

  • A destination Kubernetes cluster. We can provide documentation for correctly setting up your clusters. The original Deployment must also be deployed in this second cluster and reachable on a specific address.

This operation uses a CRD (Custom Resource Definition) which gets installed at the same time as the controller. You will create a CR (Custom Resource) pointing at your origin Deployment and apply it via the controller to make offloading available.

Create the following CR in a file my-deployment.yaml, and edit the values as needed:

apiVersion: deployment.beamlit.com/v1alpha1
kind: ModelDeployment
metadata:
  name: my-model
spec:
  model: my-model
  environment: production # Can also be development
  modelSourceRef:
    apiVersion: apps/v1 
    kind: Deployment 
    name: my-model 
    namespace: default
  serviceRef:
    name: my-model
    namespace: default
    targetPort: 8080 # Port where your model listens on
  offloadingConfig:
	  remoteBackend:
		  host: my-model-on-another-cluster:80
		  scheme: http
    behavior:
      percentage: 50
    metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 50

Apply the CR using kubectl:

The controller is now in watching state for the offloading condition to be met. ✅

To un-deploy the model offloading, simply delete the model deployment using kubectl (via the controller):