Offload burst inferences from a self hosted model
Learn how to create a hybrid replica of a model deployed on your own Kubernetes cluster.
This tutorial demonstrates how you can create a hybrid deployment of a model deployed on your own private Kubernetes cluster with Beamlit, in order to make your app more resilient in case of burst traffic or hardware failure for example.
Prerequisites
-
A Beamlit workspace
-
A running Kubernetes cluster where you have deployed a ML model (as a Deployment, StatefulSet, ReplicaSet, or DaemonSet).
-
Helm. Version 3.8.0 or later is recommended in order to use OCI-based registries. Check out the Installation Guide.
-
A Prometheus server that will contain metrics used to trigger offloading. Make sure you have its address as you will need it later.
Guide
On a conceptual level, model offloading works by defining a model deployment on Beamlit that references a Kubernetes Deployment on your own cluster. The is done via the open-source Beamlit controller. You can interact with the controller using Kubernetes Custom Resources (CR), in order to:
- create the corresponding resources on Beamlit
- monitor metrics on Prometheus
- route traffic based on trigger conditions, via a Beamlit gateway installed alongside the controller
Create a service account
Service accounts act as virtual users in your Beamlit workspace, representing external systems that need to control Beamlit resources. The Beamlit controller authenticates using a service account in your workspace.
Open the Beamlit console. In Workspace Settings > Service Accounts, create a new service account. Make sure to retrieve its client ID and client secret.
Install the Beamlit controller
The next step is to install the open-source controller in your cluster. This tutorial will install the controller in a beamlit
namespace, which is a recommended best practice.
Run the following commands to add the repository from Helm and install the package. Make sure to edit the variables CLIENT_ID
and CLIENT_SECRET
with the values previously retrieved, and config.metricInformer.prometheus.address
with the address for your Prometheus server:
You can verify that the controller was successfully installed by running:
Make your model overflow
On Beamlit, a Kubernetes Deployment is mapped to a ModelDeployment. These are fully serverless and consume resources only when they are actively processing requests. Offloaded models activate only when a metric reaches a threshold. At all other times, requests go to your Deployment, while the model on Beamlit remains idle.
Create a ModelDeployment custom resource in a file my-deployment.yaml
to be remotely applied by the controller. Use the template YAML below and replace with your specific values:
With this resource, the Beamlit controller will monitor a Prometheus server for the value of the offloading metric. If you want to use Kubernetes’ metrics-server instead, follow our guide here.
Create the model deployment by running the following command in your cluster:
A model deployment is now created on Beamlit. It remains in standby mode until the activation metric reaches the threshold. When it does, the model will become active and requests will start being routed to Beamlit, making sure all your consumers are served.
Monitor offloaded requests
You can verify when requests are being routed to Beamlit by accessing the Beamlit console and navigating to the Models page. There, when there is an active offloading state, you’ll see the model marked as actively handling offloaded requests, along with real-time metrics.
For further reference, read our documentation about model offloading.