Adapter Rollout¶
The goal of this guide is to demonstrate how to rollout a new adapter version.
Prerequisites¶
Follow the steps in the main guide
Safely rollout v2 adapter¶
Load the new adapter version to the model servers¶
This guide leverages the LoRA syncer sidecar to dynamically manage adapters within a vLLM deployment, enabling users to add or remove them through a shared ConfigMap.
Modify the LoRA syncer ConfigMap to initiate loading of the new adapter version.
kubectl edit configmap vllm-llama2-7b-adapters
Change the ConfigMap to match the following (note the new entry under models):
apiVersion: v1
kind: ConfigMap
metadata:
name: vllm-llama2-7b-adapters
data:
configmap.yaml: |
vLLMLoRAConfig:
name: vllm-llama2-7b-adapters
port: 8000
ensureExist:
models:
- base-model: meta-llama/Llama-2-7b-hf
id: tweet-summary-1
source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
- base-model: meta-llama/Llama-2-7b-hf
id: tweet-summary-2
source: mahimairaja/tweet-summarization-llama-2-finetuned
The new adapter version is applied to the model servers live, without requiring a restart.
Direct traffic to the new adapter version¶
Modify the InferenceModel to configure a canary rollout with traffic splitting. In this example, 10% of traffic for tweet-summary model will be sent to the new tweet-summary-2 adapter.
kubectl edit inferencemodel tweet-summary
Change the targetModels list in InferenceModel to match the following:
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferenceModel
metadata:
name: inferencemodel-sample
spec:
modelName: tweet-summary
criticality: Critical
poolRef:
name: vllm-llama2-7b-pool
targetModels:
- name: tweet-summary-1
weight: 90
- name: tweet-summary-2
weight: 10
The above configuration means one in every ten requests should be sent to the new version. Try it out:
-
Get the gateway IP:
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=8081
-
Send a few requests as follows:
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ "model": "tweet-summary", "prompt": "Write as if you were a critic: San Francisco", "max_tokens": 100, "temperature": 0 }'
Finish the rollout¶
Modify the InferenceModel to direct 100% of the traffic to the latest version of the adapter.
model:
name: tweet-summary
targetModels:
targetModelName: tweet-summary-2
weight: 100
Unload the older versions from the servers by updating the LoRA syncer ConfigMap to list the older version under the ensureNotExist
list:
apiVersion: v1
kind: ConfigMap
metadata:
name: dynamic-lora-config
data:
configmap.yaml: |
vLLMLoRAConfig:
name: sql-loras-llama
port: 8000
ensureExist:
models:
- base-model: meta-llama/Llama-2-7b-hf
id: tweet-summary-2
source: mahimairaja/tweet-summarization-llama-2-finetuned
ensureNotExist:
models:
- base-model: meta-llama/Llama-2-7b-hf
id: tweet-summary-1
source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
With this, all requests should be served by the new adapter version.