Inference Pool¶

Alpha since v0.1.0

The InferencePool resource is alpha and may have breaking changes in future releases of the API.

Background¶

The InferencePool resource is a logical grouping of compute resources, e.g. Pods, that run model servers. The InferencePool would deploy its own routing, and offer administrative configuration to the Platform Admin.

It is expected for the InferencePool to:

Enforce fair consumption of resources across competing workloads
Efficiently route requests across shared compute (as displayed by the PoC)

It is not expected for the InferencePool to:

Enforce any common set of adapters or base models are available on the Pods
Manage Deployments of Pods within the Pool
Manage Pod lifecycle of pods within the pool

Additionally, any Pod that seeks to join an InferencePool would need to support a protocol, defined by this project, to ensure the Pool has adequate information to intelligently route requests.

InferencePool has some small overlap with Service, displayed here:

Comparing InferencePool with Service

The InferencePool is not intended to be a mask of the Service object, simply exposing the absolute bare minimum required to allow the Platform Admin to focus less on networking, and more on Pool management.

Spec¶

The full spec of the InferencePool is defined here.