Overview
With our Containers service, you can create your own inference endpoints to serve your models while paying only for the compute that is in active use.
We support loading containers from any registry and are quite flexible about how the container is built.
You can deploy your first container by following the following guide: Tutorial: Deploy with vLLM
Features
Scale to hundreds of GPUs when needed with our battle-tested inference cluster
Scale to zero when idle, so you only pay while your container is running
Support for any container registry, using either registry-specific authentication methods or a vanilla Docker
config.json
-style authBoth manual and request queue-based autoscaling, with adjustable scaling sensitivity
Logging and metrics in the dashboard
Coming soon
RESTful API for managing your deployment as well as Python SDK
Shared storage between the Containers and Cloud GPU instances
Support for async / polling requests
Last updated