Overview

With our Containers service, you can create your own inference endpoints to serve your models while paying only for the compute that is in active use.

We support loading containers from any registry and are quite flexible about how the container is built.

You can deploy your first container by following the following guide: Tutorial: Deploy with vLLM

Features

  • Scale to hundreds of GPUs when needed with our battle-tested inference cluster

  • Scale to zero when idle, so you only pay while your container is running

  • Support for any container registry, using either registry-specific authentication methods or a vanilla Docker config.json-style auth

  • Both manual and request queue-based autoscaling, with adjustable scaling sensitivity

  • Logging and metrics in the dashboard

Coming soon

  • RESTful API for managing your deployment as well as Python SDK

  • Shared storage between the Containers and Cloud GPU instances

  • Support for async / polling requests

Last updated