Overview

With our Containers service, you can create your own inference endpoints to serve your models while paying only for the compute that is in active use.

We support loading containers from any registry and are quite flexible about how the container is built.

You can deploy your first container by following the guide: Quick: Deploy with vLLM

Serverless Containers pricing

Price is calculated in 10-minute intervals for the currently running replicas of your container. The number of currently running replicas will depend on your Scaling and health-checks settings.

See here for pricing.

Features

Scale to hundreds of GPUs when needed with our battle-tested inference cluster
Scale to zero when idle, so you only pay while your container is running
Support for any container registry, using either registry-specific authentication methods or a vanilla Docker config.json-style auth
Both manual and request queue-based autoscaling, with adjustable scaling sensitivity
Logging and metrics in the dashboard
RESTful API for managing your deployments
Python SDK
Support for async / polling requests
Shared storage between the Containers and Cloud GPU instances
Batch jobs - recommended for long inference durations > 3min

Coming soon

Container Registry

PreviousDeleting storage NextContainer Registries

Last updated 1 day ago

Was this helpful?