Quick: Deploying GPT-OSS 120B (Ollama) on Serverless Containers

Overview

This tutorial provides step-by-step instructions for deploying OpenAI's GPT-OSS 120B, as a scalable API endpoint using the DataCrunch Serverless Container platform.

This process uses a pre-built Docker image running the Ollama server.The container will download the model weights on its first run and reuse them across restarts, ensuring fast subsequent startups.

Pre-requisites

A DataCrunch Cloud Platform account to deploy serverless containers.
Either you're own docker image or a pre-made one.

A Docker image hosted in a container registry. If you need to create one, follow our guide on How to Publish Your First Docker Image.

Deployment Steps

Follow these instructions carefully in the DataCrunch cloud dashboard to create your deployment.

1. Navigate to New Deployment

2. Basic Configuration

Deployment Name: Give your deployment a unique name name (e.g., gpt-oss-<your username>).
Compute Type: Select an appropriate GPU type, for the sake of the tutorial I will be using one Nvidia H100.

3. Container Image Configuration

This is a critical step. You must provide the full, versioned path to the container image.

Container Image: the format will be docker.io/<you're username>/<the name of the image you pushed>:<the tag>
- Example: docker.io/datacrunch/gptoss:v1.0
- Action: Replace datacrunch with you're actual Docker Hub username, gptoss with the actual name of youre image and v1.0 with the specific version tag of the image you want to deploy.

Important Note on Image Tags The platform does not allow the use of the :latest tag for production deployments. This is a best practice to ensure that your deployments are predictable and reproducible. Always use a specific, immutable version tag (e.g., :v1.0, :v1.1.2).

4. Networking and Ports

Registry Credentials - (Public Docker Image): Keep the Registry Credentials to none to get a public URL for your API.
Registry Credentials - (Private Docker Image): If your docker image is private, after pasting your container image in the required format, you will have to create registry credentials. Click on create credentials. Name your credentials, and select the registry provider. In this tutorial I will be using Docker Hub's registry. Type in your username for Docker Hub and paste the Docker Hub access token (if you don't know how to make a Docker hub access token refer to this guide) which starts from dckr_pat and click create credential.

Exposed HTTP Port: Set this to the port the Ollama server listens on.
- Action: Set to 8000.
Delete any Environment Variables as we dont need them for this tutorial

5. Health Check Configuration

The health check is crucial for the platform to know when your container is ready to receive traffic. An incorrect health check is the most common reason for deployment failure.

Healthcheck Port: Will be automatically set to the exposed HTTP port unless you change it
Healthcheck Path: Set this to a lightweight API endpoint that indicates the server is running. For Ollama, the /api/tags endpoint is perfect for this.
- Action: Set to /api/tags.

6. Storage and Scaling

DataCrunch automatically attaches a persistent storage volume at the /data path inside the container. We will configure Ollama to use this volume to store model weights, so they are not re-downloaded on every container restart. You can leave the Scaling options to their default values.

7. Deploy

Review your settings and click Deploy Container. That's it! You have now created a deployment. You can check the logs of the deployment from the logs tab.

First-Time Startup

The first time the container starts, it will take some time and be slow. You can view the logs to see the progress of the ollama pull command as it downloads the 120B model to the /data volume. At this point, in the dashboard it will show the container is unavailable, no need to worry. This is a one-time operation. Subsequent restarts will be much faster.

Connecting to the Endpoint

Before you can connect to the endpoint, you will need to generate an authentication token, by going to Keys -> Inference API Keys, and click Create.

NOTE: Make sure to immediately copy and save the inference token somewhere safe as it will not be visible after closing the dialog box

The base endpoint URL for your deployment is in the API section towards the top left of the screen.

Once the container is marked as "Healthy" and the staus of the container changes to "running" on the dashboard, you can test it using curl.

Notice the added subpath : /v1/chat/completions to the base endpoint URL

curl -X POST https://<YOUR_CONTAINERS_API_URL>/v1/chat/completions \
-H 'Authorization: Bearer <YOUR_INFERENCE_API_KEY>' \
-H 'Content-Type: application/json' \
-d '{"model": "gpt-oss:120b",
        "messages": [
          { "role": "system", "content": "You are a helpful writer assistant." },
          { "role": "user", "content": "Briefly describe what is deep learning?" }
        ],
        "stream": false
}' | jq

(Optional) we can also append | jq at the end for better formatting of the response

Congratulations you have now deployed OpenAI's gpt-oss on serverless inference

PreviousIn-Depth: Asynchronous Inference Requests with Whisper NextTutorial: How to Publish Your First Docker Image to Docker Hub

Last updated 2 days ago

Was this helpful?