Quick: Deploying GPT-OSS 120B (Ollama) on Serverless Containers

Overview

This tutorial provides step-by-step instructions for deploying OpenAI's GPT-OSS 120B, as a scalable API endpoint using the DataCrunch Serverless Container platform.

This process uses a pre-built Docker image running the Ollama server.The container will download the model weights on its first run and reuse them across restarts, ensuring fast subsequent startups.

Pre-requisites

A DataCrunch Cloud Platform account to deploy serverless containers.
Either you're own docker image or a pre-made one.

A Docker image hosted in a container registry. If you need to create one, follow our guide on How to Publish Your First Docker Image.

Preparing a Custom Container Image (Optional)

For advanced control, you can build and publish your own image. This allows you to specify the exact Ollama version and add other tools.

Note: The FROM instruction below uses a specific Ollama version (e.g., ollama/ollama:0.12.6). We recommend checking the Ollama Docker Hub page for the latest version and updating it as needed.

Create a Dockerfile with the following content. The jq utility is not needed in the image itself; it is a client-side tool for formatting the API response.

FROM ollama/ollama:0.12.6

# Install curl for health checks and jq for JSON processing
RUN apt-get update && \
    apt-get install -y curl jq && \
    rm -rf /var/lib/apt/lists/*

# Create a robust startup script inside the image
RUN cat > /start-ollama.sh <<'EOF'
#!/bin/bash
set -e

echo "=== Ollama Container Starting ==="
echo "Model storage path: ${OLLAMA_MODELS:-/data/.ollama/models}"
echo "Host binding: ${OLLAMA_HOST:-0.0.0.0:8000}"

# Set default values for model storage and host
export OLLAMA_MODELS=${OLLAMA_MODELS:-/data/.ollama/models}
export OLLAMA_HOST=${OLLAMA_HOST:-0.0.0.0:8000}

echo "Creating models directory: ${OLLAMA_MODELS}"
mkdir -p "${OLLAMA_MODELS}"

# Start Ollama server in the background
OLLAMA_PORT=${OLLAMA_HOST##*:}
echo "Starting Ollama server on port ${OLLAMA_PORT}..."
ollama serve &
OLLAMA_PID=$!

# Wait for the Ollama API to become available
echo "Waiting for Ollama API to be ready..."
TIMEOUT=600
ELAPSED=0
while ! curl -s http://localhost:${OLLAMA_PORT}/api/tags >/dev/null 2>&1; do
    if [ $ELAPSED -ge $TIMEOUT ]; then
        echo "ERROR: Ollama failed to start within $TIMEOUT seconds"
        kill $OLLAMA_PID 2>/dev/null
        exit 1
    fi
    sleep 1
    ELAPSED=$((ELAPSED + 1))
done
echo "✓ Ollama API is ready!"

# If a model is specified in the environment variable, download it
if [ -n "$OLLAMA_PULL_MODEL" ]; then
    echo "Model requested: $OLLAMA_PULL_MODEL"
    if ollama list | grep -q "^${OLLAMA_PULL_MODEL}"; then
        echo "✓ Model $OLLAMA_PULL_MODEL already exists."
    else
        echo "→ Downloading model: $OLLAMA_PULL_MODEL. This may take a while..."
        if ollama pull "$OLLAMA_PULL_MODEL"; then
            echo "✓ Model download successful!"
        else
            echo "ERROR: Failed to download model."
        fi
    fi
fi

# Trap signals for graceful shutdown
trap "echo 'Shutting down...'; kill $OLLAMA_PID; exit 0" SIGTERM SIGINT

echo "=== Ollama server is running on ${OLLAMA_HOST} ==="
wait $OLLAMA_PID
EOF

# Make the startup script executable
RUN chmod +x /start-ollama.sh

# Set default environment variables. These can be overridden at runtime.
ENV OLLAMA_MODELS=/data/.ollama/models
ENV OLLAMA_HOST=0.0.0.0:8000
ENV OLLAMA_PULL_MODEL=llama3:8b

# Add a healthcheck to let Docker know when the container is ready
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
  CMD curl -f http://localhost:8000/api/tags || exit 1

LABEL maintainer="[email protected]" \
      version="1.0" \
      description="Ollama with automatic model download support"

# Expose the default port
EXPOSE 8000

# Set the entrypoint to our startup script
ENTRYPOINT ["/start-ollama.sh"]

you can follow the complete tutorial on publishing to docker registry here

Deployment Steps

Follow these instructions carefully in the DataCrunch cloud dashboard to create your deployment.

1. Navigate to New Deployment

2. Basic Configuration

Deployment Name: Choose a unique name for your deployment to avoid conflicts (e.g., gpt-oss-your-initials).
Compute Type: Select a GPU with sufficient VRAM. The GPT-OSS 120B model is very large; we recommend selecting a compute type with at least 80 GB of VRAM (e.g., NVIDIA H100 or A100 80GB) to prevent deployment failures.

3. Container Image Configuration

This is a critical step. You must provide the full, versioned path to the container image.

Container Image: the format will be docker.io/<you're username>/<the name of the image you pushed>:<the tag>
- Example: docker.io/datacrunch/gptoss:v1.0
- Action: Replace datacrunch with you're actual Docker Hub username, gptoss with the actual name of youre image and v1.0 with the specific version tag of the image you want to deploy.

Warning: The platform does not allow the use of the :latest tag. This is a best practice to ensure your deployments are predictable. Always use a specific, immutable version tag (e.g., :v1.0).

4. Networking and Ports

Registry Credentials - (Public Docker Image): Keep the Registry Credentials to none to get a public URL for your API.
Registry Credentials - (Private Docker Image): If your docker image is private, after pasting your container image in the required format, you will have to create registry credentials. Click on create credentials. Name your credentials, and select the registry provider. In this tutorial I will be using Docker Hub's registry. Type in your username for Docker Hub and paste the Docker Hub access token (if you don't know how to make a Docker hub access token refer to this guide) which starts from dckr_pat and click create credential.

Exposed HTTP Port: Set this to the port the Ollama server listens on.
- Action: Set to 8000.
Delete any Environment Variables as we dont need them for this tutorial

5. Health Check Configuration

The health check is crucial for the platform to know when your container is ready to receive traffic. An incorrect health check is the most common reason for deployment failure.

Healthcheck Port: Will be automatically set to the exposed HTTP port unless you change it
Healthcheck Path: Set this to a lightweight API endpoint that indicates the server is running. For Ollama, the /api/tags endpoint is perfect for this.
- Action: Set to /api/tags.

6. Storage and Scaling

DataCrunch automatically attaches a persistent storage volume at the /data path inside the container. We will configure Ollama to use this volume to store model weights, so they are not re-downloaded on every container restart. You can leave the Scaling options to their default values.

7. Deploy

Review your settings and click Deploy Container. That's it! You have now created a deployment. You can check the logs of the deployment from the logs tab.

First-Time Startup

The first time the container starts, it will take some time and be slow. You can view the logs to see the progress of the ollama pull command as it downloads the 120B model to the /data volume. At this point, in the dashboard it will show the container is unavailable, no need to worry. This is a one-time operation. Subsequent restarts will be much faster.

Connecting to the Endpoint

Before you can connect to the endpoint, you will need to generate an authentication token, by going to Keys -> Inference API Keys, and click Create.

NOTE: Make sure to immediately copy and save the inference token somewhere safe as it will not be visible after closing the dialog box

The base endpoint URL for your deployment is in the API section towards the top left of the screen.

Once the container is marked as "Healthy" and the staus of the container changes to "running" on the dashboard, you can test it using curl.

Notice the added subpath : /v1/chat/completions to the base endpoint URL

curl -X POST https://<YOUR_CONTAINERS_API_URL>/v1/chat/completions \
-H 'Authorization: Bearer <YOUR_INFERENCE_API_KEY>' \
-H 'Content-Type: application/json' \
-d '{"model": "gpt-oss:120b",
        "messages": [
          { "role": "system", "content": "You are a helpful writer assistant." },
          { "role": "user", "content": "Briefly describe what is deep learning?" }
        ],
        "stream": false
}' | jq

(Optional) we can also append | jq at the end for better formatting of the response

Congratulations you have now deployed OpenAI's gpt-oss on serverless inference

PreviousIn-Depth: Asynchronous Inference Requests with Whisper NextTutorial: How to Publish Your First Docker Image to Docker Hub

Last updated 2 days ago

Was this helpful?