Scaling and health-checks

DataCrunch Containers service comes with autoscaling support. Scaling rules are applied whenever the maximum number of replicas (i.e. worker nodes) per deployment is set higher than the minimum.

Queue load

We use an internal queue to handle incoming requests.

The default scaling is Queue load only.

You can adjust the scaling sensitivity based on the queue length (number of message in queue) per replica: $queue\_load = \dfrac{queue\_length}{num\_replicas}$ .

Small values indicate sensitive scaling, while larger values allow queues to fill up before new replicas are created.

Example use cases:

You want to run low-priority batch jobs overnight. Setting the maximum queue load value will keep costs down while using a small number of replicas.
Your service runs an image generation for premium paid users. Setting the minimum queue load value will make sure no requests are idly waiting for a replica.

For the queue load scaling, only messages in the queue are counted. If a replica has picked up the message, it is not counted towards the queue length.

Example: Queue load 2 with 10 replicas means there are 20 messages in queue plus 10 messages in progress before any scaling happens.

Please consider your average inference duration when calculating the queue load. If you run a quick image generation algorithm (say 3 seconds per request), a queue load of 0.5 means that the average request will wait 1.5 seconds before being picked up for processing.

If you generate video (say 1 minute per request), a queue load of 0.5 means that the average request will wait 30 seconds before being processed.

CPU and GPU utilization

Additional Scaling Metrics currently available are CPU utilization and GPU utilization (calculated as averages per deployment). In practice, these are not as reliable as queue-based scaling. Depending on the nature of your workload, these may prove useful, for example, when you have known specific CPU-usage pattern for CPU-heavy jobs.

Scaling up occurs after one of the enabled scaling metrics is exceeded, and conversely, scaling down occurs when all metrics are below the scaling thresholds.

Additional scaling attributes

Additional attributes that you can control the behavior of scaling are:

Scale-up delay - Time to delay spawning new replicas after the scale-up threshold has been exceeded.
Scale-down delay - Time to delay reducing the number of replicas after all of the scaling metrics have gone below the threshold.
Request message time to live (TTL) - Time before a request is deleted, this combines both time in the queue and the actual inference.

Controlling downscaling

To avoid terminating replicas that are actively doing work, a SIGTERM handler can be used. When a replica has been selected for downscaling, it is sent a SIGTERM and given a grace period (30 seconds) to exit - after this it will be forcefully terminated (with a SIGKILL), losing any work in progress.

If the 30-second timeout after receiving SIGTERM is not enough for your needs, please contact [email protected].

The below snippet contains a sample signal handling implementation for FastAPI. Different methods can be used. The sample will stop receiving requests and wait 30 seconds before exiting, but the cost-conscious user can implement a mutex that is toggled at the beginning and end of the predict function, allowing the application to exit the instant the mutex is toggled from busy to free.

import signal, uvicorn, logging, asyncio
from fastapi import FastAPI
from fastapi.responses import JSONResponse
from contextlib import asynccontextmanager

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
sigterm_received = False

def sigterm_handler(sig, frame):
    global sigterm_received
    logger.info(f"Received signal {sig}, starting graceful shutdown")
    sigterm_received = True

@asynccontextmanager
async def lifespan(app):
    # Setup signal handlers during startup
    signal.signal(signal.SIGTERM, sigterm_handler)
    signal.signal(signal.SIGINT, sigterm_handler)
    yield
    # Wait for existing requests to complete (max 30s)
    wait_seconds = 0
    while sigterm_received and wait_seconds < 30:
        logger.info(f"Waiting for requests to complete ({wait_seconds}s)")
        await asyncio.sleep(1)
        wait_seconds += 1

app = FastAPI(lifespan=lifespan)

# semaphore with 1 slot → acts like a Lock
max_concurrency = 1
busy_semaphore = asyncio.Semaphore(max_concurrency)

@app.get("/")
async def read_root():
    return {"message": "Hello World"}

@app.get("/predict")
async def predict():
    await busy_semaphore.acquire() # Acquire a slot
    try:
        await asyncio.sleep(15)  # Simulate long-running request
        return {"message": "Prediction completed"}
    finally:
        busy_semaphore.release()

@app.get("/health")
async def health_check():
    # Report unhealthy during shutdown to prevent new requests
    if sigterm_received:
        return JSONResponse(status_code=503, content={"status": "shutting_down"})
    # _value is how many slots remain
    if busy_semaphore._value == 0:
        return JSONResponse(status_code=200, content={"status": "busy"})
    return JSONResponse(status_code=200, content={"status": "healthy"})

if __name__ == "__main__":
    uvicorn.run("main:app", host="0.0.0.0", port=8000, timeout_graceful_shutdown=30)

In the above snippet, lifespan a context manager is used to register signal handlers before FastAPI starts serving requests, after which control is yielded to the regular request handlers.

Health checks

Health checks are an integral part of the system, knowing when a replica is ready to receive requests. If not implemented, the newly started container can receive a request before it's ready and return a 500 internal error to incoming requests.

We do not throttle at errors, but pass them through to the caller, so there is a chance that several or a lot of requests are picked from the queue and fail processing.

Health checks can also be used to control when the replica gets traffic. Our system records a replica's health status and only sends work to replicas posting ready status.

Health check returns are as follows:

Healthy:
- Any non-JSON body with HTTP 200 OK response
- JSON body with HTTP 200 OK response, withstatus having values ok,ready,healthy, running, or up , for example: { "status": "ok" }
Unhealthy:
- Other HTTP codes. It's good practice to use a 5xx code here.
- JSON body with HTTP 200 OK response, withstatus having value unhealthy
Busy (currently optional):
- JSON body with HTTP 200 OK response, withstatus having value busy

PreviousContainer Registries NextBatching and Streaming

Last updated 6 months ago

Was this helpful?