Scaling and health-checks
Last updated
Was this helpful?
Last updated
Was this helpful?
DataCrunch Containers service comes with autoscaling support. Scaling rules are applied whenever the maximum number of replicas (i.e. worker nodes) per deployment is set higher than the minimum.
We use an internal queue to handle incoming requests.
The default scaling is Queue load only.
You can adjust the scaling sensitivity based on the queue length (number of message in queue) per replica: .
Small values indicate sensitive scaling, while larger values allow queues to fill up before new replicas are created.
Example use cases:
You want to run low-priority batch jobs overnight. Setting the maximum queue load value will keep costs down while using a small number of replicas.
Your service runs an image generation for premium paid users. Setting the minimum queue load value will make sure no requests are idly waiting for a replica.
Please consider your average inference duration when calculating the queue load. If you run a quick image generation algorithm (say 3 seconds per request), a queue load of 0.5 means that the average request will wait 1.5 seconds before being picked up for processing.
If you generate video (say 1 minute per request), a queue load of 0.5 means that the average request will wait 30 seconds before being processed.
Additional Scaling Metrics currently available are CPU utilization and GPU utilization (calculated as averages per deployment). In practice, these are not as reliable as queue-based scaling. Depending on the nature of your workload, these may prove useful, for example, when you have known specific CPU-usage pattern for CPU-heavy jobs.
Scaling up occurs after one of the enabled scaling metrics is exceeded, and conversely, scaling down occurs when all metrics are below the scaling thresholds.
Additional attributes that you can control the behavior of scaling are:
Scale-up delay - Time to delay spawning new replicas after the scale-up threshold has been exceeded.
Scale-down delay - Time to delay reducing the number of replicas after all of the scaling metrics have gone below the threshold.
Request message time to live (TTL) - Time before a request is deleted, this combines both time in the queue and the actual inference.
To avoid terminating replicas that are actively doing work, a SIGTERM
handler can be used. When a replica has been selected for downscaling, it is sent a SIGTERM
and given a grace period (30 seconds) to exit - after this it will be forcefully terminated (with a SIGKILL
), losing any work in progress.
In the above snippet, lifespan
a context manager is used to register signal handlers before FastAPI starts serving requests, after which control is yielded to the regular request handlers.
Health checks are an integral part of the system, knowing when a replica is ready to receive requests. If not implemented, the newly started container can receive a request before it's ready and return a 500 internal error
to incoming requests.
We do not throttle at errors, but pass them through to the caller, so there is a chance that several or a lot of requests are picked from the queue and fail processing.
Health checks can also be used to control when the replica gets traffic. Our system records a replica's health status and only sends work to replicas posting ready status.
Health check returns are as follows:
Healthy:
Any non-JSON body with HTTP 200 OK
response
JSON body with HTTP 200 OK
response, withstatus
having values ok
,ready
,healthy
, running
, or up
, for example: { "status": "ok" }
Unhealthy:
Other HTTP codes. It's good practice to use a 5xx
code here.
JSON body with HTTP 200 OK
response, withstatus
having value unhealthy
Busy (currently optional):
JSON body with HTTP 200 OK
response, withstatus
having value busy
If the 30-second timeout after receiving SIGTERM
is not enough for your needs, please contact .
The below snippet contains a sample signal handling implementation for . Different methods can be used. The sample will stop receiving requests and wait 30 seconds before exiting, but the cost-conscious user can implement a mutex that is toggled at the beginning and end of the predict function, allowing the application to exit the instant the mutex is toggled from busy to free.