Scaling
Last updated
Last updated
DataCrunch Containers service comes with autoscaling support. Scaling rules are applied whenever the maximum number of replicas (i.e. worker nodes) per deployment is set higher than the minimum.
The default scaling is Queue load only.
You can adjust the scaling sensitivity based on the queue length (number of message in queue) per replica: .
Small values indicate sensitive scaling, while larger values allow queues to fill up before new replicas are created.
Example use cases:
You want to run low-priority batch jobs overnight. Setting the maximum queue load value will keep costs down while using a small number of replicas.
Your service runs an image generation for premium paid users. Setting the minimum queue load value will make sure no requests are idly waiting for a replica.
For the queue load scaling, only messages in queue are counted. If a replica has picked up the message, it is not counted towards the queue length.
Example: Queue load 2 with 10 replicas, means up 20 messages in queue plus 10 messages in progress before any scaling happens.
Keep in mind your average inference duration when calculating the queue load. If you run a quick image generation algorithm (say 3 seconds per request), a queue load of 0.5 means that the average request will wait 1.5 seconds before being picked up for processing.
If you generate video (say 1 minute per request), a queue load of 0.5 means that the average request will wait 30 seconds before being processed.
Additional Scaling Metrics currently available are CPU utilization and GPU utilization (calculated as averages per deployment). In practice, these are not as reliable as queue-based scaling. Depending on the nature of your workload, these may prove useful, for example, when you have known specific CPU-usage pattern for CPU-heavy jobs.
Scaling up occurs after one of the enabled scaling metrics is exceeded, and conversely, scaling down occurs when all metrics are below the scaling thresholds.
Additional attributes that you can control the behavior of scaling are:
Scale-up delay - Time to delay spawning new replicas after the scale-up threshold has been exceeded.
Scale-down delay - Time to delay reducing the number of replicas after all of the scaling metrics have gone below the threshold.
Request message time to live (TTL) - Time before a request is deleted, this combines both time in the queue and the actual inference.