Batch Jobs
What are batch jobs?
Batch jobs deployments are autoscaling containers feature that is targeted towards long-running jobs, ensuring a unique replica for each job and better resource management.
Why use batch jobs instead of continuous deployments?
With long inference duration (>3 min), it becomes difficult to properly downscale a deployment:
We either need to set a high-enough
Scale-down delayvalue to make sure the inference call has finished, which can leave a long idle time for the replica, wasting resource and funds.Too low values can result in scaling down a replica while the inference is still running.
With batch job deployments we ensure each job get it's own replica which is destroyed as soon as the job is finished.
An app handling batch jobs must provide a process exit functionality to signal the job is done, and that the replica can be destroyed.
Usage and Examples
Batch jobs deployments are similar to continuous deployments, with a few important differences:
The containerized app running in a replica must exit the process (with exit code 0 for success, non-zero code for failure) in order to signal the work hard ended, resulting in scaling down of the replica.
Unlike continuous deployments, calls to a batch job deployment are always async.
A job has a
deadlineduration, after which the replica will be destroyed regardless of the job status.
Let's create an example batch job deployment using an example python app which simulates a long-running job and success or fail scenarios.
For the container image, we will use the example app public image ghcr.io/verda-cloud/batch-jobs-example:1.0.1
Use exposed port 8000 and the default health check endpoint.
We can deploy it similarly to a continuous deployment, except for the following params:
Max concurrent jobs: maximum number of replicas, will scale to 0 when there are no jobs in the queue.
Deadline: maximum duration a replica will be up
Let's call the endpoint to trigger a job with a duration of 10 seconds:
The call is async by default, the response contains the job id, a status path to check the job status, and a result path to get the job response if you set one.
curl -X POST "https://tasks.datacrunch.io/<DEPLOYMENT_NAME>/job?duration=10"
# Response:
{
"Id": "632c1e18-85e6-4567-ac15-f04749a51b9e",
"StatusPath": "/status/<DEPLOYMENT_NAME>",
"ResultPath": "/result/<DEPLOYMENT_NAME>"
}Check the job status:
curl -X GET \
--location 'https://containers.datacrunch.io/status/<DEPLOYMENT_NAME>' \
--header 'X-Inference-Id: 632c1e18-85e6-4567-ac15-f04749a51b9e' \
--header 'Authorization: Bearer <INFERENCE_TOKEN>'
# Response:
{
"Id": "f8fff9af-6584-4b7a-a377-f3142037cc3d",
"Status": "Queue"
}Fetch the result after the job is finished:
curl -X GET \
--location 'https://containers.datacrunch.io/result/<DEPLOYMENT_NAME>' \
--header 'X-Inference-Id: 632c1e18-85e6-4567-ac15-f04749a51b9e' \
--header 'Authorization: Bearer <INFERENCE_TOKEN>'
# Response:
{
"success": true,
"message": "Job completed successfully",
"executionTime": 5,
"timestamp": "2025-11-06 11:03:03"
}Best Practices
Use this feature for long running jobs (~over 3 minutes)
Remember to exit the process when the job is done, either successful or failed
The process exit code should be called after returning a response
Use logging liberally in your app to make debugging easier
Troubleshooting
The replica keeps running after the job was done
Make sure to exit the process, and use the correct exit status code
Unhandled exceptions may cause the app to return an http error status but keep the app running
The replica was killed before the job was done
Make sure the
Deadlineduration value is lower than the estimated job duration
No response is returned
Make sure the process is not killed before returning the response
e.g. for python's FastAPI use
BackgroundTasksor javascript'ssetImmediateto exit the process after sending the response
The replica isn't accepting jobs
Make sure a
GET /healthendpoint is implemented
Last updated
Was this helpful?