Batching and Streaming
Batching
In the scaling section, you can control the number of concurrent requests. While most diffusion-based models support only one request at a time, Language Models (LLMs) can efficiently handle multiple concurrent requests.
By default, the number of concurrent requests is set to 1, with a maximum of 100 concurrent requests per replica.
Modern LLM engines are optimized for batching requests, with minimal performance impact. Taking advantage of batching can significantly improve throughput.
Our benchmarks using text-generation-inference demonstrate that token throughput can increase up to 20x with batching, while maintaining reasonable latency.
Below is a benchmark table showing median processing times for different batch sizes. While this data is from an older LLM, the pattern remains consistent across different models.
Note: Output was limited to 100 tokens per request.
1
1776
100
2
1981
200
4
2067
400
8
3767
800
16
4431
1600
32
4942
3200
The results show that batch sizes of 4 concurrent requests have minimal impact on latency, making them ideal for real-time applications.
For batch processing jobs, maximum efficiency (in tokens per $/€) is achieved with larger batch sizes (32 or more concurrent requests, up to 100 if supported). Note that your specific model's token limits and inference engine capabilities may restrict the maximum number of concurrent requests.
Streaming
For LLM models, we support streaming for real-time applications. Streaming is not recommended for batch jobs.
Streaming works when the inference server supports Server-Sent Events (SSE). We have tested and support text-generation-inference and vllm.
To enable streaming, you need to set the SSE standard headers in the request. Receiving these headers will instruct TGI and VLLM to stream the response.
Depending on your code used to call the API, you may need to set additional options. Below is an example of how to call the API using Python and the requests
library, which requires the stream=True
option:
Last updated