Great Work! I’m serving the vLLM model for a realtime TTS use-case; and noticed that concurrent requests experience much higher latency compared to sequential requests, even though the model and hardware remain the same. This affects real-time performance.
Environment / Setup:
Observed Results:
--- Sequential Benchmark Results ---
TTFB (Mean): 0.130s
Total Time (Mean): 5.807s
RTF (Mean): 0.469
--- Concurrent Benchmark Results ---
TTFB (Mean): 0.667s
Total Time (Mean): 17.228s
RTF (Mean): 1.403
TTFB Difference: +0.537s (+413%)
Clarification:
- Concurrent requests should scale better. Ideally, TTFB and total time should not increase dramatically under load. Am i correct?
- Are there recommended settings for improving concurrent throughput (batch size, async workers, streaming configuration)?
- Are there known limitations in vLLM when handling multiple simultaneous TTS requests?
- Any advice to optimize TTFB and RTF under high concurrency?
Thanks in advance!
Great Work! I’m serving the vLLM model for a realtime TTS use-case; and noticed that concurrent requests experience much higher latency compared to sequential requests, even though the model and hardware remain the same. This affects real-time performance.
Environment / Setup:
Hardware:
H100Benchmark:
Average audio duration: ~12.3s
Observed Results:
Clarification:
Thanks in advance!