Rate Limits and Concurrency¶
Find the right request_concurrency for your workload by benchmarking
sequential vs bounded execution across multiple trials.
Run It¶
python -m cookbook production/rate-limits-and-concurrency \
--input cookbook/data/demo/text-medium --limit 1 \
--prompts 12 --trials 3 --concurrency 4 --mock
Reading the Results¶
Sequential (c=1):
ok rate: 12/12 | median duration: 8.4s
Bounded (c=4):
ok rate: 12/12 | median duration: 2.6s
Speedup: 3.2x (ok rate held at 100%)
The key metric is ok rate — it should stay at N / N as you increase
concurrency. If the ok-rate drops, you're pushing too hard. Median duration
is more reliable than single-run timing because latency is noisy.
Tuning¶
- Raise
--concurrencystepwise until reliability drops. - Increase
--trialsto reduce noise in timing comparisons. - Keep
--limitand--promptsconstant while tuning concurrency. - Start conservative (2-4) in real API mode.
Next Steps¶
Use Large-Scale Fan-Out for concurrency across files. Combine with Resume on Failure for robust production pipelines.