Resume on Failure¶
Persist a manifest so retries process only unfinished or failed items. This is the durability pattern for long-running jobs where partial failures are expected.
How It Works¶
The recipe maintains a JSON manifest that tracks per-item status:
{
"items": {
"input.txt": {"status": "ok", "output": "outputs/items/input.json"},
"compare.txt": {"status": "error", "error": "RateLimitError: 429"},
"notes.txt": {"status": "pending"}
}
}
On retry with --failed-only, items with status: "ok" are skipped.
Run It¶
Initial run:
python -m cookbook production/resume-on-failure \
--input cookbook/data/demo/text-medium --limit 4 \
--manifest outputs/manifest.json --output-dir outputs/items --mock
Retry only failed/pending items:
python -m cookbook production/resume-on-failure \
--input cookbook/data/demo/text-medium --failed-only \
--manifest outputs/manifest.json --output-dir outputs/items --mock
What You'll See¶
Initial run:
Processing 4 items...
input.txt: ok
compare.txt: ok
notes.txt: error (simulated)
extra.txt: ok
Manifest: 3 ok, 1 error → outputs/manifest.json
Retry run:
Resuming: 1 item pending/failed
notes.txt: ok
Manifest: 4 ok, 0 error → outputs/manifest.json
The manifest updates after each item (not only at run end). Per-item JSON
artifacts in --output-dir preserve answers, usage, and metrics.
Tuning¶
--max-retriesand--backoff-secondscontrol retry aggressiveness.--limitsets workload size for staged production rollout.- Use a stable input directory between retries — changing item identity breaks resumability.
Next Steps¶
Split retries by error category (rate-limit vs validation) for smarter retry logic. Export manifest rollups to dashboards for operational visibility.