Structured Output Extraction¶
Use a Pydantic schema to get typed, validated output instead of parsing JSON by hand.
Defining the Schema¶
The recipe uses a DocumentSummary schema with bullets and risks fields.
In your own code, define whatever structure fits your extraction task:
from pydantic import BaseModel
class DocumentSummary(BaseModel):
bullets: list[str]
risks: list[str]
Pollux passes this schema to the provider via Options(response_schema=...).
The provider returns structured JSON; Pollux validates and parses it into your
model.
Run It¶
Mock mode (validates flow and schema shape):
python -m cookbook getting-started/structured-output-extraction \
--input cookbook/data/demo/text-medium/input.txt --mock
Real API (returns actual structured data):
python -m cookbook getting-started/structured-output-extraction \
--input path/to/file.pdf --no-mock --provider openai --model gpt-5-nano
What You'll See¶
In --no-mock mode:
Status: ok
Structured output:
bullets (3): ["Context caching reduces cost by 90%", ...]
risks (2): ["Provider lock-in for caching features", ...]
Raw answer (excerpt): The document describes three key findings...
The structured field in the envelope contains a parsed DocumentSummary
object. The raw text answer is still available in answers as a fallback.
In --mock mode, you'll see a schema preview instead — the mock provider
doesn't emit real structured payloads.
Tuning¶
- Tighten the schema (required fields, enums, constraints) as you learn edge cases from real output.
- Make the prompt demand specificity: "source-labeled bullets with concrete facts" works better than "summarize".
- If
structuredis missing in real mode, the provider/model may not support structured outputs — check Provider Capabilities.
Next Steps¶
Pair with Run vs RunMany for
multi-question extraction, or add validation pipelines downstream by
writing structured to JSONL.