Extract Media Insights¶

Validate multimodal prompts on a single media source — image, audio, or video — before scaling up.

Run It¶

Image:

python -m cookbook getting-started/extract-media-insights \
  --input cookbook/data/demo/multimodal-basic/sample_image.jpg --mock

Video:

python -m cookbook getting-started/extract-media-insights \
  --input cookbook/data/demo/multimodal-basic/sample_video.mp4 --mock

Audio:

python -m cookbook getting-started/extract-media-insights \
  --input cookbook/data/demo/multimodal-basic/sample_audio.mp3 --mock

What You'll See¶

Source: sample_image.jpg (image/jpeg)
Status: ok

Prompt 1 — "Describe the main subject":
  "A bar chart showing quarterly revenue growth, with Q3 highlighted in blue.
   The y-axis ranges from $0 to $50M."

Prompt 2 — "List all visible text":
  "Title: 'Revenue by Quarter'. Labels: Q1, Q2, Q3, Q4. Values: $12M, $28M,
   $45M, $31M."

Tokens: 890 (prompt: 780 / completion: 110)

Outputs should be specific to the media content. Video prompts should include timestamps when visible; audio prompts should extract quotes when present.

Tuning¶

Keep prompts concrete: ask for objects, attributes, timestamps, quoted text.
Ask for evidence-labeled bullets when descriptions are vague.
Reduce prompt count while iterating, then add prompts once outputs stabilize.

Next Steps¶

Scale to directories with Broadcast Process Files, or synthesize across multiple videos with Multi-Video Synthesis.