Skip to content

Extract Media Insights

Validate multimodal prompts on a single media source — image, audio, or video — before scaling up.

Run It

Image:

python -m cookbook getting-started/extract-media-insights \
  --input cookbook/data/demo/multimodal-basic/sample_image.jpg --mock

Video:

python -m cookbook getting-started/extract-media-insights \
  --input cookbook/data/demo/multimodal-basic/sample_video.mp4 --mock

Audio:

python -m cookbook getting-started/extract-media-insights \
  --input cookbook/data/demo/multimodal-basic/sample_audio.mp3 --mock

What You'll See

Source: sample_image.jpg (image/jpeg)
Status: ok

Prompt 1 — "Describe the main subject":
  "A bar chart showing quarterly revenue growth, with Q3 highlighted in blue.
   The y-axis ranges from $0 to $50M."

Prompt 2 — "List all visible text":
  "Title: 'Revenue by Quarter'. Labels: Q1, Q2, Q3, Q4. Values: $12M, $28M,
   $45M, $31M."

Tokens: 890 (prompt: 780 / completion: 110)

Outputs should be specific to the media content. Video prompts should include timestamps when visible; audio prompts should extract quotes when present.

Tuning

  • Keep prompts concrete: ask for objects, attributes, timestamps, quoted text.
  • Ask for evidence-labeled bullets when descriptions are vague.
  • Reduce prompt count while iterating, then add prompts once outputs stabilize.

Next Steps

Scale to directories with Broadcast Process Files, or synthesize across multiple videos with Multi-Video Synthesis.