Multimodal

Speech, music, images, video — all first-class in MateClaw, not tacked on.

Most AI products treat multimodal generation as a plugin you bolt on later. MateClaw ships with it as core infrastructure: six image providers, six video providers, three TTS backends, two STT backends, and two music providers, all unified behind a single tool interface so agents can call any of them without knowing which vendor is underneath.

Configure once. Use everywhere.

What's in the box

Image generation — six providers

Provider	Model family	Notes
DashScope	Wanxiang	Alibaba's image model, default cloud option
OpenAI	DALL-E 3	Standard DALL-E endpoint
fal.ai	Flux	Fast Flux inference via fal.ai
Google (Nano Banana)	gemini-3-pro-image-preview, gemini-2.5-flash-image	Via the native Gemini path; supports image editing — see Nano Banana below
Zhipu	CogView	Native Chinese prompt support
MiniMax	—	Sync and async both supported

The image-generation tool auto-picks the provider configured as default, or you can force a specific one per call. Async generation returns a job id the agent polls; when the image lands, it attaches to the original assistant message, not a new one.

New in 1.3.0

DashScope Wanxiang plugged into the unified multimodal-generation endpoint (multimodal-generation/generation) in v1.3.0, adding 14 image models — 6 of which support image editing. See Image edit below.

Image edit

New in 1.3.0

Image editing (image-to-image) is supported from v1.3.0. In v1.2.0 and earlier, image_generate was text-to-image only.

The image_generate tool gains two parameters: image and images:

Parameter	Shape	Description
`image`	Single reference image	String: path / `file://` / `data:image/...` / `http(s)://` / `msg:<id>:<idx>`
`images`	Multiple reference images (up to 5)	Array of the same forms

The tool normalizes all five reference forms into in-memory buffers internally before forwarding to the provider. Five reference forms:

Local path — /abs/path.png / ~/x.png / ./rel.png
file:// URL — absolute-path variant
data:image/png;base64,... — inline base64 / percent-encoded body
http(s)://... — with SSRF guard (rejects internal hosts)
msg:<messageId>[:<partIdx>] — references an image attachment from a message in the same conversation. Works for non-vision models too — the agent doesn't need to "see" the bytes; merely having seen the messageId in conversation history is enough

text

User: (uploads a sunset image, messageId=12345) Replace the background with a forest.
Agent: image_generate(prompt="replace background with forest",
                     image="msg:12345:0",
                     model="qwen-image-edit")

Models that support image editing (DashScope Wanxiang):

wan2.7-image / wan2.7-image-pro (T2I + edit)
qwen-image-edit / qwen-image-edit-plus / qwen-image-edit-max (edit-only)

A fuller model catalog lives in Models.

Nano Banana

New in 1.4.0

Google image generation runs through Nano Banana Pro (gemini-3-pro-image-preview) via the native Gemini path, not an OpenAI-compatibility shim.

Because it uses the native generateContent endpoint, the image tool passes input images as inline parts straight to the model — so Nano Banana isn't just text-to-image, it supports image editing (image-to-image) too. It works exactly like Image edit above: pass the image / images parameter to reference one or more source images.

Nano Banana Pro — gemini-3-pro-image-preview (default)
Nano Banana — gemini-2.5-flash-image (another Google image model)

Video generation — six providers

DashScope — Tongyi Wanxiang video
Runway — Gen-2 / Gen-3 via API
MiniMax (Hailuo) — text-to-video and image-to-video
Fal — fast inference pipeline
CogVideo — Zhipu CogVideoX
Kling — Kuaishou Kling video generation

Same async-attach model as image generation. Videos appear inline in the chat once rendering finishes — in the same bubble where the agent first said "working on it".

Music generation — two providers

Google Lyria — high-quality music generation
MiniMax — music generation with lyrics + style prompts

The music-generation tool takes a prompt, an optional style tag, and optional lyrics. Output is an MP3 attached to the message.

3D model generation — one provider

Tencent Hunyuan 3D — HY-3D-3.1 / HY-3D-3.0 (Pro, supports PBR / multi-view / white-model) / HY-3D-Express (rapid)

Text-to-3D and image-to-3D both work; output is a .glb rendered inline by <model-viewer> for drag-to-rotate preview. Full setup walkthrough: 3D Model Generation.

Text-to-speech (TTS) — three providers

DashScope CosyVoice — Chinese + English, natural prosody
OpenAI TTS — alloy, echo, fable, onyx, nova, shimmer
Edge TTS — free, no API key required; wide voice selection

Click the speaker icon on any assistant message to read it aloud. The voice is whichever TTS provider is active in Settings.

Speech-to-text (STT) — two providers

DashScope Paraformer — Chinese-first, low latency
OpenAI Whisper — the standard multilingual benchmark

Hold the mic button in the chat input to speak. Release to transcribe. Edit the result before sending if you want to.

Configuration

All multimodal providers live under Settings → Models → [category]. Add a provider once with its API key, then mark it as default for its category.

yaml

# application.yml — minimal example
mate:
  image:
    default-provider: dashscope
  video:
    default-provider: dashscope
  tts:
    default-provider: cosyvoice
  stt:
    default-provider: paraformer
  music:
    default-provider: dashscope

Per-agent overrides are available if you want a specific agent to always use, say, Flux for images and CosyVoice for voice.

How agents use it

Every multimodal capability is exposed as a tool:

Tool	Signature
`image_generate`	`(prompt, style?, size?)`
`video_generate`	`(prompt, duration?)`
`music_generate`	`(prompt, style?, lyrics?)`

Agents call them exactly like any other tool. The tool layer handles provider selection, retries, async polling, and attachment binding.

Async generation and message binding

Image and video generation often takes longer than a normal agent turn. MateClaw handles this cleanly:

Agent calls the generate tool.
Tool returns immediately with a job id and a placeholder attachment.
Backend polls the provider in the background.
When the result lands, it's attached to the original assistant message — not a new one.

It works the way you'd expect: the image appears inside the same bubble where the agent first said "working on it" — not floating in a new message.

Where it shows up in the product

Chat — drag an image into the input for vision models; press-and-hold the mic to dictate; click the speaker on any response to read aloud; generated media appears inline.
Agents — enable or disable specific multimodal tools per agent.
Tools page — every provider has a test button so you can verify a key before using it in production.
Desktop app — everything above, plus local filesystem access for batch operations.

When to use what

Image — documentation illustrations, slide graphics, concept visualization, marketing. Start with DashScope or Flux; DALL-E 3 when you need tight text rendering.
Video — short-form demos, social content, product animations. Runway for quality, MiniMax for Chinese scenarios, DashScope for cloud-local.
Music — background tracks, demo jingles, creative exploration. Two providers today; expect the surface to evolve.
TTS — accessibility, audiobook-style reading, multilingual content. CosyVoice for Chinese, OpenAI for English variety.
STT — voice-first input, meeting transcription, dictation workflows. Paraformer for Chinese, Whisper for everything else.

Multimodal input: primary doesn't speak it? Use a sidecar

Added in 1.3.0

This page is about generation (output). The input side — uploading an image to a text-only primary model — runs through a separate "multimodal sidecar" path. See Chat → Primary model can't see images? and Models → Multimodal sidecar (system-wide).

In short: configure a vision model under Settings → Models → Multimodal sidecar. When the primary model can't handle an uploaded image, the runtime captions it via the sidecar first and feeds the description to the primary chat. Primary stays cheap; the routing decision is fully visible in the chat UI (badge on the bubble, hint above the input box).

Chat & Messaging — attachment input, multimodal sidecar routing, how generated media attaches to messages
Models — provider configuration UI, multimodal sidecar settings
Tools — the tool system that hosts multimodal generation

Multimodal ​

What's in the box ​

Image generation — six providers ​

Image edit ​

Nano Banana ​

Video generation — six providers ​

Music generation — two providers ​

3D model generation — one provider ​

Text-to-speech (TTS) — three providers ​

Speech-to-text (STT) — two providers ​

Configuration ​

How agents use it ​

Async generation and message binding ​

Where it shows up in the product ​

When to use what ​

Multimodal input: primary doesn't speak it? Use a sidecar ​

Next ​