What are the key points?

Boson AI releases Higgs Audio v3 TTS, a 4B-parameter conversational model supporting 100+ languages. The model achieves single-digit WER/CER on benchmarks like Seed-TTS (1.11) and MiniMax-Multilingual (2.74). SGLang-Omni framework enables multi-stage, real-time speech generation with inline control for emotion and style.

Boson AI Releases Higgs Audio v3 TTS

•Boson AI releases Higgs Audio v3 TTS, a 4B-parameter conversational model supporting 100+ languages.
•The model achieves single-digit WER/CER on benchmarks like Seed-TTS (1.11) and MiniMax-Multilingual (2.74).
•SGLang-Omni framework enables multi-stage, real-time speech generation with inline control for emotion and style.

Boson AI announced the release of Higgs Audio v3 TTS, a text-to-speech model tailored for conversational voice agents, now supported on the SGLang-Omni inference framework. The model is built on a roughly 4B-parameter Qwen3-4B backbone and processes interleaved text and audio tokens to generate natural, expressive speech. Higgs Audio v3 is optimized for real-time interaction, allowing it to begin synthesis before a full sentence is complete, while maintaining stable speaker identity and emotion as the input stream grows. The model supports 111 languages and dialects, achieving single-digit word error rates (WER) and character error rates (CER) across 100 languages. In benchmarking, Higgs attained macro-averaged WER/CER scores of 1.11 on Seed-TTS, 4.41 on CV3, and 2.74 on MiniMax-Multilingual.

Developers can utilize inline control tokens within input text to adjust emotion, speaking style, speed, pitch, pauses, and sound effects, such as <|emotion:amusement|> or <|sfx:laughter|>. SGLang-Omni serves the model by managing a multi-stage generation pipeline where compute patterns, memory requirements, and latency vary by stage. This framework uses a stage abstraction that coordinates communication between components via layered control and data planes, while managing GPU memory isolation to ensure resource contracts are enforced per-stage. The runtime includes optimizations such as CUDA-Graph-friendly feedback runners and streaming vocoder schedulers to reduce time-to-first-audio.

End-to-end optimizations for Higgs include fused preprocessing, batched vocoder decoding, and RadixAttention caching, which allows repeated voice-cloning references to reuse prefix caches. Performance testing on 1× H100 hardware demonstrated that at a concurrency of 16, the system achieved a throughput of 14.74 requests per second with a mean latency of 1079 ms. The system produces audio faster than real-time, recording an audio_s/s value of 61.84 under the same peak load. Higgs joins an existing ecosystem of SGLang-Omni models, including Qwen3-Omni and Fish Audio S2-Pro, which share the same infrastructure for organizing heterogeneous, multi-stage generation tasks. The model is available for deployment via Docker, with support for streaming, zero-shot voice cloning, and direct API interaction.

Boson AI announced the release of Higgs Audio v3 TTS, a text-to-speech model tailored for conversational voice agents, now supported on the SGLang-Omni inference framework. The model is built on a roughly 4B-parameter Qwen3-4B backbone and processes interleaved text and audio tokens to generate natural, expressive speech. Higgs Audio v3 is optimized for real-time interaction, allowing it to begin synthesis before a full sentence is complete, while maintaining stable speaker identity and emotion as the input stream grows. The model supports 111 languages and dialects, achieving single-digit word error rates (WER) and character error rates (CER) across 100 languages. In benchmarking, Higgs attained macro-averaged WER/CER scores of 1.11 on Seed-TTS, 4.41 on CV3, and 2.74 on MiniMax-Multilingual.

Developers can utilize inline control tokens within input text to adjust emotion, speaking style, speed, pitch, pauses, and sound effects, such as <|emotion:amusement|> or <|sfx:laughter|>. SGLang-Omni serves the model by managing a multi-stage generation pipeline where compute patterns, memory requirements, and latency vary by stage. This framework uses a stage abstraction that coordinates communication between components via layered control and data planes, while managing GPU memory isolation to ensure resource contracts are enforced per-stage. The runtime includes optimizations such as CUDA-Graph-friendly feedback runners and streaming vocoder schedulers to reduce time-to-first-audio.

End-to-end optimizations for Higgs include fused preprocessing, batched vocoder decoding, and RadixAttention caching, which allows repeated voice-cloning references to reuse prefix caches. Performance testing on 1× H100 hardware demonstrated that at a concurrency of 16, the system achieved a throughput of 14.74 requests per second with a mean latency of 1079 ms. The system produces audio faster than real-time, recording an audio_s/s value of 61.84 under the same peak load. Higgs joins an existing ecosystem of SGLang-Omni models, including Qwen3-Omni and Fish Audio S2-Pro, which share the same infrastructure for organizing heterogeneous, multi-stage generation tasks. The model is available for deployment via Docker, with support for streaming, zero-shot voice cloning, and direct API interaction.