Folx
Technical report by Folx — consulting, software development & AI/ML engineering. We partner with you to create, develop and deliver advanced IT solutions — from cutting-edge AI-powered systems to seamless technology integration. folx.it

OmniVoice Real-Time TTS for Polish — Viability Report

2026-04-09 RTX 5090 (32 GB) k2-fsa/OmniVoice float16 Voice: weronika (cloned) Language: Polish Wersja polska
Why Polish? — All tests use Polish, a language that is notoriously difficult for TTS systems. Polish features complex consonant clusters (szcz, prz, trz, dz, dź, dż), nasal vowels (ą, ę), context-dependent palatalization, free stress exceptions, and extensive inflection that changes word endings across cases, genders, and numbers. These phonological challenges make Polish an excellent stress test for any TTS model — if it handles Polish well, most other European languages will be easier.

The Models

OmniVoice

by k2-fsa · Apache-2.0 · arXiv:2604.00688

State-of-the-art massively multilingual zero-shot TTS supporting 600+ languages — the broadest language coverage among zero-shot TTS models. Uses a diffusion language model architecture with an 8-layer hierarchical audio codebook for high-fidelity synthesis.

  • Architecture: Qwen3-0.6B backbone (~600M params), masked diffusion iterative decoding
  • Voice cloning: Zero-shot from reference audio + transcript
  • Voice design: Control gender, age, pitch, accent, dialect, whisper via text instructions
  • Paralinguistics: [laughter], [breath] and pronunciation correction via pinyin/phonemes
  • Speed: RTF as low as 0.025 (40x real-time)
  • Available on: HuggingFace, CLI (omnivoice-infer), Gradio demo

Chatterbox

by Resemble AI · MIT License

Family of three open-source TTS models (Chatterbox, Chatterbox-Multilingual, Chatterbox-Turbo) designed for natural speech generation with voice cloning. Built-in Perth watermarking for AI audio detection.

  • Architecture: Autoregressive speech tokenizer + mel-spectral decoder, 350M–500M params
  • Voice cloning: Zero-shot from reference audio
  • Turbo variant: 350M params, distilled to 1-step decoding for low-latency voice agents
  • Paralinguistics: Native [laugh], [cough], [chuckle] tags
  • Languages: 23+ (English, Spanish, French, German, Polish, Chinese, Japanese, Arabic, Hindi, etc.)
  • Watermarking: Perth watermarks survive MP3 compression with ~100% detection accuracy
  • Available on: pip install chatterbox-tts, HuggingFace

Architecture Overview

ChatterboxOmniVoice
Model typeAutoregressive (token-by-token)Masked diffusion (iterative unmasking)
BackboneCustom T3 decoderQwen3-0.6B LLM (~600M params)
Audio codecSingle codebook stream8-layer hierarchical (8×1025 tokens)
StreamingTrue per-token streamingNo native streaming — full sequence per call
Voice cloningEmbedding conditioningReference audio tokenization + prefix
LanguagesPolish + basic multilingual600+ languages native

Key Architectural Difference

Chatterbox is real-time because it streams token-by-token — the autoregressive decoder emits tokens sequentially, each decoded to audio and sent immediately. TTFA = time to generate the first few tokens.

OmniVoice uses masked diffusion — all audio tokens start as [MASK] and are iteratively unmasked over N steps. Each step runs a full forward pass over the entire sequence. Tokens are revealed by confidence score, not position. Partial audio cannot be decoded mid-generation.

However, OmniVoice's raw inference speed is so fast (RTF 0.01–0.04) that text-level chunking achieves comparable or better TTFA than Chatterbox's token streaming.


Test 1: Baseline Latency

Full generation, voice cloned, no chunking. Total wall time from generate() to returned tensor.

TextCharsAudio32 steps16 steps8 steps
tiny121.2–1.5s269ms (RTF 0.222)136ms (RTF 0.094)72ms (RTF 0.054)
short664.3s298ms (RTF 0.069)151ms (RTF 0.035)80ms (RTF 0.019)
medium19612.6s503ms (RTF 0.040)255ms (RTF 0.020)137ms (RTF 0.011)
long48828.5–28.9s849ms (RTF 0.030)446ms (RTF 0.015)255ms (RTF 0.009)
Baseline — Tiny
32 steps
16 steps
8 steps
Baseline — Short
32 steps
16 steps
8 steps
Baseline — Medium
32 steps
16 steps
8 steps
Baseline — Long
32 steps
16 steps
8 steps

Test 2: Chunked Streaming (TTFA Measurement)

Simulates streaming: split text at sentence boundaries, generate each chunk independently, measure time to first completed chunk.

Medium text (196 chars, 2 chunks)

StepsTTFA (chunk 0)Chunk 0 audioChunk 1 genChunk 1 audioTotal
32335ms7.00s327ms5.80s662ms
16169ms7.00s165ms5.80s333ms
888ms7.00s86ms5.80s173ms
Chunked — Medium
32 steps (2 chunks)
16 steps (2 chunks)
8 steps (2 chunks)

Long text (488 chars, 5 chunks)

StepsTTFA (chunk 0)ChunksTotal genTotal audioRTF
32331ms51605ms29.6s0.054
16168ms5813ms29.6s0.027
887ms5423ms29.6s0.014

Per-chunk breakdown (long text, 16 steps):

ChunkGen timeAudioContent
0168ms7.00s"Powazny blad w obiegu dokumentow..."
1165ms5.80s"Przez pomylke dokumentacja..."
2150ms4.08s"Incydent zostal zgloszony..."
3171ms7.44s"Linia lotnicza przeprosila..."
4159ms5.32s"Zwiazki zawodowe domagaja sie..."
Chunked — Long
32 steps (5 chunks)
16 steps (5 chunks)
8 steps (5 chunks)

Test 3: Voice Clone Prompt Caching

create_voice_clone_prompt() pre-encodes reference audio into reusable tokens.

ModeGeneration time
Raw ref_audio path (re-encodes each call)264ms
Pre-cached VoiceClonePrompt255ms
Prompt creation cost37ms (one-time)
Per-call savings9ms (3%)

Prompt encoding is already fast (37ms). Caching is still worthwhile for a server to avoid redundant re-encoding.


Test 4: Concurrent Inference

Same text, 3 requests

ModeWall timePer-requestSpeedup
Sequential767ms255ms each1.0x
3x concurrent (thread pool + CUDA streams)462ms436–460ms each1.66x

Mixed text lengths, 3 concurrent

RequestTextLatencyAudio
0tiny (12 chars)408ms1.51s
1medium (196 chars)502ms12.58s
2long (488 chars)555ms28.61s
Wall time558ms

Test 5: Pipeline Breakdown

Isolating overhead (averaged over 10 runs, medium text, 16 steps):

StageTime% of total
Token generation (16 steps)~245ms92%
Audio decode (HiggsAudioV2)10ms4%
Post-process (silence removal, fade, norm)10ms4%

Token generation dominates. Decode and post-processing are negligible. Optimization should focus on the forward pass.


Test 6: First-Chunk-Optimized Streaming

Best strategy: split at first sentence, generate short first chunk for minimum TTFA, generate rest while first chunk plays.

StepsTTFAFirst chunk playsRest gen timeGap?
32335ms7.00s700msNo — rest ready 6.3s early
16169ms7.00s357msNo — rest ready 6.6s early
887ms7.00s185msNo — rest ready 6.8s early
Optimized Streaming — Long text
32 steps
16 steps
8 steps

Even for the longest text (488 chars, 29s audio), there is zero playback gap at any step count. The first chunk produces 7s of audio, providing a massive buffer window. Margin is 6–7 seconds — enough to absorb network jitter, encoding, and client buffering.


Test 7: GPU Memory Profile

ScenarioPeak VRAM
Model loaded (idle)5.41 GB
1 inference5.61 GB
3 concurrent inferences5.87 GB
Headroom on 32 GB26.1 GB free
Estimated max concurrent~5

Incremental cost per concurrent inference is ~150 MB. Substantial headroom for additional model instances or concurrent requests.


OmniVoice vs Chatterbox

MetricChatterboxOmniVoice (16 steps)Winner
TTFA~250ms169msOmniVoice
RTF (medium text)0.05–0.100.020OmniVoice
Streaming typeTrue token-levelChunk-levelChatterbox
Playback gapsNoneNone (7s buffer)Tie
Voice qualityGoodExcellentOmniVoice
Voice cloningEmbedding conditioningRef audio + textOmniVoice
LanguagesPolish + limited600+OmniVoice
VRAM usage4–6 GB5.6 GBTie
Concurrent users3–4 with slot pool3–5 on single GPUTie

Chatterbox vs OmniVoice — Side-by-Side (weronika voice)

Same text, same voice (weronika), same GPU. Chatterbox generated via production server. OmniVoice at 16 steps (recommended) and 32 steps (best quality).

Tiny — "Dzien dobry."

Chatterbox

OmniVoice 16 steps

Short — News sentence (66 chars)

Chatterbox

OmniVoice 16 steps

Medium — PLL LOT incident (196 chars)

Chatterbox

OmniVoice 16 steps

Chatterbox (same)

OmniVoice 32 steps

Long — Full news story (488 chars)

Chatterbox

OmniVoice 16 steps

Chatterbox (same)

OmniVoice 32 steps


OmniVoice Internal Comparisons

Step count and chunking tradeoffs within OmniVoice.

32 vs 16 steps (medium text)

32 steps (reference)

16 steps (recommended)

Baseline vs Chunked (medium text, 16 steps)

Baseline (single generation)

Chunked (2 chunks, streaming sim)

8-step quality floor (medium text)

32 steps (best quality)

8 steps (fastest, 87ms TTFA)

Optimized streaming vs Baseline (long text)

Baseline 16 steps (single shot)

Optimized 16 steps (chunked stream)


Paralinguistics: Non-Verbal Sound Tags

OmniVoice supports expressive non-verbal tags embedded directly in text. All samples: Polish, voice-cloned (weronika), 16 steps.

Supported tags: [laughter], [sigh], [confirmation-en], [question-ah], [question-oh], [question-ei], [question-yi], [surprise-ah], [surprise-oh], [surprise-wa], [surprise-yo], [dissatisfaction-hnn]

TagTextGenAudio
[laughter]"...to sie stalo [laughter] naprawde nie moge."156ms3.91s
[sigh]"No coz [sigh] trzeba bylo to przewidziec."150ms2.41s
[confirmation-en]"[confirmation-en] tak, dokladnie o to mi chodzilo."152ms2.77s
[question-ah]"Naprawde tak uwazasz [question-ah] bo ja mam watpliwosci."153ms3.68s
[question-oh]"[question-oh] a to ciekawe, kiedy to sie stalo?"151ms2.74s
[question-ei]"Mowisz powaznie [question-ei] nie zartujesz?"147ms2.71s
[surprise-ah]"[surprise-ah] nie spodziewalam sie tego!"146ms2.61s
[surprise-oh]"[surprise-oh] to niesamowite co sie wydarzylo."151ms2.76s
[surprise-wa]"[surprise-wa] ale rewelacja, nie do wiary!"150ms2.30s
[surprise-yo]"Wygralismy konkurs [surprise-yo] fantastycznie!"148ms3.08s
[dissatisfaction-hnn]"[dissatisfaction-hnn] no nie wiem, to mnie nie przekonuje."155ms3.38s
Mixed (4 tags)"...co sie stalo [question-ah] ...wygrali [surprise-oh] ...stracili [sigh] ...[dissatisfaction-hnn] trzeba bylo..."247ms10.44s
No tags (control)"Nie moge uwierzyc, ze to sie stalo, naprawde nie moge."153ms3.40s
Emotions & Reactions
"...to sie stalo [laughter] naprawde nie moge."
"No coz [sigh] trzeba bylo to przewidziec."
"[confirmation-en] tak, dokladnie o to mi chodzilo."
"[dissatisfaction-hnn] no nie wiem, to mnie nie przekonuje."
Questions
"Naprawde tak uwazasz [question-ah] bo ja mam watpliwosci."
"[question-oh] a to ciekawe, kiedy to sie stalo?"
"Mowisz powaznie [question-ei] nie zartujesz?"
Surprise
"[surprise-ah] nie spodziewalam sie tego!"
"[surprise-oh] to niesamowite co sie wydarzylo."
"[surprise-wa] ale rewelacja, nie do wiary!"
"Wygralismy konkurs [surprise-yo] fantastycznie!"
Mixed emotions (4 tags in one sentence)
"Slyszales co sie stalo [question-ah] okazuje sie ze wygrali [surprise-oh] a potem wszystko stracili [sigh] no i co tu duzo mowic [dissatisfaction-hnn] trzeba bylo lepiej planowac."
Control (no tags)
"Nie moge uwierzyc, ze to sie stalo, naprawde nie moge."

All tags produce distinct non-verbal sounds at the marked positions. The mixed sample (4 tags, 10.44s) demonstrates natural flow between speech and emotions. Generation time stays consistent (~150ms) regardless of tag count; the mixed sample is longer (247ms) only because the text itself is longer.


Conclusion

OmniVoice is viable for real-time streaming TTS and outperforms Chatterbox on raw speed metrics. The masked-diffusion architecture prevents true token-level streaming, but sentence-level chunking achieves 169ms TTFA at 16 steps with zero playback gaps. Combined with excellent voice quality, 600+ language support, and low VRAM footprint, OmniVoice is a strong candidate for production TTS deployment.

For consulting on real-time TTS integration, streaming architecture, or AI/ML engineering, contact Folx.