Why a Broadcast-Grade AI Radio Host Isn't Just TTS in a Fancy Wrapper

By the KAVANA engineering team — June 2026

Every few months someone sends us a link to a demo video: a radio station that has replaced its human DJ with an AI voice. The demo sounds impressive. The voice is natural, the transitions are smooth, the audio quality is clean. The person sending the link usually adds: "this is what you should be building."

We have been building AI voice systems for broadcast since before most of those demo projects existed. Our honest response to the comparison is: the demo is doing text-to-speech. We are doing something different. This post explains what that difference actually is, why it matters technically, and what it costs to close the gap.

TTS Reads Sentences. A Broadcast Host Reads a Clock.

This is the most important distinction and the least obvious one if you have not worked in broadcast.

A text-to-speech system has one job: given a string of text, produce plausible audio that represents that text. The input is text. The output is audio. The system does not need to know what time it is, what segment of the program clock is active, whether this is a news break or a music sweep, whether the station is running long or short, or whether the previous segment ended cleanly or with an abrupt cut.

A broadcast host — human or AI — reads a clock. A broadcast program clock is a structured template that defines what kind of content plays when, in what order, with what timing constraints. A music format might specify: 6 minutes of music, 30-second liner, 4-minute music, 2-minute news, sponsor mention, music. The host's job is not just to read words. It is to fit within that clock, to adjust length based on where the clock actually is, to deliver content in a way that serves the function of each clock position, and to make the whole thing sound like it was planned even when it is being assembled live.

That requires different architecture at every level: the system that generates the script must know the clock position and target duration; the voice synthesis must be calibrated to the delivery style appropriate for that position; the output must be validated against the clock before it is committed to playout. A TTS wrapper around a language model can generate plausible-sounding radio content. It cannot read a clock without a significant amount of infrastructure around it.

The KAVANA AI Host system — which we call AI Three Gods internally, reflecting the three synthesis pipelines it orchestrates — is built around the clock as a first-class concept. Every synthesis request includes the clock position, the target duration range, and the adjacent content context. The output is validated against the clock before the file is written to the playout queue.

Prosody: The Gap Between Readable and Listenable

General-purpose TTS systems are optimized for intelligibility. The voice should be clear, the words should be understandable, the pacing should be natural for a human reading context — which typically means reading speed appropriate for a document being read silently, with prosody that sounds like a person reading aloud.

Broadcast prosody is different. It has been different for eighty years, through AM, FM, and digital, because the listening context is different. Radio listeners are not sitting quietly attending to the content. They are driving, working, cooking, exercising. Broadcast prosody is engineered to maintain attention in an environment with competing stimuli, to create energy appropriate to the daypart, to signal transitions between content types, and to create a sense of momentum that keeps the listener from reaching for the dial.

The specific prosodic markers differ by format and daypart. A morning drive host on a Top 40 station uses a delivery pattern that would sound completely wrong on a classical music station at 11 PM. A news reader's prosody is calibrated differently from a music host's. An emergency announcement has specific prosodic requirements that are essentially the opposite of normal broadcast style — slower, more measured, more explicit — because the communication priority shifts from maintaining attention to ensuring comprehension.

None of this is captured in a general-purpose TTS model. The models are trained on a distribution of human speech that includes broadcast content but is dominated by conversational and documentary speech. They produce plausible speech but not broadcast speech.

We address this through what we call scene-level voice design: for each of the nine broadcast scenes we support, we have tuned the synthesis parameters — speaking rate, pitch range, emphasis patterns, pause placement — specifically for that scene's listening context. We also expose these parameters to stations that want to adjust them for their specific format. This is not a simple slider. It is a configuration space that our engineers have spent considerable time calibrating against real broadcast output.

Nine Scenes, Nine Different Engineering Problems

Broadcast is not one thing. The engineering requirements for voice synthesis differ significantly across the nine scene types we support in KAVANA AI Host. Here is an honest account of what makes each one technically distinct.

Time call. This sounds like the simplest possible broadcast task: announce the time. It is actually one of the more demanding ones because the time call is a reference point in the listener's experience — it has to be precise to the second, it has to be delivered with a specific confidence and authority, and it has to work within a narrow duration window. A time call that runs 40% long is unusable. The synthesis must also handle the edge cases: top of the hour versus quarter hour versus half hour each carry different conventional phrasing in different broadcast cultures.

Station identification. Legal ID must appear within a specific time window relative to the top of the hour in most broadcast regulatory frameworks. The delivery has to be authoritative. The text is short and fixed, which means any prosody flaw is very audible. This is one of the scenes where voice cloning from the actual station voice talent produces significantly better results than a generic synthesized voice.

News segment. Multiple story items, each with different subject matter and emotional register. The host needs to transition between a story about a local government budget and a story about a regional flood without inappropriate affect carryover. The pacing needs to allow comprehension without feeling slow. Duration management is critical — if stories are running long, the synthesis needs to know and adjust accordingly.

Weather segment. Similar to news but with a more constrained vocabulary and a strong listener expectation about format. The challenge is making a structured data dump — temperature, precipitation probability, wind speed — sound like natural speech rather than a recited list.

Music host liner. This is where broadcast AI voices most often fail. The music liner needs energy, personality, and timing relative to the music it brackets. It needs to land on a specific beat, which means the synthesis duration has to be predictable to within a fraction of a second. General TTS produces variable-length audio with variable prosody; music host liners require consistent duration and calibrated energy.

Sponsor mention. Regulatory and ethical requirements vary by jurisdiction. The synthesized voice must be clearly identified as automated in any jurisdiction that requires this. The content must match the approved copy exactly, with no paraphrasing — this is both a legal requirement and a contractual one.

Traffic and travel. High information density, rapidly changing underlying data, strict duration constraints. The synthesis must handle alphanumeric strings (road designations, junction numbers) correctly. Different broadcast cultures have different conventions for how traffic information is phrased.

Cultural and community content. For the regional and community stations that make up a large part of our customer base, this includes local event announcements, community notices, and format-specific content that may be in a minority language or dialect. This is where generic TTS models fail most obviously: they were not trained on this content and the errors are immediately audible to local listeners.

Emergency announcement. Different delivery requirements, potentially different voice, different prosody. In some regulatory frameworks, emergency content must be identifiably different from normal programming. The synthesis pipeline has a separate configuration for emergency content that deliberately breaks from the station's normal voice characteristics to signal the change in information type.

What We Actually Use Under the Hood, and Why

We use multiple synthesis backends, and the choice of which backend handles which scene is not arbitrary.

Alibaba Cloud CosyVoice 3 is our primary cloud synthesis backend. It produces high-quality Chinese-language broadcast speech with good prosody control and reliable duration prediction. We use it for scenes where latency is not critical (content can be synthesized in advance) and where the cloud round-trip is acceptable.

For local GPU inference, we use our OmniVoice pipeline — our internal name for the production instance of CosyVoice 2 that runs on GPU hardware at the station or at a regional hub. Local inference eliminates the cloud latency and the data transmission, which matters for content that contains potentially sensitive local information and for stations with poor internet connectivity. The AI listening interface gives stations a way to preview and verify synthesized content before it goes to playout.

MiniMax is our third pipeline, used primarily for voice cloning use cases where the station has provided voice samples and wants synthesis in a cloned voice. MiniMax's multi-speaker synthesis quality for cloned voices is, in our testing, currently ahead of the alternatives for the languages we support.

We do not use ElevenLabs, Azure Speech, OpenAI TTS, Bark, or Coqui in our production pipeline, though we have evaluated all of them. The reasons differ by product. ElevenLabs produces excellent voice quality but its architecture is optimized for on-demand cloud synthesis at the API level, not for the tight integration with a broadcast clock that we need, and its pricing structure is not viable for the station volumes we serve. Azure Speech has good API design and reliable SLAs but its Chinese-language voice quality for broadcast prosody is behind the local models. OpenAI TTS is designed for conversational assistant use cases and shows it in the prosody. Bark and Coqui are interesting research systems; production stability for broadcast is not where they need to be.

The honest comparison is that ElevenLabs is better than us for use cases where you want a high-quality synthesized voice for a one-off production. We are better than ElevenLabs for use cases where you need a synthesis system that understands broadcast clock structure, runs locally, integrates with a playout system, and is priced for a county radio station's budget rather than an enterprise content production team.

Voice Cloning: Where the Ethical and Legal Lines Actually Are

Voice cloning is technically feasible. The question of whether it is appropriate in a given context is more complicated, and we have spent considerable internal time on this.

The straightforward case: a station wants to synthesize content in the voice of a professional voice actor who has been hired specifically for this purpose and has contractually agreed to it. This is a standard commercial transaction with clear legal footing. We support it.

The more complicated case: a station wants to clone the voice of an existing on-air personality — a human DJ or news reader who has been working at the station for years and whose voice is associated with the station's identity. The legal status of this varies significantly by jurisdiction. In most frameworks, the on-air personality owns their voice, and using it without consent for AI synthesis purposes is a rights violation regardless of whether the station "owns" the recordings that could be used for training. The employment contract may or may not address this — most broadcast employment contracts predate voice cloning as a practical technology and are silent on the question.

We do not enable voice cloning of identified individuals in our system without a documented consent chain. This is not purely an ethical position — it is also a legal risk position. A station that clones an employee's voice without clear authorization is exposed to claims that are novel but not clearly going to fail. We are not going to create that exposure for our customers.

The additional complication is AI disclosure. Several jurisdictions now require that AI-synthesized broadcast content be disclosed as such to listeners. The regulatory landscape here is moving fast and is inconsistent across markets. We have built disclosure capabilities into the system — the ability to insert a standard disclosure announcement at configurable intervals — but the specific requirements are the station's responsibility to understand and configure for their jurisdiction.

We have published our position on voice ethics and the technical architecture of consent verification as part of our technical documentation. We think the broadcast industry needs clearer standards here and we are willing to participate in developing them.

The Cost Structure Is Genuinely Different

We want to address this directly because it comes up in every evaluation conversation.

ElevenLabs, Azure Speech, and similar cloud TTS services charge per character or per audio hour. At low volumes this is fine. At broadcast production volumes — a station producing multiple hours of synthesized content per day across multiple voices and multiple segments — the per-unit cost adds up to an ongoing operational expense that is a meaningful line item.

Our pricing model is different. The software license covers the synthesis pipeline. The incremental cost for synthesis is the cost of running local GPU inference (electricity, amortized hardware cost) plus any cloud API costs for the cloud synthesis pipelines you choose to use. For a station that runs primarily local inference, the marginal cost of synthesis approaches zero after the hardware is paid for.

The upfront cost is higher. The GPU hardware required for local inference is a capital purchase that cloud synthesis does not require. Whether the total cost of ownership is better or worse depends on the station's volume, its internet reliability, its data residency requirements, and its access to capital for hardware purchase versus ongoing OpEx budget for cloud services.

We do not claim to be cheaper in every scenario. We claim that for stations with high production volumes, local data residency requirements, or budget structures that favor capital expenditure over ongoing subscription costs, our cost model is often significantly better. We encourage stations evaluating us to model their specific situation rather than accepting a general claim.

What This Actually Sounds Like

The fair question at the end of all of this is: does the output actually sound good?

We have samples on the KAVANA listening interface that represent production output across several of our broadcast scene types. They are real output from the production system, not curated demos. We think the quality for Chinese-language broadcast content is competitive with the best available alternatives. For other languages, the quality depends on the underlying synthesis model; our pipeline can integrate with local language models, but we do not ship voice models for languages outside our current customer base.

The honest answer is: listen for yourself. The samples are there. If you are evaluating AI synthesis for your station, listen to them the way your listeners will — in the car, with the window down, at the normal listening volume — not through studio monitors in a quiet room. Broadcast audio is designed for one context. It should be judged in that context.

Reach us at international@kavanafm.com with questions about specific use cases, languages, or technical integration requirements. We try to give direct answers rather than directing you to a sales process.

KAVANA is developed by Hunan ShengGuang Technology Co., Ltd. (湖南声广科技有限公司), incorporated 2012, team active since 2005. We hold a broadcast production and distribution license (湘字第00565号) and operate under Chinese cybersecurity Level 3 certification. Technical documentation and open specifications: github.com/kavanafm.