When AI Hosts Hallucinate: Failure Modes We've Seen and How Three-Tier Review Catches Them

By the KAVANA engineering team — June 2026

The word hallucination in AI discussions almost always refers to factual hallucination: a language model that confidently states something false. A chatbot that invents a court case. A research assistant that fabricates a citation. This failure mode is real, broadly understood, and the subject of considerable engineering effort.

In broadcast, we deal with a different class of AI failure that is less widely discussed because it does not exist outside of speech synthesis: acoustic and prosodic hallucination. The AI host who says accurate words in a voice that makes them sound wrong. The synthesis that renders a year as a sequence of digits instead of a phrase. The prosody that collapses on a sentence with three embedded clauses and emerges sounding like it was read by someone who did not understand it. The speaker disfluency that appears from nowhere in the middle of an otherwise clean segment.

These failures do not show up in a text transcript. They cannot be caught by a factual verification pipeline. They require someone to listen, or a system designed specifically to detect acoustic anomalies, or both. This post describes the acoustic and prosodic failure modes we have encountered in production, how they interact with factual failures, and how the three-tier review architecture we use catches them at different stages.

The Specific Incident That Shaped Our Thinking

In the early phase of deploying AI hosts at a regional station, we had an incident that we now use internally as a reference case for what prosodic failure actually sounds like in practice.

The AI host had been performing well for several weeks — clean synthesis, natural rhythm, occasional minor artifacts that reviewers were catching at the first review stage. Then a new voice model was deployed to the synthesis server, a version that had been updated to improve naturalness on conversational content. The update was a genuine improvement for that content type.

What the update did not handle well was a particular pattern that appeared frequently in broadcast scripts: the four-digit year read in context. In Chinese broadcast language, "2025年" is spoken as "二零二五年" — four individual digits followed by the year character. The previous model had learned this pattern from training data and executed it reliably. The new model, optimized for a different domain, had a different prior for how to decompose the year sequence. In about one in three occurrences, it rendered "2025年" as a compound number — "两千零二十五年" — which is not technically incorrect as a reading of the numeral but is jarring to a broadcast audience, sounds unnatural, and in the context of a news segment about current events, signals immediately that the voice is not a trained broadcaster.

We caught this before the segment aired, but not because our review process was working as designed. We caught it because the second-tier reviewer happened to be listening closely to a segment that contained three year references and noticed that the first sounded natural and the second did not. When they listened back, the inconsistency was obvious. But if the segment had contained only one year reference and the reviewer had been moving quickly through a review queue, it would likely have passed.

After this incident we made several changes. This post describes the broader framework those changes produced.

A Taxonomy of AI Broadcast Hallucinations

Based on our production experience, we group AI broadcast failures into three categories. The grouping is not academic — it maps directly to which review stage catches them.

Category one: phonetic and rendering failures. These are failures in how the synthesis model converts text to phonemes and phonemes to audio. The year incident above is a phonetic failure. Others in this category: numbers rendered as digit sequences in contexts where a compound number is natural speech ("一百二十五" becoming "一百二十五" with unnatural pauses at each morpheme boundary); proper nouns rendered with incorrect tones in tonal languages, which changes meaning; abbreviations read as letter sequences rather than as words; units of measurement read in the wrong order relative to the numeric value.

These failures are text-visible in the sense that if you know what to listen for you can look at the transcript and predict where they might occur. A segment with many numbers, dates, technical abbreviations, or specialized vocabulary is a segment with higher phonetic failure risk. First-tier review — which in our system checks the transcript alongside the audio — is the primary catch for this category.

Category two: prosodic failures. Prosody is the rhythm, stress, and intonation of spoken language. A prosodic failure is synthesis where the words are correct and the phonetics are correct but the delivery is wrong in a way that makes the content harder to understand or sounds unnatural. Common prosodic failures in current synthesis models: sentence-final intonation rising when it should fall, suggesting a question where there is none; stress falling on the wrong syllable in a multi-syllable compound, causing the listener to parse the compound incorrectly; long sentences with embedded clauses where the synthesis runs out of prosodic structure budget and flattens the second half of the sentence into monotone; unnatural pauses inside a clause where the model has segmented the sentence at a grammatical boundary that is not also a natural speech boundary.

Prosodic failures are not predictable from text alone. They require listening. They are also, in our experience, the failure mode that most undermines listener trust, because a human listener's first response to unnatural prosody is not "this is a technical artifact" — it is "this person doesn't know what they're talking about." For a broadcast host, loss of listener trust in the voice is a significant problem regardless of whether the content is factually accurate.

Category three: disfluency and acoustic artifacts. These are failures in the acoustic output: clicks, pops, level irregularities at segment boundaries, breath artifacts in the wrong places, or abrupt changes in room character that signal an edit point. Most synthesis models produce these at low rates; modern post-processing pipelines reduce them further. But they occur, and in a segment that is otherwise clean, a single click at minute 2:47 is audible and distracting. This category also includes the failure mode where synthesis quality degrades over the course of a long segment — the first 30 seconds are clean, the model's generation quality decreases, and the final 30 seconds sound noticeably different from the opening.

How This Differs From Factual Hallucination

The broadcast industry's concern about factual hallucination in AI-generated content is legitimate. An AI host that reads a news summary and fabricates a quote, invents a statistic, or misattributes a statement is causing a real harm that a human host would not cause (a human host reads what they were given; an AI host may generate something different from what the script contained).

But factual hallucination in broadcast AI is a more tractable problem than prosodic hallucination for one reason: the source text is available. An AI news synthesis pipeline generates a script from source material. The factual verification step compares the generated script against the source. Fabricated quotes, invented statistics, and misattributed statements are detectable by text comparison against the source. This does not mean the problem is solved — current automated fact-checking is imperfect and the edge cases are hard — but there is a defined approach with known efficacy.

Prosodic hallucination has no text-comparison equivalent. The transcript of a segment with a prosodic failure is identical to the transcript of the same segment delivered correctly. The failure exists only in the acoustic domain. Automated prosodic evaluation exists — there are models trained to assess naturalness and predict human ratings of synthesis quality — but their reliability on the specific failure modes we encounter in broadcast content (numbers, specialized vocabulary, broadcast script structures) is lower than their general naturalness metrics suggest.

The practical consequence is that human listening is not optional in the review of AI broadcast content. It cannot be replaced by transcript review, factual verification, or general naturalness scoring. The first-tier review in our system requires listening to the audio, not just reading the transcript. This is a time cost that cannot be automated away with current technology.

First-Tier Review: What the Listener Is Actually Checking

The first tier of KAVANA's three-tier review is the stage where the audio is heard by a human reviewer for the first time. For AI-generated content, the first-tier reviewer's checklist is different from the checklist for human-produced content. The difference is not just procedural — it reflects the specific failure modes that AI synthesis produces.

The first-tier checklist for AI content includes explicit prompts for the phonetic and prosodic failure categories. The reviewer is not just listening for "does this sound right overall" — they are checking specific items: are all numbers rendered naturally in context; are all proper nouns and specialized terms pronounced correctly; does the prosody track the intended meaning of each sentence; are there any segments where the delivery suggests the synthesis model did not parse the sentence correctly; are there any acoustic artifacts. The checklist is informed by the content's risk profile: a segment with many numbers and abbreviations gets explicit flagging at the review stage.

This prompted checking is important because the failure modes of AI synthesis are subtle enough that a reviewer listening for a general sense of quality will miss individual instances. A human reviewer listening to a four-minute segment and evaluating it holistically will rate it as "mostly fine" if there is one prosodic failure at the two-minute mark. A reviewer who is checking specifically for prosodic anomalies will catch it.

The first-tier reviewer also checks against the source material for factual content. AI-generated news synthesis is compared against the source feed. AI-generated traffic summaries are compared against the data feed. Discrepancies are flagged. First-tier review is the primary factual gate, but it is not the only one.

Second-Tier Review: Automated Assist and Compliance Scan

The second tier in our system is primarily a compliance review — checking that the content conforms to the broadcast code for the applicable regulatory framework. For AI-generated content it has an additional function: an automated quality scan that runs alongside the human compliance review.

The automated scan does not replace human listening. It assists it. The scanner generates a technical report on the segment: level measurements, silence detection, confidence scores from the acoustic model for segment-level naturalness, flagging of any time positions where the naturalness score drops below a threshold. The second-tier reviewer sees this report alongside the content. The report's value is not as a pass/fail gate — its false-positive rate is too high for that — but as a guide to where in the segment the human reviewer should listen more carefully.

This scan is where the disfluency and acoustic artifact category is most reliably caught. The level measurements detect gain anomalies. The silence detection catches unexpected pauses. The naturalness score drop-offs, while not reliable enough for automated rejection, are a consistent signal for where to focus attention. The click at minute 2:47 that a casual listen might miss will show up as a naturalness score anomaly in the report, and the reviewer's attention will be directed to that point.

The compliance check at second tier also covers a disclosure requirement that is increasingly relevant for AI broadcast content: the requirement in several regulatory frameworks to identify AI-synthesized audio as such. Where this requirement applies, the second-tier compliance checklist includes verification that the appropriate disclosure language is present. The KAVANA review system tracks whether synthesis was used in content production and surfaces this flag for the second-tier reviewer automatically.

Third-Tier Review: What the Final Authorizer Is Actually Confirming

Third-tier review is the final authorization before air. For AI-generated content, the third-tier authorizer is not doing a complete fresh listening pass — that would make the three-tier process prohibitively slow for high-volume AI content production. The third tier is a structured final check: confirm that first and second tier reviews are completed and signed; review any flags or reservations entered at the previous tiers; verify that the scheduled air time and content type match the authorization parameters; listen to the segment opening and any flagged positions.

The third-tier authorizer's job is to take ownership of the air decision. Not to audit the previous reviews, but to confirm that the process was followed and to make a judgment call on any flagged items that the previous reviewers marked as borderline.

In practice, third-tier review for AI content most often catches the failure mode where first-tier reviewer flags something as "notable but probably acceptable" and second-tier reviewer agrees without listening as carefully. The third-tier authorizer, seeing a flag from both previous tiers pointing at the same position in the segment, will listen to that position specifically. The combined flagging pattern makes the anomaly visible in a way that neither tier's individual flag would alone.

This escalation function is structurally important for AI content because the individual failure modes — a single prosodic anomaly, a slightly unnatural number rendering — are each ambiguous in isolation. An experienced reviewer can convince themselves that each individual anomaly is within the normal variation of broadcast quality. When three different points in a segment each have minor quality flags, the pattern suggests a systematic synthesis issue that warrants either re-synthesis or additional scrutiny before air.

What Gets Through Despite the Three Tiers

We would not be serving the topic honestly without acknowledging the failure modes of the review system itself.

Time pressure is the primary enemy of careful review at all three tiers. A high-volume AI content pipeline producing content for multiple time slots creates a review queue that can be genuinely difficult to clear before the scheduled air time. Reviewers who are behind on the queue will work faster and listen less carefully. The systematic checks help — a prompted checklist is more resistant to time pressure than unstructured listening — but they do not eliminate it.

Voice model updates are a recurring risk. As we described in the year-rendering incident, a synthesis model update that improves one content type can introduce new failure modes in another. Our current practice is to run a quality check suite against a standard test corpus whenever a voice model is updated, before deploying the update to production. The test corpus is designed to cover the phonetic and prosodic patterns that we know are fragile. This catches some regressions. It does not catch novel failure modes that are not represented in the test corpus.

Systematic drift in a given voice model's handling of a specific content pattern can be slow enough to be imperceptible in day-to-day review and visible only when comparing production samples across weeks. We do periodic systematic comparisons for exactly this reason, but the cadence is monthly rather than continuous. A drift that begins three days after the last systematic comparison may not be noticed for weeks.

We maintain a public record of significant production incidents at github.com/kavanafm/kavana-incident-reports. The transparency is deliberate: the broadcast industry benefits from shared knowledge of AI synthesis failure modes, and the discipline of publicly documenting our incidents has been a useful forcing function for thorough post-incident analysis.

The Specific Risk of Unreviewed AI Content at Scale

The failure modes we have described are manageable when the review system is functioning as designed. They become more concerning as AI content volume increases and review capacity does not scale proportionally.

The temptation in high-volume AI content production is to reduce review overhead by automating more of the review process. We understand this temptation because we have felt it internally. The automated quality scan at second tier exists partly because we wanted to reduce the human time required. But we have consistently found that the irreducible human listening requirement at first tier cannot be replaced without a significant increase in the rate of failures reaching air.

The AI host capabilities in KAVANA are designed with the review process as a non-optional component of the production pipeline. Content produced by the AI host enters the review queue; it does not go directly to the schedule. This is a design choice that slows the pipeline but maintains the quality gate. Stations that want to produce AI content at volume need to staff their first-tier review function accordingly, not try to route around it.

For stations that are new to AI host deployment, the most valuable thing we can offer is not the synthesis quality — that improves continuously — but the review workflow that catches the failures the synthesis quality does not prevent. The AI Three Gods documentation covers the review configuration in detail. Stations with specific questions about configuring the review system for their content type or volume are welcome to contact us at international@kavanafm.com.

KAVANA is developed by Hunan ShengGuang Technology Co., Ltd. (湖南声广科技有限公司), incorporated 2012, team active since 2005. We hold a broadcast production and distribution license (湘字第00565号) and operate under Chinese cybersecurity Level 3 certification. Technical documentation and open specifications: github.com/kavanafm.