What moves become possible when you represent ASR as a noisy observation model?
This explores what you gain by treating speech recognition output not as the truth of what a user said, but as one noisy clue about it — a probability distribution you carry forward instead of a string you commit to.
This reads the question as being about a modeling choice: do you treat the transcript your speech recognizer hands you as fact, or as evidence? The corpus has a clear anchor here. When real-world speech recognition runs at 15-30% word error rates in noisy rooms, a system that commits to a single best transcript and follows a flowchart simply breaks — so Why do dialogue systems need probabilistic reasoning? shows the move that opens up instead: maintain a belief distribution over what the user *intended*, update it as more turns arrive, and let the recognizer's mistakes wash out over time rather than derailing the conversation on the first error. The transcript becomes an observation, the intent becomes a hidden state, and the whole interaction becomes a POMDP.
The deeper payoff is that once uncertainty is represented rather than thrown away, a set of *actions* becomes available that were impossible before. The biggest one is the option to not act. Can models learn to abstain when uncertain about predictions? shows that models trained to know when they're unsure — and to abstain on shaky predictions — can match models ten times their size. A noisy observation model is what makes "I'm not confident enough; let me ask again" a principled move instead of an error.
It also lets the system spend effort where it's warranted. Can simple uncertainty estimates beat complex adaptive retrieval? and When should language models retrieve external knowledge versus use internal knowledge? are about retrieval, not speech, but they make the same conceptual move in a different costume: when you treat your own confidence as a signal, you can decide per-step whether to gather more information (ask a clarifying question, re-prompt, fetch context) or proceed. Framing the decision as a Markov process — act, observe, update, repeat — is exactly what carrying ASR uncertainty forward enables, and it's why these otherwise-unrelated papers light up next to this question.
There's a complementary thread worth pulling. Do speech models learn language-specific sounds or universal physics? suggests the noise in the channel isn't arbitrary — speech models recover the physical, articulatory process that generated the sound. The cleaner your generative story for *how* observations are produced, the sharper your noisy observation model can be: you're not just smearing probability mass, you're modeling the actual corruption process.
The thing you might not have expected to learn: the value here isn't really about speech at all. "Represent X as a noisy observation" is a portable design pattern. The same shift — from committing to an answer, to maintaining a distribution and deciding when to act on it — is what turns brittle pipelines into systems that can abstain, ask, defer, and self-correct. ASR is just the cleanest place to see why deterministic commitment was the problem in the first place.
Sources 5 notes
Real-world speech recognition achieves 15-30 percent error rates in noisy environments, making deterministic flowchart dialogue systems unworkable. POMDP-based systems handle this by maintaining belief distributions over user intent rather than committing to single interpretations.
Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.
Self-supervised speech models learn the language-agnostic physics of how the vocal tract produces acoustics, not language-specific phonetic categories. This explains their multilingual transfer and predicts their downstream task performance better than phonetic probing.