Tech At Bloomberg

Bloomberg’s AI Researchers Turn Whisper into a True Streaming ASR Model at Interspeech 2025

August 18, 2025

During the 26th edition of the Interspeech Conference (Interspeech 2025) this week in Rotterdam, Netherlands, researchers from Bloomberg’s AI Engineering group are showcasing their expertise in speech recognition with their paper “Adapting Whisper for Streaming Speech Recognition via Two-pass Decoding.”

In their research, the Bloomberg team worked with a group of researchers from the WeNet Open Source Community to “teach” Whisper, a powerful speech-to-text AI system from OpenAI, to handle live audio streams with minimal accuracy degradation. Since Whisper is designed to work on whole recordings, not live, real-time audio, it usually struggles with things like transcribing meetings or phone calls as they happen. To overcome this, the research team added a second “quick-listen” system (called a CTC decoder) that produces fast, partial transcripts while you’re speaking, and then used Whisper’s original “careful-listen” system to clean them up when a pause is detected. They also gave the quick-listen part a smaller set of “word pieces” to work with, which made it faster and better at guessing unusual words. Testing on company earnings calls and public speech datasets demonstrated that the new version can keep up with speech in real time — even on regular CPUs — while still delivering accurate, well-formatted transcripts.

We asked the paper’s lead author to explain why their work is notable in advancing the state-of-the-art with regards to speech science and technology:

Wednesday, August 20, 2025

Session 8 –  Streaming ASR
17:20-17:40 CEST

Adapting Whisper for Streaming Speech Recognition via Two-Pass Decoding
Haoran Zhou (Bloomberg, WeNet), Xingchen Song (WeNet), Brendan Fahy (Bloomberg), Qiaochu Song (WeNet), Binbin Zhang (WeNet), Zhendong Peng (WeNet), Anshul Wadhawan (Bloomberg), Denglin Jiang (Bloomberg), Apurv Verma (Bloomberg), Vinay Ramesh (Bloomberg), Srivas Prasad (Bloomberg) and Michele M. Franceschini (Bloomberg)

Click to read "Adapting Whisper for Streaming Speech Recognition via Two-Pass Decoding," published in the Interspeech 2025 Proceedings.Click to read "Adapting Whisper for Streaming Speech Recognition via Two-Pass Decoding," published in the Interspeech 2025 Proceedings.

Please summarize your research. Why are your results notable?

Haoran Zhou: The original Whisper works only on full recordings, so it cannot deliver live captions without modification. We added a quick “draft” pass that uses a newly trained CTC decoder to output rough text as speech arrives. A second pass with the Whisper decoder then polishes it once additional context is available. We also slimmed down the draft stage’s vocabulary so it learns faster using less data. The result is close to Whisper’s transcription quality but now runs in real-time on CPUs with a clear, tunable delay.

You tested this on earnings calls. What makes financial conversations especially challenging for automatic transcription?

Financial calls pack in dense, domain-specific language – think about company and product names, ticker symbols and acronyms – that ordinary speech models haven’t seen before. Executives speaking on these calls often read out long strings of numbers, dates, or percentages in rapid succession. There’s also frequent back-and-forth Q&A with overlaps, interruptions, and phone-quality audio. All of this adds up to a tough mix of unfamiliar words, fast pacing, and muddy recordings that makes accurate, low-latency transcription especially challenging.

How does your research advance the state-of-the-art in the field of automatic speech recognition (ASR)?

Our work pushes ASR forward by taking Whisper — originally designed for transcribing full recordings — and converting it into a true streaming model that delivers near-offline accuracy with low, predictable delay using standard CPUs. We do this by embedding Whisper in a Unified Two‐Pass (U2) framework, in which a lightweight, causally‐masked CTC decoder emits draft transcripts as audio arrives, and the original attention decoder then rescores them for high quality. We further introduce a “hybrid” tokenizer that shrinks the CTC token set for data‐efficient fine‐tuning while retaining Whisper’s full vocabulary for reranking. This is the first work that turns Whisper into a true streaming model.