AWS Enables Real-Time Voice Applications with SageMaker AI Bidirectional Streaming and vLLM

According to aws.amazon.com, Amazon Web Services launched bidirectional streaming capabilities for real-time inference on Amazon SageMaker AI starting in November 2025. The feature enables continuous data streaming in both directions between clients and model containers, addressing the latency limitations of traditional request-response inference for voice applications.

The announcement highlights integration with vLLM, an inference engine that now supports real-time audio transcription through its Realtime API using WebSockets for bidirectional streaming between client and server, according to aws.amazon.com. AWS demonstrated the capability by deploying Voxtral-Mini-4B-Realtime-2602, described as Mistral AI’s compact real-time speech model, to a SageMaker AI endpoint using a vLLM container.

According to aws.amazon.com, this configuration creates “a fully managed, speech-to-text service where audio flows in and transcription flows back in real time.” The company states the technology supports use cases including voice agents, live captioning, contact center analytics, and accessibility tools. A complete implementation example is available in AWS’s GitHub repository.

The announcement emphasizes that traditional inference approaches fail for real-time speech applications because “transcription cannot begin until the entire audio recording has been received,” according to aws.amazon.com, creating latency incompatible with real-time requirements.