Streaming Inference
Definition
Generating model output incrementally, delivering tokens to the user as they are produced rather than waiting for the complete response.
Streaming inference allows users to see the model's output in real-time as it is generated, rather than waiting for the entire response to complete. This significantly reduces perceived latency, especially for longer outputs. The model generates one token at a time, and each token is immediately sent to the client.
Ummless uses streaming inference for text refinement — as the language model refines your transcription, you see the polished text appear word by word. This provides immediate feedback and lets you stop generation early if the output is going in the wrong direction. Streaming also enables progressive rendering of the refined text in the UI.