Text-to-SpeechTTS
Definition
The conversion of written text into spoken audio.
Text-to-speech is the inverse of speech-to-text — it synthesizes natural-sounding speech from text input. Modern TTS systems use neural networks to generate highly realistic speech that can mimic specific voices, emotions, and speaking styles.
TTS is relevant to the STT field because advances in one domain often inform the other. Both rely on similar audio representations (mel spectrograms), similar model architectures (transformers), and similar training data. Understanding TTS helps explain concepts like phoneme modeling, prosody, and audio feature representations used throughout speech technology.