Glossary
Definitions of key terms in speech recognition, AI language models, audio processing, privacy, and developer voice workflows.
A
Technology that converts spoken language into written text.
A model that maps audio features to phonemes or text tokens.
A neural network component that allows models to focus on relevant parts of the input when producing each part of the output.
The set of signal processing steps applied to raw audio before it is fed into a speech recognition model.
A standard for encoding and decoding digital audio, determining file format, compression, and quality.
A system that automatically adjusts audio input levels to maintain consistent volume.
B
A decoding strategy that explores multiple candidate transcriptions simultaneously to find the most likely output.
C
A loss function that allows training sequence-to-sequence models without requiring pre-aligned input-output pairs.
A metric that measures transcription accuracy at the character level rather than the word level.
The maximum amount of text (measured in tokens) that a language model can process in a single forward pass.
D
The component of an ASR model that generates text output from encoded audio representations.
The physical or geographic location where data is stored and processed, relevant to privacy regulations and compliance.
The practice of speaking text aloud for transcription into written form.
E
The component of an ASR model that processes audio input and produces hidden representations.
An ASR approach where a single neural network directly maps audio input to text output without separate acoustic and language model components.
A dense vector representation of text, audio, or other data that captures semantic meaning in a continuous space.
The process of removing acoustic echo from an audio signal, preventing speaker output from being re-captured by the microphone.
Processing data on or near the device where it is generated, rather than sending it to a remote cloud server.
F
The process of converting raw audio waveforms into numerical representations suitable for machine learning models.
The process of further training a pre-trained model on a specific dataset to adapt it to a particular task or domain.
A model's ability to learn a new task from just a few examples provided in the prompt.
G
A decoding strategy that selects the single most probable token at each time step.
H
A statistical model that represents sequences of observations as transitions between hidden states.
When a language model generates text that is fluent but factually incorrect or unsupported by the input.
I
The process of using a trained model to generate predictions or outputs from new input data.
L
A model that assigns probabilities to sequences of words, used to improve ASR accuracy by favoring linguistically plausible transcriptions.
A neural network with billions of parameters trained on vast text corpora, capable of generating and understanding natural language.
Running a machine learning model on the user's own hardware rather than on a remote cloud server.
M
A compact representation of the short-term power spectrum of an audio signal, designed to approximate human auditory perception.
A time-frequency representation of audio where the frequency axis is scaled to match human auditory perception.
Reducing a neural network's numerical precision (e.g., from 32-bit to 8-bit or 4-bit) to decrease model size and increase inference speed.
N
The field of AI focused on enabling computers to understand, interpret, and generate human language.
A computational model inspired by biological neural systems, composed of interconnected layers of nodes that learn patterns from data.
Techniques for reducing or eliminating unwanted background sounds from an audio signal.
Dedicated hardware in modern processors optimized for running neural network computations efficiently.
O
Running machine learning models entirely on the user's local hardware without sending data to external servers.
P
The smallest unit of sound that distinguishes one word from another in a language.
The practice of designing input instructions to guide a language model toward producing desired outputs.
Applying multiple refinement presets sequentially, where the output of one preset becomes the input to the next.
R
A neural network architecture that processes sequential data by maintaining a hidden state across time steps.
The ratio of processing time to audio duration, indicating how fast an ASR system transcribes speech.
A processing chain that transforms raw speech-to-text output into polished, formatted text using AI language models.
S
The process of identifying and segmenting audio by speaker — determining who spoke when.
A visual representation of the frequency content of an audio signal over time.
The conversion of spoken audio into written text, also known as automatic speech recognition.
A parameter that controls the randomness of a language model's output by scaling the probability distribution over tokens.
Generating model output incrementally, delivering tokens to the user as they are produced rather than waiting for the complete response.
The number of audio samples captured per second, measured in Hertz (Hz).
The ratio of desired signal power to background noise power, typically expressed in decibels.
T
The conversion of written text into spoken audio.
A neural network architecture based on self-attention mechanisms that processes entire sequences in parallel.
The basic unit of text that language models process — typically a word, subword, or character.
The process of converting raw text into a sequence of tokens that a language model can process.
A decoding strategy that restricts token selection to the k most probable candidates at each step.
A decoding strategy that selects from the smallest set of tokens whose cumulative probability exceeds a threshold p.
The neural network architecture based on self-attention that powers modern language models and speech recognition systems.
The process of converting text into a standardized form by handling numbers, abbreviations, punctuation, and formatting.
The process of converting spoken audio into a written text document.
Using AI to improve the quality, clarity, and formatting of text while preserving the original meaning.
V
The process of detecting the presence or absence of human speech in an audio signal.
Writing code using spoken commands and dictation rather than typing on a keyboard.
W
An open-source, multitask speech recognition model developed by OpenAI.
The standard metric for evaluating speech recognition accuracy, calculated as the ratio of errors to total reference words.
A graphical representation of an audio signal showing amplitude changes over time.
Z
A model's ability to perform a task it was not explicitly trained on, using only a natural language description of the task.