Glossary

Definitions of key terms in speech recognition, AI language models, audio processing, privacy, and developer voice workflows.

A

Automatic Speech Recognition(ASR)

Technology that converts spoken language into written text.

Acoustic Model

A model that maps audio features to phonemes or text tokens.

Attention Mechanism

A neural network component that allows models to focus on relevant parts of the input when producing each part of the output.

Audio Preprocessing

The set of signal processing steps applied to raw audio before it is fed into a speech recognition model.

Audio Codec

A standard for encoding and decoding digital audio, determining file format, compression, and quality.

Automatic Gain Control(AGC)

A system that automatically adjusts audio input levels to maintain consistent volume.

B

Beam Search

A decoding strategy that explores multiple candidate transcriptions simultaneously to find the most likely output.

C

Connectionist Temporal Classification(CTC)

A loss function that allows training sequence-to-sequence models without requiring pre-aligned input-output pairs.

Character Error Rate(CER)

A metric that measures transcription accuracy at the character level rather than the word level.

Context Window

The maximum amount of text (measured in tokens) that a language model can process in a single forward pass.

D

Decoder

The component of an ASR model that generates text output from encoded audio representations.

Data Residency

The physical or geographic location where data is stored and processed, relevant to privacy regulations and compliance.

Dictation

The practice of speaking text aloud for transcription into written form.

E

Encoder

The component of an ASR model that processes audio input and produces hidden representations.

End-to-End Speech-to-Text

An ASR approach where a single neural network directly maps audio input to text output without separate acoustic and language model components.

Embedding

A dense vector representation of text, audio, or other data that captures semantic meaning in a continuous space.

Echo Cancellation

The process of removing acoustic echo from an audio signal, preventing speaker output from being re-captured by the microphone.

Edge Computing

Processing data on or near the device where it is generated, rather than sending it to a remote cloud server.

F

Feature Extraction

The process of converting raw audio waveforms into numerical representations suitable for machine learning models.

Fine-Tuning

The process of further training a pre-trained model on a specific dataset to adapt it to a particular task or domain.

Few-Shot Learning

A model's ability to learn a new task from just a few examples provided in the prompt.

G

Greedy Decoding

A decoding strategy that selects the single most probable token at each time step.

H

Hidden Markov Model(HMM)

A statistical model that represents sequences of observations as transitions between hidden states.

Hallucination

When a language model generates text that is fluent but factually incorrect or unsupported by the input.

I

Inference

The process of using a trained model to generate predictions or outputs from new input data.

L

Language Model

A model that assigns probabilities to sequences of words, used to improve ASR accuracy by favoring linguistically plausible transcriptions.

Large Language Model(LLM)

A neural network with billions of parameters trained on vast text corpora, capable of generating and understanding natural language.

Local Inference

Running a machine learning model on the user's own hardware rather than on a remote cloud server.

M

Mel-Frequency Cepstral Coefficients(MFCC)

A compact representation of the short-term power spectrum of an audio signal, designed to approximate human auditory perception.

Mel Spectrogram

A time-frequency representation of audio where the frequency axis is scaled to match human auditory perception.

Model Quantization

Reducing a neural network's numerical precision (e.g., from 32-bit to 8-bit or 4-bit) to decrease model size and increase inference speed.

N

Natural Language Processing(NLP)

The field of AI focused on enabling computers to understand, interpret, and generate human language.

Neural Network

A computational model inspired by biological neural systems, composed of interconnected layers of nodes that learn patterns from data.

Noise Cancellation

Techniques for reducing or eliminating unwanted background sounds from an audio signal.

Neural Engine

Dedicated hardware in modern processors optimized for running neural network computations efficiently.

O

On-Device Processing

Running machine learning models entirely on the user's local hardware without sending data to external servers.

P

Phoneme

The smallest unit of sound that distinguishes one word from another in a language.

Prompt Engineering

The practice of designing input instructions to guide a language model toward producing desired outputs.

Preset Stacking

Applying multiple refinement presets sequentially, where the output of one preset becomes the input to the next.

R

Recurrent Neural Network(RNN)

A neural network architecture that processes sequential data by maintaining a hidden state across time steps.

Real-Time Factor(RTF)

The ratio of processing time to audio duration, indicating how fast an ASR system transcribes speech.

Refinement Pipeline

A processing chain that transforms raw speech-to-text output into polished, formatted text using AI language models.

S

Speaker Diarization

The process of identifying and segmenting audio by speaker — determining who spoke when.

Spectrogram

A visual representation of the frequency content of an audio signal over time.

Speech-to-Text(STT)

The conversion of spoken audio into written text, also known as automatic speech recognition.

Sampling Temperature

A parameter that controls the randomness of a language model's output by scaling the probability distribution over tokens.

Streaming Inference

Generating model output incrementally, delivering tokens to the user as they are produced rather than waiting for the complete response.

Sample Rate

The number of audio samples captured per second, measured in Hertz (Hz).

Signal-to-Noise Ratio(SNR)

The ratio of desired signal power to background noise power, typically expressed in decibels.

T

Text-to-Speech(TTS)

The conversion of written text into spoken audio.

Transformer Model

A neural network architecture based on self-attention mechanisms that processes entire sequences in parallel.

Token

The basic unit of text that language models process — typically a word, subword, or character.

Tokenization

The process of converting raw text into a sequence of tokens that a language model can process.

Top-k Sampling

A decoding strategy that restricts token selection to the k most probable candidates at each step.

Top-p Sampling

A decoding strategy that selects from the smallest set of tokens whose cumulative probability exceeds a threshold p.

Transformer Architecture

The neural network architecture based on self-attention that powers modern language models and speech recognition systems.

Text Normalization

The process of converting text into a standardized form by handling numbers, abbreviations, punctuation, and formatting.

Transcription

The process of converting spoken audio into a written text document.

Text Refinement

Using AI to improve the quality, clarity, and formatting of text while preserving the original meaning.

V

Voice Activity Detection(VAD)

The process of detecting the presence or absence of human speech in an audio signal.

Voice Coding

Writing code using spoken commands and dictation rather than typing on a keyboard.

W

Whisper Model

An open-source, multitask speech recognition model developed by OpenAI.

Word Error Rate(WER)

The standard metric for evaluating speech recognition accuracy, calculated as the ratio of errors to total reference words.

Waveform

A graphical representation of an audio signal showing amplitude changes over time.

Z

Zero-Shot Learning

A model's ability to perform a task it was not explicitly trained on, using only a natural language description of the task.