Speaker Diarization

Definition

The process of identifying and segmenting audio by speaker — determining who spoke when.

Speaker diarization answers the question 'who spoke when?' in a multi-speaker audio recording. It involves detecting speaker changes, clustering speech segments by speaker identity, and labeling each segment. This is distinct from speaker identification (determining who a specific speaker is) and speech recognition (determining what was said).

Modern diarization systems use neural speaker embeddings (like x-vectors or ECAPA-TDNN) combined with clustering algorithms. Diarization is essential for meeting transcription, interview processing, and any scenario where multiple speakers need to be distinguished in the output transcript.

Related Terms

Related Content