Text Normalization
Definition
The process of converting text into a standardized form by handling numbers, abbreviations, punctuation, and formatting.
Text normalization transforms inconsistent text into a canonical form. In speech recognition, this includes converting spoken numbers to digits or words ('twenty-three' to '23' or vice versa), expanding abbreviations, standardizing punctuation, and handling special cases like dates, times, currencies, and URLs.
In text refinement, normalization ensures consistent formatting throughout the output. Ummless applies normalization as part of its refinement pipeline, ensuring that spoken forms are converted to appropriate written forms based on the context and preset configuration.