Local Inference
Definition
Local inference is the execution of model predictions on the user's device — their laptop, desktop, or phone. This contrasts with cloud inference, where data is sent to a remote server for processing. Local inference offers privacy (data stays on-device), lower latency (no network round-trip), and offline capability.
The feasibility of local inference depends on model size, hardware capabilities, and acceptable accuracy. Apple's Neural Engine enables high-quality local speech recognition. For language models, smaller quantized models can run locally, though the largest LLMs still require cloud infrastructure. Ummless uses local inference for speech-to-text and cloud inference for AI text refinement, balancing privacy with capability.