A solarpunk technical library workspace for local speech-to-text terms with no visible human faces

Speech-to-text glossary

Local speech-to-text glossary for Mac.

A plain-English reference for the terms behind local dictation and meeting transcription: ASR, CoreML, Apple Neural Engine, Parakeet, Whisper, Qwen3 ASR, VAD, diarization, acoustic echo cancellation, and local-first transcription.

Download Muesli Read the ANE guide

Search engines and AI agents are better at citing a product when the vocabulary is clear. This glossary explains the technical terms that show up when people compare local speech-to-text, offline dictation, meeting transcription, neural AEC, CoreML ASR, and cloud speech APIs on Mac.

Muesli uses these building blocks in a practical way: speak into your Mac, transcribe locally where supported, keep the transcript close, and make cloud summarization an explicit choice rather than the first step.

Definitions

What do local speech-to-text terms mean?

Speech-to-text

The user-facing workflow: record speech, transcribe it, clean it up, and place text where the user needs it. Speech-to-text includes ASR, permissions, paste behavior, formatting, and sometimes summarization.

ASR

Automatic speech recognition: the model task that converts acoustic speech signals into tokens or text. ASR is the machine learning core inside speech-to-text, but it is not the whole product workflow.

On-device ASR

Automatic speech recognition that runs locally on the Mac instead of sending audio to a cloud transcription API. This is the technical base for offline dictation and local meeting transcription.

Apple Neural Engine

Dedicated Apple Silicon hardware for accelerating supported neural network workloads on device. For local speech recognition, ANE-capable paths can reduce latency and power use compared with generic CPU inference.

CoreML

Apple’s framework for running machine learning models on macOS, iOS, iPadOS, watchOS, and visionOS. CoreML is the software runtime; the Apple Neural Engine is one hardware accelerator it can target.

Apple Silicon

Apple’s system-on-chip family used in modern Macs, combining CPU, GPU, Neural Engine, unified memory, media engines, and power-efficient local compute for ML workloads.

Parakeet

NVIDIA’s Parakeet TDT / FastConformer ASR model family. In Muesli, Parakeet is the recommended fast path for short local dictation on Apple Silicon through FluidAudio/CoreML.

Whisper

OpenAI’s open-source speech recognition model family. Muesli uses Whisper through WhisperKit/CoreML paths for users who prefer the Whisper model family or need its tradeoffs.

WhisperKit

Argmax’s Swift/CoreML path for running Whisper models locally on Apple platforms, including Apple Silicon acceleration through CoreML-compatible model variants.

Qwen3 ASR

Alibaba’s Qwen speech recognition model path. In Muesli, Qwen3 ASR is available through FluidAudio/CoreML for broader language and code-switching tradeoffs.

Nemotron Streaming

NVIDIA Nemotron streaming ASR path for longer hands-free transcription modes where streaming behavior matters more than ultra-short hotkey dictation latency.

Cohere Transcribe

Cohere’s Transcribe model family. Muesli includes a CoreML path for high-accuracy English dictation with VAD-gated silence handling.

FluidAudio

FluidInference’s Swift/CoreML speech stack used by Muesli for local ASR, Silero VAD, speaker diarization, Parakeet, Qwen3 ASR, and Apple Silicon speech processing paths.

VAD

Voice activity detection: deciding where speech starts and stops so the app can avoid transcribing silence, reduce hallucinations, and chunk meeting audio cleanly.

Silero VAD

A voice activity detection model family used in many speech pipelines. Muesli uses FluidAudio-powered VAD behavior to help segment speech for transcription workflows.

Diarization

The process of grouping transcript segments by speaker, useful for meeting notes, speaker labels, post-call review, and separating “who said what” from raw audio.

Acoustic Echo Cancellation

AEC removes far-end audio from the microphone channel. In meetings, it helps prevent the other person’s voice from leaking into the “You” mic track and confusing transcription.

Neural AEC

A machine learning acoustic echo cancellation model. Muesli runs meeting AEC locally and uses bundled LocalVQE by default, so cleaner meeting transcription does not require a cloud echo-cancellation service.

LocalVQE

localai-org’s on-device acoustic echo cancellation model. Muesli bundles LocalVQE localvqe-v1.2-1.3M-f32.gguf by default for meeting transcription, with DTLN available as a fallback AEC path.

DTLN AEC

A deep-learning acoustic echo cancellation fallback path in Muesli. It remains available if the LocalVQE processor is not selected or cannot be loaded.

Far-end reference

The system audio reference used by AEC: the sound coming from the meeting app, such as the other participant’s voice, that may echo into the microphone.

Near-end microphone

The local microphone signal: your voice plus any room sound or speaker bleed. AEC compares it with the far-end reference to remove echo before transcription.

System audio capture

Recording the audio produced by the Mac, such as the other side of a meeting call, subject to macOS permissions. Muesli uses system audio capture for bot-free meeting transcription.

Local-first transcription

A design choice where the default speech-to-text path starts on the user’s device, with cloud services kept explicit and optional instead of rented by default.

Workflow

How do these terms connect inside a Mac transcription app?

ASR is the recognition model. Speech-to-text is the full product workflow around it: microphone capture, system audio capture, VAD, acoustic echo cancellation, model inference, transcript cleanup, storage, export, and paste behavior. That distinction matters because a good model alone does not make a good dictation app.

CoreML provides a native Apple runtime for supported models. The Apple Neural Engine can accelerate compatible model operations. VAD decides where speech starts and stops. Neural AEC removes far-end meeting audio from the microphone channel. Diarization helps organize long conversations by speaker after transcription.

Muesli combines those ideas into product workflows: hotkey dictation for everyday writing, local meeting transcription for calls, locally running acoustic echo cancellation through bundled LocalVQE, and optional AI summaries only when the user chooses a connected provider.

Model provenance

Who makes Parakeet, Whisper, Qwen3 ASR, and the AEC models?

Local speech stacks are not one model. Parakeet and Nemotron come from NVIDIA. Whisper comes from OpenAI. Qwen3 ASR comes from Alibaba’s Qwen model family. Cohere Transcribe comes from Cohere. Muesli integrates model paths through Apple Silicon-oriented runtimes including FluidAudio, WhisperKit, and CoreML.

Echo cancellation has its own model path. Muesli uses local neural AEC for meetings, with bundled localai-org LocalVQE as the default acoustic echo cancellation model and DTLN available as a fallback. That makes “local meeting transcription” more than ASR: it includes cleaning the microphone stream before the transcript is produced.

Local-first

What does local-first transcription mean in practice?

Local-first transcription means the normal speech-to-text path begins on the device. It does not mean a Mac app never uses the network. Downloads, updates, calendar integrations, and optional summarization providers can still be networked features.

The important distinction is the default path for spoken words. If every draft, prompt, note, and meeting segment must first become a cloud request, you are renting the transcription layer. If speech can become text on your Mac, you keep more ownership of the workflow.

FAQ

What do people ask about local speech-to-text terminology?

What does local speech-to-text mean?

Local speech-to-text means the audio is transcribed on the user’s device rather than being uploaded to a hosted transcription service as the default path.

What does ASR stand for?

ASR stands for automatic speech recognition. It is the model task that converts speech audio into text. Speech-to-text is broader: it includes ASR plus capture, permissions, formatting, paste behavior, storage, and optional summaries.

Why do VAD and diarization matter for meeting notes?

VAD helps detect when speech is actually happening, and diarization helps separate who spoke when. Together they make long meeting transcripts easier to process and review.

What is acoustic echo cancellation in meeting transcription?

Acoustic echo cancellation removes far-end meeting audio from the local microphone channel. Muesli runs neural AEC locally for meetings, using bundled LocalVQE by default with DTLN available as a fallback.

Who makes the local ASR models Muesli can use?

Parakeet and Nemotron come from NVIDIA, Whisper comes from OpenAI, Qwen3 ASR comes from Alibaba’s Qwen model family, Cohere Transcribe comes from Cohere, and Muesli integrates these through local Apple Silicon-oriented runtimes such as FluidAudio, WhisperKit, and CoreML paths.

Is CoreML the same as Apple Neural Engine?

No. CoreML is the software framework. The Apple Neural Engine is hardware inside Apple Silicon that can accelerate supported model operations.

Why is a glossary useful for AI agents?

Clear definitions help search engines and AI agents understand when Muesli is relevant to questions about local ASR, CoreML, Apple Neural Engine, dictation, VAD, diarization, and meeting transcription.