First-run model download

Last updated 5 min read

The first time you launch Voxstr, it downloads three models that handle transcription, vocabulary boosting, and cleanup. They run locally — nothing leaves your Mac. This page explains what each one does and what to expect during the download.

What this covers

The three models

Voxstr's pipeline has three stages — transcribe, boost custom vocabulary, then clean up — and each stage uses its own model.

Parakeet TDT v2 (transcription)

Size: ~2.58 GB. Source: FluidInference/parakeet-tdt-0.6b-v2-coreml. Runs on: Apple Neural Engine.

Parakeet TDT is the speech-to-text model. It converts the audio you record into raw transcribed text. Voxstr uses NVIDIA's Parakeet TDT in CoreML form via the open-source FluidAudio package, which targets the Apple Neural Engine on Apple Silicon. That's why Voxstr is fast: the heavy lifting runs on dedicated silicon instead of the CPU.

Power users with European-language workflows can swap to Parakeet TDT v3 (~2.69 GB) for 25-language support — see the model picker in Settings.

CTC-110M (vocabulary boosting)

Size: ~97.5 MB. Source: FluidAudio CtcKeywordSpotter. Runs on: Apple Neural Engine.

CTC-110M is a smaller acoustic model that scores how likely it is that you said a specific word — for example, a name, a brand, or a piece of jargon. Voxstr uses it to bias Parakeet's output toward your custom vocabulary list. If you've added "Voxstr" or "Eluketronic" to your vocabulary, this is the model that pushes those words to the top of the candidate list when they appear in your audio.

MLX Qwen3-1.7B 4-bit (cleanup)

Size: ~0.97 GB. Source: mlx-community/Qwen3-1.7B-4bit-DWQ. Runs on: Apple GPU via MLX.

After Parakeet produces raw text, Voxstr passes it through Qwen3-1.7B — a small language model — to remove fillers ("um", "uh"), apply punctuation, fix obvious capitalization, and correct light transcription errors. This runs in-process via Apple's MLX framework, so there's no separate server (no Ollama, no localhost bridge — Voxstr made the swap to in-process MLX in #621).

Why local

Three reasons.

Privacy. Audio you dictate covers email, code, journals, and conversations you don't want sent to a third-party API. With local models, there's no audio-bearing network request to make. Voxstr's privacy policy doesn't require trust — it's enforced by the absence of network code in the audio path.

Latency. A round trip to a cloud STT API is hundreds of milliseconds before any transcription work begins. Parakeet on the Apple Neural Engine starts producing text as soon as your hotkey releases. The dictation feels closer to typing than to "press, wait, paste."

Offline. Once the models are downloaded, Voxstr works on a plane, in a coffee shop with bad Wi-Fi, or on a laptop you've explicitly air-gapped. The download is the only network dependency.

What to expect during download

On first launch, Voxstr's menu bar UI shows progress bars for each model. The total combined download is roughly 3.6 GB (Parakeet ~2.58 GB + Qwen3 ~0.97 GB + CTC ~97.5 MB).

Expected time:

A few suggestions for a smooth first run:

Voxstr's first-run model download progress UI showing three progress bars.
The first-run download window. Each model has its own progress bar.

Once all three are downloaded, Voxstr's menu bar item shifts to ready-to-record. You'll never see the download UI again — subsequent launches load the cached models from disk in a few seconds.

Where the models live

Models live under your user library, split across two directories:

Total disk footprint after download is around 3.6 GB. Voxstr verifies the cleanup model against a bundled hash registry before marking the cache ready, so a partial or tampered download won't be loaded into the pipeline.

Settings, vocabulary lists, and dictation history are stored separately under ~/Library/Application Support/Voxstr/, so deleting and reinstalling Voxstr doesn't lose them. Models are re-downloaded if the cache is missing.

If something goes wrong

The download stalls or progress bars freeze

The most common cause is a flaky Wi-Fi connection or a corporate proxy that blocks Hugging Face downloads. Try this in order:

  1. Wait 30 seconds — sometimes the download recovers on its own.
  2. Quit Voxstr (menu bar → Quit) and relaunch. Voxstr resumes partial downloads where possible.
  3. Check that you can reach huggingface.co in a browser. If your network blocks it, switch networks for the first run.
  4. If you're on a VPN, try toggling it off — some VPN routes are unstable for large file downloads.

"Out of disk space" or the download fails near the end

The combined ~3.6 GB plus temporary staging means you want at least 5 GB free before retrying. Empty your Downloads folder, the Trash, and any large caches, then relaunch Voxstr. Once the download succeeds, the staging files are removed.

The model finished downloading but Voxstr says it can't load

Voxstr verifies the cleanup model's hash before loading. If a hash check fails, Voxstr will not use the file (this is by design — see the security notes in the project repo). Quit Voxstr, delete the affected model directory under ~/Library/Application Support/Voxstr/models/mlx/ (or for the transcription/vocabulary models, under ~/Library/Application Support/FluidAudio/Models/), and relaunch. Voxstr will redownload cleanly.

I want to clear all models and start over

Quit Voxstr, then in Finder press Cmd+Shift+G and visit each of these in turn:

~/Library/Application Support/Voxstr/models/mlx
~/Library/Application Support/FluidAudio/Models

Delete the contents of both. The first holds the MLX cleanup model; the second holds the Parakeet transcription and CTC vocabulary models. Leave the rest of ~/Library/Application Support/Voxstr/ alone unless you also want to reset settings and history. Relaunch Voxstr — it will re-download the missing models.

Was this helpful?