TL;DR: I built VoiceFlow, a local-first macOS voice dictation app that runs whisper.cpp for speech-to-text entirely on-device. Hold a key, talk, release, and the transcription gets typed into whatever app you're focused on. No cloud, no API keys, no $25/month subscription. Transcribes 9 seconds of audio in under a second on M-series chips.
Why I Built This
I'd been using Wispr Flow for voice dictation on macOS. It's good — hold a key, talk, text appears. But it's $24.99/month, it sends your audio to the cloud, and I kept thinking: whisper.cpp exists, it runs locally on Apple Silicon, why am I paying for this?
I'd already explored voice input in my Tauri app using ElevenLabs' cloud STT. That worked, but it was still cloud-dependent and tied to a specific app. I wanted something system-wide — a menu bar utility that works in any text field, runs entirely offline, and costs nothing after the initial build.
So I built VoiceFlow.
The Architecture
The interesting constraint was performance. Voice dictation needs to feel instant — you release the key and the text should appear within a second. That rules out most "easy" approaches and forces you into C++ territory.
Swift (macOS Shell) ←→ Obj-C++ Bridge ←→ C API ←→ C++ Core Engine
Four layers, three threads, one lock-free ring buffer connecting them:
- Audio capture thread (real-time priority) — AVAudioEngine grabs mic input at 48kHz, resamples to 16kHz, feeds raw floats into the ring buffer
- Processing thread — drains the ring buffer, accumulates audio while you're holding the key
- Main thread — handles UI, hotkey detection, and text injection into the focused app
The C++ core is a static library with exactly 7 extern "C" functions. That's the entire bridge surface between Swift and C++. Everything else is hidden.
The Hard Parts
AVAudioEngine doesn't do what you think it does
My first attempt at audio capture looked reasonable:
// core/src/AudioCapture.swift — what I tried first (broken)
inputNode.installTap(onBus: 0, bufferSize: 4096, format: targetFormat)
I set the tap format to 16kHz mono (what whisper.cpp expects) and assumed AVAudioEngine would handle the resampling. It didn't. On Apple Silicon, the hardware runs at 48kHz, and asking for a different format in the tap just... silently fails or produces garbage.
The fix was to tap at the hardware's native format and resample manually with AVAudioConverter:
// platform/macos/VoiceFlow/Audio/AudioCapture.swift
let hardwareFormat = inputNode.outputFormat(forBus: 0)
let converter = AVAudioConverter(from: hardwareFormat, to: targetFormat)
inputNode.installTap(onBus: 0, bufferSize: 4096, format: hardwareFormat) {
[weak self] buffer, _ in
let ratio = 16000.0 / hardwareFormat.sampleRate
let outputFrameCount = AVAudioFrameCount(Double(buffer.frameLength) * ratio)
// ... convert manually with AVAudioConverter
}
Not documented anywhere obvious. I burned a lot of time watching empty buffers before figuring this out.
The hotkey problem is actually a permissions problem
macOS has two ways to listen for global keyboard events: CGEvent taps and NSEvent monitors. I started with CGEvent taps because they give you more control — you can intercept and modify events. But they require Accessibility permission, which is painful during development because the permission is tied to the specific binary, and Xcode rebuilds produce a new binary every time.
I switched to NSEvent.addGlobalMonitorForEvents for key detection (no Accessibility needed), and only use Accessibility for the text injection step. I also switched the hotkey from Right Option to the Fn/Globe key — it's less likely to conflict with other shortcuts.
Text injection on macOS is a minefield
Getting transcribed text into the focused app sounds trivial. It's not. macOS gives you three options, and they all have problems:
-
Accessibility API (
AXUIElement) — the "correct" way. Finds the focused text field and sets its value. But it requires Accessibility permission, and some apps (Electron-based ones especially) don't expose their text fields properly. -
CGEvent key simulation — synthesize individual keystrokes. Works broadly but is slow for long text and requires the same Accessibility permission.
-
Clipboard paste — copy text to pasteboard, simulate Cmd+V. Fast and works everywhere, but it clobbers whatever the user had copied.
I ended up implementing all three as a fallback chain. The clipboard approach turned out to be the most reliable for an MVP — it's what Wispr Flow itself appears to do based on how it behaves.
Bypassing VAD for hold-to-talk
The C++ engine originally ran all audio through Silero VAD (Voice Activity Detection) to find speech boundaries. This makes sense for always-on listening, but for hold-to-talk it was actually getting in the way — VAD would sometimes clip the beginning of speech or add latency waiting for its confidence threshold.
The fix was simple but required rethinking the architecture: in hold-to-talk mode, bypass VAD entirely. Every sample that comes in while the key is held gets accumulated directly. The user's key press is the segmentation signal.
// core/src/engine.cpp — processing_loop()
// In hold-to-talk mode: accumulate ALL audio while recording.
// The user's key press/release controls segmentation, not VAD.
speech_buffer_.insert(speech_buffer_.end(), frame, frame + VAD_FRAME_SAMPLES);
VAD is still initialized and ready — it'll be needed when I add an always-on listening mode. But for now, the simplest approach turned out to be the right one.
The Lock-Free Ring Buffer
This is the piece I'm most happy with. The audio capture thread runs at real-time priority — you absolutely cannot block it with a mutex, or you get audio glitches. So the ring buffer connecting capture to processing is a single-producer single-consumer (SPSC) lock-free design:
// core/src/audio_ring_buffer.cpp
size_t AudioRingBuffer::write(const float* data, size_t count) {
const size_t w = write_pos_.load(std::memory_order_relaxed);
const size_t r = read_pos_.load(std::memory_order_acquire);
const size_t space = capacity_ - (w - r);
const size_t to_write = std::min(count, space);
for (size_t i = 0; i < to_write; ++i) {
buffer_[(w + i) & mask_] = data[i];
}
write_pos_.store(w + to_write, std::memory_order_release);
return to_write;
}
Power-of-2 capacity means the index masking (& mask_) handles wraparound without branches. The memory ordering is carefully chosen — relaxed for the thread's own position, acquire/release for cross-thread synchronization. No mutexes, no locks, no allocation on the hot path.
What I'd Do Differently
- Start with clipboard-based injection from day one. I spent too long trying to get AXUIElement working reliably across apps before accepting that clipboard paste is what actually ships.
- Skip VAD for the initial version. I built the whole VAD pipeline before realizing hold-to-talk doesn't need it. Should have started with the simpler interaction model.
- Use XPC or a Launch Agent instead of a standalone app. The Accessibility permission dance is easier if the binary doesn't change between builds. An XPC service gets permission once and keeps it.
Key Takeaways
-
AVAudioEngine's tap format parameter doesn't do resampling for you. Always tap at the hardware format and convert manually. This is barely documented and will waste your time.
-
Lock-free data structures aren't just for high-frequency trading. Any time you have a real-time audio thread, you need lock-free communication. SPSC ring buffers are the simplest correct solution.
-
macOS permissions are the hardest part of macOS development. The actual audio processing, speech recognition, and text injection code is straightforward. The permission model — Accessibility, Microphone, Input Monitoring — is what makes you question your life choices.
-
whisper.cpp on Apple Silicon is absurdly fast. 9.3 seconds of audio transcribed in under 1 second on an M5. There's genuinely no reason to send audio to a cloud API for basic dictation anymore.
-
A 7-function C API is the right bridge width. The temptation is to expose more of the C++ engine. Resist it. A minimal
extern "C"surface means the Swift shell and C++ core can evolve independently. -
Build the obvious simple thing first. Hold-to-talk with clipboard paste is "boring" compared to always-on VAD with Accessibility injection. It's also what actually works reliably.
VoiceFlow is open source and runs entirely on your Mac. No cloud, no subscription, no audio leaves your machine.