I Built My Own Wispr Flow to Stop Paying $25/Month

TL;DR: I built VoiceFlow, a local-first macOS voice dictation app that runs whisper.cpp for speech-to-text entirely on-device. Hold a key, talk, release, and the transcription gets typed into whatever app you're focused on. No cloud, no API keys, no $25/month subscription. Transcribes 9 seconds of audio in under a second on M-series chips.

Why I Built This

I'd been using Wispr Flow for voice dictation on macOS. It's good — hold a key, talk, text appears. But it's $24.99/month, it sends your audio to the cloud, and I kept thinking: whisper.cpp exists, it runs locally on Apple Silicon, why am I paying for this?

I'd already explored voice input in my Tauri app using ElevenLabs' cloud STT. That worked, but it was still cloud-dependent and tied to a specific app. I wanted something system-wide — a menu bar utility that works in any text field, runs entirely offline, and costs nothing after the initial build.

So I built VoiceFlow.

The Architecture

The interesting constraint was performance. Voice dictation needs to feel instant — you release the key and the text should appear within a second. That rules out most "easy" approaches and forces you into C++ territory.

Swift (macOS Shell) ←→ Obj-C++ Bridge ←→ C API ←→ C++ Core Engine

Four layers, three threads, one lock-free ring buffer connecting them:

Audio capture thread (real-time priority) — AVAudioEngine grabs mic input at 48kHz, resamples to 16kHz, feeds raw floats into the ring buffer
Processing thread — drains the ring buffer, accumulates audio while you're holding the key
Main thread — handles UI, hotkey detection, and text injection into the focused app

The C++ core is a static library with exactly 7 extern "C" functions. That's the entire bridge surface between Swift and C++. Everything else is hidden.

The Hard Parts

AVAudioEngine doesn't do what you think it does

My first attempt at audio capture looked reasonable:

// core/src/AudioCapture.swift — what I tried first (broken)
inputNode.installTap(onBus: 0, bufferSize: 4096, format: targetFormat)

I set the tap format to 16kHz mono (what whisper.cpp expects) and assumed AVAudioEngine would handle the resampling. It didn't. On Apple Silicon, the hardware runs at 48kHz, and asking for a different format in the tap just... silently fails or produces garbage.

The fix was to tap at the hardware's native format and resample manually with AVAudioConverter:

// platform/macos/VoiceFlow/Audio/AudioCapture.swift
let hardwareFormat = inputNode.outputFormat(forBus: 0)
let converter = AVAudioConverter(from: hardwareFormat, to: targetFormat)

inputNode.installTap(onBus: 0, bufferSize: 4096, format: hardwareFormat) {
    [weak self] buffer, _ in
    let ratio = 16000.0 / hardwareFormat.sampleRate
    let outputFrameCount = AVAudioFrameCount(Double(buffer.frameLength) * ratio)
    // ... convert manually with AVAudioConverter
}

Not documented anywhere obvious. I burned a lot of time watching empty buffers before figuring this out.

The hotkey problem is actually a permissions problem

macOS has two ways to listen for global keyboard events: CGEvent taps and NSEvent monitors. I started with CGEvent taps because they give you more control — you can intercept and modify events. But they require Accessibility permission, which is painful during development because the permission is tied to the specific binary, and Xcode rebuilds produce a new binary every time.

I switched to NSEvent.addGlobalMonitorForEvents for key detection (no Accessibility needed), and only use Accessibility for the text injection step. I also switched the hotkey from Right Option to the Fn/Globe key — it's less likely to conflict with other shortcuts.

Text injection on macOS is a minefield

Getting transcribed text into the focused app sounds trivial. It's not. macOS gives you three options, and they all have problems:

Accessibility API (AXUIElement) — the "correct" way. Finds the focused text field and sets its value. But it requires Accessibility permission, and some apps (Electron-based ones especially) don't expose their text fields properly.
CGEvent key simulation — synthesize individual keystrokes. Works broadly but is slow for long text and requires the same Accessibility permission.
Clipboard paste — copy text to pasteboard, simulate Cmd+V. Fast and works everywhere, but it clobbers whatever the user had copied.

I ended up implementing all three as a fallback chain. The clipboard approach turned out to be the most reliable for an MVP — it's what Wispr Flow itself appears to do based on how it behaves.

Bypassing VAD for hold-to-talk

The C++ engine originally ran all audio through Silero VAD (Voice Activity Detection) to find speech boundaries. This makes sense for always-on listening, but for hold-to-talk it was actually getting in the way — VAD would sometimes clip the beginning of speech or add latency waiting for its confidence threshold.

The fix was simple but required rethinking the architecture: in hold-to-talk mode, bypass VAD entirely. Every sample that comes in while the key is held gets accumulated directly. The user's key press is the segmentation signal.

// core/src/engine.cpp — processing_loop()
// In hold-to-talk mode: accumulate ALL audio while recording.
// The user's key press/release controls segmentation, not VAD.
speech_buffer_.insert(speech_buffer_.end(), frame, frame + VAD_FRAME_SAMPLES);

VAD is still initialized and ready — it'll be needed when I add an always-on listening mode. But for now, the simplest approach turned out to be the right one.

The Lock-Free Ring Buffer

This is the piece I'm most happy with. The audio capture thread runs at real-time priority — you absolutely cannot block it with a mutex, or you get audio glitches. So the ring buffer connecting capture to processing is a single-producer single-consumer (SPSC) lock-free design:

// core/src/audio_ring_buffer.cpp
size_t AudioRingBuffer::write(const float* data, size_t count) {
    const size_t w = write_pos_.load(std::memory_order_relaxed);
    const size_t r = read_pos_.load(std::memory_order_acquire);
    const size_t space = capacity_ - (w - r);
    const size_t to_write = std::min(count, space);
    for (size_t i = 0; i < to_write; ++i) {
        buffer_[(w + i) & mask_] = data[i];
    }
    write_pos_.store(w + to_write, std::memory_order_release);
    return to_write;
}

Power-of-2 capacity means the index masking (& mask_) handles wraparound without branches. The memory ordering is carefully chosen — relaxed for the thread's own position, acquire/release for cross-thread synchronization. No mutexes, no locks, no allocation on the hot path.

What I'd Do Differently

Start with clipboard-based injection from day one. I spent too long trying to get AXUIElement working reliably across apps before accepting that clipboard paste is what actually ships.
Skip VAD for the initial version. I built the whole VAD pipeline before realizing hold-to-talk doesn't need it. Should have started with the simpler interaction model.
Use XPC or a Launch Agent instead of a standalone app. The Accessibility permission dance is easier if the binary doesn't change between builds. An XPC service gets permission once and keeps it.

Key Takeaways

AVAudioEngine's tap format parameter doesn't do resampling for you. Always tap at the hardware format and convert manually. This is barely documented and will waste your time.
Lock-free data structures aren't just for high-frequency trading. Any time you have a real-time audio thread, you need lock-free communication. SPSC ring buffers are the simplest correct solution.
macOS permissions are the hardest part of macOS development. The actual audio processing, speech recognition, and text injection code is straightforward. The permission model — Accessibility, Microphone, Input Monitoring — is what makes you question your life choices.
whisper.cpp on Apple Silicon is absurdly fast. 9.3 seconds of audio transcribed in under 1 second on an M5. There's genuinely no reason to send audio to a cloud API for basic dictation anymore.
A 7-function C API is the right bridge width. The temptation is to expose more of the C++ engine. Resist it. A minimal extern "C" surface means the Swift shell and C++ core can evolve independently.
Build the obvious simple thing first. Hold-to-talk with clipboard paste is "boring" compared to always-on VAD with Accessibility injection. It's also what actually works reliably.

VoiceFlow is open source and runs entirely on your Mac. No cloud, no subscription, no audio leaves your machine.