Adding Voice Input to a Desktop App: Three Layers of Things That Can Go Wrong

TL;DR: I added voice-to-text to Solo IDE using ElevenLabs' Scribe v2 Realtime WebSocket API, proxied through the Rust backend. The button worked, the mic was recording, but no text ever appeared. It took three rounds of debugging to find all the problems: wrong WebSocket message format, invisible log output, and macOS silently giving me 48kHz audio when I asked for 16kHz.

Why Voice Input in an IDE?

Solo is evolving toward a Solopreneur Business OS — the kind of tool where non-technical users talk to an AI to build things. For that audience, voice input isn't a gimmick. It's often the primary interaction mode. People think in spoken language, not typed commands.

The architecture goal was simple: user holds mic button, speaks, text appears in the chat input, user reviews it, hits Enter. No auto-submit. Review-first.

The Architecture

The interesting constraint is that Solo runs as a Tauri 2 desktop app — Rust backend, React frontend in a WebView. The API key can't live in the browser. So the audio has to flow through Rust:

Microphone → WebView AudioContext → base64 PCM chunks
    → Tauri IPC → Rust elevenlabs_commands.rs
        → tokio-tungstenite WebSocket → ElevenLabs
            → partial/committed transcripts back through the same chain
                → Zustand store → Lexical editor

Five hops. Each one a place where things can silently break.

The Hard Parts

1. Getting the WebSocket Message Format Wrong

I built the entire ElevenLabs WebSocket client from assumptions about how their API probably works, rather than reading the actual docs carefully. My first implementation sent audio like this:

// crates/solo-elevenlabs/src/stt/types.rs — WRONG
#[derive(Serialize)]
pub struct SttAudioMessage {
    #[serde(rename = "type")]
    pub msg_type: &'static str,  // "audio"
    pub audio: String,
}

ElevenLabs actually expects this:

// crates/solo-elevenlabs/src/stt/types.rs — CORRECT
#[derive(Serialize)]
pub struct SttAudioMessage {
    pub message_type: &'static str,  // "input_audio_chunk"
    pub audio_base_64: String,
    pub commit: bool,
    pub sample_rate: u32,
}

Different field names. Different structure. The discriminant field is message_type, not type. The audio field is audio_base_64, not audio. And commit is a boolean on the audio message itself, not a separate message type.

The server response format was equally wrong. I was parsing for {"type": "transcript", "channel": {"alternatives": [...]}} (a Deepgram-style format) when ElevenLabs sends {"message_type": "partial_transcript", "text": "hello"}. Flat and simple.

Every message I sent was malformed. Every response I got back failed to deserialize. The WebSocket connected fine — it just produced nothing.

2. Invisible Logging (The Tracing Filter Trap)

Here's the insidious one. I had logging everywhere:

// crates/solo-elevenlabs/src/stt/realtime.rs
tracing::info!("Connecting to ElevenLabs STT: {}", url);
tracing::info!("ElevenLabs STT WebSocket connected");
tracing::debug!("STT WS recv: {}", &text[..text.len().min(300)]);

None of it appeared in the terminal. The Tauri app logs were working fine for everything else — terminal, git, file system. Just nothing from ElevenLabs.

The tracing subscriber was configured with:

// apps/desktop/src-tauri/src/lib.rs
"solo_desktop=debug,tauri=info"

But the Tauri lib crate is actually named solo_desktop_lib (check your [lib] name in Cargo.toml). And the ElevenLabs crate logs under solo_elevenlabs::stt::realtime. Neither matches solo_desktop.

The fix:

"solo_desktop_lib=debug,solo_elevenlabs=debug,tauri=info"

What makes this brutal is that solo_desktop_lib logs were showing (for worktree, git, terminal commands) because EnvFilter does prefix matching — solo_desktop happens to be a prefix of solo_desktop_lib. But solo_desktop is NOT a prefix of solo_elevenlabs. So one crate was visible and the other was completely dark.

3. The Sample Rate Mismatch

Even after fixing the message format and making logs visible, transcription was returning empty results. Audio was flowing, ElevenLabs was receiving it, but the transcripts were blank.

The WebSocket URL declared audio_format=pcm_16000. The AudioContext was created requesting 16kHz:

// apps/desktop/src/lib/elevenlabs/audio-capture.ts
this.audioContext = new AudioContext({ sampleRate: 16000 });

But on macOS, the hardware sample rate is typically 48000Hz. The AudioContext may honor your request, or it may silently give you 48kHz. You have to check:

this._actualSampleRate = this.audioContext.sampleRate;
// On my Mac: 48000, not 16000

If you tell ElevenLabs you're sending pcm_16000 but actually send 48kHz audio, it sounds like chipmunks at 3x speed. The model can't transcribe that — it just returns nothing.

The fix was reordering the startup sequence. Instead of connecting the WebSocket first (with a hardcoded 16kHz assumption) then starting the mic, I now start the mic first to discover the actual sample rate, then open the WebSocket with the correct format:

// apps/desktop/src/hooks/useVoiceInput.ts
await capture.start();
const actualRate = capture.actualSampleRate; // e.g. 48000
await sttStart(sid, language, actualRate);   // → audio_format=pcm_48000

Other Things That Bit Me

Arc<F> doesn't implement Fn in Rust. When sharing a callback across tokio tasks, you need Arc<dyn Fn(...) + Send + Sync>, not Arc<F> where F: Fn(...). The generic doesn't automatically delegate the trait through the Arc.
Lexical editor needs focus for insertText to work. When the user clicks the voice button, DOM focus moves to the button. When the transcript arrives 2 seconds later, editor.insertText() silently does nothing because Lexical has no active selection. Fix: call editor.focus() before editor.update().
ElevenLabs has ~12 different error message_type values (auth_error, quota_exceeded, rate_limited, commit_throttled, etc.). If your serde enum only handles input_error, all the others fail to deserialize and get silently dropped. Using #[serde(other)] as a catch-all variant saves you from invisible error swallowing.

What I'd Do Differently

Read the API docs first, code second. I would have saved hours by spending 20 minutes reading the ElevenLabs WebSocket API reference before writing a single line of Rust. The message format, query parameters, and commit strategy were all documented. I just didn't look.
Start with a standalone test script. Before building the full pipeline (frontend → IPC → Rust → WebSocket), I should have written a minimal Rust program that connects to ElevenLabs, sends a hardcoded PCM file, and prints the transcript. That would have caught the format issues immediately without needing to debug through five layers.
Make the tracing filter explicit from the start. When adding a new Rust crate to a Tauri app, immediately add it to the tracing subscriber filter. Don't discover this after an hour of "why are there no logs?"

Key Takeaways

WebSocket APIs are protocol-sensitive. Unlike REST where you get a 400 back, a WebSocket might just silently accept your malformed messages and return nothing. Always verify the exact message format with the docs.
Sample rates lie. On macOS (and probably other platforms), new AudioContext({ sampleRate: 16000 }) is a request, not a guarantee. Always read back audioContext.sampleRate and use the actual value.
Tracing filters are prefix-matched. In Rust's tracing-subscriber, "my_crate=debug" will match my_crate_lib but not my_other_crate. When you add a new crate to a workspace, add it to the filter explicitly.
Proxy architectures multiply debugging surface. Every hop in the chain (browser → IPC → Rust → WebSocket → API) is a place where data can be silently transformed, dropped, or misformatted. Add logging at every boundary, not just the endpoints.
Focus management matters for programmatic text insertion. If you're inserting text into a rich text editor from an async callback (voice transcription, clipboard paste, etc.), you need to explicitly focus the editor first. The selection API doesn't work without DOM focus.
Use #[serde(other)] for WebSocket enums. Any tagged enum that deserializes WebSocket messages should have a catch-all variant. APIs add new message types, and silent deserialization failures are the worst kind of bug.

Solo IDE is an open-source Tauri 2 desktop IDE with an AI agent, built with Rust and React. The voice integration uses ElevenLabs Scribe v2 Realtime for speech-to-text and streaming TTS for agent voice output.