Add scribe stream command for live microphone transcription by javiertoledo · Pull Request #1 · theam/scribe

javiertoledo · 2026-03-31T11:31:45Z

Summary

Add scribe stream command for live microphone transcription
Two engines: default (Parakeet TDT v3, multilingual, ~11s latency) and Nemotron (English-only, ~560ms latency)
README updated with streaming docs and engine trade-offs

Status

Draft — streaming not working reliably yet. Known issues:

Nemotron engine: output shows mixed/repeated text from accumulated transcript diffing
Default engine: gets stuck when switching languages mid-stream
Default engine: ~11s latency (inherent to SlidingWindow approach with batch model)
No system audio capture yet (mic only)

What works

scribe stream starts and captures microphone audio
scribe stream --engine nemotron downloads and loads the Nemotron model
Partial text preview on stderr
Both text and JSONL output formats
Model download retry on partial/corrupt cache

Architecture decisions

Nemotron 560ms via StreamingAsrEngine protocol (true cache-aware streaming)
Parakeet TDT v3 via SlidingWindowAsrManager (batch model in sliding windows)
Actor-based state for thread safety (Swift 6 sendability)

Test plan

Nemotron: speak English continuously, verify clean incremental output
Default: speak Spanish, verify transcription appears
Default: speak English then Spanish, verify no hang
--format jsonl produces valid JSON per line
--output file.txt saves to file
Ctrl+C exits cleanly

🤖 Generated with Claude Code

New command: scribe stream — captures microphone audio and transcribes in real-time using FluidAudio's SlidingWindowAsrManager (Parakeet). Features: - Live transcription from microphone with timestamps - Text and JSONL output formats - Save to file with --output - Ctrl+C to stop cleanly - Uses streaming ASR config (11s chunks, 1s hypothesis updates) Usage: scribe stream # listen and transcribe scribe stream --format jsonl # JSONL output scribe stream --output meeting.txt # save to file System audio capture (--source) will be added in a follow-up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Reduce chunk size from 11s to 3s for ~3-4s latency (was ~13s) - Lower confirmation threshold from 0.8 to 0.5 for faster output - Reduce right context from 2s to 0.5s - Fix speaker label: remove "Others" tag for mic input - Add text dedup to avoid repeating same hypothesis - Remove --mic flag (mic is default and only source for now) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…pdates The 3s chunk config was too short for Parakeet — model needs ~10s context. Reverted to the library's .streaming preset (11s chunks, 1s hypothesis). Now shows two types of updates: - Volatile (hypothesis): shown as ephemeral line on stderr with \r overwrite Gives immediate ~1-2s feedback while speaking - Confirmed: printed as permanent line to stdout Stable, final text after sufficient context Also fixes: - Stream getting stuck on longer utterances (was breaking model state) - Text format shows live preview on stderr, final on stdout - JSONL emits both volatile and confirmed (with "confirmed" field) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace SlidingWindowAsrManager (batch TDT in sliding windows, ~11s latency) with StreamingAsrEngine protocol using Nemotron 560ms: - True cache-aware streaming: each 560ms chunk inherits full context - 2.12% WER (better than TDT v3's 2.5% on LibriSpeech) - Includes punctuation and capitalization - ~560ms to first text (was ~11s) - Partial transcript callback for live preview on stderr - Confirmed text printed to stdout Architecture: - Mic audio → appendAudio() → processBufferedAudio() → getPartialTranscript() - Partial callback fires on every chunk for live preview (\r overwrite on stderr) - Main loop polls at 20Hz, emits new confirmed text to stdout - Actor-based state management for thread safety (Swift 6) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Default: Parakeet TDT v3 via SlidingWindow (25 languages, higher latency) - --engine nemotron: Nemotron 560ms (English-only, ~560ms latency, punctuation) Usage: scribe stream # multilingual (default) scribe stream --engine nemotron # English-only, low latency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Nemotron: retry with cache cleanup on failed model load (fixes partial download) - Both engines: show download progress messages (not just --verbose) - README: add streaming section with engine comparison and trade-offs - README: update performance table with streaming latencies Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The Nemotron engine's partial callback returns the full accumulated transcript each time, which grows and revises. The previous code tried to diff via getPartialTranscript() polling, causing repeated/mixed output. Fix: Track printed length in StreamState actor. The partial callback fires after each 560ms chunk — we diff to find only the new portion and emit that. Live preview shows the tail of the transcript on stderr (ephemeral, overwritten). New confirmed text goes to stdout. Also simplified SlidingWindow engine to only emit to stdout on confirmed text (volatile goes to stderr preview only). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds a test/dev mode that feeds pre-recorded audio files into the streaming pipeline (instead of microphone), making streaming output deterministic and reproducible. Required to validate streaming quality without manual mic testing. Stream.swift: - New --audio-file <path> option (Nemotron engine only for now) - Reads via FluidAudio.AudioConverter.resampleAudioFile() (16kHz mono Float32) - Chunks at 4096 samples to mimic the live mic tap - Calls engine.finish() to flush tail audio before exit - Refactored Nemotron path: split mic vs file, extracted callback setup - SlidingWindow path explicitly rejects --audio-file (not yet supported) eval.py: - New --mode batch|streaming flag - Streaming mode dispatches to scribe stream --audio-file --format jsonl - Parses JSONL deltas and concatenates for WER computation - Auto-skips MLS Spanish in streaming mode (Nemotron is English-only) - New "mode" column in CSVs to differentiate batch vs streaming runs summarize.py: - Splits proper noun recall section by mode (batch vs streaming) Baseline results (scribe v0.2.1): | Dataset | Batch (Parakeet) | Streaming (Nemotron) | Delta | |------------|-----------------:|---------------------:|--------:| | TED-LIUM | 6.0% | 23.2% | +17.2pp | | Earnings-21| 12.9% | 39.9% | +27.0pp | | PN recall | 70.8% | 58.4% | -12.4pp | Streaming is 3-4x worse than batch with deletions dominating the error budget (TED 1327 D vs 759 I, Earnings-21 3640 D vs 1469 I) — streaming is dropping content. Likely culprits: 1. Delta-text logic in StreamState.getNewText() may discard model revisions 2. engine.finish() may not flush the final partial-transcript callback This commit establishes the regression baseline. Next phase: investigate the delta-text dropping and re-run the same eval to confirm fixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two bugs were causing streaming to drop ~30% of content vs batch: Bug 1: Discarded finish()'s authoritative return value ----------------------------------------------------- The previous code accumulated text via the per-chunk callback only and just logged finalText.count after engine.finish(). Two failure modes: (a) finish() pads any trailing partial chunk and decodes additional tokens, but those never reach the callback path. (b) The callback dispatched its emission via Task { } — those tasks could still be pending when the program exited, losing output. Fix: in file mode, do NOT set up the per-chunk callback. Use engine.finish() as the source of truth and emit it once via a new emitFinalTranscript() helper. The callback pattern is kept only for mic mode where live preview is the actual feature. This alone reduced WER from 23.2% → 13.2% on TED-LIUM. Bug 2: Used the .nemotron560ms variant --------------------------------------- NemotronChunkSize.swift comments claim "560ms - same accuracy" as 1120ms, but empirically 560ms is much worse: - 560ms on TED talk_0: 15.3% WER, 363 deletions - 1120ms on TED talk_0: 7.5% WER, 133 deletions The published FluidAudio benchmark (Documentation/Benchmarks.md) only covers 1120ms (2.51% on LibriSpeech), and the 560ms variant has no published number. The "same accuracy" claim isn't supported by data. Fix: switch to .nemotron1120ms. The latency cost (1.12s vs 0.56s) is acceptable for live use and gives a meaningful accuracy boost. Also matched the official NemotronTranscribe.swift CLI pattern: feed the whole file as one AVAudioPCMBuffer instead of chunking into 4096 samples. (No accuracy difference, but matches the reference pattern.) Final results (scribe v0.2.1, Nemotron 1120ms streaming vs Parakeet TDT v3 batch): | Dataset | Batch | Streaming | Delta | |------------|------:|----------:|------:| | TED-LIUM | 6.0% | 7.2% | +1.2pp | | Earnings-21| 12.9% | 18.5% | +5.6pp | | PN recall | 70.8% | 65.7% | -5.1pp | Streaming is now within striking distance of batch quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…changes Confirms the streaming work in this branch did not affect the batch path: | Dataset | Before | After | |------------|-------:|-------:| | TED-LIUM | 6.0% | 6.0% | | Earnings-21| 12.9% | 12.9% | | MLS Spanish| 8.5% | 8.5% | Earnings-21 proper noun recall also unchanged at 70.8% (126/178 entities). S/D/I counts byte-identical to the prior baseline; only processing_secs varies (wall-clock noise). CSV schema updated with the new "mode" column (= "batch") and the proper_noun_* columns added in the prior session, so all batch + streaming CSVs now share a uniform schema. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Both modes now flow through a single drain loop. The only difference is the source adapter — file mode pushes one buffer and finishes the continuation; mic mode keeps yielding from the AVAudioEngine tap until SIGINT finishes the continuation. This eliminates two parallel code paths and a class of mic-only bugs. Architecture (matches FluidAudio's own SlidingWindowAsrManager pattern): source ──► AsyncStream<AVAudioPCMBuffer> ──► drain loop ├─ engine.appendAudio ├─ engine.processBufferedAudio ├─ engine.getPartialTranscript └─ emit delta via StreamState stream ends (file done OR SIGINT) └────► engine.finish() ──► emit tail Specific bugs fixed in mic mode by sharing the proven file pipeline: 1. No more detached Task { } in the per-chunk callback. The drain loop polls getPartialTranscript() in the main task. No race, no lost emissions at exit. 2. Darwin.exit(0) → DispatchSource.makeSignalSource(SIGINT). The signal handler runs on a regular dispatch queue and calls continuation.finish(). The drain loop falls through to engine.finish() and emits the tail audio that used to be silently dropped. 3. Mic tap pre-resamples to 16kHz mono Float32 via AudioConverter. resampleBuffer() (the same converter the file path uses for whole files), so the engine sees identical input format regardless of source. Code changes: - New: runNemotronDrainLoop, feedNemotronFromFile, startMicSource, emitDelta, emitLivePreview, makeMonoFloat32Buffer, NemotronMicResources - Removed: setupNemotronCallback, runNemotronFromMic, runNemotronFromFile, emitFinalTranscript - runNemotronWithEngine becomes a thin orchestrator that picks the source adapter and runs the drain loop Eval verification (same scribe v0.2.1, same audio, same reference): | Dataset | Streaming WER (before) | Streaming WER (after) | |------------|-----------------------:|----------------------:| | TED-LIUM | 7.2% | 7.2% | | Earnings-21| 18.5% | 18.5% | Per-file S/D/I counts are byte-identical to the previous baseline. The unified pipeline now emits multiple deltas per file (one from the drain poll, one from the finish() tail) instead of one big block, but jiwer's text normalization makes the result equivalent. Out of scope: - runSlidingWindow (multilingual) path stays as-is — still rejects --audio-file, still uses setupSignalHandler. The SlidingWindow engine has its own concurrency model and is a separate refactor. - System audio capture via CATapDescription — plugs into the same pipeline as a third source adapter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The first processBufferedAudio() call triggers CoreML model compilation (~10-20s). Without warmup, this happens on the first real audio chunk, causing all live output to batch up and appear at once when speech pauses or Ctrl+C is pressed. Fix: feed one silent chunk (17920 samples = 1120ms) through the engine during startup, before the mic tap begins. CoreML compiles during the "Warming up..." log message. Then reset() to discard the silence tokens so they don't pollute the real transcript. File-mode eval unchanged: talk_0 still 7.5% WER, S/D/I 72/133/11. The reset() correctly clears warmup state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two UX improvements to scribe stream: 1. Text mode now flows naturally without timestamps: Before: [00:18] . Can you help me [00:19] with something After: Can you help me with something? I just need to say something. Tokens append inline; newlines on sentence boundaries (. ? !). JSONL mode unchanged (keeps timestamps for machine consumption). 2. New --save-audio <path> flag records the mic audio to a WAV file. After the stream ends (Ctrl+C), the saved audio is automatically re-transcribed with the batch engine (Parakeet TDT v3) which is significantly more accurate (6% vs 7.2% on TED, 13% vs 18.5% on Earnings-21). Users get real-time assistance during the meeting AND a polished transcript at the end. Usage: scribe stream --engine nemotron --save-audio meeting.wav Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

javiertoledo and others added 7 commits April 9, 2026 19:22

javiertoledo force-pushed the feature/stream branch from ff95589 to 3c93c3d Compare April 9, 2026 18:22

javiertoledo and others added 6 commits April 9, 2026 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scribe stream command for live microphone transcription#1

Add scribe stream command for live microphone transcription#1
javiertoledo wants to merge 13 commits intomainfrom
feature/stream

javiertoledo commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

javiertoledo commented Mar 31, 2026

Summary

Status

What works

Architecture decisions

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant