Add scribe stream command for live microphone transcription#1
Draft
javiertoledo wants to merge 13 commits intomainfrom
Draft
Add scribe stream command for live microphone transcription#1javiertoledo wants to merge 13 commits intomainfrom
javiertoledo wants to merge 13 commits intomainfrom
Conversation
New command: scribe stream — captures microphone audio and transcribes in real-time using FluidAudio's SlidingWindowAsrManager (Parakeet). Features: - Live transcription from microphone with timestamps - Text and JSONL output formats - Save to file with --output - Ctrl+C to stop cleanly - Uses streaming ASR config (11s chunks, 1s hypothesis updates) Usage: scribe stream # listen and transcribe scribe stream --format jsonl # JSONL output scribe stream --output meeting.txt # save to file System audio capture (--source) will be added in a follow-up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Reduce chunk size from 11s to 3s for ~3-4s latency (was ~13s) - Lower confirmation threshold from 0.8 to 0.5 for faster output - Reduce right context from 2s to 0.5s - Fix speaker label: remove "Others" tag for mic input - Add text dedup to avoid repeating same hypothesis - Remove --mic flag (mic is default and only source for now) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pdates The 3s chunk config was too short for Parakeet — model needs ~10s context. Reverted to the library's .streaming preset (11s chunks, 1s hypothesis). Now shows two types of updates: - Volatile (hypothesis): shown as ephemeral line on stderr with \r overwrite Gives immediate ~1-2s feedback while speaking - Confirmed: printed as permanent line to stdout Stable, final text after sufficient context Also fixes: - Stream getting stuck on longer utterances (was breaking model state) - Text format shows live preview on stderr, final on stdout - JSONL emits both volatile and confirmed (with "confirmed" field) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace SlidingWindowAsrManager (batch TDT in sliding windows, ~11s latency) with StreamingAsrEngine protocol using Nemotron 560ms: - True cache-aware streaming: each 560ms chunk inherits full context - 2.12% WER (better than TDT v3's 2.5% on LibriSpeech) - Includes punctuation and capitalization - ~560ms to first text (was ~11s) - Partial transcript callback for live preview on stderr - Confirmed text printed to stdout Architecture: - Mic audio → appendAudio() → processBufferedAudio() → getPartialTranscript() - Partial callback fires on every chunk for live preview (\r overwrite on stderr) - Main loop polls at 20Hz, emits new confirmed text to stdout - Actor-based state management for thread safety (Swift 6) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Default: Parakeet TDT v3 via SlidingWindow (25 languages, higher latency) - --engine nemotron: Nemotron 560ms (English-only, ~560ms latency, punctuation) Usage: scribe stream # multilingual (default) scribe stream --engine nemotron # English-only, low latency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Nemotron: retry with cache cleanup on failed model load (fixes partial download) - Both engines: show download progress messages (not just --verbose) - README: add streaming section with engine comparison and trade-offs - README: update performance table with streaming latencies Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Nemotron engine's partial callback returns the full accumulated transcript each time, which grows and revises. The previous code tried to diff via getPartialTranscript() polling, causing repeated/mixed output. Fix: Track printed length in StreamState actor. The partial callback fires after each 560ms chunk — we diff to find only the new portion and emit that. Live preview shows the tail of the transcript on stderr (ephemeral, overwritten). New confirmed text goes to stdout. Also simplified SlidingWindow engine to only emit to stdout on confirmed text (volatile goes to stderr preview only). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ff95589 to
3c93c3d
Compare
Adds a test/dev mode that feeds pre-recorded audio files into the streaming pipeline (instead of microphone), making streaming output deterministic and reproducible. Required to validate streaming quality without manual mic testing. Stream.swift: - New --audio-file <path> option (Nemotron engine only for now) - Reads via FluidAudio.AudioConverter.resampleAudioFile() (16kHz mono Float32) - Chunks at 4096 samples to mimic the live mic tap - Calls engine.finish() to flush tail audio before exit - Refactored Nemotron path: split mic vs file, extracted callback setup - SlidingWindow path explicitly rejects --audio-file (not yet supported) eval.py: - New --mode batch|streaming flag - Streaming mode dispatches to scribe stream --audio-file --format jsonl - Parses JSONL deltas and concatenates for WER computation - Auto-skips MLS Spanish in streaming mode (Nemotron is English-only) - New "mode" column in CSVs to differentiate batch vs streaming runs summarize.py: - Splits proper noun recall section by mode (batch vs streaming) Baseline results (scribe v0.2.1): | Dataset | Batch (Parakeet) | Streaming (Nemotron) | Delta | |------------|-----------------:|---------------------:|--------:| | TED-LIUM | 6.0% | 23.2% | +17.2pp | | Earnings-21| 12.9% | 39.9% | +27.0pp | | PN recall | 70.8% | 58.4% | -12.4pp | Streaming is 3-4x worse than batch with deletions dominating the error budget (TED 1327 D vs 759 I, Earnings-21 3640 D vs 1469 I) — streaming is dropping content. Likely culprits: 1. Delta-text logic in StreamState.getNewText() may discard model revisions 2. engine.finish() may not flush the final partial-transcript callback This commit establishes the regression baseline. Next phase: investigate the delta-text dropping and re-run the same eval to confirm fixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs were causing streaming to drop ~30% of content vs batch:
Bug 1: Discarded finish()'s authoritative return value
-----------------------------------------------------
The previous code accumulated text via the per-chunk callback only and
just logged finalText.count after engine.finish(). Two failure modes:
(a) finish() pads any trailing partial chunk and decodes additional
tokens, but those never reach the callback path.
(b) The callback dispatched its emission via Task { } — those tasks
could still be pending when the program exited, losing output.
Fix: in file mode, do NOT set up the per-chunk callback. Use
engine.finish() as the source of truth and emit it once via a new
emitFinalTranscript() helper. The callback pattern is kept only for
mic mode where live preview is the actual feature.
This alone reduced WER from 23.2% → 13.2% on TED-LIUM.
Bug 2: Used the .nemotron560ms variant
---------------------------------------
NemotronChunkSize.swift comments claim "560ms - same accuracy" as
1120ms, but empirically 560ms is much worse:
- 560ms on TED talk_0: 15.3% WER, 363 deletions
- 1120ms on TED talk_0: 7.5% WER, 133 deletions
The published FluidAudio benchmark (Documentation/Benchmarks.md) only
covers 1120ms (2.51% on LibriSpeech), and the 560ms variant has no
published number. The "same accuracy" claim isn't supported by data.
Fix: switch to .nemotron1120ms. The latency cost (1.12s vs 0.56s)
is acceptable for live use and gives a meaningful accuracy boost.
Also matched the official NemotronTranscribe.swift CLI pattern: feed
the whole file as one AVAudioPCMBuffer instead of chunking into 4096
samples. (No accuracy difference, but matches the reference pattern.)
Final results (scribe v0.2.1, Nemotron 1120ms streaming vs Parakeet TDT v3 batch):
| Dataset | Batch | Streaming | Delta |
|------------|------:|----------:|------:|
| TED-LIUM | 6.0% | 7.2% | +1.2pp |
| Earnings-21| 12.9% | 18.5% | +5.6pp |
| PN recall | 70.8% | 65.7% | -5.1pp |
Streaming is now within striking distance of batch quality.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…changes Confirms the streaming work in this branch did not affect the batch path: | Dataset | Before | After | |------------|-------:|-------:| | TED-LIUM | 6.0% | 6.0% | | Earnings-21| 12.9% | 12.9% | | MLS Spanish| 8.5% | 8.5% | Earnings-21 proper noun recall also unchanged at 70.8% (126/178 entities). S/D/I counts byte-identical to the prior baseline; only processing_secs varies (wall-clock noise). CSV schema updated with the new "mode" column (= "batch") and the proper_noun_* columns added in the prior session, so all batch + streaming CSVs now share a uniform schema. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both modes now flow through a single drain loop. The only difference is
the source adapter — file mode pushes one buffer and finishes the
continuation; mic mode keeps yielding from the AVAudioEngine tap until
SIGINT finishes the continuation. This eliminates two parallel code
paths and a class of mic-only bugs.
Architecture (matches FluidAudio's own SlidingWindowAsrManager pattern):
source ──► AsyncStream<AVAudioPCMBuffer> ──► drain loop
├─ engine.appendAudio
├─ engine.processBufferedAudio
├─ engine.getPartialTranscript
└─ emit delta via StreamState
stream ends (file done OR SIGINT)
└────► engine.finish() ──► emit tail
Specific bugs fixed in mic mode by sharing the proven file pipeline:
1. No more detached Task { } in the per-chunk callback. The drain loop
polls getPartialTranscript() in the main task. No race, no lost
emissions at exit.
2. Darwin.exit(0) → DispatchSource.makeSignalSource(SIGINT). The signal
handler runs on a regular dispatch queue and calls
continuation.finish(). The drain loop falls through to engine.finish()
and emits the tail audio that used to be silently dropped.
3. Mic tap pre-resamples to 16kHz mono Float32 via AudioConverter.
resampleBuffer() (the same converter the file path uses for whole
files), so the engine sees identical input format regardless of source.
Code changes:
- New: runNemotronDrainLoop, feedNemotronFromFile, startMicSource,
emitDelta, emitLivePreview, makeMonoFloat32Buffer, NemotronMicResources
- Removed: setupNemotronCallback, runNemotronFromMic, runNemotronFromFile,
emitFinalTranscript
- runNemotronWithEngine becomes a thin orchestrator that picks the source
adapter and runs the drain loop
Eval verification (same scribe v0.2.1, same audio, same reference):
| Dataset | Streaming WER (before) | Streaming WER (after) |
|------------|-----------------------:|----------------------:|
| TED-LIUM | 7.2% | 7.2% |
| Earnings-21| 18.5% | 18.5% |
Per-file S/D/I counts are byte-identical to the previous baseline. The
unified pipeline now emits multiple deltas per file (one from the drain
poll, one from the finish() tail) instead of one big block, but jiwer's
text normalization makes the result equivalent.
Out of scope:
- runSlidingWindow (multilingual) path stays as-is — still rejects
--audio-file, still uses setupSignalHandler. The SlidingWindow engine
has its own concurrency model and is a separate refactor.
- System audio capture via CATapDescription — plugs into the same
pipeline as a third source adapter.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The first processBufferedAudio() call triggers CoreML model compilation (~10-20s). Without warmup, this happens on the first real audio chunk, causing all live output to batch up and appear at once when speech pauses or Ctrl+C is pressed. Fix: feed one silent chunk (17920 samples = 1120ms) through the engine during startup, before the mic tap begins. CoreML compiles during the "Warming up..." log message. Then reset() to discard the silence tokens so they don't pollute the real transcript. File-mode eval unchanged: talk_0 still 7.5% WER, S/D/I 72/133/11. The reset() correctly clears warmup state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two UX improvements to scribe stream:
1. Text mode now flows naturally without timestamps:
Before: [00:18] . Can you help me
[00:19] with something
After: Can you help me with something?
I just need to say something.
Tokens append inline; newlines on sentence boundaries (. ? !).
JSONL mode unchanged (keeps timestamps for machine consumption).
2. New --save-audio <path> flag records the mic audio to a WAV file.
After the stream ends (Ctrl+C), the saved audio is automatically
re-transcribed with the batch engine (Parakeet TDT v3) which is
significantly more accurate (6% vs 7.2% on TED, 13% vs 18.5% on
Earnings-21). Users get real-time assistance during the meeting
AND a polished transcript at the end.
Usage: scribe stream --engine nemotron --save-audio meeting.wav
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
scribe streamcommand for live microphone transcriptionStatus
Draft — streaming not working reliably yet. Known issues:
What works
scribe streamstarts and captures microphone audioscribe stream --engine nemotrondownloads and loads the Nemotron modelArchitecture decisions
StreamingAsrEngineprotocol (true cache-aware streaming)SlidingWindowAsrManager(batch model in sliding windows)Test plan
--format jsonlproduces valid JSON per line--output file.txtsaves to file🤖 Generated with Claude Code