Skip to content

Add scribe stream command for live microphone transcription#1

Draft
javiertoledo wants to merge 13 commits intomainfrom
feature/stream
Draft

Add scribe stream command for live microphone transcription#1
javiertoledo wants to merge 13 commits intomainfrom
feature/stream

Conversation

@javiertoledo
Copy link
Copy Markdown
Member

Summary

  • Add scribe stream command for live microphone transcription
  • Two engines: default (Parakeet TDT v3, multilingual, ~11s latency) and Nemotron (English-only, ~560ms latency)
  • README updated with streaming docs and engine trade-offs

Status

Draft — streaming not working reliably yet. Known issues:

  • Nemotron engine: output shows mixed/repeated text from accumulated transcript diffing
  • Default engine: gets stuck when switching languages mid-stream
  • Default engine: ~11s latency (inherent to SlidingWindow approach with batch model)
  • No system audio capture yet (mic only)

What works

  • scribe stream starts and captures microphone audio
  • scribe stream --engine nemotron downloads and loads the Nemotron model
  • Partial text preview on stderr
  • Both text and JSONL output formats
  • Model download retry on partial/corrupt cache

Architecture decisions

  • Nemotron 560ms via StreamingAsrEngine protocol (true cache-aware streaming)
  • Parakeet TDT v3 via SlidingWindowAsrManager (batch model in sliding windows)
  • Actor-based state for thread safety (Swift 6 sendability)

Test plan

  • Nemotron: speak English continuously, verify clean incremental output
  • Default: speak Spanish, verify transcription appears
  • Default: speak English then Spanish, verify no hang
  • --format jsonl produces valid JSON per line
  • --output file.txt saves to file
  • Ctrl+C exits cleanly

🤖 Generated with Claude Code

javiertoledo and others added 7 commits April 9, 2026 19:22
New command: scribe stream — captures microphone audio and transcribes
in real-time using FluidAudio's SlidingWindowAsrManager (Parakeet).

Features:
- Live transcription from microphone with timestamps
- Text and JSONL output formats
- Save to file with --output
- Ctrl+C to stop cleanly
- Uses streaming ASR config (11s chunks, 1s hypothesis updates)

Usage:
  scribe stream                      # listen and transcribe
  scribe stream --format jsonl       # JSONL output
  scribe stream --output meeting.txt # save to file

System audio capture (--source) will be added in a follow-up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Reduce chunk size from 11s to 3s for ~3-4s latency (was ~13s)
- Lower confirmation threshold from 0.8 to 0.5 for faster output
- Reduce right context from 2s to 0.5s
- Fix speaker label: remove "Others" tag for mic input
- Add text dedup to avoid repeating same hypothesis
- Remove --mic flag (mic is default and only source for now)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pdates

The 3s chunk config was too short for Parakeet — model needs ~10s context.
Reverted to the library's .streaming preset (11s chunks, 1s hypothesis).

Now shows two types of updates:
- Volatile (hypothesis): shown as ephemeral line on stderr with \r overwrite
  Gives immediate ~1-2s feedback while speaking
- Confirmed: printed as permanent line to stdout
  Stable, final text after sufficient context

Also fixes:
- Stream getting stuck on longer utterances (was breaking model state)
- Text format shows live preview on stderr, final on stdout
- JSONL emits both volatile and confirmed (with "confirmed" field)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace SlidingWindowAsrManager (batch TDT in sliding windows, ~11s latency)
with StreamingAsrEngine protocol using Nemotron 560ms:

- True cache-aware streaming: each 560ms chunk inherits full context
- 2.12% WER (better than TDT v3's 2.5% on LibriSpeech)
- Includes punctuation and capitalization
- ~560ms to first text (was ~11s)
- Partial transcript callback for live preview on stderr
- Confirmed text printed to stdout

Architecture:
- Mic audio → appendAudio() → processBufferedAudio() → getPartialTranscript()
- Partial callback fires on every chunk for live preview (\r overwrite on stderr)
- Main loop polls at 20Hz, emits new confirmed text to stdout
- Actor-based state management for thread safety (Swift 6)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Default: Parakeet TDT v3 via SlidingWindow (25 languages, higher latency)
- --engine nemotron: Nemotron 560ms (English-only, ~560ms latency, punctuation)

Usage:
  scribe stream                    # multilingual (default)
  scribe stream --engine nemotron  # English-only, low latency

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Nemotron: retry with cache cleanup on failed model load (fixes partial download)
- Both engines: show download progress messages (not just --verbose)
- README: add streaming section with engine comparison and trade-offs
- README: update performance table with streaming latencies

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Nemotron engine's partial callback returns the full accumulated
transcript each time, which grows and revises. The previous code tried
to diff via getPartialTranscript() polling, causing repeated/mixed output.

Fix: Track printed length in StreamState actor. The partial callback
fires after each 560ms chunk — we diff to find only the new portion
and emit that. Live preview shows the tail of the transcript on stderr
(ephemeral, overwritten). New confirmed text goes to stdout.

Also simplified SlidingWindow engine to only emit to stdout on confirmed
text (volatile goes to stderr preview only).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
javiertoledo and others added 6 commits April 9, 2026 20:02
Adds a test/dev mode that feeds pre-recorded audio files into the
streaming pipeline (instead of microphone), making streaming output
deterministic and reproducible. Required to validate streaming quality
without manual mic testing.

Stream.swift:
- New --audio-file <path> option (Nemotron engine only for now)
- Reads via FluidAudio.AudioConverter.resampleAudioFile() (16kHz mono Float32)
- Chunks at 4096 samples to mimic the live mic tap
- Calls engine.finish() to flush tail audio before exit
- Refactored Nemotron path: split mic vs file, extracted callback setup
- SlidingWindow path explicitly rejects --audio-file (not yet supported)

eval.py:
- New --mode batch|streaming flag
- Streaming mode dispatches to scribe stream --audio-file --format jsonl
- Parses JSONL deltas and concatenates for WER computation
- Auto-skips MLS Spanish in streaming mode (Nemotron is English-only)
- New "mode" column in CSVs to differentiate batch vs streaming runs

summarize.py:
- Splits proper noun recall section by mode (batch vs streaming)

Baseline results (scribe v0.2.1):

| Dataset    | Batch (Parakeet) | Streaming (Nemotron) | Delta   |
|------------|-----------------:|---------------------:|--------:|
| TED-LIUM   | 6.0%             | 23.2%                | +17.2pp |
| Earnings-21| 12.9%            | 39.9%                | +27.0pp |
| PN recall  | 70.8%            | 58.4%                | -12.4pp |

Streaming is 3-4x worse than batch with deletions dominating the error
budget (TED 1327 D vs 759 I, Earnings-21 3640 D vs 1469 I) — streaming
is dropping content. Likely culprits:
1. Delta-text logic in StreamState.getNewText() may discard model revisions
2. engine.finish() may not flush the final partial-transcript callback

This commit establishes the regression baseline. Next phase: investigate
the delta-text dropping and re-run the same eval to confirm fixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs were causing streaming to drop ~30% of content vs batch:

Bug 1: Discarded finish()'s authoritative return value
-----------------------------------------------------
The previous code accumulated text via the per-chunk callback only and
just logged finalText.count after engine.finish(). Two failure modes:
  (a) finish() pads any trailing partial chunk and decodes additional
      tokens, but those never reach the callback path.
  (b) The callback dispatched its emission via Task { } — those tasks
      could still be pending when the program exited, losing output.

Fix: in file mode, do NOT set up the per-chunk callback. Use
engine.finish() as the source of truth and emit it once via a new
emitFinalTranscript() helper. The callback pattern is kept only for
mic mode where live preview is the actual feature.

This alone reduced WER from 23.2% → 13.2% on TED-LIUM.

Bug 2: Used the .nemotron560ms variant
---------------------------------------
NemotronChunkSize.swift comments claim "560ms - same accuracy" as
1120ms, but empirically 560ms is much worse:
  - 560ms on TED talk_0: 15.3% WER, 363 deletions
  - 1120ms on TED talk_0:  7.5% WER, 133 deletions

The published FluidAudio benchmark (Documentation/Benchmarks.md) only
covers 1120ms (2.51% on LibriSpeech), and the 560ms variant has no
published number. The "same accuracy" claim isn't supported by data.

Fix: switch to .nemotron1120ms. The latency cost (1.12s vs 0.56s)
is acceptable for live use and gives a meaningful accuracy boost.

Also matched the official NemotronTranscribe.swift CLI pattern: feed
the whole file as one AVAudioPCMBuffer instead of chunking into 4096
samples. (No accuracy difference, but matches the reference pattern.)

Final results (scribe v0.2.1, Nemotron 1120ms streaming vs Parakeet TDT v3 batch):

| Dataset    | Batch | Streaming | Delta |
|------------|------:|----------:|------:|
| TED-LIUM   |  6.0% |      7.2% | +1.2pp |
| Earnings-21| 12.9% |     18.5% | +5.6pp |
| PN recall  | 70.8% |     65.7% | -5.1pp |

Streaming is now within striking distance of batch quality.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…changes

Confirms the streaming work in this branch did not affect the batch path:

| Dataset    | Before | After  |
|------------|-------:|-------:|
| TED-LIUM   |   6.0% |   6.0% |
| Earnings-21|  12.9% |  12.9% |
| MLS Spanish|   8.5% |   8.5% |

Earnings-21 proper noun recall also unchanged at 70.8% (126/178 entities).
S/D/I counts byte-identical to the prior baseline; only processing_secs
varies (wall-clock noise).

CSV schema updated with the new "mode" column (= "batch") and the
proper_noun_* columns added in the prior session, so all batch + streaming
CSVs now share a uniform schema.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both modes now flow through a single drain loop. The only difference is
the source adapter — file mode pushes one buffer and finishes the
continuation; mic mode keeps yielding from the AVAudioEngine tap until
SIGINT finishes the continuation. This eliminates two parallel code
paths and a class of mic-only bugs.

Architecture (matches FluidAudio's own SlidingWindowAsrManager pattern):

  source ──► AsyncStream<AVAudioPCMBuffer> ──► drain loop
                                                ├─ engine.appendAudio
                                                ├─ engine.processBufferedAudio
                                                ├─ engine.getPartialTranscript
                                                └─ emit delta via StreamState
  stream ends (file done OR SIGINT)
       └────► engine.finish() ──► emit tail

Specific bugs fixed in mic mode by sharing the proven file pipeline:

1. No more detached Task { } in the per-chunk callback. The drain loop
   polls getPartialTranscript() in the main task. No race, no lost
   emissions at exit.

2. Darwin.exit(0) → DispatchSource.makeSignalSource(SIGINT). The signal
   handler runs on a regular dispatch queue and calls
   continuation.finish(). The drain loop falls through to engine.finish()
   and emits the tail audio that used to be silently dropped.

3. Mic tap pre-resamples to 16kHz mono Float32 via AudioConverter.
   resampleBuffer() (the same converter the file path uses for whole
   files), so the engine sees identical input format regardless of source.

Code changes:
- New: runNemotronDrainLoop, feedNemotronFromFile, startMicSource,
  emitDelta, emitLivePreview, makeMonoFloat32Buffer, NemotronMicResources
- Removed: setupNemotronCallback, runNemotronFromMic, runNemotronFromFile,
  emitFinalTranscript
- runNemotronWithEngine becomes a thin orchestrator that picks the source
  adapter and runs the drain loop

Eval verification (same scribe v0.2.1, same audio, same reference):

| Dataset    | Streaming WER (before) | Streaming WER (after) |
|------------|-----------------------:|----------------------:|
| TED-LIUM   |                   7.2% |                  7.2% |
| Earnings-21|                  18.5% |                 18.5% |

Per-file S/D/I counts are byte-identical to the previous baseline. The
unified pipeline now emits multiple deltas per file (one from the drain
poll, one from the finish() tail) instead of one big block, but jiwer's
text normalization makes the result equivalent.

Out of scope:
- runSlidingWindow (multilingual) path stays as-is — still rejects
  --audio-file, still uses setupSignalHandler. The SlidingWindow engine
  has its own concurrency model and is a separate refactor.
- System audio capture via CATapDescription — plugs into the same
  pipeline as a third source adapter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The first processBufferedAudio() call triggers CoreML model compilation
(~10-20s). Without warmup, this happens on the first real audio chunk,
causing all live output to batch up and appear at once when speech
pauses or Ctrl+C is pressed.

Fix: feed one silent chunk (17920 samples = 1120ms) through the engine
during startup, before the mic tap begins. CoreML compiles during the
"Warming up..." log message. Then reset() to discard the silence tokens
so they don't pollute the real transcript.

File-mode eval unchanged: talk_0 still 7.5% WER, S/D/I 72/133/11.
The reset() correctly clears warmup state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two UX improvements to scribe stream:

1. Text mode now flows naturally without timestamps:
   Before: [00:18] . Can you help me
           [00:19] with something
   After:  Can you help me with something?
           I just need to say something.
   Tokens append inline; newlines on sentence boundaries (. ? !).
   JSONL mode unchanged (keeps timestamps for machine consumption).

2. New --save-audio <path> flag records the mic audio to a WAV file.
   After the stream ends (Ctrl+C), the saved audio is automatically
   re-transcribed with the batch engine (Parakeet TDT v3) which is
   significantly more accurate (6% vs 7.2% on TED, 13% vs 18.5% on
   Earnings-21). Users get real-time assistance during the meeting
   AND a polished transcript at the end.

   Usage: scribe stream --engine nemotron --save-audio meeting.wav

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant