Skip to content

fix(stt): support tokenizer.json for cohere-transcribe-03-2026 (mlx_audio ≤ 0.4.2 compat)#24

Open
guglxni wants to merge 1 commit intoLayr-Labs:masterfrom
guglxni:fix/cohere-asr-tokenizer-json
Open

fix(stt): support tokenizer.json for cohere-transcribe-03-2026 (mlx_audio ≤ 0.4.2 compat)#24
guglxni wants to merge 1 commit intoLayr-Labs:masterfrom
guglxni:fix/cohere-asr-tokenizer-json

Conversation

@guglxni
Copy link
Copy Markdown

@guglxni guglxni commented Apr 15, 2026

Summary

  • mlx_audio ≤ 0.4.2 post_load_hook hardcodes tokenizer.model (SentencePiece), but CohereLabs/cohere-transcribe-03-2026 ships only tokenizer.json (HuggingFace fast tokenizer), causing a crash at server startup.
  • Adds _CohereAsrTokenizerHF — a drop-in shim over HuggingFace tokenizers implementing the same interface (eos_token_id, build_prompt_tokens, batch_decode) as the existing CohereAsrTokenizer.
  • Monkey-patches Model.post_load_hook inside load_model() to probe tokenizer.model first, then fall back to tokenizer.json — same pattern already used for load_buffers_from_checkpoint.

Root cause

# mlx_audio/stt/models/cohere_asr/cohere_asr.py  line 705
tokenizer_path = model_path / "tokenizer.model"   # ← always SentencePiece
model._tokenizer = CohereAsrTokenizer(str(tokenizer_path), ...)
# CohereAsrTokenizer.__init__ calls spm.SentencePieceProcessor().load(path)
# → OSError / RuntimeError because the file doesn't exist

CohereLabs/cohere-transcribe-03-2026 (released March 2026) moved to the HuggingFace tokenizers JSON format; mlx_audio 0.4.2 was not updated to match.

Test plan

  • python stt_server.py --model CohereLabs/cohere-transcribe-03-2026 --port 8101 starts without error
  • POST /v1/audio/transcriptions returns a transcript for a WAV file
  • Models that still ship tokenizer.model (SentencePiece) continue to load via the original path

Notes

The R2 bundle ships mlx_audio 0.2.7 (no cohere_asr module at all). This fix is only exercised on installs that have upgraded to 0.4.2 and are targeting cohere-transcribe-03-2026. A proper upstream fix in mlx_audio would be preferable long-term.

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 15, 2026

@guglxni is attempting to deploy a commit to the EigenLabs Team on Vercel.

A member of the Team first needs to authorize it.

mlx_audio ≤ 0.4.2's post_load_hook hardcodes a SentencePiece
tokenizer.model lookup, but CohereLabs/cohere-transcribe-03-2026 ships
only tokenizer.json (HuggingFace fast tokenizer format).

Add _CohereAsrTokenizerHF — a drop-in shim backed by `tokenizers`
that implements the same interface (eos_token_id, build_prompt_tokens,
batch_decode) as CohereAsrTokenizer. Monkey-patch post_load_hook in
load_model() to probe for tokenizer.model first, then fall back to
tokenizer.json, following the same pattern already used for
load_buffers_from_checkpoint.

The R2 bundle ships mlx_audio 0.2.7 (no cohere_asr at all), so this
fix is only needed on installs that have upgraded to 0.4.2.
@guglxni guglxni force-pushed the fix/cohere-asr-tokenizer-json branch from 5127d22 to 05757e7 Compare April 15, 2026 14:38
@guglxni
Copy link
Copy Markdown
Author

guglxni commented Apr 15, 2026

@Gajesh2007 could you approve the workflow run for this PR? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant