Context
This issue tracks enhancements based on feedback from ReturningTarzan on Reddit:
The main thing about tokenization is always correctness. Throughput is nice but secondary.
While splintr is already fast and functional, these enhancements will ensure production-grade correctness and robustness.
Status Summary
Already Implemented ✅
| Feature |
Status |
Reference |
| Control tokens (multiple BPE runs) |
✅ Implemented |
src/core/tokenizer.rs:863-899 - encode_with_special() splits at special tokens and runs BPE on each segment separately |
| Decoding to byte strings |
✅ Implemented |
src/core/tokenizer.rs:905 - decode_bytes(&[u32]) -> Vec<u8> |
| Buffer/queue for incomplete chars |
✅ Implemented |
src/core/streaming.rs - StreamingDecoder and ByteLevelStreamingDecoder (commit 9e45c14) |
| Basic test coverage |
✅ Implemented |
~1,537 lines in tests/ covering 4 tokenizers (commit d6d836f) |
Proposed Enhancements
1. Comprehensive Correctness Testing
Priority: HIGH | Complexity: MEDIUM
Current State: Integration tests cover basic functionality and exact token IDs (~1,537 lines).
Gaps:
Why it matters: LLMs are robust to small differences, but those differences can accumulate and silently degrade performance over large datasets.
2. Trimming/Padding/Normalization for Added Tokens
Priority: MEDIUM | Complexity: LOW-MEDIUM
Current State: No normalization layer exists.
Tasks:
3. Preprocessing/Postprocessing Validation
Priority: HIGH | Complexity: LOW
Current State: Has PCRE2 regex, ByteLevel preprocessing, Aho-Corasick. No formal validation against reference implementations.
Tasks:
4. Ergonomic API Improvements for Single Token Decoding
Priority: LOW | Complexity: LOW
Current State: decode_bytes(&[u32]) -> Vec<u8> and streaming decoders work well, but single-token convenience methods could be added.
Tasks:
Success Criteria
Resources
Context
This issue tracks enhancements based on feedback from ReturningTarzan on Reddit:
While splintr is already fast and functional, these enhancements will ensure production-grade correctness and robustness.
Status Summary
Already Implemented ✅
src/core/tokenizer.rs:863-899-encode_with_special()splits at special tokens and runs BPE on each segment separatelysrc/core/tokenizer.rs:905-decode_bytes(&[u32]) -> Vec<u8>src/core/streaming.rs-StreamingDecoderandByteLevelStreamingDecoder(commit9e45c14)tests/covering 4 tokenizers (commitd6d836f)Proposed Enhancements
1. Comprehensive Correctness Testing
Priority: HIGH | Complexity: MEDIUM
Current State: Integration tests cover basic functionality and exact token IDs (~1,537 lines).
Gaps:
cargo-fuzzorproptestWhy it matters: LLMs are robust to small differences, but those differences can accumulate and silently degrade performance over large datasets.
2. Trimming/Padding/Normalization for Added Tokens
Priority: MEDIUM | Complexity: LOW-MEDIUM
Current State: No normalization layer exists.
Tasks:
Normalizertrait and implementations3. Preprocessing/Postprocessing Validation
Priority: HIGH | Complexity: LOW
Current State: Has PCRE2 regex, ByteLevel preprocessing, Aho-Corasick. No formal validation against reference implementations.
Tasks:
4. Ergonomic API Improvements for Single Token Decoding
Priority: LOW | Complexity: LOW
Current State:
decode_bytes(&[u32]) -> Vec<u8>and streaming decoders work well, but single-token convenience methods could be added.Tasks:
decode_token_bytes(u32) -> Option<Vec<u8>>convenience methoddecode_token(u32) -> Option<String>methodSuccess Criteria
Resources