Smart prefill: disaggregating prefill across the Mac fleet#74
Draft
Gajesh2007 wants to merge 7 commits intocursor/disaggregated-compute-small-providers-40fffrom
Draft
Conversation
Adds a new message pair for smart-prefill compression:
prompt_compression_request (coordinator → provider)
prompt_compression_complete (provider → coordinator)
Same E2E envelope as embeddings/rerank: encrypted_body on the request
and encrypted_data on the response under the consumer's session key,
so the coordinator never sees the plaintext prompt or the compressed
result on either leg.
Symmetric Go ↔ Rust types with round-trip tests on both sides:
PromptCompressionRequestBody {compressor_model, prompt, target_ratio,
min_keep_tokens, max_keep_tokens,
preserve_boundaries}
PromptCompressionUsage {original_tokens, compressed_tokens,
total_tokens}
Phase 1 returns the kept tokens as plain text in original order — no
target-engine modifications required, works with every model in the
catalog today. Phase 2 (planned) will return position IDs alongside
the text so the big-tier provider can run sparse prefill against the
original positional schema (cf. arXiv 2603.02631 cross-family
speculative prefill, vllm-mlx#179 prototype, ~5x TTFT at 128k).
Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>
Three small wires for the new compressor model_type: * registry.PreferredTiersForModelType now treats 'compressor' the same as 'embedding' and 'rerank' — routing prefers tiny/small tier so big Macs stay free for memory-bandwidth-bound decode. * payments.CalculateCompressorCost + DefaultCompressorPrices. Default rate is 4_000 micro-USD per 1M tokens for Qwen3-0.6B (~half of our embedding rate). Net economic impact for the consumer is ~19x positive — see docs/smart-prefill.md. * seedModelCatalog (coordinator) and fallback_catalog (provider) gain Qwen3-0.6B and Qwen3-1.7B with model_type='compressor'. Cross-family drafts work (arXiv 2603.02631) so a single 0.6B serves the entire big-tier catalog — we don't need per-target compressors. Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>
Standalone smart-prefill endpoint for consumers who want to pre-compress a corpus once and reuse it as a stable system prompt. Mirrors the embeddings handler 1:1 — same dispatchEmbedding-shaped helper, same retry-with-excludeProviders (3 attempts), same E2E flow, same pre-flight reservation + clamp + refund + free-credit guard. The shared dispatch path (runCompression) is also called from the smart-prefill middleware in a follow-up commit, so retry/billing/ encryption are guaranteed identical between the two surfaces. Tests in compress_test.go: - TestCompressE2E — full encrypted round-trip through a 16 GB tiny provider - TestCompressNoFreeCreditWhenBillingDisabled — refund cannot mint balance when billing was never charged - TestCompressInvalidRatio — 400 on out-of-range target_ratio Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>
Opt-in compression baked into the chat-completions handler. Consumer
sets either:
"smart_prefill": true (use defaults)
or the object form for overrides:
"smart_prefill": {
"enabled": true,
"compressor_model": "mlx-community/Qwen3-1.7B",
"target_ratio": 0.3,
"min_keep_tokens": 128,
"min_prompt_tokens": 4000,
"preserve_boundaries": true
}
Middleware:
* Strips the smart_prefill field from the body so the provider backend
never sees our extension.
* Picks the longest user/system message (longest = dominates prefill
cost; compressing a 50-token chat turn is wasted overhead).
* Skips silently for short prompts (< 2_000 estimated tokens by default).
* Calls runCompression (shared with /v1/compress) — same retry, billing,
E2E, free-credit guard.
* On failure, refunds the reservation and falls through to full prefill
rather than failing the consumer's chat request. Smart prefill is a
best-effort optimization, never an availability hazard.
* On success, swaps the compressed prompt back into the message and
surfaces stats via response headers:
X-SmartPrefill-Compressor: mlx-community/Qwen3-0.6B
X-SmartPrefill-Original-Tokens: 32000
X-SmartPrefill-Compressed-Tokens: 8000
Tests in compress_test.go:
- TestSmartPrefillMiddlewareSwapsLongestMessage — register a tiny
compressor + a standard chat provider; assert the chat provider
receives the *compressed* prompt and the response carries the
X-SmartPrefill headers
- TestSmartPrefillFallsThroughOnShortPrompt — middleware no-ops below
the min-tokens threshold; chat provider sees the original prompt
Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>
CoordinatorClient learns a new event type (PromptCompressionRequest), decrypting the body and capturing the coordinator's ephemeral session pubkey so the proxy can encrypt the compressed prompt back over the session key. proxy::handle_compression_request forwards to a local HTTP backend on $EIGENINFERENCE_COMPRESSOR_PORT (default embedding_port + 1) that exposes POST /v1/compress with the same body schema, then mirrors the response back over the existing ProviderMessage envelope. Same launcher contract as the embedding sidecar — if not running, compression requests fail with connect-refused and the coordinator routes to another tiny provider (or, for the smart_prefill middleware, falls through to full prefill). A first-party MLX-based sidecar will ship with the next provider bundle. The compressor is a single forward pass through a small draft LLM (Qwen3-0.6B fits in 8 GB), which gives low-RAM Macs that can't host a quality decoder model their first real production workload. Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>
Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>
Two real bugs caught by independent reviewers (PR #74): 1. **smart_prefill field leaked to chat provider on fall-through.** The middleware always strips the smart_prefill field but the caller only re-marshaled rawBody when compression actually applied. Combined with an explicit max_tokens (which skips the ensureMaxTokensBound re-marshal), this leaked our extension into the request body the chat provider's vllm-mlx backend sees — and vllm-mlx rejects unknown fields. Fix: always re-marshal after applySmartPrefill returns. Regression test: TestSmartPrefillStripsFieldOnFallThrough — sets max_tokens explicitly and asserts the chat provider's received body has no smart_prefill field. 2. **Reservation not refunded on success-but-no-swap.** When the compressor returned an empty CompressedPrompt (or replaceMessageContent failed), the middleware silently fell through but kept the compression charge on the consumer's ledger — billing them for work they got no benefit from. Fix: factor out a refundCompression closure and call it on every fall-through path that didn't apply the compressed prompt. Regression test: TestSmartPrefillRefundsOnEmptyResult — registers a compressor that returns CompressedPrompt='', asserts the consumer's ledger shows a smart_prefill refund entry. Plus one cleanup: 3. **max_keep_tokens was defined in PromptCompressionRequestBody but never wired through smart_prefill settings.** Now it parses from the object form and is forwarded to the compressor request body, so consumers can cap the result size when they have a context budget. Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacks on top of #71 (the embeddings/rerank disaggregated-compute layer). Adds attention-based prompt compression so low-RAM Macs can absorb a meaningful chunk of the prefill load that would otherwise pin the big-RAM fleet.
A small-tier provider runs a tiny draft LLM (default Qwen3-0.6B, 700 MB), captures attention scores over the consumer's prompt, and returns the top-K% of tokens in order. The big-tier provider then runs normal prefill on a 4× shorter prompt with no engine modifications required. Expected: 2-3× lower TTFT today, 5×+ at 128 K context once we graduate to true sparse prefill via vllm-mlx#179 in phase 2.
This is the answer to "how do we disaggregate prefill across consumer internet?" The naive answer (ship KV cache) is dead — 8 GB per 32 K-token Qwen 27B prefill vs 125 MB/s residential pipe. The right answer is don't ship KV, ship a shorter prompt. Same compute leverage, ~6 orders of magnitude less bandwidth.
Linked issue
Closes #
What ships
Two surfaces, one shared dispatch path:
POST /v1/compress— explicit. Useful for pre-compressing a RAG corpus once and storing the compressed chunks.smart_prefillfield on/v1/chat/completions— opt-in middleware. Coordinator dispatches the longest user/system message to a tiny-tier compressor, swaps the result back in, then routes to the consumer's chosen big model. Surfaces stats viaX-SmartPrefill-*response headers.Both call into the same
runCompressionhelper so retry, billing, and E2E encryption are guaranteed identical.Wire protocol (
coordinator/internal/protocol/messages.go↔provider/src/protocol.rs)prompt_compression_request/prompt_compression_completeTier routing (
coordinator/internal/registry/registry.go)compressormodel_type joinsembedding/rerankinPreferredTiersForModelType— routes to tiny/small tier first, falls back to standardCatalog (
coordinator/cmd/coordinator/main.go+provider/src/main.rs)mlx-community/Qwen3-0.6B(700 MB, 8 GB min RAM)mlx-community/Qwen3-1.7B(1.8 GB, 8 GB min RAM, higher quality)Pricing (
coordinator/internal/payments/pricing.go)Provider proxy (
provider/src/proxy.rs,coordinator.rs,main.rs)handle_compression_requestforwards to a local HTTP sidecar atEIGENINFERENCE_COMPRESSOR_PORT(defaultembedding_port + 1) exposingPOST /v1/compressDocs (
docs/smart-prefill.md) — full design, citing the recent literature: SpecPrefill (NVIDIA/MS 2025), Cross-Family Speculative Prefill (arXiv 2603.02631, March 2026), PrfaaS (arXiv 2604.15039, April 2026), BEAVER (arXiv 2603.19635, March 2026), and the open vllm-mlx#179 prototype on our exact runtime that demonstrates 5.45× TTFT reduction at 128 K.Review-round-2 fixes
Two independent reviewers caught real bugs in the first revision; commit
6de0a851fixes all of them:smart_prefillfield leaked into chat-provider request body when middleware fell through AND consumer set explicitmax_tokensrawBodyafterapplySmartPrefill— the middleware always strips the field, so the re-marshal must be unconditional tooTestSmartPrefillStripsFieldOnFallThroughrefundCompressionclosure called from every post-billing fall-through path (compression error, empty result, swap failure)TestSmartPrefillRefundsOnEmptyResultmax_keep_tokensdefined in protocol but never plumbed through middleware settingssmartPrefillSettings, parsed from object form, forwarded to compressor request bodyReviewer 2 also questioned the cleanup pattern in
dispatchCompression's decrypt-failure path; verified against the embeddings/rerank canonical flow — it's correct (pending was already removed byhandlePromptCompressionCompletebefore the channel send, decrypt failure only needsSetProviderIdle+RecordJobFailure, both of which are called). Reviewer 1 independently reached the same conclusion.Test plan
cd coordinator && go test ./...— all green (8 new tests, including 2 regression tests)cd coordinator && gofmt -l .— cleancd coordinator && go vet ./...— cleanrustfmt --check --edition 2024on the Rust files I touched — cleanNew tests:
TestCompressE2E— full encrypted round-trip on the standalone endpoint through a 16 GB tiny providerTestSmartPrefillMiddlewareSwapsLongestMessage— full encrypted round-trip through the middleware: register a tiny compressor + a standard chat provider, assert the chat provider receives the compressed prompt and the response carries theX-SmartPrefill-*headersTestSmartPrefillFallsThroughOnShortPrompt— middleware no-ops below the threshold; chat provider sees the original promptTestSmartPrefillStripsFieldOnFallThrough— regression test for review-round-2 fix DGInf: Decentralized private AI inference platform #1TestSmartPrefillRefundsOnEmptyResult— regression test for review-round-2 fix feat: build Mac private inference core #2, with the billing service activeTestCompressNoFreeCreditWhenBillingDisabled— same regression test as embeddings (refund cannot mint balance when billing was never charged)TestCompressInvalidRatio— input validationTestPreferredTiersIncludesCompressor— tier preference wired upprovider/src/protocol.rsComponents touched
Protocol / interface changes
Both sides updated symmetrically; round-trip tests on both sides catch drift. The
smart_prefillfield on/v1/chat/completionsis a new request extension; we strip it from the body before forwarding to the provider so the existing OpenAI-compatible providers never see it.No bundle script changes — the compressor sidecar runs at a deterministic port (
embedding_port + 1orEIGENINFERENCE_COMPRESSOR_PORT) so the launcher and the proxy agree without extra IPC.Notes for reviewers
Why now and why this design:
I asked Exa for the latest literature before writing a line of code. The field moved past LLMLingua-2 — the new winner is attention-based "Speculative Prefill" using a small draft LLM (NVIDIA/MS 2025), with cross-family drafts proven viable in March 2026. There's an open prototype of exactly this on the vllm-mlx fork our providers run (waybarrios/vllm-mlx#179) that reports 5.45× TTFT reduction on Qwen3.5-122B at 128 K context on M2 Ultra. So this PR ships the protocol envelope + tier routing + billing now, and the next iteration graduates from text-level compression (phase 1, this PR) to true sparse prefill against the original positional schema (phase 2, follow-up). Clients don't change between phases — the same
smart_prefillflag governs both.What this is NOT:
smart_prefilloff.Stacking note: This PR's base branch is
cursor/disaggregated-compute-small-providers-40ff(PR #71), notmaster. The diff against #71 is what this PR adds; the diff againstmasteris the union. Once #71 lands, GitHub will auto-rebase this againstmaster.