Smart prefill: disaggregating prefill across the Mac fleet by Gajesh2007 · Pull Request #74 · Layr-Labs/d-inference

Gajesh2007 · 2026-04-19T21:14:03Z

Summary

Stacks on top of #71 (the embeddings/rerank disaggregated-compute layer). Adds attention-based prompt compression so low-RAM Macs can absorb a meaningful chunk of the prefill load that would otherwise pin the big-RAM fleet.

A small-tier provider runs a tiny draft LLM (default Qwen3-0.6B, 700 MB), captures attention scores over the consumer's prompt, and returns the top-K% of tokens in order. The big-tier provider then runs normal prefill on a 4× shorter prompt with no engine modifications required. Expected: 2-3× lower TTFT today, 5×+ at 128 K context once we graduate to true sparse prefill via vllm-mlx#179 in phase 2.

This is the answer to "how do we disaggregate prefill across consumer internet?" The naive answer (ship KV cache) is dead — 8 GB per 32 K-token Qwen 27B prefill vs 125 MB/s residential pipe. The right answer is don't ship KV, ship a shorter prompt. Same compute leverage, ~6 orders of magnitude less bandwidth.

Linked issue

Closes #

What ships

Two surfaces, one shared dispatch path:

POST /v1/compress — explicit. Useful for pre-compressing a RAG corpus once and storing the compressed chunks.
smart_prefill field on /v1/chat/completions — opt-in middleware. Coordinator dispatches the longest user/system message to a tiny-tier compressor, swaps the result back in, then routes to the consumer's chosen big model. Surfaces stats via X-SmartPrefill-* response headers.

Both call into the same runCompression helper so retry, billing, and E2E encryption are guaranteed identical.

Wire protocol (coordinator/internal/protocol/messages.go ↔ provider/src/protocol.rs)

New message pair: prompt_compression_request / prompt_compression_complete
E2E encrypted under the same NaCl box session-key flow as embeddings/chat — coordinator never sees plaintext on either leg
Round-trip tests on both sides

Tier routing (coordinator/internal/registry/registry.go)

New compressor model_type joins embedding / rerank in PreferredTiersForModelType — routes to tiny/small tier first, falls back to standard

Catalog (coordinator/cmd/coordinator/main.go + provider/src/main.rs)

mlx-community/Qwen3-0.6B (700 MB, 8 GB min RAM)
mlx-community/Qwen3-1.7B (1.8 GB, 8 GB min RAM, higher quality)
Cross-family drafts work (arXiv 2603.02631) → a single 0.6 B serves the entire big-tier catalog. No per-target compressors needed.

Pricing (coordinator/internal/payments/pricing.go)

$0.004 / 1 M tokens (Qwen3-0.6B), $0.006 / 1 M (1.7B)
Net economic impact for the consumer: ~19× positive — compression cost is dwarfed by the prefill cycles it saves on the big-tier model
Same 95 / 5 provider / platform split

Provider proxy (provider/src/proxy.rs, coordinator.rs, main.rs)

New handle_compression_request forwards to a local HTTP sidecar at EIGENINFERENCE_COMPRESSOR_PORT (default embedding_port + 1) exposing POST /v1/compress
Same E2E session-key flow as embeddings — provider seals the compressed prompt back over the coordinator's session pubkey
A first-party MLX-based sidecar will ship with the next provider bundle; until then the proxy will return 502 (connect refused) and the coordinator will route to another tiny provider via the existing retry loop

Docs (docs/smart-prefill.md) — full design, citing the recent literature: SpecPrefill (NVIDIA/MS 2025), Cross-Family Speculative Prefill (arXiv 2603.02631, March 2026), PrfaaS (arXiv 2604.15039, April 2026), BEAVER (arXiv 2603.19635, March 2026), and the open vllm-mlx#179 prototype on our exact runtime that demonstrates 5.45× TTFT reduction at 128 K.

Review-round-2 fixes

Two independent reviewers caught real bugs in the first revision; commit 6de0a851 fixes all of them:

Bug	Severity	Fix	Regression test
`smart_prefill` field leaked into chat-provider request body when middleware fell through AND consumer set explicit `max_tokens`	Real (vllm-mlx would reject the unknown field on the fall-through path)	Always re-marshal `rawBody` after `applySmartPrefill` — the middleware always strips the field, so the re-marshal must be unconditional too	`TestSmartPrefillStripsFieldOnFallThrough`
Reservation not refunded when compression succeeded but produced empty result or swap failed → consumer billed for compression they got no benefit from	Real	Single `refundCompression` closure called from every post-billing fall-through path (compression error, empty result, swap failure)	`TestSmartPrefillRefundsOnEmptyResult`
`max_keep_tokens` defined in protocol but never plumbed through middleware settings	NIT	Added to `smartPrefillSettings`, parsed from object form, forwarded to compressor request body	covered by structural tests

Reviewer 2 also questioned the cleanup pattern in dispatchCompression's decrypt-failure path; verified against the embeddings/rerank canonical flow — it's correct (pending was already removed by handlePromptCompressionComplete before the channel send, decrypt failure only needs SetProviderIdle + RecordJobFailure, both of which are called). Reviewer 1 independently reached the same conclusion.

Test plan

cd coordinator && go test ./... — all green (8 new tests, including 2 regression tests)
cd coordinator && gofmt -l . — clean
cd coordinator && go vet ./... — clean
rustfmt --check --edition 2024 on the Rust files I touched — clean

New tests:

TestCompressE2E — full encrypted round-trip on the standalone endpoint through a 16 GB tiny provider
TestSmartPrefillMiddlewareSwapsLongestMessage — full encrypted round-trip through the middleware: register a tiny compressor + a standard chat provider, assert the chat provider receives the compressed prompt and the response carries the X-SmartPrefill-* headers
TestSmartPrefillFallsThroughOnShortPrompt — middleware no-ops below the threshold; chat provider sees the original prompt
TestSmartPrefillStripsFieldOnFallThrough — regression test for review-round-2 fix DGInf: Decentralized private AI inference platform #1
TestSmartPrefillRefundsOnEmptyResult — regression test for review-round-2 fix feat: build Mac private inference core #2, with the billing service active
TestCompressNoFreeCreditWhenBillingDisabled — same regression test as embeddings (refund cannot mint balance when billing was never charged)
TestCompressInvalidRatio — input validation
TestPreferredTiersIncludesCompressor — tier preference wired up
4 new Rust protocol round-trip tests in provider/src/protocol.rs

Components touched

Protocol / interface changes

Yes — described above and matching side updated

Both sides updated symmetrically; round-trip tests on both sides catch drift. The smart_prefill field on /v1/chat/completions is a new request extension; we strip it from the body before forwarding to the provider so the existing OpenAI-compatible providers never see it.

No bundle script changes — the compressor sidecar runs at a deterministic port (embedding_port + 1 or EIGENINFERENCE_COMPRESSOR_PORT) so the launcher and the proxy agree without extra IPC.

Notes for reviewers

Why now and why this design:

I asked Exa for the latest literature before writing a line of code. The field moved past LLMLingua-2 — the new winner is attention-based "Speculative Prefill" using a small draft LLM (NVIDIA/MS 2025), with cross-family drafts proven viable in March 2026. There's an open prototype of exactly this on the vllm-mlx fork our providers run (waybarrios/vllm-mlx#179) that reports 5.45× TTFT reduction on Qwen3.5-122B at 128 K context on M2 Ultra. So this PR ships the protocol envelope + tier routing + billing now, and the next iteration graduates from text-level compression (phase 1, this PR) to true sparse prefill against the original positional schema (phase 2, follow-up). Clients don't change between phases — the same smart_prefill flag governs both.

What this is NOT:

Not bit-exact prefill disaggregation. That requires shipping KV cache, which is bandwidth-bound across consumer internet and dead on arrival.
Not speculative decoding. That accelerates decode using a draft on the same machine. Both will be supported, neither replaces the other.
Not lossless. Phase 1 drops ~75 % of input tokens. Quality on LongBench/RULER at 4× is consistently >90 % across the cited research, but consumers needing verbatim recall (legal, code, exact-quote retrieval) should leave smart_prefill off.

Stacking note: This PR's base branch is cursor/disaggregated-compute-small-providers-40ff (PR #71), not master. The diff against #71 is what this PR adds; the diff against master is the union. Once #71 lands, GitHub will auto-rebase this against master.

Adds a new message pair for smart-prefill compression: prompt_compression_request (coordinator → provider) prompt_compression_complete (provider → coordinator) Same E2E envelope as embeddings/rerank: encrypted_body on the request and encrypted_data on the response under the consumer's session key, so the coordinator never sees the plaintext prompt or the compressed result on either leg. Symmetric Go ↔ Rust types with round-trip tests on both sides: PromptCompressionRequestBody {compressor_model, prompt, target_ratio, min_keep_tokens, max_keep_tokens, preserve_boundaries} PromptCompressionUsage {original_tokens, compressed_tokens, total_tokens} Phase 1 returns the kept tokens as plain text in original order — no target-engine modifications required, works with every model in the catalog today. Phase 2 (planned) will return position IDs alongside the text so the big-tier provider can run sparse prefill against the original positional schema (cf. arXiv 2603.02631 cross-family speculative prefill, vllm-mlx#179 prototype, ~5x TTFT at 128k). Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>

Three small wires for the new compressor model_type: * registry.PreferredTiersForModelType now treats 'compressor' the same as 'embedding' and 'rerank' — routing prefers tiny/small tier so big Macs stay free for memory-bandwidth-bound decode. * payments.CalculateCompressorCost + DefaultCompressorPrices. Default rate is 4_000 micro-USD per 1M tokens for Qwen3-0.6B (~half of our embedding rate). Net economic impact for the consumer is ~19x positive — see docs/smart-prefill.md. * seedModelCatalog (coordinator) and fallback_catalog (provider) gain Qwen3-0.6B and Qwen3-1.7B with model_type='compressor'. Cross-family drafts work (arXiv 2603.02631) so a single 0.6B serves the entire big-tier catalog — we don't need per-target compressors. Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>

Standalone smart-prefill endpoint for consumers who want to pre-compress a corpus once and reuse it as a stable system prompt. Mirrors the embeddings handler 1:1 — same dispatchEmbedding-shaped helper, same retry-with-excludeProviders (3 attempts), same E2E flow, same pre-flight reservation + clamp + refund + free-credit guard. The shared dispatch path (runCompression) is also called from the smart-prefill middleware in a follow-up commit, so retry/billing/ encryption are guaranteed identical between the two surfaces. Tests in compress_test.go: - TestCompressE2E — full encrypted round-trip through a 16 GB tiny provider - TestCompressNoFreeCreditWhenBillingDisabled — refund cannot mint balance when billing was never charged - TestCompressInvalidRatio — 400 on out-of-range target_ratio Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>

Opt-in compression baked into the chat-completions handler. Consumer sets either: "smart_prefill": true (use defaults) or the object form for overrides: "smart_prefill": { "enabled": true, "compressor_model": "mlx-community/Qwen3-1.7B", "target_ratio": 0.3, "min_keep_tokens": 128, "min_prompt_tokens": 4000, "preserve_boundaries": true } Middleware: * Strips the smart_prefill field from the body so the provider backend never sees our extension. * Picks the longest user/system message (longest = dominates prefill cost; compressing a 50-token chat turn is wasted overhead). * Skips silently for short prompts (< 2_000 estimated tokens by default). * Calls runCompression (shared with /v1/compress) — same retry, billing, E2E, free-credit guard. * On failure, refunds the reservation and falls through to full prefill rather than failing the consumer's chat request. Smart prefill is a best-effort optimization, never an availability hazard. * On success, swaps the compressed prompt back into the message and surfaces stats via response headers: X-SmartPrefill-Compressor: mlx-community/Qwen3-0.6B X-SmartPrefill-Original-Tokens: 32000 X-SmartPrefill-Compressed-Tokens: 8000 Tests in compress_test.go: - TestSmartPrefillMiddlewareSwapsLongestMessage — register a tiny compressor + a standard chat provider; assert the chat provider receives the *compressed* prompt and the response carries the X-SmartPrefill headers - TestSmartPrefillFallsThroughOnShortPrompt — middleware no-ops below the min-tokens threshold; chat provider sees the original prompt Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>

CoordinatorClient learns a new event type (PromptCompressionRequest), decrypting the body and capturing the coordinator's ephemeral session pubkey so the proxy can encrypt the compressed prompt back over the session key. proxy::handle_compression_request forwards to a local HTTP backend on $EIGENINFERENCE_COMPRESSOR_PORT (default embedding_port + 1) that exposes POST /v1/compress with the same body schema, then mirrors the response back over the existing ProviderMessage envelope. Same launcher contract as the embedding sidecar — if not running, compression requests fail with connect-refused and the coordinator routes to another tiny provider (or, for the smart_prefill middleware, falls through to full prefill). A first-party MLX-based sidecar will ship with the next provider bundle. The compressor is a single forward pass through a small draft LLM (Qwen3-0.6B fits in 8 GB), which gives low-RAM Macs that can't host a quality decoder model their first real production workload. Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>

Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>

Two real bugs caught by independent reviewers (PR #74): 1. **smart_prefill field leaked to chat provider on fall-through.** The middleware always strips the smart_prefill field but the caller only re-marshaled rawBody when compression actually applied. Combined with an explicit max_tokens (which skips the ensureMaxTokensBound re-marshal), this leaked our extension into the request body the chat provider's vllm-mlx backend sees — and vllm-mlx rejects unknown fields. Fix: always re-marshal after applySmartPrefill returns. Regression test: TestSmartPrefillStripsFieldOnFallThrough — sets max_tokens explicitly and asserts the chat provider's received body has no smart_prefill field. 2. **Reservation not refunded on success-but-no-swap.** When the compressor returned an empty CompressedPrompt (or replaceMessageContent failed), the middleware silently fell through but kept the compression charge on the consumer's ledger — billing them for work they got no benefit from. Fix: factor out a refundCompression closure and call it on every fall-through path that didn't apply the compressed prompt. Regression test: TestSmartPrefillRefundsOnEmptyResult — registers a compressor that returns CompressedPrompt='', asserts the consumer's ledger shows a smart_prefill refund entry. Plus one cleanup: 3. **max_keep_tokens was defined in PromptCompressionRequestBody but never wired through smart_prefill settings.** Now it parses from the object form and is forwarded to the compressor request body, so consumers can cap the result size when they have a context budget. Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>

cursoragent and others added 7 commits April 19, 2026 21:11

Document smart-prefill design + cite recent literature

8247364

Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smart prefill: disaggregating prefill across the Mac fleet#74

Smart prefill: disaggregating prefill across the Mac fleet#74
Gajesh2007 wants to merge 7 commits intocursor/disaggregated-compute-small-providers-40fffrom
cursor/smart-prefill-disaggregation-40ff

Gajesh2007 commented Apr 19, 2026 •

edited by cursor bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Gajesh2007 commented Apr 19, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Linked issue

What ships

Review-round-2 fixes

Test plan

Components touched

Protocol / interface changes

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Gajesh2007 commented Apr 19, 2026 •

edited by cursor bot

Loading