Skip to content

Smart prefill: disaggregating prefill across the Mac fleet#74

Draft
Gajesh2007 wants to merge 7 commits intocursor/disaggregated-compute-small-providers-40fffrom
cursor/smart-prefill-disaggregation-40ff
Draft

Smart prefill: disaggregating prefill across the Mac fleet#74
Gajesh2007 wants to merge 7 commits intocursor/disaggregated-compute-small-providers-40fffrom
cursor/smart-prefill-disaggregation-40ff

Conversation

@Gajesh2007
Copy link
Copy Markdown
Member

@Gajesh2007 Gajesh2007 commented Apr 19, 2026

Summary

Stacks on top of #71 (the embeddings/rerank disaggregated-compute layer). Adds attention-based prompt compression so low-RAM Macs can absorb a meaningful chunk of the prefill load that would otherwise pin the big-RAM fleet.

A small-tier provider runs a tiny draft LLM (default Qwen3-0.6B, 700 MB), captures attention scores over the consumer's prompt, and returns the top-K% of tokens in order. The big-tier provider then runs normal prefill on a 4× shorter prompt with no engine modifications required. Expected: 2-3× lower TTFT today, 5×+ at 128 K context once we graduate to true sparse prefill via vllm-mlx#179 in phase 2.

This is the answer to "how do we disaggregate prefill across consumer internet?" The naive answer (ship KV cache) is dead — 8 GB per 32 K-token Qwen 27B prefill vs 125 MB/s residential pipe. The right answer is don't ship KV, ship a shorter prompt. Same compute leverage, ~6 orders of magnitude less bandwidth.

Linked issue

Closes #

What ships

Two surfaces, one shared dispatch path:

  1. POST /v1/compress — explicit. Useful for pre-compressing a RAG corpus once and storing the compressed chunks.
  2. smart_prefill field on /v1/chat/completions — opt-in middleware. Coordinator dispatches the longest user/system message to a tiny-tier compressor, swaps the result back in, then routes to the consumer's chosen big model. Surfaces stats via X-SmartPrefill-* response headers.

Both call into the same runCompression helper so retry, billing, and E2E encryption are guaranteed identical.

Wire protocol (coordinator/internal/protocol/messages.goprovider/src/protocol.rs)

  • New message pair: prompt_compression_request / prompt_compression_complete
  • E2E encrypted under the same NaCl box session-key flow as embeddings/chat — coordinator never sees plaintext on either leg
  • Round-trip tests on both sides

Tier routing (coordinator/internal/registry/registry.go)

  • New compressor model_type joins embedding / rerank in PreferredTiersForModelType — routes to tiny/small tier first, falls back to standard

Catalog (coordinator/cmd/coordinator/main.go + provider/src/main.rs)

  • mlx-community/Qwen3-0.6B (700 MB, 8 GB min RAM)
  • mlx-community/Qwen3-1.7B (1.8 GB, 8 GB min RAM, higher quality)
  • Cross-family drafts work (arXiv 2603.02631) → a single 0.6 B serves the entire big-tier catalog. No per-target compressors needed.

Pricing (coordinator/internal/payments/pricing.go)

  • $0.004 / 1 M tokens (Qwen3-0.6B), $0.006 / 1 M (1.7B)
  • Net economic impact for the consumer: ~19× positive — compression cost is dwarfed by the prefill cycles it saves on the big-tier model
  • Same 95 / 5 provider / platform split

Provider proxy (provider/src/proxy.rs, coordinator.rs, main.rs)

  • New handle_compression_request forwards to a local HTTP sidecar at EIGENINFERENCE_COMPRESSOR_PORT (default embedding_port + 1) exposing POST /v1/compress
  • Same E2E session-key flow as embeddings — provider seals the compressed prompt back over the coordinator's session pubkey
  • A first-party MLX-based sidecar will ship with the next provider bundle; until then the proxy will return 502 (connect refused) and the coordinator will route to another tiny provider via the existing retry loop

Docs (docs/smart-prefill.md) — full design, citing the recent literature: SpecPrefill (NVIDIA/MS 2025), Cross-Family Speculative Prefill (arXiv 2603.02631, March 2026), PrfaaS (arXiv 2604.15039, April 2026), BEAVER (arXiv 2603.19635, March 2026), and the open vllm-mlx#179 prototype on our exact runtime that demonstrates 5.45× TTFT reduction at 128 K.

Review-round-2 fixes

Two independent reviewers caught real bugs in the first revision; commit 6de0a851 fixes all of them:

Bug Severity Fix Regression test
smart_prefill field leaked into chat-provider request body when middleware fell through AND consumer set explicit max_tokens Real (vllm-mlx would reject the unknown field on the fall-through path) Always re-marshal rawBody after applySmartPrefill — the middleware always strips the field, so the re-marshal must be unconditional too TestSmartPrefillStripsFieldOnFallThrough
Reservation not refunded when compression succeeded but produced empty result or swap failed → consumer billed for compression they got no benefit from Real Single refundCompression closure called from every post-billing fall-through path (compression error, empty result, swap failure) TestSmartPrefillRefundsOnEmptyResult
max_keep_tokens defined in protocol but never plumbed through middleware settings NIT Added to smartPrefillSettings, parsed from object form, forwarded to compressor request body covered by structural tests

Reviewer 2 also questioned the cleanup pattern in dispatchCompression's decrypt-failure path; verified against the embeddings/rerank canonical flow — it's correct (pending was already removed by handlePromptCompressionComplete before the channel send, decrypt failure only needs SetProviderIdle + RecordJobFailure, both of which are called). Reviewer 1 independently reached the same conclusion.

Test plan

  • cd coordinator && go test ./... — all green (8 new tests, including 2 regression tests)
  • cd coordinator && gofmt -l . — clean
  • cd coordinator && go vet ./... — clean
  • rustfmt --check --edition 2024 on the Rust files I touched — clean

New tests:

  • TestCompressE2E — full encrypted round-trip on the standalone endpoint through a 16 GB tiny provider
  • TestSmartPrefillMiddlewareSwapsLongestMessage — full encrypted round-trip through the middleware: register a tiny compressor + a standard chat provider, assert the chat provider receives the compressed prompt and the response carries the X-SmartPrefill-* headers
  • TestSmartPrefillFallsThroughOnShortPrompt — middleware no-ops below the threshold; chat provider sees the original prompt
  • TestSmartPrefillStripsFieldOnFallThroughregression test for review-round-2 fix DGInf: Decentralized private AI inference platform #1
  • TestSmartPrefillRefundsOnEmptyResultregression test for review-round-2 fix feat: build Mac private inference core #2, with the billing service active
  • TestCompressNoFreeCreditWhenBillingDisabled — same regression test as embeddings (refund cannot mint balance when billing was never charged)
  • TestCompressInvalidRatio — input validation
  • TestPreferredTiersIncludesCompressor — tier preference wired up
  • 4 new Rust protocol round-trip tests in provider/src/protocol.rs

Components touched

  • coordinator (Go)
  • provider (Rust)
  • console-ui (Next.js)
  • image-bridge (Python)
  • app (macOS Swift)
  • enclave (Swift)
  • infra / CI / release
  • docs

Protocol / interface changes

  • Yes — described above and matching side updated

Both sides updated symmetrically; round-trip tests on both sides catch drift. The smart_prefill field on /v1/chat/completions is a new request extension; we strip it from the body before forwarding to the provider so the existing OpenAI-compatible providers never see it.

No bundle script changes — the compressor sidecar runs at a deterministic port (embedding_port + 1 or EIGENINFERENCE_COMPRESSOR_PORT) so the launcher and the proxy agree without extra IPC.

Notes for reviewers

Why now and why this design:

I asked Exa for the latest literature before writing a line of code. The field moved past LLMLingua-2 — the new winner is attention-based "Speculative Prefill" using a small draft LLM (NVIDIA/MS 2025), with cross-family drafts proven viable in March 2026. There's an open prototype of exactly this on the vllm-mlx fork our providers run (waybarrios/vllm-mlx#179) that reports 5.45× TTFT reduction on Qwen3.5-122B at 128 K context on M2 Ultra. So this PR ships the protocol envelope + tier routing + billing now, and the next iteration graduates from text-level compression (phase 1, this PR) to true sparse prefill against the original positional schema (phase 2, follow-up). Clients don't change between phases — the same smart_prefill flag governs both.

What this is NOT:

  • Not bit-exact prefill disaggregation. That requires shipping KV cache, which is bandwidth-bound across consumer internet and dead on arrival.
  • Not speculative decoding. That accelerates decode using a draft on the same machine. Both will be supported, neither replaces the other.
  • Not lossless. Phase 1 drops ~75 % of input tokens. Quality on LongBench/RULER at 4× is consistently >90 % across the cited research, but consumers needing verbatim recall (legal, code, exact-quote retrieval) should leave smart_prefill off.

Stacking note: This PR's base branch is cursor/disaggregated-compute-small-providers-40ff (PR #71), not master. The diff against #71 is what this PR adds; the diff against master is the union. Once #71 lands, GitHub will auto-rebase this against master.

Open in Web Open in Cursor 

cursoragent and others added 7 commits April 19, 2026 21:11
Adds a new message pair for smart-prefill compression:

  prompt_compression_request  (coordinator → provider)
  prompt_compression_complete (provider → coordinator)

Same E2E envelope as embeddings/rerank: encrypted_body on the request
and encrypted_data on the response under the consumer's session key,
so the coordinator never sees the plaintext prompt or the compressed
result on either leg.

Symmetric Go ↔ Rust types with round-trip tests on both sides:

  PromptCompressionRequestBody {compressor_model, prompt, target_ratio,
                                min_keep_tokens, max_keep_tokens,
                                preserve_boundaries}
  PromptCompressionUsage      {original_tokens, compressed_tokens,
                                total_tokens}

Phase 1 returns the kept tokens as plain text in original order — no
target-engine modifications required, works with every model in the
catalog today. Phase 2 (planned) will return position IDs alongside
the text so the big-tier provider can run sparse prefill against the
original positional schema (cf. arXiv 2603.02631 cross-family
speculative prefill, vllm-mlx#179 prototype, ~5x TTFT at 128k).

Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>
Three small wires for the new compressor model_type:

* registry.PreferredTiersForModelType now treats 'compressor' the same
  as 'embedding' and 'rerank' — routing prefers tiny/small tier so big
  Macs stay free for memory-bandwidth-bound decode.

* payments.CalculateCompressorCost + DefaultCompressorPrices. Default
  rate is 4_000 micro-USD per 1M tokens for Qwen3-0.6B (~half of our
  embedding rate). Net economic impact for the consumer is ~19x
  positive — see docs/smart-prefill.md.

* seedModelCatalog (coordinator) and fallback_catalog (provider) gain
  Qwen3-0.6B and Qwen3-1.7B with model_type='compressor'. Cross-family
  drafts work (arXiv 2603.02631) so a single 0.6B serves the entire
  big-tier catalog — we don't need per-target compressors.

Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>
Standalone smart-prefill endpoint for consumers who want to pre-compress
a corpus once and reuse it as a stable system prompt.

Mirrors the embeddings handler 1:1 — same dispatchEmbedding-shaped
helper, same retry-with-excludeProviders (3 attempts), same E2E flow,
same pre-flight reservation + clamp + refund + free-credit guard.

The shared dispatch path (runCompression) is also called from the
smart-prefill middleware in a follow-up commit, so retry/billing/
encryption are guaranteed identical between the two surfaces.

Tests in compress_test.go:
- TestCompressE2E — full encrypted round-trip through a 16 GB tiny
  provider
- TestCompressNoFreeCreditWhenBillingDisabled — refund cannot mint
  balance when billing was never charged
- TestCompressInvalidRatio — 400 on out-of-range target_ratio

Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>
Opt-in compression baked into the chat-completions handler. Consumer
sets either:

  "smart_prefill": true                       (use defaults)

or the object form for overrides:

  "smart_prefill": {
    "enabled": true,
    "compressor_model": "mlx-community/Qwen3-1.7B",
    "target_ratio": 0.3,
    "min_keep_tokens": 128,
    "min_prompt_tokens": 4000,
    "preserve_boundaries": true
  }

Middleware:

* Strips the smart_prefill field from the body so the provider backend
  never sees our extension.
* Picks the longest user/system message (longest = dominates prefill
  cost; compressing a 50-token chat turn is wasted overhead).
* Skips silently for short prompts (< 2_000 estimated tokens by default).
* Calls runCompression (shared with /v1/compress) — same retry, billing,
  E2E, free-credit guard.
* On failure, refunds the reservation and falls through to full prefill
  rather than failing the consumer's chat request. Smart prefill is a
  best-effort optimization, never an availability hazard.
* On success, swaps the compressed prompt back into the message and
  surfaces stats via response headers:

    X-SmartPrefill-Compressor:        mlx-community/Qwen3-0.6B
    X-SmartPrefill-Original-Tokens:   32000
    X-SmartPrefill-Compressed-Tokens: 8000

Tests in compress_test.go:
- TestSmartPrefillMiddlewareSwapsLongestMessage — register a tiny
  compressor + a standard chat provider; assert the chat provider
  receives the *compressed* prompt and the response carries the
  X-SmartPrefill headers
- TestSmartPrefillFallsThroughOnShortPrompt — middleware no-ops below
  the min-tokens threshold; chat provider sees the original prompt

Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>
CoordinatorClient learns a new event type (PromptCompressionRequest),
decrypting the body and capturing the coordinator's ephemeral session
pubkey so the proxy can encrypt the compressed prompt back over the
session key.

proxy::handle_compression_request forwards to a local HTTP backend on
$EIGENINFERENCE_COMPRESSOR_PORT (default embedding_port + 1) that
exposes POST /v1/compress with the same body schema, then mirrors the
response back over the existing ProviderMessage envelope.

Same launcher contract as the embedding sidecar — if not running,
compression requests fail with connect-refused and the coordinator
routes to another tiny provider (or, for the smart_prefill middleware,
falls through to full prefill).

A first-party MLX-based sidecar will ship with the next provider
bundle. The compressor is a single forward pass through a small draft
LLM (Qwen3-0.6B fits in 8 GB), which gives low-RAM Macs that can't
host a quality decoder model their first real production workload.

Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>
Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>
Two real bugs caught by independent reviewers (PR #74):

1. **smart_prefill field leaked to chat provider on fall-through.**
   The middleware always strips the smart_prefill field but the caller
   only re-marshaled rawBody when compression actually applied. Combined
   with an explicit max_tokens (which skips the ensureMaxTokensBound
   re-marshal), this leaked our extension into the request body the
   chat provider's vllm-mlx backend sees — and vllm-mlx rejects unknown
   fields. Fix: always re-marshal after applySmartPrefill returns.
   Regression test: TestSmartPrefillStripsFieldOnFallThrough — sets
   max_tokens explicitly and asserts the chat provider's received body
   has no smart_prefill field.

2. **Reservation not refunded on success-but-no-swap.** When the
   compressor returned an empty CompressedPrompt (or replaceMessageContent
   failed), the middleware silently fell through but kept the
   compression charge on the consumer's ledger — billing them for work
   they got no benefit from. Fix: factor out a refundCompression
   closure and call it on every fall-through path that didn't apply the
   compressed prompt. Regression test: TestSmartPrefillRefundsOnEmptyResult
   — registers a compressor that returns CompressedPrompt='', asserts
   the consumer's ledger shows a smart_prefill refund entry.

Plus one cleanup:

3. **max_keep_tokens was defined in PromptCompressionRequestBody but
   never wired through smart_prefill settings.** Now it parses from the
   object form and is forwarded to the compressor request body, so
   consumers can cap the result size when they have a context budget.

Co-authored-by: Gajesh Naik <Gajesh2007@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants