Skip to content

feat: Minions — BullMQ-inspired Postgres-native job queue#130

Open
garrytan wants to merge 6 commits intomasterfrom
garrytan/minions-jobs
Open

feat: Minions — BullMQ-inspired Postgres-native job queue#130
garrytan wants to merge 6 commits intomasterfrom
garrytan/minions-jobs

Conversation

@garrytan
Copy link
Copy Markdown
Owner

Summary

Minions is a durable job queue built directly into GBrain. No Redis. No external dependencies. Postgres transactions replace BullMQ's Lua scripts while maintaining the same correctness guarantees.

Why: sync/embed/enrich/lint run synchronously today, exit after one cycle, and can't handle 14K+ page bulk operations. Minions makes them durable background jobs with automatic retry, stall detection, and progress tracking.

What shipped (5 commits):

  • Schema + migration v5: minion_jobs table (20 columns) with CHECK constraints, 5 partial indexes, RLS. executeRaw<T>() added to BrainEngine interface.
  • Minions core library: MinionQueue (15 methods), MinionWorker (handler registry, lock renewal, graceful SIGTERM), backoff calculation (exponential/fixed with jitter), UnrecoverableError.
  • 43 unit tests: Queue CRUD, 8-state machine, backoff, stall detection (atomic CTE), parent-child dependencies, worker lifecycle, lock management, claim mechanics.
  • CLI commands: gbrain jobs submit/list/get/cancel/retry/prune/stats/work with --follow (inline execution), --dry-run, --params.
  • 6 MCP operations: submit_job, get_job, list_jobs, cancel_job, retry_job, get_job_progress. Contract-first, auto-exposed to AI clients.

Patterns stolen from:

  • BullMQ: lock tokens, FOR UPDATE SKIP LOCKED claim, stall detection, parent-child flows
  • Sidekiq: exponential backoff with jitter, dead set (exhausted retries)
  • Inngest: checkpoint/resume, durable execution concepts

PGLite: Inline execution only (--follow). Worker daemon is Postgres-only (PGLite exclusive file lock blocks concurrent processes).

Test Coverage

861 unit tests pass, 0 fail
98 E2E tests pass, 0 fail
43 new Minions tests covering all code paths

Pre-Landing Review

All reviews cleared:

  • CEO Review: CLEAR (5/5 scope proposals accepted)
  • Eng Review: CLEAR (4 issues found, all resolved)
  • DX Review: CLEAR (6/10 → 8/10)
  • Codex Outside Voice: 4 rounds, 12 findings, all resolved

TODOS

No TODO items completed in this PR. No new TODOs created.

Test plan

  • All unit tests pass (861 pass, 0 fail)
  • All E2E tests pass (98 pass, 0 fail)
  • Minions tests pass (43 pass, 0 fail)
  • Migration v5 applies cleanly on PGLite
  • Schema includes CHECK constraints, partial indexes, RLS

🤖 Generated with Claude Code

garrytan and others added 6 commits April 14, 2026 23:39
…gine

Foundation for the Minions job queue system. Adds:
- minion_jobs table (20 columns) with CHECK constraints, partial indexes,
  and RLS. Inspired by BullMQ's job model, adapted for Postgres.
- Migration v5 creates the table for existing databases.
- executeRaw<T>() method on BrainEngine interface for raw SQL access,
  needed by the Minions module for claim queries (FOR UPDATE SKIP LOCKED),
  token-fenced writes, and atomic stall detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BullMQ-inspired Postgres-native job queue built into GBrain. No Redis.
No external dependencies. Postgres transactions replace Lua scripts.

- MinionQueue: submit, claim (FOR UPDATE SKIP LOCKED), complete/fail
  (token-fenced), atomic stall detection (CTE), delayed promotion,
  parent-child resolution, prune, stats
- MinionWorker: handler registry, lock renewal, graceful SIGTERM,
  exponential backoff with jitter, UnrecoverableError bypass
- MinionJobContext: updateProgress(), log(), isActive() for handlers
- 8-state machine: waiting/active/completed/failed/delayed/dead/
  cancelled/waiting-children

Patterns stolen from: BullMQ (lock tokens, stall detection, flows),
Sidekiq (dead set, backoff formula), Inngest (checkpoint/resume).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full coverage of the Minions module against PGLite in-memory:
- Queue CRUD (9): submit, get, list, remove, cancel, retry, duplicate
- State machine (6): waiting→active→completed/failed, retry→delayed→waiting
- Backoff (4): exponential, fixed, jitter range, attempts_made=0 edge
- Stall detection (3): detect stalled, counter increment, max→dead
- Dependencies (5): parent waits, fail_parent, continue, remove_dep, orphan
- Worker lifecycle (5): register, start-without-handlers, claim+execute,
  non-Error throws, UnrecoverableError bypass
- Lock management (3): renewal, token mismatch, claim sets lock fields
- Claim mechanics (4): empty queue, priority ordering, name filtering,
  delayed promotion timing
- Cancel & retry (2): cancel active, retry dead

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire Minions into the GBrain CLI and MCP layer:

CLI (gbrain jobs):
  submit <name> [--params JSON] [--follow] [--dry-run]
  list [--status S] [--queue Q] [--limit N]
  get <id> — detailed view with attempt history
  cancel/retry/delete <id>
  prune [--older-than 30d]
  stats — job health dashboard
  work [--queue Q] [--concurrency N] — Postgres-only worker daemon

6 MCP operations (contract-first, auto-exposed via MCP server):
  submit_job, get_job, list_jobs, cancel_job, retry_job, get_job_progress

Built-in handlers: sync, embed, lint, import. --follow runs inline.
Worker daemon blocked on PGLite (exclusive file lock).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CLAUDE.md: added Minions files to key files, updated operation count (36),
BrainEngine method count (38), test file count (45), added jobs CLI commands.
CHANGELOG.md: added Minions entry to v0.10.0 (background jobs, retry, stall
detection, worker daemon).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…x, tokens, replay)

Adds the foundation for Minions as universal agent orchestration infrastructure.
GBrain's Postgres-native job queue now supports durable, observable, steerable
background agents. The OpenClaw plugin (separate repo) will consume these via
library import, not MCP, for zero-latency local integration.

## New capabilities

- **Concurrent worker** — Promise pool replaces sequential loop. Per-job
  AbortController for cooperative cancellation. Graceful shutdown waits for
  all in-flight jobs via Promise.allSettled.
- **Pause/resume** — pauseJob clears the lock and fires AbortSignal on active
  jobs. Handlers check ctx.signal.aborted and exit cleanly. resumeJob returns
  paused jobs to waiting. Catch block skips failJob when signal.aborted.
- **Inbox (separate table)** — minion_inbox table for sidechannel messages.
  sendMessage with sender validation (parent job or admin). readInbox is
  token-fenced and marks read_at atomically. Separate table avoids row bloat
  from rewriting JSONB on every send.
- **Token accounting** — tokens_input/tokens_output/tokens_cache_read columns.
  updateTokens accumulates; completeJob rolls child tokens up to parent.
  USD cost computed at read time (no cost_usd column — pricing too volatile).
- **Job replay** — replayJob clones a terminal job with optional data overrides.
  New job, fresh attempts, no parent link.

## Handler contract additions

MinionJobContext now provides:
- `signal: AbortSignal` — cooperative cancellation
- `updateTokens(tokens)` — accumulate token usage
- `readInbox()` — check for sidechannel messages
- `log()` — now accepts string or TranscriptEntry

## MCP operations added

pause_job, resume_job, replay_job, send_job_message — all auto-generate CLI
commands and MCP server endpoints.

## Library exports

package.json exports map adds ./minions and ./engine-factory paths so plugins
can `import { MinionQueue } from 'gbrain/minions'` for direct library use.

## Instruction layer (the teaching)

- skills/minion-orchestrator/SKILL.md — when/how to use Minions, decision
  matrix, lifecycle management, anti-patterns
- skills/conventions/subagent-routing.md — cross-cutting rule: all background
  work goes through Minions
- RESOLVER.md — trigger entries for agent orchestration
- manifest.json — registered

## Schema migration v6

Additive: 3 token columns, paused status, minion_inbox table with unread index.
Full Postgres + PGLite support. No backfill needed.

## Tests

65 tests (was 43): pause/resume (5), inbox (6), tokens (4), replay (4),
concurrent worker context (3), plus all existing coverage.

## What's NOT in this commit

Deferred to follow-up PRs:
- LISTEN/NOTIFY subscribe (needs real Postgres E2E)
- Resource governor (depends on concurrent worker stress testing)
- Routing eval harness (needs API keys + benchmark data)
- OpenClaw plugin (separate @gbrain/openclaw-minions-plugin repo)

See docs/designs/MINIONS_AGENT_ORCHESTRATION.md for full CEO-approved design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant