Apprentice

Train cheap models on expensive ones. Automatically. With receipts.

Apprentice is an adaptive model distillation framework. It starts by routing every request to a frontier API (Claude, GPT, etc.), collects the outputs as training data, fine-tunes a local model, then progressively shifts traffic to it — while continuously verifying quality. The goal: replace a $15/M-token API with a $0 local model that produces equivalent results for your specific tasks.

The caller sends a request and gets a response. They don't know whether it came from Claude, a LoRA on Llama, or both. That's the point.

When to Use Apprentice

Apprentice is for recurring, domain-specific tasks where a general-purpose frontier model is overkill once you have enough examples. If you're making 10 API calls a day, the cost savings don't justify the infrastructure. If you're making 10,000, they do.

Use Apprentice when:

You have repetitive tasks with consistent structure (classification, extraction, routing, summarization)
You're spending real money on API calls that a fine-tuned 8B model could handle
You need quality guarantees — not "hope the local model is good enough" but measurable correlation scores
You want phased rollout — not a hard cutover, but a gradual, data-driven transition
The task has a clear evaluation criteria (exact match, structured field comparison, semantic similarity)

Don't use Apprentice when:

Every request is unique and creative (novel writing, open-ended brainstorming)
You need the frontier model's full reasoning capability, not a specialized subset
Your volume is low enough that API costs don't matter
You're already happy with prompt engineering alone

Philosophy: Models Are Commodities, Data Is the Asset

The frontier model is a teacher, not a dependency. Apprentice treats API calls as training signals — every response is simultaneously a result and a labeled example. Over time, the local model absorbs the teacher's behavior for your specific domain. The API bill goes down. The local model gets better. The quality metrics prove it.

This is the opposite of prompt engineering. Prompt engineering optimizes the question you ask a smart model. Apprentice optimizes which model you ask, based on evidence that a cheaper one gives the same answer.

The confidence engine doesn't trust the local model by default. It earns trust through a sliding window of comparison scores. Phase transitions happen mechanically: 50 examples collected → start comparing; 0.85 correlation sustained → start routing locally. If correlation drops, traffic shifts back. No manual intervention required.

Quick Start

pip install apprentice-ai

from apprentice import Apprentice

app = await Apprentice.create("apprentice.yaml")

# Routing is automatic — you don't choose the model
response = await app.run("classify_ticket", {
    "text": "My payment didn't go through",
    "metadata": {"source": "email"}
})

print(response.result)   # {"category": "billing", "priority": 2}
print(response.source)   # "local" or "remote" or "dual"

await app.close()

How It Works

Request
  |
  v
Router ──────────────────────────────────────────────────────┐
  |                                                          |
  |  Phase 1: COLD START         Phase 2: REINFORCEMENT     |  Phase 3: STEADY STATE
  |  ┌─────────────────┐        ┌──────────────────────┐    |  ┌───────────────────┐
  |  │ Remote API only  │        │ Dual: local + remote │    |  │ Local model + spot │
  |  │ Collect examples │        │ Compare via evaluator│    |  │ checks via sampler │
  |  └────────┬────────┘        └──────────┬───────────┘    |  └────────┬──────────┘
  |           │                             │                |           │
  |           v                             v                |           v
  |     Training Data              Confidence Engine         |    Sampling Scheduler
  |           │                     (rolling window)         |    (adaptive frequency)
  |           v                             │                |           │
  |     Fine-Tuning                         v                |           v
  |     Orchestrator              Phase transition?  ────────┘   Correlation check
  |           │                   (0.85 correlation)               │
  |           v                                                    v
  |     Model Validator                                     Regress? → back to Phase 2
  └──────────────────────────────────────────────────────────────────┘

Three phases, all data-driven:

Cold Start — Every request goes to the remote API. Responses are stored as training examples. After enough examples accumulate (configurable threshold), fine-tuning begins.
Reinforcement — Both models process each request. The evaluator scores local vs. remote output. A rolling window tracks correlation. When sustained correlation exceeds the threshold, the system promotes to Phase 3.
Steady State — The local model handles most traffic. An adaptive sampler periodically sends requests to both models to verify quality hasn't degraded. If it has, the system automatically regresses to Phase 2.

PII Protection

Apprentice includes a built-in PII detection and tokenization middleware that scrubs sensitive data before it reaches models, training stores, or audit logs. The system uses a hybrid multi-modal approach that combines fast regex patterns with optional NER model inference.

Detection Modes

Mode	Strategies	Latency	Dependencies
`regex_only` (default)	Regex patterns + field heuristics	~0.1ms	None
`hybrid`	Regex + field heuristics + NER model	~50ms	`pip install apprentice-ai[ml]`
`ner_only`	NER model only	~50ms	`pip install apprentice-ai[ml]`

What It Detects

Regex: Emails, phone numbers, SSNs, credit cards, IP addresses, API keys, dates of birth Field Heuristics: Sensitive field names (email, phone, ssn, password, etc.) + learned patterns from prior detections NER Model: Person names, locations, organizations, miscellaneous entities — unstructured PII that regex can't catch

How It Works

Input data (may contain PII)
  │
  ├─ RegexDetectionStrategy        [confidence=1.0]
  ├─ FieldHeuristicStrategy        [confidence=0.9]
  └─ NERDetectionStrategy          [confidence=varies]
  │
  ▼
Merge + deduplicate (union, highest confidence wins overlaps)
  │
  ▼
Replace spans with opaque tokens → model sees __PII_EMAIL_a1b2c3__
  │
  ▼
Post-process: restore tokens → original PII for end user

The system learns over time — fields that repeatedly contain PII get auto-flagged, and user feedback (false positives/negatives) adjusts confidence.

Configuration

pii:
  enabled: true
  detection_mode: hybrid       # regex_only | hybrid | ner_only
  ner_model: dslim/bert-base-NER
  ner_device: cpu              # cpu | cuda
  ner_confidence_threshold: 0.7
  sensitive_fields:
    - email
    - phone
    - ssn
    - password

Evaluation

Apprentice includes a built-in evaluation harness for measuring PII detection quality against labeled datasets:

# Ingest the ai4privacy/pii-masking-200k dataset
apprentice pii-ingest --dataset ai4privacy/pii-masking-200k --limit 1000

# Evaluate regex baseline
apprentice pii-evaluate --mode regex_only

# Evaluate hybrid (regex + NER)
apprentice pii-evaluate --mode hybrid

CLI

Command	Purpose
`apprentice run <config>`	Start the system (interactive or as HTTP server)
`apprentice serve <config>`	Start HTTP server with REST API
`apprentice status <config>`	Show phase, confidence, budget for each task
`apprentice report <config>`	Generate summary report with metrics
`apprentice ingest <file>`	Bulk ingest training data from file
`apprentice pii-ingest`	Download and ingest PII evaluation dataset
`apprentice pii-evaluate`	Evaluate PII detection against labeled data

HTTP Server Endpoints

When running apprentice serve, the following endpoints are available:

Endpoint	Method	Purpose
`/health`	GET	Health check
`/v1/run`	POST	Submit a task request (core routing)
`/v1/status`	GET	System status
`/v1/status/{skill}`	GET	Per-skill phase and confidence
`/v1/report`	GET	Generate metrics report
`/v1/events`	POST	Ingest external events (fire-and-forget)
`/v1/feedback`	POST	Submit feedback on recommendations
`/v1/recommendations`	POST	Request a recommendation for a skill
`/v1/skills`	GET	List configured skills with phase info

Configuration

See examples/apprentice.yaml for a complete example.

tasks:
  - name: classify_ticket
    prompt_template: "Classify: {text}"
    evaluator: structured_match
    match_fields: [category, priority]
    confidence_thresholds:
      phase2: 50        # examples before Phase 2 begins
      phase3: 0.85      # sustained correlation to enter Phase 3

  - name: extract_entities
    prompt_template: "Extract entities from: {text}"
    evaluator: semantic_similarity
    confidence_thresholds:
      phase2: 100
      phase3: 0.90

remote:
  provider: anthropic
  model: claude-sonnet-4-5-20250929
  api_key: env:ANTHROPIC_API_KEY

local:
  backend: ollama
  base_model: llama3.1:8b

budget:
  monthly_limit_usd: 150.00
  alert_threshold_pct: 80

server:
  host: 0.0.0.0
  port: 8787

Multi-Task Configuration

Each task gets its own phase progression, confidence window, and evaluator. A single Apprentice instance can manage dozens of tasks simultaneously — one might be in Phase 3 (local model proven) while another is still collecting examples in Phase 1.

Architecture

28 components organized in two layers — 21 leaf implementations with zero cross-dependencies, wired together by 7 integration compositions:

Leaf Components

Component	Purpose
`config_loader`	Load and validate YAML configuration
`task_registry`	Manage task type definitions and schemas
`data_models`	Shared Pydantic models across all components
`remote_api_client`	Multi-provider API abstraction (Anthropic, OpenAI, etc.)
`local_model_server`	Local model inference (Ollama, vLLM, llama.cpp)
`evaluators`	Response quality scoring (exact match, semantic, structured)
`phase_manager`	Phase 1/2/3 lifecycle and transitions
`rolling_window`	Sliding window correlation tracking
`sampling_scheduler`	Adaptive sampling frequency control
`training_data_store`	Training example collection and management
`fine_tuning_orchestrator`	Fine-tuning pipeline (LoRA, OpenAI, HuggingFace)
`model_validator`	Pre-promotion model quality validation
`budget_manager`	Multi-window spend tracking and enforcement
`router`	Request routing (local, remote, dual)
`apprentice_class`	Core Apprentice class — run, status, report
`cli`	Command-line interface and HTTP server
`audit_log`	Structured event logging (JSONL)
`report_generator`	Reports, metrics, and observability
`pii_tokenizer`	PII detection middleware with learned patterns
`pii_detection`	Multi-strategy PII detection (regex, NER, heuristic)
`pii_evaluation`	Span-level PII detection evaluation harness

Integration Compositions

Composition	Children	Purpose
`config_and_registry`	config_loader, task_registry, data_models	Configuration + type system
`confidence_engine`	evaluators, phase_manager, rolling_window	Quality tracking pipeline
`external_interfaces`	remote_api_client, local_model_server	External service adapters
`training_pipeline`	training_data_store, fine_tuning_orchestrator, model_validator	Training lifecycle
`unified_interface`	apprentice_class, cli	User-facing API + CLI
`reporting`	audit_log, report_generator	Observability layer
`root`	all 6 compositions above	Full system composition root

Development

git clone https://github.com/jmcentire/apprentice.git
cd apprentice
make dev          # Install with dev + lint dependencies
make test         # Run all 2,486 tests
make test-quick   # Stop on first failure
make lint         # Run ruff linter
make lint-fix     # Auto-fix lint issues
make clean        # Remove build artifacts

Requires Python 3.12+. Core dependencies: pydantic, pyyaml, httpx. Optional: pip install apprentice-ai[ml] for NER-based PII detection (adds transformers, torch, datasets).

Built With

This project was built using Pact — a contract-first multi-agent software engineering framework. Pact decomposed the task into 25 components, generated contracts and tests for each, then implemented them using iterative Claude Code sessions that write code, run tests, and fix failures autonomously.

Background

Apprentice is one of three systems (alongside Pact and Emergence) built to test the ideas in Beyond Code: Context, Constraints, and the New Craft of Software. The book covers the coordination, verification, and specification problems that motivated these designs.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
docker/trainer		docker/trainer
docs		docs
examples		examples
src/apprentice		src/apprentice
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
design.md		design.md
pact.yaml		pact.yaml
pyproject.toml		pyproject.toml
sops.md		sops.md
task.md		task.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apprentice

When to Use Apprentice

Philosophy: Models Are Commodities, Data Is the Asset

Quick Start

How It Works

PII Protection

Detection Modes

What It Detects

How It Works

Configuration

Evaluation

CLI

HTTP Server Endpoints

Configuration

Multi-Task Configuration

Architecture

Leaf Components

Integration Compositions

Development

Built With

Background

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

jmcentire/apprentice

Folders and files

Latest commit

History

Repository files navigation

Apprentice

When to Use Apprentice

Philosophy: Models Are Commodities, Data Is the Asset

Quick Start

How It Works

PII Protection

Detection Modes

What It Detects

How It Works

Configuration

Evaluation

CLI

HTTP Server Endpoints

Configuration

Multi-Task Configuration

Architecture

Leaf Components

Integration Compositions

Development

Built With

Background

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages