Train cheap models on expensive ones. Automatically. With receipts.
Apprentice is an adaptive model distillation framework. It starts by routing every request to a frontier API (Claude, GPT, etc.), collects the outputs as training data, fine-tunes a local model, then progressively shifts traffic to it — while continuously verifying quality. The goal: replace a $15/M-token API with a $0 local model that produces equivalent results for your specific tasks.
The caller sends a request and gets a response. They don't know whether it came from Claude, a LoRA on Llama, or both. That's the point.
Apprentice is for recurring, domain-specific tasks where a general-purpose frontier model is overkill once you have enough examples. If you're making 10 API calls a day, the cost savings don't justify the infrastructure. If you're making 10,000, they do.
Use Apprentice when:
- You have repetitive tasks with consistent structure (classification, extraction, routing, summarization)
- You're spending real money on API calls that a fine-tuned 8B model could handle
- You need quality guarantees — not "hope the local model is good enough" but measurable correlation scores
- You want phased rollout — not a hard cutover, but a gradual, data-driven transition
- The task has a clear evaluation criteria (exact match, structured field comparison, semantic similarity)
Don't use Apprentice when:
- Every request is unique and creative (novel writing, open-ended brainstorming)
- You need the frontier model's full reasoning capability, not a specialized subset
- Your volume is low enough that API costs don't matter
- You're already happy with prompt engineering alone
The frontier model is a teacher, not a dependency. Apprentice treats API calls as training signals — every response is simultaneously a result and a labeled example. Over time, the local model absorbs the teacher's behavior for your specific domain. The API bill goes down. The local model gets better. The quality metrics prove it.
This is the opposite of prompt engineering. Prompt engineering optimizes the question you ask a smart model. Apprentice optimizes which model you ask, based on evidence that a cheaper one gives the same answer.
The confidence engine doesn't trust the local model by default. It earns trust through a sliding window of comparison scores. Phase transitions happen mechanically: 50 examples collected → start comparing; 0.85 correlation sustained → start routing locally. If correlation drops, traffic shifts back. No manual intervention required.
pip install apprentice-aifrom apprentice import Apprentice
app = await Apprentice.create("apprentice.yaml")
# Routing is automatic — you don't choose the model
response = await app.run("classify_ticket", {
"text": "My payment didn't go through",
"metadata": {"source": "email"}
})
print(response.result) # {"category": "billing", "priority": 2}
print(response.source) # "local" or "remote" or "dual"
await app.close()Request
|
v
Router ──────────────────────────────────────────────────────┐
| |
| Phase 1: COLD START Phase 2: REINFORCEMENT | Phase 3: STEADY STATE
| ┌─────────────────┐ ┌──────────────────────┐ | ┌───────────────────┐
| │ Remote API only │ │ Dual: local + remote │ | │ Local model + spot │
| │ Collect examples │ │ Compare via evaluator│ | │ checks via sampler │
| └────────┬────────┘ └──────────┬───────────┘ | └────────┬──────────┘
| │ │ | │
| v v | v
| Training Data Confidence Engine | Sampling Scheduler
| │ (rolling window) | (adaptive frequency)
| v │ | │
| Fine-Tuning v | v
| Orchestrator Phase transition? ────────┘ Correlation check
| │ (0.85 correlation) │
| v v
| Model Validator Regress? → back to Phase 2
└──────────────────────────────────────────────────────────────────┘
Three phases, all data-driven:
- Cold Start — Every request goes to the remote API. Responses are stored as training examples. After enough examples accumulate (configurable threshold), fine-tuning begins.
- Reinforcement — Both models process each request. The evaluator scores local vs. remote output. A rolling window tracks correlation. When sustained correlation exceeds the threshold, the system promotes to Phase 3.
- Steady State — The local model handles most traffic. An adaptive sampler periodically sends requests to both models to verify quality hasn't degraded. If it has, the system automatically regresses to Phase 2.
Apprentice includes a built-in PII detection and tokenization middleware that scrubs sensitive data before it reaches models, training stores, or audit logs. The system uses a hybrid multi-modal approach that combines fast regex patterns with optional NER model inference.
| Mode | Strategies | Latency | Dependencies |
|---|---|---|---|
regex_only (default) |
Regex patterns + field heuristics | ~0.1ms | None |
hybrid |
Regex + field heuristics + NER model | ~50ms | pip install apprentice-ai[ml] |
ner_only |
NER model only | ~50ms | pip install apprentice-ai[ml] |
Regex: Emails, phone numbers, SSNs, credit cards, IP addresses, API keys, dates of birth Field Heuristics: Sensitive field names (email, phone, ssn, password, etc.) + learned patterns from prior detections NER Model: Person names, locations, organizations, miscellaneous entities — unstructured PII that regex can't catch
Input data (may contain PII)
│
├─ RegexDetectionStrategy [confidence=1.0]
├─ FieldHeuristicStrategy [confidence=0.9]
└─ NERDetectionStrategy [confidence=varies]
│
▼
Merge + deduplicate (union, highest confidence wins overlaps)
│
▼
Replace spans with opaque tokens → model sees __PII_EMAIL_a1b2c3__
│
▼
Post-process: restore tokens → original PII for end user
The system learns over time — fields that repeatedly contain PII get auto-flagged, and user feedback (false positives/negatives) adjusts confidence.
pii:
enabled: true
detection_mode: hybrid # regex_only | hybrid | ner_only
ner_model: dslim/bert-base-NER
ner_device: cpu # cpu | cuda
ner_confidence_threshold: 0.7
sensitive_fields:
- email
- phone
- ssn
- passwordApprentice includes a built-in evaluation harness for measuring PII detection quality against labeled datasets:
# Ingest the ai4privacy/pii-masking-200k dataset
apprentice pii-ingest --dataset ai4privacy/pii-masking-200k --limit 1000
# Evaluate regex baseline
apprentice pii-evaluate --mode regex_only
# Evaluate hybrid (regex + NER)
apprentice pii-evaluate --mode hybrid| Command | Purpose |
|---|---|
apprentice run <config> |
Start the system (interactive or as HTTP server) |
apprentice serve <config> |
Start HTTP server with REST API |
apprentice status <config> |
Show phase, confidence, budget for each task |
apprentice report <config> |
Generate summary report with metrics |
apprentice ingest <file> |
Bulk ingest training data from file |
apprentice pii-ingest |
Download and ingest PII evaluation dataset |
apprentice pii-evaluate |
Evaluate PII detection against labeled data |
When running apprentice serve, the following endpoints are available:
| Endpoint | Method | Purpose |
|---|---|---|
/health |
GET | Health check |
/v1/run |
POST | Submit a task request (core routing) |
/v1/status |
GET | System status |
/v1/status/{skill} |
GET | Per-skill phase and confidence |
/v1/report |
GET | Generate metrics report |
/v1/events |
POST | Ingest external events (fire-and-forget) |
/v1/feedback |
POST | Submit feedback on recommendations |
/v1/recommendations |
POST | Request a recommendation for a skill |
/v1/skills |
GET | List configured skills with phase info |
See examples/apprentice.yaml for a complete example.
tasks:
- name: classify_ticket
prompt_template: "Classify: {text}"
evaluator: structured_match
match_fields: [category, priority]
confidence_thresholds:
phase2: 50 # examples before Phase 2 begins
phase3: 0.85 # sustained correlation to enter Phase 3
- name: extract_entities
prompt_template: "Extract entities from: {text}"
evaluator: semantic_similarity
confidence_thresholds:
phase2: 100
phase3: 0.90
remote:
provider: anthropic
model: claude-sonnet-4-5-20250929
api_key: env:ANTHROPIC_API_KEY
local:
backend: ollama
base_model: llama3.1:8b
budget:
monthly_limit_usd: 150.00
alert_threshold_pct: 80
server:
host: 0.0.0.0
port: 8787Each task gets its own phase progression, confidence window, and evaluator. A single Apprentice instance can manage dozens of tasks simultaneously — one might be in Phase 3 (local model proven) while another is still collecting examples in Phase 1.
28 components organized in two layers — 21 leaf implementations with zero cross-dependencies, wired together by 7 integration compositions:
| Component | Purpose |
|---|---|
config_loader |
Load and validate YAML configuration |
task_registry |
Manage task type definitions and schemas |
data_models |
Shared Pydantic models across all components |
remote_api_client |
Multi-provider API abstraction (Anthropic, OpenAI, etc.) |
local_model_server |
Local model inference (Ollama, vLLM, llama.cpp) |
evaluators |
Response quality scoring (exact match, semantic, structured) |
phase_manager |
Phase 1/2/3 lifecycle and transitions |
rolling_window |
Sliding window correlation tracking |
sampling_scheduler |
Adaptive sampling frequency control |
training_data_store |
Training example collection and management |
fine_tuning_orchestrator |
Fine-tuning pipeline (LoRA, OpenAI, HuggingFace) |
model_validator |
Pre-promotion model quality validation |
budget_manager |
Multi-window spend tracking and enforcement |
router |
Request routing (local, remote, dual) |
apprentice_class |
Core Apprentice class — run, status, report |
cli |
Command-line interface and HTTP server |
audit_log |
Structured event logging (JSONL) |
report_generator |
Reports, metrics, and observability |
pii_tokenizer |
PII detection middleware with learned patterns |
pii_detection |
Multi-strategy PII detection (regex, NER, heuristic) |
pii_evaluation |
Span-level PII detection evaluation harness |
| Composition | Children | Purpose |
|---|---|---|
config_and_registry |
config_loader, task_registry, data_models | Configuration + type system |
confidence_engine |
evaluators, phase_manager, rolling_window | Quality tracking pipeline |
external_interfaces |
remote_api_client, local_model_server | External service adapters |
training_pipeline |
training_data_store, fine_tuning_orchestrator, model_validator | Training lifecycle |
unified_interface |
apprentice_class, cli | User-facing API + CLI |
reporting |
audit_log, report_generator | Observability layer |
root |
all 6 compositions above | Full system composition root |
git clone https://github.com/jmcentire/apprentice.git
cd apprentice
make dev # Install with dev + lint dependencies
make test # Run all 2,486 tests
make test-quick # Stop on first failure
make lint # Run ruff linter
make lint-fix # Auto-fix lint issues
make clean # Remove build artifactsRequires Python 3.12+. Core dependencies: pydantic, pyyaml, httpx. Optional: pip install apprentice-ai[ml] for NER-based PII detection (adds transformers, torch, datasets).
This project was built using Pact — a contract-first multi-agent software engineering framework. Pact decomposed the task into 25 components, generated contracts and tests for each, then implemented them using iterative Claude Code sessions that write code, run tests, and fix failures autonomously.
Apprentice is one of three systems (alongside Pact and Emergence) built to test the ideas in Beyond Code: Context, Constraints, and the New Craft of Software. The book covers the coordination, verification, and specification problems that motivated these designs.
MIT