This document explains how CarePath AI manages Large Language Models (LLMs), including downloading, deploying, switching between models, and rollback procedures.
CarePath AI's chat service (service_chat) supports multiple LLM modes:
- mock: Returns hardcoded responses (for testing, MVP)
- qwen / Qwen3-4B-Thinking-2507: Uses Qwen3-4B-Thinking-2507 model from Hugging Face
The active mode is controlled via the LLM_MODE environment variable.
Handles automatic download and caching of models from Hugging Face.
Key Functions:
download_model_if_needed(): Downloads model if not already cachedload_qwen_model(): Loads model and tokenizer into memoryget_model_cache_dir(): Returns path to model cache directory
Model Caching:
- Default cache directory:
/app/models(configurable viaMODEL_CACHE_DIRenv var) - Models are cached to avoid re-downloading on every container restart
- Cache is per-pod ephemeral storage by default (cleared when pod is deleted)
- For persistence, mount a shared PersistentVolume (see Persistent Model Cache)
Provides unified interface for generating responses across different LLM modes.
Key Functions:
generate_response_mock(): Returns mock responsegenerate_response_qwen(): Uses Qwen model for inferencegenerate_response(): Dispatcher that routes to appropriate implementation based on mode
Model Caching:
- Model and tokenizer are loaded once per container and cached in memory (
_model_cache) - Subsequent requests reuse the cached model (no re-loading)
- Model is only loaded when first request with
LLM_MODE=qwenis received
- Python 3.11+
- Virtual environment activated (see
CLAUDE.md) - At least 10GB free disk space for model download
- Optional: Hugging Face account and token (if model requires authentication)
# Install base chat service dependencies
make install-chat
# Install LLM-specific dependencies (torch, transformers, huggingface-hub)
make install-chat-llmOption 1: Using Makefile
# Downloads Qwen3-4B-Thinking-2507 to ./models/ directory
make download-llm-modelOption 2: Using Python directly
python -c "from service_chat.services.model_manager import download_model_if_needed; download_model_if_needed()"Option 3: With Hugging Face token (if model requires authentication)
export HUGGINGFACE_TOKEN="hf_your_token_here"
make download-llm-model-
Create
.envfile (if not exists):cp .env.example .env
-
Update
.env:LLM_MODE=Qwen3-4B-Thinking-2507 MODEL_CACHE_DIR=./models # Optional: use local directory -
Start the service:
make run-chat
-
Test the endpoint:
curl -X POST http://localhost:8002/triage \ -H "Content-Type: application/json" \ -d '{ "patient_mrn": "MRN-001", "query": "What are my current medications?" }'
There are two approaches to deploying models to Kubernetes:
How it works:
- Docker image does NOT include the model
- Model is downloaded when first pod starts up
- Each pod downloads model independently to its ephemeral storage
Pros:
- Smaller Docker image (~500MB vs ~8GB)
- Faster image builds and pushes
- Less storage required in ECR
Cons:
- First pod startup takes 5-15 minutes (model download time)
- Each pod downloads the model independently (network bandwidth)
- Model is lost when pod is deleted (re-download needed)
Build Command:
# Default - model NOT included
docker build -t carepath-chat-api:latest -f service_chat/Dockerfile service_chat/How it works:
- Model is downloaded during Docker image build
- Model is embedded in the Docker image
- Pods start immediately with model already available
Pros:
- Fast pod startup (~30 seconds)
- No download time or network usage during startup
- Consistent across all pods
Cons:
- Large Docker image (~8GB)
- Slow image builds (10-20 minutes for first build)
- Slow image pushes to ECR
- Higher ECR storage costs
Build Command:
docker build \
--build-arg DOWNLOAD_MODEL=true \
--build-arg HUGGINGFACE_TOKEN="$HUGGINGFACE_TOKEN" \
-t carepath-chat-api:latest \
-f service_chat/Dockerfile \
service_chat/See notes/ai-service-upgrade.md for detailed step-by-step deployment instructions.
Quick Steps:
-
Build and push image:
make docker-build-chat make docker-push-chat
-
Update Terraform config:
# infra/terraform/envs/demo/terraform.tfvars chat_api_image = "123456789012.dkr.ecr.us-east-1.amazonaws.com/carepath-chat-api:v2.0"
-
Update ConfigMap to enable Qwen mode:
# infra/terraform/modules/app/main.tf env { name = "LLM_MODE" value = "Qwen3-4B-Thinking-2507" }
-
Apply changes:
make tf-apply
-
Monitor rollout:
kubectl rollout status deployment/chat-api -n carepath-demo
You can switch between mock and real LLM modes without rebuilding the image.
# Update environment variable in Kubernetes
kubectl set env deployment/chat-api LLM_MODE=mock -n carepath-demo
# This triggers a rolling restart of podskubectl set env deployment/chat-api LLM_MODE=Qwen3-4B-Thinking-2507 -n carepath-demoUpdate infra/terraform/modules/app/main.tf:
env {
name = "LLM_MODE"
value = "Qwen3-4B-Thinking-2507" # or "mock"
}Then apply:
make tf-applyIf the new model deployment causes issues:
# Quick rollback to previous version
kubectl rollout undo deployment/chat-api -n carepath-demo
# Check rollback status
kubectl rollout status deployment/chat-api -n carepath-demo# Switch LLM_MODE back to mock (keeps current image)
kubectl set env deployment/chat-api LLM_MODE=mock -n carepath-demo- Revert changes in
terraform.tfvars(old image tag) - Revert changes in
app/main.tf(LLM_MODE=mock) - Apply:
make tf-apply
To avoid re-downloading models on every pod restart, use a PersistentVolume.
Add to infra/terraform/modules/app/main.tf:
resource "kubernetes_persistent_volume_claim" "model_cache" {
metadata {
name = "model-cache"
namespace = kubernetes_namespace.app.metadata[0].name
}
spec {
access_modes = ["ReadWriteMany"] # Shared across pods
resources {
requests = {
storage = "20Gi" # Enough for multiple models
}
}
storage_class_name = "efs-sc" # Use EFS for shared access
}
}Update deployment spec:
resource "kubernetes_deployment" "chat_api" {
spec {
template {
spec {
# Add volume
volume {
name = "model-cache"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.model_cache.metadata[0].name
}
}
container {
# Mount volume
volume_mount {
name = "model-cache"
mount_path = "/app/models"
}
env {
name = "MODEL_CACHE_DIR"
value = "/app/models"
}
}
}
}
}
}EKS doesn't support ReadWriteMany by default. You need EFS:
-
Install EFS CSI driver:
kubectl apply -k "github.com/kubernetes-sigs/aws-efs-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-1.5" -
Create EFS file system via Terraform (add to
infra/terraform/modules/efs/) -
Create StorageClass:
kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: efs-sc provisioner: efs.csi.aws.com
Note: EFS adds complexity and cost. For MVP, ephemeral storage (no persistence) is acceptable.
Located in service_chat/services/llm_client.py:
outputs = model.generate(
inputs.input_ids,
max_new_tokens=512, # Maximum response length
temperature=0.7, # Randomness (0.0 = deterministic, 1.0 = creative)
top_p=0.9, # Nucleus sampling
do_sample=True, # Enable sampling (vs greedy)
pad_token_id=tokenizer.eos_token_id
)- max_new_tokens: Increase for longer responses (512 → 1024), but increases latency
- temperature: Lower (0.3) for factual responses, higher (0.9) for creative responses
- top_p: Nucleus sampling threshold (0.9 is standard)
Changes require code modification and redeployment.
- Instance Type: t3.medium or t3.large
- Expected Latency: 5-15 seconds per request
- Concurrency: Limited (1-2 requests per pod simultaneously)
- Cost: Low (~$30-60/month for node group)
For better performance, consider GPU instances:
- Instance Type: g4dn.xlarge (1 GPU, 4 vCPUs, 16GB RAM)
- Expected Latency: 1-3 seconds per request
- Concurrency: Higher (5-10 requests per pod)
- Cost: Higher (~$300-500/month)
Migration: Update node_instance_types in Terraform to include GPU instances, update Dockerfile to use GPU-enabled PyTorch.
Symptom: Pod logs show "Failed to download model"
Causes:
- No internet access from pods
- Hugging Face rate limiting
- Model requires authentication
Solutions:
# Check NAT gateway (pods need internet access)
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- curl https://huggingface.co
# Add Hugging Face token as Secret
kubectl create secret generic huggingface-token \
--from-literal=token=$HUGGINGFACE_TOKEN \
-n carepath-demo
# Update deployment to use secret
env {
name = "HUGGINGFACE_TOKEN"
value_from {
secret_key_ref {
name = "huggingface-token"
key = "token"
}
}
}Symptom: Pods crash with OOMKilled status
Cause: Model requires more memory than pod limit
Solutions:
# Increase memory limits in deployment
resources {
limits = {
memory = "8Gi" # Increase from 512Mi
}
requests = {
memory = "4Gi"
}
}
# Or upgrade node instance type
node_instance_types = ["t3.xlarge"] # 16GB RAMSymptom: Latency >30 seconds
Causes:
- CPU inference is slow
- Node CPU throttling
- Too many concurrent requests
Solutions:
- Reduce
max_new_tokens(512 → 256) - Lower
temperaturefor faster generation - Increase pod CPU limits
- Add more replicas to distribute load
- Consider GPU instances
Symptom: "Model not found" or "Unable to load model"
Check:
# SSH into pod
kubectl exec -it <pod-name> -n carepath-demo -- /bin/bash
# Check model directory
ls -lh /app/models/
# Check environment
env | grep MODEL
env | grep LLM_MODE
# Try manual download
python -c "from service_chat.services.model_manager import download_model_if_needed; download_model_if_needed()"-
ECR Image Storage: ~$0.10/GB/month
- Mock mode image: ~0.5GB = $0.05/month
- Qwen embedded image: ~8GB = $0.80/month
-
EFS (if using persistent cache): ~$0.30/GB/month
- Model cache: ~10GB = $3/month
- t3.medium (2 vCPU, 4GB): ~$0.0416/hour = $30/month
- t3.large (2 vCPU, 8GB): ~$0.0832/hour = $60/month
- g4dn.xlarge (4 vCPU, 16GB, 1 GPU): ~$0.526/hour = $380/month
Recommendation: Start with t3.medium or t3.large for MVP, upgrade to GPU if latency becomes an issue.
- Model Versioning: Support multiple model versions simultaneously
- A/B Testing: Route traffic to different models for comparison
- Quantization: Use 4-bit or 8-bit quantized models for faster inference
- Model Serving: Use Triton Inference Server or TorchServe for optimized serving
- Multi-GPU: Distribute model across multiple GPUs for very large models
- Fine-tuning: Fine-tune Qwen on healthcare-specific data
- Qwen3-4B-Thinking-2507: https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507
- Hugging Face Hub: https://huggingface.co/docs/hub/index
- Transformers Library: https://huggingface.co/docs/transformers/index
- PyTorch: https://pytorch.org/docs/stable/index.html
- Deployment Options:
notes/rollout-options.md - Upgrade Guide:
notes/ai-service-upgrade.md