Skip to content

feat: multi-Python worker images with startup version check (AE-2827)#89

Draft
deanq wants to merge 2 commits intomainfrom
deanq/ae-2827-multi-python-versions
Draft

feat: multi-Python worker images with startup version check (AE-2827)#89
deanq wants to merge 2 commits intomainfrom
deanq/ae-2827-multi-python-versions

Conversation

@deanq
Copy link
Copy Markdown
Contributor

@deanq deanq commented Apr 21, 2026

Summary

Sibling to runpod/flash#322. Adds Python 3.10 and 3.11 to GPU and LB images via side-by-side torch install in the runpod/pytorch base, expands CI to build the full {3.10, 3.11, 3.12} × {gpu, cpu, lb, lb-cpu} matrix, and fails the worker fast at boot if the interpreter doesn't match the image's advertised version.

Marked draft until CI has run a full green matrix — I don't have a GPU Docker daemon locally to validate the non-3.12 side-by-side torch install path.

Design

  • 3.12 keeps the existing fast path: use the base image's pre-installed Python + torch. No reinstall cost.
  • 3.10 / 3.11 install torch 2.9.1+cu128 from download.pytorch.org/whl/cu128 and repoint /usr/local/bin/python + python3 to the target interpreter. Paid cold-start cost: ~7 GB one-time per DC.
  • CPU / LB-CPU already accepted ARG PYTHON_VERSION; added ENV FLASH_PYTHON_VERSION=${PYTHON_VERSION} so the worker startup check can read it uniformly.
  • Default flip on prod-cpu / prod-lb-cpu: the is-default: true entry moves from 3.11 to 3.12 to match flash's DEFAULT_PYTHON_VERSION. Tags like runpod/flash-cpu:latest now resolve to 3.12; per-version tags like runpod/flash-cpu:py3.11-latest are unchanged.

Worker-side guardrail

version.assert_python_version_matches_image() raises PythonVersionMismatchError at QB + LB handler boot when sys.version_info disagrees with FLASH_PYTHON_VERSION. Catches mis-tagged images before user code runs. Skipped when the env var is unset (local dev / pytest harness).

Test plan

  • make quality-check passes locally (all 271 unit tests + 14 handler smoke tests)
  • 4 new unit tests in test_version.py cover env-unset skip, match, mismatch raise, error-message contents
  • test_lb_handler.py mock refreshed so fresh-import tests continue to pass
  • CI: full image matrix builds green (draft until this is confirmed — particularly the GPU side-by-side torch install for 3.10 / 3.11)
  • Smoke test one non-3.12 image end-to-end: deploy a 3.11-targeted FlashApp once the image tags land on Docker Hub

Rollout

  • Merge after flash#322 so the SDK can resolve to the new image tags.
  • No env-var or flag flip required for existing 3.12 users — they stay on the fast path.

Add Python 3.10 and 3.11 support to GPU worker images via side-by-side
torch install in the existing runpod/pytorch base. 3.12 keeps the fast
path (torch pre-installed) to avoid the ~7 GB reinstall cost on hot
deployments; 3.10/3.11 images pay that cost once per cold start per DC.

Sibling to flash#322 which landed the SDK-level plumbing. Tags follow
the same ``py${VERSION}-${TAG}`` scheme already in use for CPU images.

- Dockerfile / Dockerfile-lb (GPU): accept PYTHON_VERSION build arg;
  install torch from download.pytorch.org/whl/cu128 and repoint
  /usr/local/bin/python for non-3.12 targets; validate interpreter
  matches the arg during build.
- Dockerfile-cpu / Dockerfile-lb-cpu (CPU): surface PYTHON_VERSION at
  runtime via FLASH_PYTHON_VERSION env so the worker's startup check
  can read it.
- src/version.py: new ``assert_python_version_matches_image`` — raises
  PythonVersionMismatchError at handler boot when ``sys.version_info``
  disagrees with the image's stamped FLASH_PYTHON_VERSION. Caught
  before user code runs; skipped when the env var is unset (local dev).
- src/handler.py / src/lb_handler.py: call the assertion immediately
  after logging setup, before ``maybe_unpack()`` and handler import.
- tests/unit/test_version.py: 4 new cases covering env-unset skip,
  match, mismatch raise, and message contents.
- tests/unit/test_lb_handler.py: extend the mocked ``version`` module
  with ``assert_python_version_matches_image`` so fresh-import tests
  don't break.
- .github/workflows/ci.yml: expand CI to build GPU and LB images
  across {3.10, 3.11, 3.12}; align prod CPU and LB-CPU default to
  3.12 (matches flash's DEFAULT_PYTHON_VERSION).
@deanq deanq force-pushed the deanq/ae-2827-multi-python-versions branch from d939c16 to b34f132 Compare April 22, 2026 04:32
Ubuntu 22.04's system python3.10 has ensurepip disabled by Debian
policy, which broke the side-by-side torch install for 3.10 GPU images
(CI: docker-test-gpu (3.10), docker-test-lb (3.10)). python3.11 is a
separate interpreter without the disable, so only 3.10 was affected.

Use urllib+get-pip.py instead of ensurepip — works for any interpreter
regardless of distro patching, and urllib is stdlib so no curl dep.

Also corrects the outdated deadsnakes comment on both Dockerfiles: the
runpod/pytorch base image layers alt-Python 3.11/3.12 on top of the
system 3.10, not via deadsnakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant