feat: multi-Python worker images with startup version check (AE-2827)#89
Draft
feat: multi-Python worker images with startup version check (AE-2827)#89
Conversation
Add Python 3.10 and 3.11 support to GPU worker images via side-by-side
torch install in the existing runpod/pytorch base. 3.12 keeps the fast
path (torch pre-installed) to avoid the ~7 GB reinstall cost on hot
deployments; 3.10/3.11 images pay that cost once per cold start per DC.
Sibling to flash#322 which landed the SDK-level plumbing. Tags follow
the same ``py${VERSION}-${TAG}`` scheme already in use for CPU images.
- Dockerfile / Dockerfile-lb (GPU): accept PYTHON_VERSION build arg;
install torch from download.pytorch.org/whl/cu128 and repoint
/usr/local/bin/python for non-3.12 targets; validate interpreter
matches the arg during build.
- Dockerfile-cpu / Dockerfile-lb-cpu (CPU): surface PYTHON_VERSION at
runtime via FLASH_PYTHON_VERSION env so the worker's startup check
can read it.
- src/version.py: new ``assert_python_version_matches_image`` — raises
PythonVersionMismatchError at handler boot when ``sys.version_info``
disagrees with the image's stamped FLASH_PYTHON_VERSION. Caught
before user code runs; skipped when the env var is unset (local dev).
- src/handler.py / src/lb_handler.py: call the assertion immediately
after logging setup, before ``maybe_unpack()`` and handler import.
- tests/unit/test_version.py: 4 new cases covering env-unset skip,
match, mismatch raise, and message contents.
- tests/unit/test_lb_handler.py: extend the mocked ``version`` module
with ``assert_python_version_matches_image`` so fresh-import tests
don't break.
- .github/workflows/ci.yml: expand CI to build GPU and LB images
across {3.10, 3.11, 3.12}; align prod CPU and LB-CPU default to
3.12 (matches flash's DEFAULT_PYTHON_VERSION).
d939c16 to
b34f132
Compare
Ubuntu 22.04's system python3.10 has ensurepip disabled by Debian policy, which broke the side-by-side torch install for 3.10 GPU images (CI: docker-test-gpu (3.10), docker-test-lb (3.10)). python3.11 is a separate interpreter without the disable, so only 3.10 was affected. Use urllib+get-pip.py instead of ensurepip — works for any interpreter regardless of distro patching, and urllib is stdlib so no curl dep. Also corrects the outdated deadsnakes comment on both Dockerfiles: the runpod/pytorch base image layers alt-Python 3.11/3.12 on top of the system 3.10, not via deadsnakes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Sibling to runpod/flash#322. Adds Python 3.10 and 3.11 to GPU and LB images via side-by-side torch install in the
runpod/pytorchbase, expands CI to build the full {3.10, 3.11, 3.12} × {gpu, cpu, lb, lb-cpu} matrix, and fails the worker fast at boot if the interpreter doesn't match the image's advertised version.Marked draft until CI has run a full green matrix — I don't have a GPU Docker daemon locally to validate the non-3.12 side-by-side torch install path.
Design
download.pytorch.org/whl/cu128and repoint/usr/local/bin/python+python3to the target interpreter. Paid cold-start cost: ~7 GB one-time per DC.ARG PYTHON_VERSION; addedENV FLASH_PYTHON_VERSION=${PYTHON_VERSION}so the worker startup check can read it uniformly.is-default: trueentry moves from 3.11 to 3.12 to match flash'sDEFAULT_PYTHON_VERSION. Tags likerunpod/flash-cpu:latestnow resolve to 3.12; per-version tags likerunpod/flash-cpu:py3.11-latestare unchanged.Worker-side guardrail
version.assert_python_version_matches_image()raisesPythonVersionMismatchErrorat QB + LB handler boot whensys.version_infodisagrees withFLASH_PYTHON_VERSION. Catches mis-tagged images before user code runs. Skipped when the env var is unset (local dev / pytest harness).Test plan
make quality-checkpasses locally (all 271 unit tests + 14 handler smoke tests)test_version.pycover env-unset skip, match, mismatch raise, error-message contentstest_lb_handler.pymock refreshed so fresh-import tests continue to passRollout