feat(build): per-app python_version with cross-resource validation#322
feat(build): per-app python_version with cross-resource validation#322
Conversation
|
Promptless prepared a documentation update related to this change. Triggered by runpod/flash#322 This PR introduces per-app Python version selection for Flash, allowing users to target Python 3.10, 3.11, or 3.12. The docs update covers the new |
…E-2827) Expand GPU and CPU worker images to support Python 3.10, 3.11, and 3.12. Flash apps ship as one tarball so every resource must share a Python version; the build step now reconciles per-resource python_version declarations into a single app-level value or accepts an explicit --python-version override. - constants.py: expand GPU_PYTHON_VERSIONS / CPU_PYTHON_VERSIONS tuples - manifest.py: add _reconcile_python_version; stamp target_python_version on every resource; raise on conflicting declarations - build.py / deploy.py: add --python-version CLI flag, thread through run_build and ManifestBuilder - docs: document per-app python_version, cold-start tradeoff for non-3.12 GPU images, and the 3.10 EOL window - tests/unit/test_dotenv_loading.py: add preserve_runpod_flash_modules fixture so module-deletion tests don't leak stale module references into sibling test files (unblocks deterministic test ordering)
2548f64 to
6138ae2
Compare
runpod-Henrik
left a comment
There was a problem hiding this comment.
QA Review — PR 322
Tested against v1.14.0 baseline. Four findings.
Finding 1: SDK ships before worker images exist
GPU_PYTHON_VERSIONS and CPU_PYTHON_VERSIONS now include "3.10" and "3.11", but these image tags don't exist until a separate flash-worker PR merges. A user who deploys with --python-version 3.10 or --python-version 3.11 today will silently receive the 3.12 image (or the deploy will fail with an opaque image-pull error), depending on how RunPod handles missing tags. The CLI accepts the flag and emits no warning.
Question: Is there a plan to block --python-version 3.10 and --python-version 3.11 at the CLI until the worker images exist? Or will this ship as-is?
Finding 2: First deploy after SDK upgrade may trigger a spurious rolling release
python_version is now extracted from resource configs and stamped into the manifest (via target_python_version). For any project that had python_version set on a resource before this PR, the manifest fingerprint will differ from the deployed fingerprint even though the user changed nothing. This will trigger the rolling release recycle on first deploy after upgrade.
Whether this is acceptable depends on whether the behaviour is documented. Nothing in the changelog or the _reconcile_python_version docstring mentions it.
Question: Is this a known and acceptable tradeoff? If so, it should be called out in the release notes.
Finding 3: self.python_version is set twice in ManifestBuilder
In manifest.py, self.python_version is assigned in __init__ and then overwritten inside build(). Any code that reads self.python_version between construction and build() would see the wrong value. Currently nothing does, but the pattern is fragile. The __init__ assignment is dead assignment — if build() always overwrites it, the __init__ line should be removed.
Finding 4: Isolation fix is scoped to two tests only
preserve_runpod_flash_modules is added to two tests in test_dotenv_loading.py. If the same dotenv re-import issue surfaces in other test files (the underlying cause is module-level side effects in runpod_flash/__init__.py), those tests will still be fragile. This fix is targeted, not structural — that's fine for now but worth noting if isolation artifacts appear elsewhere.
No blockers on the core logic. _reconcile_python_version() reads cleanly, the validation cases are thorough (9 new tests cover the main paths), and the CLI flag wiring is straightforward.
Summary
Expand GPU and CPU worker images to support Python 3.10, 3.11, and 3.12. Flash apps ship as one tarball so every resource in an app must share a Python version; the build step now reconciles per-resource
python_versiondeclarations into a single app-level value, or accepts an explicit--python-versionoverride onflash build/flash deploy.This is the SDK half of AE-2827. The flash-worker half (GPU Dockerfile parameterization for side-by-side torch install + CI matrix expansion + worker startup assertion) ships as a sibling PR once image builds can be validated end-to-end.
Changes
constants.py—GPU_PYTHON_VERSIONS/CPU_PYTHON_VERSIONSexpanded from("3.12",)to("3.10", "3.11", "3.12").DEFAULT_PYTHON_VERSIONstays 3.12.build_utils/manifest.py— new_reconcile_python_version(). Stamps the reconciled version onto every resource'starget_python_version; removes the now-redundant GPU/CPU special-case that hardcoded 3.12.build.py/deploy.py—--python-versionCLI flag threaded throughrun_buildintoManifestBuilder. Existing pip-wheel threading via_resolve_pip_python_versionkeeps working.docs/Flash_Deploy_Guide.md— new "Python version selection" section documenting per-resource declarations, the app-level override, the ~7 GB GPU cold-start tax for non-3.12, and the 3.10 EOL (2026-10-31) warning.tests/unit/test_dotenv_loading.py— addpreserve_runpod_flash_modulesfixture. Two existing tests force-reloadrunpod_flashviadel sys.modules[...]without restoring, leaking stale module references into sibling test files and causing flakes inTestRemoteClassWrapperPickle. The fixture snapshots and restores. Pre-existingmainhad 22 order-dependent failures in a-p no:randomlyrun; this branch now has 0 in the pickle cluster.Reconciliation rules
Resolution order when building the manifest:
--python-versionoverride (validated againstSUPPORTED_PYTHON_VERSIONS)python_versiondeclared across resource configsDEFAULT_PYTHON_VERSIONwhen no resource declares oneRaises
ValueErrorwhen resources declare conflicting versions, or when the override conflicts with a resource's explicit declaration.Test plan
make quality-checkpasses (85.7% coverage)TestReconcilePythonVersioncases cover override/conflict/default pathstest_constants.py+test_live_serverless.pypytest -p no:randomly tests/unit/pickle-cluster failures drop from 6 → 0