Skip to content

feat(build): per-app python_version with cross-resource validation#322

Open
deanq wants to merge 1 commit intomainfrom
deanq/ae-2827-multi-python-versions
Open

feat(build): per-app python_version with cross-resource validation#322
deanq wants to merge 1 commit intomainfrom
deanq/ae-2827-multi-python-versions

Conversation

@deanq
Copy link
Copy Markdown
Member

@deanq deanq commented Apr 21, 2026

Summary

Expand GPU and CPU worker images to support Python 3.10, 3.11, and 3.12. Flash apps ship as one tarball so every resource in an app must share a Python version; the build step now reconciles per-resource python_version declarations into a single app-level value, or accepts an explicit --python-version override on flash build / flash deploy.

This is the SDK half of AE-2827. The flash-worker half (GPU Dockerfile parameterization for side-by-side torch install + CI matrix expansion + worker startup assertion) ships as a sibling PR once image builds can be validated end-to-end.

Changes

  • constants.pyGPU_PYTHON_VERSIONS / CPU_PYTHON_VERSIONS expanded from ("3.12",) to ("3.10", "3.11", "3.12"). DEFAULT_PYTHON_VERSION stays 3.12.
  • build_utils/manifest.py — new _reconcile_python_version(). Stamps the reconciled version onto every resource's target_python_version; removes the now-redundant GPU/CPU special-case that hardcoded 3.12.
  • build.py / deploy.py--python-version CLI flag threaded through run_build into ManifestBuilder. Existing pip-wheel threading via _resolve_pip_python_version keeps working.
  • docs/Flash_Deploy_Guide.md — new "Python version selection" section documenting per-resource declarations, the app-level override, the ~7 GB GPU cold-start tax for non-3.12, and the 3.10 EOL (2026-10-31) warning.
  • tests/unit/test_dotenv_loading.py — add preserve_runpod_flash_modules fixture. Two existing tests force-reload runpod_flash via del sys.modules[...] without restoring, leaking stale module references into sibling test files and causing flakes in TestRemoteClassWrapperPickle. The fixture snapshots and restores. Pre-existing main had 22 order-dependent failures in a -p no:randomly run; this branch now has 0 in the pickle cluster.

Reconciliation rules

Resolution order when building the manifest:

  1. Explicit --python-version override (validated against SUPPORTED_PYTHON_VERSIONS)
  2. Exactly one distinct python_version declared across resource configs
  3. DEFAULT_PYTHON_VERSION when no resource declares one

Raises ValueError when resources declare conflicting versions, or when the override conflicts with a resource's explicit declaration.

Test plan

  • make quality-check passes (85.7% coverage)
  • 9 new TestReconcilePythonVersion cases cover override/conflict/default paths
  • Parametrized 3.10 / 3.11 / 3.12 across test_constants.py + test_live_serverless.py
  • Test isolation fix verified: pytest -p no:randomly tests/unit/ pickle-cluster failures drop from 6 → 0
  • Follow-up PR: flash-worker GPU Dockerfile side-by-side torch install + CI matrix + worker startup assertion
  • Integration: end-to-end deploy of a 3.11-targeted FlashApp once the flash-worker PR lands

@promptless
Copy link
Copy Markdown

promptless Bot commented Apr 21, 2026

Promptless prepared a documentation update related to this change.

Triggered by runpod/flash#322

This PR introduces per-app Python version selection for Flash, allowing users to target Python 3.10, 3.11, or 3.12. The docs update covers the new --python-version CLI flag, per-resource python_version configuration, reconciliation rules, and cold-start performance implications for GPU workers.

Review: Document per-app Python version selection for Flash

@deanq deanq changed the title feat(build): per-app python_version with cross-resource validation (AE-2827) feat(build): per-app python_version with cross-resource validation Apr 21, 2026
…E-2827)

Expand GPU and CPU worker images to support Python 3.10, 3.11, and 3.12.
Flash apps ship as one tarball so every resource must share a Python
version; the build step now reconciles per-resource python_version
declarations into a single app-level value or accepts an explicit
--python-version override.

- constants.py: expand GPU_PYTHON_VERSIONS / CPU_PYTHON_VERSIONS tuples
- manifest.py: add _reconcile_python_version; stamp target_python_version
  on every resource; raise on conflicting declarations
- build.py / deploy.py: add --python-version CLI flag, thread through
  run_build and ManifestBuilder
- docs: document per-app python_version, cold-start tradeoff for non-3.12
  GPU images, and the 3.10 EOL window
- tests/unit/test_dotenv_loading.py: add preserve_runpod_flash_modules
  fixture so module-deletion tests don't leak stale module references
  into sibling test files (unblocks deterministic test ordering)
@deanq deanq force-pushed the deanq/ae-2827-multi-python-versions branch from 2548f64 to 6138ae2 Compare April 22, 2026 04:32
Copy link
Copy Markdown
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA Review — PR 322

Tested against v1.14.0 baseline. Four findings.


Finding 1: SDK ships before worker images exist

GPU_PYTHON_VERSIONS and CPU_PYTHON_VERSIONS now include "3.10" and "3.11", but these image tags don't exist until a separate flash-worker PR merges. A user who deploys with --python-version 3.10 or --python-version 3.11 today will silently receive the 3.12 image (or the deploy will fail with an opaque image-pull error), depending on how RunPod handles missing tags. The CLI accepts the flag and emits no warning.

Question: Is there a plan to block --python-version 3.10 and --python-version 3.11 at the CLI until the worker images exist? Or will this ship as-is?


Finding 2: First deploy after SDK upgrade may trigger a spurious rolling release

python_version is now extracted from resource configs and stamped into the manifest (via target_python_version). For any project that had python_version set on a resource before this PR, the manifest fingerprint will differ from the deployed fingerprint even though the user changed nothing. This will trigger the rolling release recycle on first deploy after upgrade.

Whether this is acceptable depends on whether the behaviour is documented. Nothing in the changelog or the _reconcile_python_version docstring mentions it.

Question: Is this a known and acceptable tradeoff? If so, it should be called out in the release notes.


Finding 3: self.python_version is set twice in ManifestBuilder

In manifest.py, self.python_version is assigned in __init__ and then overwritten inside build(). Any code that reads self.python_version between construction and build() would see the wrong value. Currently nothing does, but the pattern is fragile. The __init__ assignment is dead assignment — if build() always overwrites it, the __init__ line should be removed.


Finding 4: Isolation fix is scoped to two tests only

preserve_runpod_flash_modules is added to two tests in test_dotenv_loading.py. If the same dotenv re-import issue surfaces in other test files (the underlying cause is module-level side effects in runpod_flash/__init__.py), those tests will still be fragile. This fix is targeted, not structural — that's fine for now but worth noting if isolation artifacts appear elsewhere.


No blockers on the core logic. _reconcile_python_version() reads cleanly, the validation cases are thorough (9 new tests cover the main paths), and the CLI flag wiring is straightforward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants