Skip to content

feat(e2e): add CPU E2E test suite with provisioner and rolling release tests#326

Open
runpod-Henrik wants to merge 5 commits intomainfrom
Henrik/e2e-cpu-smoke
Open

feat(e2e): add CPU E2E test suite with provisioner and rolling release tests#326
runpod-Henrik wants to merge 5 commits intomainfrom
Henrik/e2e-cpu-smoke

Conversation

@runpod-Henrik
Copy link
Copy Markdown
Contributor

@runpod-Henrik runpod-Henrik commented Apr 23, 2026

Summary

Adds the full E2E test infrastructure built and validated during v1.14.0 QA. All 15 CPU tests confirmed passing locally.
AE-2168

New files:

  • e2e/provisioner.py — session-scoped endpoint pool with parallel provisioning
  • e2e/test_cpu_suite.py — QB function (smoke, empty string, unicode, concurrent), deps (numpy/pandas), class, LB endpoint (9 pass, 1 xfail AE-2744)
  • e2e/test_rolling_release.py — no-spurious-release and config-change-triggers-drift
  • e2e/test_redeploy.py — scale-to-zero and multi-worker recycle tests (3 pass)
  • e2e/test_gpu_smoke.py — GPU deploy → invoke → undeploy

Updated:

  • e2e/conftest.py — better error messages, sys.path fix, sweep prefix filter
  • e2e/test_cpu_smoke.py — updated for provisioner
  • .github/workflows/e2e.yml — inject FLASH_SDK_GIT_REF

Note on GPU smoke: test_gpu_smoke.py will timeout in CI when GPU inventory is constrained.

Test plan

  • All 15 CPU tests confirmed passing locally (v1.14.0): cpu_smoke, cpu_suite (9+1 xfail), rolling_release (2), redeploy (3)
  • GPU smoke — requires GPU inventory; expected to timeout when constrained

🤖 Generated with Claude Code

…e tests

Adds the full E2E test infrastructure built and validated during v1.14.0 QA:

- provisioner.py: session-scoped endpoint pool with parallel provisioning
- test_cpu_smoke.py: updated deploy → invoke → undeploy smoke test
- test_cpu_suite.py: QB function (smoke, empty string, unicode, concurrent),
  deps (numpy/pandas), class, and LB endpoint tests (9 pass, 1 xfail AE-2744)
- test_rolling_release.py: no-spurious-release and config-change-triggers-drift
- test_redeploy.py: scale-to-zero and multi-worker (scale-to-zero + always-on)
  recycle tests; single-slot always-on failures split to test_redeploy_always_on.py
- e2e.yml: enable push/PR CI triggers; inject FLASH_SDK_GIT_REF

All 15 CPU tests confirmed passing locally (v1.14.0). GPU smoke included;
may timeout in CI when GPU inventory is constrained.

Excluded from this PR (tracked separately):
- test_redeploy_always_on.py: single-slot always-on recycle (AE-2940/2941/2942)
- test_source_fingerprint.py: needs assertion update
- test_concurrency_modifier.py: inconclusive — needs redesign

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
runpod-Henrik and others added 2 commits April 23, 2026 14:22
Keep workflow_dispatch-only trigger; schedule can be added back once
the E2E account quota and test batching are sorted out.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new end-to-end (E2E) testing harness under e2e/ to validate Flash deploy/redeploy/rolling-release behaviors against live RunPod endpoints, and wires CI to run the E2E suite against the exact commit under test.

Changes:

  • Add new CPU/GPU E2E test suites covering smoke, rolling release drift detection, and redeploy/worker recycle scenarios.
  • Introduce an E2E provisioner + improved E2E conftest.py helpers for credential restoration, endpoint state parsing, and prefix-scoped endpoint sweeping.
  • Update the E2E GitHub Actions workflow to inject FLASH_SDK_GIT_REF=${{ github.sha }} for worker-side dependency pinning.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
uv.lock Updates editable package version entry for runpod-flash.
e2e/conftest.py Adds sys.path fixups, better errors for state parsing, prefix-filtered endpoint sweeping, and an api_key fixture.
e2e/provisioner.py Adds helper to deploy workers with dependency pinning via FLASH_SDK_GIT_REF / FLASH_SDK_LOCAL_PATH.
e2e/test_cpu_smoke.py Switches worker dependency to flash_dep() and strengthens cleanup messaging.
e2e/test_cpu_suite.py Adds session-scoped CPU endpoint pool and tests for QB/LB behavior, deps, concurrency, and auth.
e2e/test_redeploy.py Adds redeploy / rolling release / recycle verification tests across worker configurations.
e2e/test_rolling_release.py Adds drift detection tests for “no-op redeploy” and “config change triggers update”.
e2e/test_gpu_smoke.py Adds GPU deploy → invoke → undeploy smoke test.
.github/workflows/e2e.yml Injects FLASH_SDK_GIT_REF into E2E job environment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread e2e/test_rolling_release.py
Comment thread e2e/test_rolling_release.py Outdated
runpod-Henrik and others added 2 commits April 27, 2026 18:06
- Correct TestRollingReleaseNoSpuriousRelease docstring: remove false
  claim about 'cached' in output; describe actual worker_id comparison
- Make LOG_LEVEL=INFO explicit in _deploy_env so the "Updating endpoint"
  log.info assertion is reliable regardless of caller environment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants