Skip to content

feat: v0.6.5 bidirectional CUDA IPC with Python exporter and receiver mode#3

Merged
forkni merged 7 commits intomasterfrom
feat/4k-performance-analysis
Feb 22, 2026
Merged

feat: v0.6.5 bidirectional CUDA IPC with Python exporter and receiver mode#3
forkni merged 7 commits intomasterfrom
feat/4k-performance-analysis

Conversation

@forkni
Copy link
Copy Markdown
Owner

@forkni forkni commented Feb 22, 2026

Summary

  • Add CUDAIPCExporter (Python → TD direction) for sending AI-generated frames back to TouchDesigner via zero-copy CUDA IPC
  • Add dual Sender/Receiver mode to CUDAIPCExtension TD component, enabling both directions
  • Add build_wheel.cmd for PEP 517 wheel distribution (cuda_link-0.6.5-py3-none-any.whl)
  • Comprehensive test suite: 7 test files including roundtrip and Python exporter tests
  • Updated README with architecture docs, wheel install guide, and troubleshooting
  • Full CI/CD infrastructure: Charlie CI, GitHub Actions (Claude + Gemini review, branch protection, docs validation)

Changes

  • src/cuda_link/cuda_ipc_exporter.py — New Python-side exporter class
  • td_exporter/CUDAIPCExtension.py — Dual mode (Sender/Receiver) with ring buffer
  • td_exporter/parexecute_callbacks.py — Mode parameter callback
  • tests/test_roundtrip.py — End-to-end Python↔TD roundtrip tests
  • tests/test_cuda_ipc_exporter_python.py — Python exporter unit tests
  • build_wheel.cmd — Wheel builder script
  • pyproject.toml — v0.6.5 with dev extras
  • TOXES/ — All component versions v0.6.0–v0.6.5

Test Plan

  • pytest tests/ -v -m "not requires_cuda" — protocol and unit tests (no GPU needed)
  • pytest tests/ -v -m "requires_cuda" — CUDA integration tests (needs GPU)
  • Verify Claude Code review posts a comment
  • Verify Gemini review posts a comment
  • Verify branch-protection validate/test/lint all pass

🤖 Generated with Claude Code

web-flow and others added 5 commits February 12, 2026 11:48
…ation

- Added cudaMemory() timing instrumentation to prove it's NOT the 8ms bottleneck (215-275us)
- Implemented TD 2025+ modoutsidecook receiver path for Execute DAT direct import
- Fixed receiver resolution delay by swapping callback order (import_frame before update_receiver_resolution)
- Added backward compatibility for TD 2023 (force-cook fallback)
- Updated TOX_BUILD_GUIDE.md with Step 6b for modoutsidecook setup
- Added comprehensive stat files for sender and receiver analysis
- Updated .toe file with latest component state
- Updated package version in pyproject.toml

Performance findings:
- cudaMemory(): 215-275us regardless of resolution (1024² to 4K)
- GPU D2D memcpy: 65us (1024²), 1045-1067us (4K)
- TD Execute DAT overhead: ~8ms (resolution-independent, not optimizable)
- Receiver modoutsidecook: 0.167ms vs 0.149ms force-cook
- Resolution change delay: reduced from 3-4 frames to 1-2 frames

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@charliecreates
Copy link
Copy Markdown

The pull request is too large to review automatically due to GitHub's line limit. Please consider breaking it into smaller PRs for a more effective review.

@forkni forkni merged commit 6861c0e into master Feb 22, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants