Skip to content

CUDA illegal memory access in dflash_worker_v2.py on startup and for certain requests #61

@liangjason87

Description

@liangjason87

Similar to what is happening in this issue: [https://github.com//issues/38](Sporadic CUDA illegal memory access in dflash_worker_v2.py:335 on A6000 (SM86)), I noticed illegal memory access will happen upon trying to capture cuda graphs on start up if I set the mem fraction to be 0.7 or 0.9. Inference server starts up if set at 0.8, but certain requests will trigger another illegal memory access error:

Startup error:

[2026-04-15 02:12:14 TP2] Scheduler hit an exception: Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3293, in run_scheduler_process scheduler.run_event_loop() File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1179, in run_event_loop dispatch_event_loop(self) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3173, in dispatch_event_loop scheduler.event_loop_overlap() File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1240, in event_loop_overlap batch_result = self.run_batch(batch) ^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2410, in run_batch batch_result = self.model_worker.forward_batch_generation( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/speculative/dflash_worker_v2.py", line 135, in forward_batch_generation batch_output = self.target_worker.forward_batch_generation( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 467, in forward_batch_generation out = self.model_runner.forward( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2482, in forward output = self._forward_raw( ^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2584, in _forward_raw ret, can_run_graph = self.forward_extend( ^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2419, in forward_extend self.model.forward( File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/models/gpt_oss.py", line 648, in forward return self.logits_processor( ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 341, in forward logits = self._get_logits(pruned_states, lm_head, logits_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 826, in _get_logits logits = self._compute_lm_head(hidden_states, lm_head, embedding_bias) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 881, in _compute_lm_head logits = torch.matmul( ^^^^^^^^^^^^^ RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)``

After starting up and serving requests with 0.8 mem fraction:

2026-04-12T21:16:41.762854908Z [2026-04-12 21:16:41 TP0] Scheduler hit an exception: Traceback (most recent call last): 2026-04-12T21:16:41.762889585Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3293, in run_scheduler_process 2026-04-12T21:16:41.762903081Z scheduler.run_event_loop() 2026-04-12T21:16:41.762908982Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1179, in run_event_loop 2026-04-12T21:16:41.762944875Z dispatch_event_loop(self) 2026-04-12T21:16:41.762958049Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3173, in dispatch_event_loop 2026-04-12T21:16:41.763027176Z scheduler.event_loop_overlap() 2026-04-12T21:16:41.763042605Z File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context 2026-04-12T21:16:41.763047182Z return func(*args, **kwargs) 2026-04-12T21:16:41.763051385Z ^^^^^^^^^^^^^^^^^^^^^ 2026-04-12T21:16:41.763090865Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1229, in event_loop_overlap 2026-04-12T21:16:41.763103141Z batch = self.get_next_batch_to_run() 2026-04-12T21:16:41.763132386Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2026-04-12T21:16:41.763138232Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2036, in get_next_batch_to_run 2026-04-12T21:16:41.763173096Z self.running_batch = self.update_running_batch(self.running_batch) 2026-04-12T21:16:41.763177009Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2026-04-12T21:16:41.763209261Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2350, in update_running_batch 2026-04-12T21:16:41.763247037Z batch.prepare_for_decode() 2026-04-12T21:16:41.763281659Z File "/sgl-workspace/sglang/python/sglang/srt/managers/schedule_batch.py", line 1949, in prepare_for_decode 2026-04-12T21:16:41.763285333Z draft_input.prepare_for_decode(self) 2026-04-12T21:16:41.763322723Z File "/sgl-workspace/sglang/python/sglang/srt/speculative/dflash_info_v2.py", line 137, in prepare_for_decode 2026-04-12T21:16:41.763328063Z cur_kv_lens = cur_kv_lens_cpu_t.to(device=batch.device) 2026-04-12T21:16:41.763402769Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2026-04-12T21:16:41.76341701Z torch.AcceleratorError: CUDA error: an illegal memory access was encountered 2026-04-12T21:16:41.763423451Z Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
2026-04-12T21:16:41.763431837Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2026-04-12T21:16:41.763493107Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1
2026-04-12T21:16:41.763498559Z Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
2026-04-12T21:16:04.363657306Z [2026-04-12 21:16:04 TP0] Decode batch, #running-req: 1, #full token: 87023, full token usage: 0.04, #swa token: 87023, swa token usage: 0.05, accept len: 10.00, accept rate: 1.00, cuda graph: True, gen throughput (token/s): 1296.44, #queue-req: 0`

For this scenario, I notice the crash is usually preceded by a bunch of decode batches where accept rate is 1.00, which doesn't seem correct.

2026-04-12T21:16:04.363657306Z [2026-04-12 21:16:04 TP0] Decode batch, #running-req: 1, #full token: 87023, full token usage: 0.04, #swa token: 87023, swa token usage: 0.05, accept len: 10.00, accept rate: 1.00, cuda graph: True, gen throughput (token/s): 1296.44, #queue-req: 0

Server Launch Config

python3 -m sglang.launch_server
--model-path /models/gpt-oss-120b
--served-model-name gpt-oss-120b
--tp-size 4
--chunked-prefill-size 131072
--context-length 131072
--reasoning-parser gpt-oss
--tool-call-parser gpt-oss
--enable-metrics
--max-running-requests 64
--mem-fraction-static 0.9
--attention-backend fa3
--speculative-num-draft-tokens 10
--speculative-algorithm DFLASH
--speculative-draft-model-path /models/gpt-oss-120b-DFlash

Environment

GPU: NDH100
SGLang: 0.5.6.post2, installed from refs/pull/20547/head (up to this commit sgl-project/sglang@4926ca2) => should have the "fix" mentioned in the other issue
flashinfer: 0.6.4
PyTorch: 2.9.1
sgl-kernel: 0.3.21
Model: GPTOSS120B + GPTOSS120B-Dflash

Appreciate your help here, thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions