CUDA illegal memory access in dflash_worker_v2.py on startup and for certain requests

Similar to what is happening in this issue: [https://github.com/z-lab/dflash/issues/38](Sporadic CUDA illegal memory access in dflash_worker_v2.py:335 on A6000 (SM86)), I noticed illegal memory access will happen upon trying to capture cuda graphs on start up if I set the mem fraction to be 0.7 or 0.9. Inference server starts up if set at 0.8, but certain requests will trigger another illegal memory access error:

**Startup error:**

`[2026-04-15 02:12:14 TP2] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3293, in run_scheduler_process
    scheduler.run_event_loop()
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1179, in run_event_loop
    dispatch_event_loop(self)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3173, in dispatch_event_loop
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1240, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2410, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/dflash_worker_v2.py", line 135, in forward_batch_generation
    batch_output = self.target_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 467, in forward_batch_generation
    out = self.model_runner.forward(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2482, in forward
    output = self._forward_raw(
             ^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2584, in _forward_raw
    ret, can_run_graph = self.forward_extend(
                         ^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2419, in forward_extend
    self.model.forward(
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/gpt_oss.py", line 648, in forward
    return self.logits_processor(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 341, in forward
    logits = self._get_logits(pruned_states, lm_head, logits_metadata)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 826, in _get_logits
    logits = self._compute_lm_head(hidden_states, lm_head, embedding_bias)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 881, in _compute_lm_head
    logits = torch.matmul(
             ^^^^^^^^^^^^^
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)``

**After starting up and serving requests with 0.8 mem fraction:**

`2026-04-12T21:16:41.762854908Z [2026-04-12 21:16:41 TP0] Scheduler hit an exception: Traceback (most recent call last):
2026-04-12T21:16:41.762889585Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3293, in run_scheduler_process
2026-04-12T21:16:41.762903081Z scheduler.run_event_loop()
2026-04-12T21:16:41.762908982Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1179, in run_event_loop
2026-04-12T21:16:41.762944875Z dispatch_event_loop(self)
2026-04-12T21:16:41.762958049Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3173, in dispatch_event_loop
2026-04-12T21:16:41.763027176Z scheduler.event_loop_overlap()
2026-04-12T21:16:41.763042605Z File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
2026-04-12T21:16:41.763047182Z return func(*args, **kwargs)
2026-04-12T21:16:41.763051385Z ^^^^^^^^^^^^^^^^^^^^^
2026-04-12T21:16:41.763090865Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1229, in event_loop_overlap
2026-04-12T21:16:41.763103141Z batch = self.get_next_batch_to_run()
2026-04-12T21:16:41.763132386Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-04-12T21:16:41.763138232Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2036, in get_next_batch_to_run
2026-04-12T21:16:41.763173096Z self.running_batch = self.update_running_batch(self.running_batch)
2026-04-12T21:16:41.763177009Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-04-12T21:16:41.763209261Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2350, in update_running_batch
2026-04-12T21:16:41.763247037Z batch.prepare_for_decode()
2026-04-12T21:16:41.763281659Z File "/sgl-workspace/sglang/python/sglang/srt/managers/schedule_batch.py", line 1949, in prepare_for_decode
2026-04-12T21:16:41.763285333Z draft_input.prepare_for_decode(self)
2026-04-12T21:16:41.763322723Z File "/sgl-workspace/sglang/python/sglang/srt/speculative/dflash_info_v2.py", line 137, in prepare_for_decode
2026-04-12T21:16:41.763328063Z cur_kv_lens = cur_kv_lens_cpu_t.to(device=batch.device)
2026-04-12T21:16:41.763402769Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-04-12T21:16:41.76341701Z torch.AcceleratorError: CUDA error: an illegal memory access was encountered
2026-04-12T21:16:41.763423451Z Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
2026-04-12T21:16:41.763431837Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2026-04-12T21:16:41.763493107Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1
2026-04-12T21:16:41.763498559Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2026-04-12T21:16:04.363657306Z [2026-04-12 21:16:04 TP0] Decode batch, #running-req: 1, #full token: 87023, full token usage: 0.04, #swa token: 87023, swa token usage: 0.05, accept len: 10.00, accept rate: 1.00, cuda graph: True, gen throughput (token/s): 1296.44, #queue-req: 0`

**For this scenario, I notice the crash is usually preceded by a bunch of decode batches where accept rate is 1.00, which doesn't seem correct.**

`2026-04-12T21:16:04.363657306Z [2026-04-12 21:16:04 TP0] Decode batch, #running-req: 1, #full token: 87023, full token usage: 0.04, #swa token: 87023, swa token usage: 0.05, accept len: 10.00, accept rate: 1.00, cuda graph: True, gen throughput (token/s): 1296.44, #queue-req: 0`

**Server Launch Config**

python3 -m sglang.launch_server \
  --model-path /models/gpt-oss-120b \
  --served-model-name gpt-oss-120b \
  --tp-size 4 \
  --chunked-prefill-size 131072 \
  --context-length 131072 \
  --reasoning-parser gpt-oss \
  --tool-call-parser gpt-oss \
  --enable-metrics \
  --max-running-requests 64 \
  --mem-fraction-static 0.9 \
  --attention-backend fa3 \
  --speculative-num-draft-tokens 10 \
  --speculative-algorithm DFLASH \
  --speculative-draft-model-path /models/gpt-oss-120b-DFlash 

**Environment**

GPU: NDH100
SGLang: 0.5.6.post2, installed from refs/pull/20547/head (up to this commit https://github.com/sgl-project/sglang/pull/20547/changes/4926ca275fcf5940e76c2d09e52552420d74ad96) => should have the "fix" mentioned in the other issue
flashinfer: 0.6.4
PyTorch: 2.9.1
sgl-kernel: 0.3.21
Model: GPTOSS120B + GPTOSS120B-Dflash

Appreciate your help here, thank you!








Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA illegal memory access in dflash_worker_v2.py on startup and for certain requests #61

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA illegal memory access in dflash_worker_v2.py on startup and for certain requests #61

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions