Similar to what is happening in this issue: [https://github.com//issues/38](Sporadic CUDA illegal memory access in dflash_worker_v2.py:335 on A6000 (SM86)), I noticed illegal memory access will happen upon trying to capture cuda graphs on start up if I set the mem fraction to be 0.7 or 0.9. Inference server starts up if set at 0.8, but certain requests will trigger another illegal memory access error:
Startup error:
[2026-04-15 02:12:14 TP2] Scheduler hit an exception: Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3293, in run_scheduler_process scheduler.run_event_loop() File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1179, in run_event_loop dispatch_event_loop(self) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3173, in dispatch_event_loop scheduler.event_loop_overlap() File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1240, in event_loop_overlap batch_result = self.run_batch(batch) ^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2410, in run_batch batch_result = self.model_worker.forward_batch_generation( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/speculative/dflash_worker_v2.py", line 135, in forward_batch_generation batch_output = self.target_worker.forward_batch_generation( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 467, in forward_batch_generation out = self.model_runner.forward( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2482, in forward output = self._forward_raw( ^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2584, in _forward_raw ret, can_run_graph = self.forward_extend( ^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2419, in forward_extend self.model.forward( File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/models/gpt_oss.py", line 648, in forward return self.logits_processor( ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 341, in forward logits = self._get_logits(pruned_states, lm_head, logits_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 826, in _get_logits logits = self._compute_lm_head(hidden_states, lm_head, embedding_bias) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 881, in _compute_lm_head logits = torch.matmul( ^^^^^^^^^^^^^ RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)``
After starting up and serving requests with 0.8 mem fraction:
2026-04-12T21:16:41.762854908Z [2026-04-12 21:16:41 TP0] Scheduler hit an exception: Traceback (most recent call last): 2026-04-12T21:16:41.762889585Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3293, in run_scheduler_process 2026-04-12T21:16:41.762903081Z scheduler.run_event_loop() 2026-04-12T21:16:41.762908982Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1179, in run_event_loop 2026-04-12T21:16:41.762944875Z dispatch_event_loop(self) 2026-04-12T21:16:41.762958049Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3173, in dispatch_event_loop 2026-04-12T21:16:41.763027176Z scheduler.event_loop_overlap() 2026-04-12T21:16:41.763042605Z File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context 2026-04-12T21:16:41.763047182Z return func(*args, **kwargs) 2026-04-12T21:16:41.763051385Z ^^^^^^^^^^^^^^^^^^^^^ 2026-04-12T21:16:41.763090865Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1229, in event_loop_overlap 2026-04-12T21:16:41.763103141Z batch = self.get_next_batch_to_run() 2026-04-12T21:16:41.763132386Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2026-04-12T21:16:41.763138232Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2036, in get_next_batch_to_run 2026-04-12T21:16:41.763173096Z self.running_batch = self.update_running_batch(self.running_batch) 2026-04-12T21:16:41.763177009Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2026-04-12T21:16:41.763209261Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2350, in update_running_batch 2026-04-12T21:16:41.763247037Z batch.prepare_for_decode() 2026-04-12T21:16:41.763281659Z File "/sgl-workspace/sglang/python/sglang/srt/managers/schedule_batch.py", line 1949, in prepare_for_decode 2026-04-12T21:16:41.763285333Z draft_input.prepare_for_decode(self) 2026-04-12T21:16:41.763322723Z File "/sgl-workspace/sglang/python/sglang/srt/speculative/dflash_info_v2.py", line 137, in prepare_for_decode 2026-04-12T21:16:41.763328063Z cur_kv_lens = cur_kv_lens_cpu_t.to(device=batch.device) 2026-04-12T21:16:41.763402769Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2026-04-12T21:16:41.76341701Z torch.AcceleratorError: CUDA error: an illegal memory access was encountered 2026-04-12T21:16:41.763423451Z Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
2026-04-12T21:16:41.763431837Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2026-04-12T21:16:41.763493107Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1
2026-04-12T21:16:41.763498559Z Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
2026-04-12T21:16:04.363657306Z [2026-04-12 21:16:04 TP0] Decode batch, #running-req: 1, #full token: 87023, full token usage: 0.04, #swa token: 87023, swa token usage: 0.05, accept len: 10.00, accept rate: 1.00, cuda graph: True, gen throughput (token/s): 1296.44, #queue-req: 0`
For this scenario, I notice the crash is usually preceded by a bunch of decode batches where accept rate is 1.00, which doesn't seem correct.
2026-04-12T21:16:04.363657306Z [2026-04-12 21:16:04 TP0] Decode batch, #running-req: 1, #full token: 87023, full token usage: 0.04, #swa token: 87023, swa token usage: 0.05, accept len: 10.00, accept rate: 1.00, cuda graph: True, gen throughput (token/s): 1296.44, #queue-req: 0
Server Launch Config
python3 -m sglang.launch_server
--model-path /models/gpt-oss-120b
--served-model-name gpt-oss-120b
--tp-size 4
--chunked-prefill-size 131072
--context-length 131072
--reasoning-parser gpt-oss
--tool-call-parser gpt-oss
--enable-metrics
--max-running-requests 64
--mem-fraction-static 0.9
--attention-backend fa3
--speculative-num-draft-tokens 10
--speculative-algorithm DFLASH
--speculative-draft-model-path /models/gpt-oss-120b-DFlash
Environment
GPU: NDH100
SGLang: 0.5.6.post2, installed from refs/pull/20547/head (up to this commit sgl-project/sglang@4926ca2) => should have the "fix" mentioned in the other issue
flashinfer: 0.6.4
PyTorch: 2.9.1
sgl-kernel: 0.3.21
Model: GPTOSS120B + GPTOSS120B-Dflash
Appreciate your help here, thank you!
Similar to what is happening in this issue: [https://github.com//issues/38](Sporadic CUDA illegal memory access in dflash_worker_v2.py:335 on A6000 (SM86)), I noticed illegal memory access will happen upon trying to capture cuda graphs on start up if I set the mem fraction to be 0.7 or 0.9. Inference server starts up if set at 0.8, but certain requests will trigger another illegal memory access error:
Startup error:
[2026-04-15 02:12:14 TP2] Scheduler hit an exception: Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3293, in run_scheduler_process scheduler.run_event_loop() File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1179, in run_event_loop dispatch_event_loop(self) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3173, in dispatch_event_loop scheduler.event_loop_overlap() File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1240, in event_loop_overlap batch_result = self.run_batch(batch) ^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2410, in run_batch batch_result = self.model_worker.forward_batch_generation( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/speculative/dflash_worker_v2.py", line 135, in forward_batch_generation batch_output = self.target_worker.forward_batch_generation( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 467, in forward_batch_generation out = self.model_runner.forward( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2482, in forward output = self._forward_raw( ^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2584, in _forward_raw ret, can_run_graph = self.forward_extend( ^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2419, in forward_extend self.model.forward( File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/models/gpt_oss.py", line 648, in forward return self.logits_processor( ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 341, in forward logits = self._get_logits(pruned_states, lm_head, logits_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 826, in _get_logits logits = self._compute_lm_head(hidden_states, lm_head, embedding_bias) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 881, in _compute_lm_head logits = torch.matmul( ^^^^^^^^^^^^^ RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when callingcublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)``After starting up and serving requests with 0.8 mem fraction:
2026-04-12T21:16:41.762854908Z [2026-04-12 21:16:41 TP0] Scheduler hit an exception: Traceback (most recent call last): 2026-04-12T21:16:41.762889585Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3293, in run_scheduler_process 2026-04-12T21:16:41.762903081Z scheduler.run_event_loop() 2026-04-12T21:16:41.762908982Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1179, in run_event_loop 2026-04-12T21:16:41.762944875Z dispatch_event_loop(self) 2026-04-12T21:16:41.762958049Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3173, in dispatch_event_loop 2026-04-12T21:16:41.763027176Z scheduler.event_loop_overlap() 2026-04-12T21:16:41.763042605Z File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context 2026-04-12T21:16:41.763047182Z return func(*args, **kwargs) 2026-04-12T21:16:41.763051385Z ^^^^^^^^^^^^^^^^^^^^^ 2026-04-12T21:16:41.763090865Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1229, in event_loop_overlap 2026-04-12T21:16:41.763103141Z batch = self.get_next_batch_to_run() 2026-04-12T21:16:41.763132386Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2026-04-12T21:16:41.763138232Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2036, in get_next_batch_to_run 2026-04-12T21:16:41.763173096Z self.running_batch = self.update_running_batch(self.running_batch) 2026-04-12T21:16:41.763177009Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2026-04-12T21:16:41.763209261Z File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2350, in update_running_batch 2026-04-12T21:16:41.763247037Z batch.prepare_for_decode() 2026-04-12T21:16:41.763281659Z File "/sgl-workspace/sglang/python/sglang/srt/managers/schedule_batch.py", line 1949, in prepare_for_decode 2026-04-12T21:16:41.763285333Z draft_input.prepare_for_decode(self) 2026-04-12T21:16:41.763322723Z File "/sgl-workspace/sglang/python/sglang/srt/speculative/dflash_info_v2.py", line 137, in prepare_for_decode 2026-04-12T21:16:41.763328063Z cur_kv_lens = cur_kv_lens_cpu_t.to(device=batch.device) 2026-04-12T21:16:41.763402769Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2026-04-12T21:16:41.76341701Z torch.AcceleratorError: CUDA error: an illegal memory access was encountered 2026-04-12T21:16:41.763423451Z Search forcudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.2026-04-12T21:16:41.763431837Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2026-04-12T21:16:41.763493107Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1
2026-04-12T21:16:41.763498559Z Compile with
TORCH_USE_CUDA_DSAto enable device-side assertions.2026-04-12T21:16:04.363657306Z [2026-04-12 21:16:04 TP0] Decode batch, #running-req: 1, #full token: 87023, full token usage: 0.04, #swa token: 87023, swa token usage: 0.05, accept len: 10.00, accept rate: 1.00, cuda graph: True, gen throughput (token/s): 1296.44, #queue-req: 0`
For this scenario, I notice the crash is usually preceded by a bunch of decode batches where accept rate is 1.00, which doesn't seem correct.
2026-04-12T21:16:04.363657306Z [2026-04-12 21:16:04 TP0] Decode batch, #running-req: 1, #full token: 87023, full token usage: 0.04, #swa token: 87023, swa token usage: 0.05, accept len: 10.00, accept rate: 1.00, cuda graph: True, gen throughput (token/s): 1296.44, #queue-req: 0Server Launch Config
python3 -m sglang.launch_server
--model-path /models/gpt-oss-120b
--served-model-name gpt-oss-120b
--tp-size 4
--chunked-prefill-size 131072
--context-length 131072
--reasoning-parser gpt-oss
--tool-call-parser gpt-oss
--enable-metrics
--max-running-requests 64
--mem-fraction-static 0.9
--attention-backend fa3
--speculative-num-draft-tokens 10
--speculative-algorithm DFLASH
--speculative-draft-model-path /models/gpt-oss-120b-DFlash
Environment
GPU: NDH100
SGLang: 0.5.6.post2, installed from refs/pull/20547/head (up to this commit sgl-project/sglang@4926ca2) => should have the "fix" mentioned in the other issue
flashinfer: 0.6.4
PyTorch: 2.9.1
sgl-kernel: 0.3.21
Model: GPTOSS120B + GPTOSS120B-Dflash
Appreciate your help here, thank you!