Skip to content

SpecDec Bench Tweaks#939

Open
benchislett wants to merge 8 commits intomainfrom
bchislett/specbench-tweaks
Open

SpecDec Bench Tweaks#939
benchislett wants to merge 8 commits intomainfrom
bchislett/specbench-tweaks

Conversation

@benchislett
Copy link
Contributor

@benchislett benchislett commented Feb 26, 2026

What does this PR do?

  • Add HumanEval to SpecDec Bench
    • Skip first token since prefill doesn't draft tokens. Important for HumanEval since completions are short, so even one step can affect the average AR. Also skip last token since EOS interrupts speculation
  • Add parallel drafting support, which vLLM uses e.g. for P-EAGLE

Example usage:

mkdir results
python3 run.py --model_dir openai/gpt-oss-20b --tokenizer openai/gpt-oss-20b --draft_model_dir amazon/GPT-OSS-20B-P-EAGLE --dataset humaneval --dataset_path openai/openai_humaneval --tp_size 1 --ep_size 1 --draft_length 15 --output_length 256 --engine VLLM --concurrency 1 --num_requests 164 --show_progress --speculative_algorithm EAGLE3 --postprocess gptoss --save_dir results/ --parallel_drafting > results/log.txt 2>&1

Summary by CodeRabbit

  • New Features

    • HumanEval dataset is now available for use in benchmarks
    • Added a --parallel_drafting command-line option to enable parallel drafting during model inference
  • Bug Fixes

    • Acceptance-rate metric preprocessing refined to trim extraneous tokens for more accurate per-turn measurements

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@benchislett benchislett requested a review from a team as a code owner February 26, 2026 15:03
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 26, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 79292c5e-f888-4a89-bbb1-8d5875a2cb14

📥 Commits

Reviewing files that changed from the base of the PR and between 6c5c507 and 8815a70.

📒 Files selected for processing (2)
  • examples/specdec_bench/specdec_bench/datasets/humaneval.py
  • examples/specdec_bench/specdec_bench/models/vllm.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • examples/specdec_bench/specdec_bench/models/vllm.py
  • examples/specdec_bench/specdec_bench/datasets/humaneval.py

📝 Walkthrough

Walkthrough

Adds HumanEval dataset support, a --parallel_drafting CLI flag wired into model/speculative-decoding, and trims turn data in acceptance rate computation before per-turn calculations.

Changes

Cohort / File(s) Summary
HumanEval dataset
examples/specdec_bench/specdec_bench/datasets/humaneval.py, examples/specdec_bench/specdec_bench/datasets/__init__.py
New HumanEval Dataset adapter, format_prompt helper, and export of HumanEval in package __all__. Loads dataset, builds Request objects, and limits samples by num_samples.
CLI & wiring
examples/specdec_bench/run.py
Added --parallel_drafting CLI flag, added "humaneval": datasets.HumanEval to available datasets, and passed parallel_drafting=args.parallel_drafting to model constructor.
Model / speculative decoding
examples/specdec_bench/specdec_bench/models/vllm.py
Propagates parallel_drafting into speculative decoding config when present (sets specdec["parallel_drafting"]=True if kwargs.get("parallel_drafting") is truthy and specdec exists).
Metrics adjustment
examples/specdec_bench/specdec_bench/metrics/acceptance_rate.py
process_final now trims per-turn sequences: drops prefill when first element <= 1 and removes final element (EOS) before computing per-turn acceptance rate and histogram lengths.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (run.py)
    participant Dataset as HumanEval dataset loader
    participant Model as Model constructor (vllm)
    participant SpecDec as SpeculativeDecoding config
    CLI->>Dataset: select dataset "humaneval" and load samples
    Dataset-->>CLI: list of Request objects
    CLI->>Model: instantiate with parallel_drafting flag
    Model->>SpecDec: if specdec present and parallel_drafting true
    SpecDec-->>Model: specdec config includes parallel_drafting
    Model-->>CLI: model ready (with speculative decoding config)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'SpecDec Bench Tweaks' is vague and generic. While it references the SpecDec benchmark, it does not clearly indicate the main changes: HumanEval dataset addition, parallel drafting support, and acceptance rate calculation adjustments. Use a more specific title that captures the primary changes, such as 'Add HumanEval dataset and parallel drafting support to SpecDec Bench' or similar.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Security Anti-Patterns ✅ Passed The PR does not introduce any security anti-patterns. HumanEval dataset loads safely without unsafe deserialization, parallel_drafting is properly configurable, no eval()/exec() on untrusted input present, and all dependencies use permissive licenses.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bchislett/specbench-tweaks
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can customize the high-level summary generated by CodeRabbit.

Configure the reviews.high_level_summary_instructions setting to provide custom instructions for generating the high-level summary.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
examples/specdec_bench/specdec_bench/datasets/humaneval.py (1)

24-28: Minor: Inaccurate inline comment.

The comment # list of list of questions. is misleading—self.data is actually a list[Request], not a nested list. Consider updating or removing the comment.

📝 Suggested fix
 class HumanEval(Dataset):
     def __init__(self, path, num_samples=164, **kwargs):
-        self.data: list[Request] = []  # list of list of questions.
+        self.data: list[Request] = []
         self.num_samples = num_samples
         self._preprocess(path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/datasets/humaneval.py` around lines 24 -
28, Update the inaccurate inline comment on the attribute in class HumanEval:
the field self.data is declared as a list[Request] (not a nested list), so
change or remove the comment "// list of list of questions." to accurately
reflect that self.data is a flat list of Request objects; locate the declaration
in the __init__ of HumanEval and adjust the comment to something like "list of
Request" or remove it entirely.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/specdec_bench/specdec_bench/models/vllm.py`:
- Around line 68-70: The code currently sets specdec["parallel_drafting"] = True
when kwargs.get("parallel_drafting") is truthy but doesn't handle the case where
specdec is None; update the block around kwargs.get("parallel_drafting") to
first ensure specdec is a dict (e.g., if specdec is None, assign an empty dict)
or only perform the item assignment when specdec is not None, so that the
assignment to specdec["parallel_drafting"] cannot raise a TypeError; reference
the specdec variable and the kwargs.get("parallel_drafting") check to locate
where to add the guard/initialization.

---

Nitpick comments:
In `@examples/specdec_bench/specdec_bench/datasets/humaneval.py`:
- Around line 24-28: Update the inaccurate inline comment on the attribute in
class HumanEval: the field self.data is declared as a list[Request] (not a
nested list), so change or remove the comment "// list of list of questions." to
accurately reflect that self.data is a flat list of Request objects; locate the
declaration in the __init__ of HumanEval and adjust the comment to something
like "list of Request" or remove it entirely.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef5a2df and 6c5c507.

📒 Files selected for processing (5)
  • examples/specdec_bench/run.py
  • examples/specdec_bench/specdec_bench/datasets/__init__.py
  • examples/specdec_bench/specdec_bench/datasets/humaneval.py
  • examples/specdec_bench/specdec_bench/metrics/acceptance_rate.py
  • examples/specdec_bench/specdec_bench/models/vllm.py

@codecov
Copy link

codecov bot commented Feb 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.09%. Comparing base (1070d89) to head (8815a70).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #939      +/-   ##
==========================================
- Coverage   70.10%   70.09%   -0.02%     
==========================================
  Files         221      221              
  Lines       25541    25541              
==========================================
- Hits        17905    17902       -3     
- Misses       7636     7639       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 26, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

if len(turn) > 1 and turn[0] <= 1:
turn = turn[1:] # Skip prefill if it is 1 or less, indicating no specdec
if len(turn) > 1:
turn = turn[:-1] # Skip final acceptance due to EOS truncating speculation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only skip if EOS is present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to skip anyways, since truncation for any reason (EOS, length, stop token) might misrepresent the AR since it is not aligned with the draft size per step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants