Skip to content

Refactor: Clean up EAGLE training dataset preparation#684

Open
benchislett wants to merge 3 commits intomainfrom
bchislett/eagle-training-qol
Open

Refactor: Clean up EAGLE training dataset preparation#684
benchislett wants to merge 3 commits intomainfrom
bchislett/eagle-training-qol

Conversation

@benchislett
Copy link
Contributor

@benchislett benchislett commented Dec 12, 2025

What does this PR do?

Type of change: Refactor

Overview:

  • Consolidate input dataset preparation into make_dataset.py
  • Read dataset mix spec from a YAML file
    • Can now specify how many samples to take from each split
    • Can no longer easily split a dataset into train/test sections. I don't think this feature was really useful to begin with. Most datasets can already be separated into train/val/test at the split level, and those that can't are usually going to be splitted by the training FW anyways.
  • Add support for a few new dataset types, magpie 300k/500k/1M, nemotron post-training dataset v2.

Usage

See README for detailed example

Testing

Ran it locally on all dataset modes, works successfully and output looks good. Checked shuffling, conversation IDs, and output contents were all unique and usable.

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: No
  • Did you add or update any necessary documentation?: Yes
  • Did you update Changelog?: No

Summary by CodeRabbit

  • Documentation

    • Updated speculative decoding example documentation with new dataset references and standardized file paths.
  • New Features

    • Introduced configuration-driven dataset preparation supporting multiple dataset sources with centralized configuration files.
  • Refactor

    • Simplified dataset preparation workflow with unified tooling and updated default data paths throughout the training pipeline.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@benchislett benchislett requested a review from a team as a code owner December 12, 2025 22:31
@benchislett benchislett requested a review from h-guo18 December 12, 2025 22:31
@codecov
Copy link

codecov bot commented Dec 12, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.09%. Comparing base (1070d89) to head (3e90c5e).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #684      +/-   ##
==========================================
- Coverage   70.10%   70.09%   -0.02%     
==========================================
  Files         221      221              
  Lines       25541    25541              
==========================================
- Hits        17905    17902       -3     
- Misses       7636     7639       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sources:
- name: "sharegpt"
splits:
all: 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does 0 mean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 means "no samples". This is just an example template file showing which datasets are supported, so include most of the options even if they aren't used by default

@yeyu-nvidia
Copy link
Contributor

Can you rebase to main as there are some conflicts?

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 15, 2026

📝 Walkthrough

Walkthrough

This change consolidates multiple dataset-specific preparation scripts into a single unified dataset builder. Individual scripts for Daring-Anteater, MTBench, ShareGPT, and UltraChat are removed and replaced with make_dataset.py, which uses YAML configuration to mix samples from multiple dataset sources. Supporting utilities are reorganized accordingly.

Changes

Cohort / File(s) Summary
Documentation Updates
examples/speculative_decoding/README.md
Updated dataset references from Daring-Anteater to UltraChat-200k, standardized file paths to train.jsonl, replaced data preparation instructions to use new make_dataset.py with YAML config, and updated available datasets list.
Removed Dataset-Specific Preparation Scripts
examples/speculative_decoding/prepare_input_conversations/add_daring_anteater.py, add_mtbench.py, add_sharegpt.py, add_ultrachat.py, example_make_prompt_dataset.sh
Deleted dataset-specific integrations (Daring-Anteater, MTBench, ShareGPT, UltraChat) and the shell example that orchestrated them, consolidating functionality into a unified configuration-driven builder.
Unified Dataset Builder
examples/speculative_decoding/prepare_input_conversations/make_dataset.py
New Python script implementing configuration-driven dataset mixing. Introduces DataMixingConfig, SourceDatasetSpec, OutputConfig classes, async loaders for six dataset types (MTBench, ShareGPT, UltraChat, Daring-Anteater, Magpie, Nemotron Post-Training v2), constraint-based sampling, deduplication, and JSONL output writing. Supports --config-file and --full-conversations CLI flags.
Configuration Example
examples/speculative_decoding/prepare_input_conversations/example_data_config.yaml
New YAML template defining data mixing configuration with train.jsonl output target, global_limit of 50000, and per-dataset source specs with split-specific quotas.
Utility Changes
examples/speculative_decoding/prepare_input_conversations/utils.py
Removed four functions (add_conversations_to_split, mix_conversations_and_add_to_splits, update_dataset_file_with_conversations, dataset_splits_explanation) that handled dataset splitting and file persistence; retained id_for_conversation and download_file.
Training Script Update
examples/speculative_decoding/train_eagle3_and_export.sh
Changed default input data path from input_conversations/daring-anteater.jsonl to input_conversations/train.jsonl.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant MakeDataset as make_dataset.py
    participant ConfigLoader as Config Loader
    participant AsyncLoaders as Async Loaders
    participant Dedup as Deduplication
    participant OutputWriter as Output Writer

    User->>MakeDataset: python make_dataset.py --config-file example_data_config.yaml
    MakeDataset->>ConfigLoader: load_conversations_for_split() for each source
    ConfigLoader->>AsyncLoaders: Initialize dataset-specific loaders<br/>(MTBench, ShareGPT, UltraChat, etc.)
    
    loop For each dataset source in config
        AsyncLoaders->>AsyncLoaders: Download & parse dataset
        AsyncLoaders->>AsyncLoaders: Filter/normalize records
        AsyncLoaders->>MakeDataset: Yield conversation objects
    end
    
    MakeDataset->>Dedup: Aggregate samples from all sources
    Dedup->>Dedup: Deduplicate by truncated ID
    Dedup->>Dedup: Apply per-split constraints
    Dedup->>Dedup: Apply global_limit subsampling
    Dedup->>OutputWriter: Shuffle and prepare batches
    OutputWriter->>OutputWriter: Generate conversation_id if missing
    OutputWriter->>User: Write JSONL to train.jsonl
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective of the PR: consolidating dataset preparation scripts into a single script and updating the README to reflect these changes.
Security Anti-Patterns ✅ Passed Pull request contains no CRITICAL security anti-patterns; uses yaml.safe_load(), avoids trust_remote_code=True hardcoding, and employs safe deserialization methods.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bchislett/eagle-training-qol
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use your project's `pylint` configuration to improve the quality of Python code reviews.

Add a pylint configuration file to your project to customize how CodeRabbit runs pylint.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
examples/speculative_decoding/prepare_input_conversations/make_dataset.py (1)

75-82: Normalize splits into a dict in the type system too.

After __post_init__, the rest of the file only treats splits as a mapping, but the field is still annotated as list[str] | dict[...]. Line 468 calls source.splits.items() unconditionally, so the runtime normalization works while the static contract stays wrong. A factory/classmethod that converts list input before instantiation would keep this API honest and avoid a mypy false branch here.

As per coding guidelines, "Use mypy for type checking on Python code (configured in pyproject.toml)."

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/speculative_decoding/prepare_input_conversations/make_dataset.py`:
- Around line 427-457: The current deduplication_ids set is local to
load_conversations_for_split (created where deduplication_ids is assigned) so
identical conversations from different splits/datasets can bypass dedup when
later concatenated; to fix, move the seen-set to the per-output scope and pass
it into load_conversations_for_split (or deduplicate after gathering split
results) so id_for_conversation(truncated_conversations) is checked against a
shared deduplication_ids before adding to unique_samples and before applying
global_limit/max_samples_for_constraint; update callers to accept a shared
deduplication_ids and ensure the check/update occurs prior to enforcing
global_limit.
- Around line 114-121: The function check_row_constraint performs a numeric
comparison before validating the type, causing TypeError for string typos; fix
it by first handling the special string "all", then validate that constraint is
an instance of (int, float) before doing constraint < 0, and if it's not a
numeric type raise a ValueError with a clear message; ensure
check_row_constraint still returns None for "all", returns the numeric value for
valid ints/floats (after the negative check), and raises for any other types
instead of letting a TypeError bubble up.
- Around line 272-287: In _load_ultrachat_conversations, the function currently
ignores the dataset's "messages" field and always emits a single-turn
conversation from "prompt"; update it to check ds[i].get("messages") and when
present and when the caller requested full conversations (honor the
--full-conversations flag by reading the relevant boolean variable or
parameter), convert those messages into the same [{"role": "...", "content":
"..."}] structure (mapping dataset speaker names to "user"/"assistant" as
appropriate), use that list as msgs, and only fall back to using the single
prompt (as today) when "messages" is missing/empty or full-conversations is
false; keep using id_for_conversation to generate prompt_id when needed and
preserve the prompt_id naming convention f"ultrachat-{split_name}-{prompt_id}"
and the existing length/yield semantics and final logger call.

In `@examples/speculative_decoding/README.md`:
- Around line 204-205: Update the README sentence that points readers to
make_dataset.py so it instead directs them to the example_data_config.yaml (the
config used by the new CLI flow) and briefly note that dataset selection is
managed via that YAML config and the CLI rather than by editing the script;
change the reference text in examples/speculative_decoding/README.md to mention
example_data_config.yaml and the CLI workflow.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 382c764d-1729-4360-aea6-a00065c567ad

📥 Commits

Reviewing files that changed from the base of the PR and between 1070d89 and 3e90c5e.

📒 Files selected for processing (10)
  • examples/speculative_decoding/README.md
  • examples/speculative_decoding/prepare_input_conversations/add_daring_anteater.py
  • examples/speculative_decoding/prepare_input_conversations/add_mtbench.py
  • examples/speculative_decoding/prepare_input_conversations/add_sharegpt.py
  • examples/speculative_decoding/prepare_input_conversations/add_ultrachat.py
  • examples/speculative_decoding/prepare_input_conversations/example_data_config.yaml
  • examples/speculative_decoding/prepare_input_conversations/example_make_prompt_dataset.sh
  • examples/speculative_decoding/prepare_input_conversations/make_dataset.py
  • examples/speculative_decoding/prepare_input_conversations/utils.py
  • examples/speculative_decoding/train_eagle3_and_export.sh
💤 Files with no reviewable changes (6)
  • examples/speculative_decoding/prepare_input_conversations/add_ultrachat.py
  • examples/speculative_decoding/prepare_input_conversations/add_sharegpt.py
  • examples/speculative_decoding/prepare_input_conversations/add_mtbench.py
  • examples/speculative_decoding/prepare_input_conversations/utils.py
  • examples/speculative_decoding/prepare_input_conversations/add_daring_anteater.py
  • examples/speculative_decoding/prepare_input_conversations/example_make_prompt_dataset.sh

Comment on lines +114 to +121
def check_row_constraint(constraint) -> int | float | None:
if constraint == "all":
return None
if constraint < 0:
raise ValueError("Number of samples to use for a split cannot be negative.")
if isinstance(constraint, (float, int)):
return constraint
return 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate the constraint type before comparing it.

constraint < 0 runs before the type check, so a typo like "50" or "half" in the YAML raises TypeError instead of a clear config error. This is user-facing input, so fail fast with explicit validation here.

🛠️ Proposed fix
-def check_row_constraint(constraint) -> int | float | None:
+def check_row_constraint(constraint: int | float | str | None) -> int | float | None:
     if constraint == "all":
         return None
+    if isinstance(constraint, bool) or not isinstance(constraint, (int, float)):
+        raise ValueError("Split constraint must be a non-negative int, float, or 'all'.")
     if constraint < 0:
         raise ValueError("Number of samples to use for a split cannot be negative.")
-    if isinstance(constraint, (float, int)):
-        return constraint
-    return 0
+    return constraint
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/prepare_input_conversations/make_dataset.py`
around lines 114 - 121, The function check_row_constraint performs a numeric
comparison before validating the type, causing TypeError for string typos; fix
it by first handling the special string "all", then validate that constraint is
an instance of (int, float) before doing constraint < 0, and if it's not a
numeric type raise a ValueError with a clear message; ensure
check_row_constraint still returns None for "all", returns the numeric value for
valid ints/floats (after the negative check), and raises for any other types
instead of letting a TypeError bubble up.

Comment on lines +272 to +287
async def _load_ultrachat_conversations(
split_name: str,
) -> AsyncGenerator[int | dict[str, Any], None]:
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split=split_name)
ds = ds.shuffle(seed=42)
yield len(ds)
for i in range(len(ds)):
prompt = ds[i]["prompt"].strip()
prompt_id = ds[i]["prompt_id"].strip()
if prompt:
msgs = [{"role": "user", "content": prompt}]
if not prompt_id:
prompt_id = id_for_conversation(msgs)
prompt_id = f"ultrachat-{split_name}-{prompt_id}"
yield {"conversation_id": prompt_id, "conversations": msgs}
logger.info(f"Finished loading UltraChat {split_name} conversations.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, understand the file structure and context around the function
fd "make_dataset.py" --type f

Repository: NVIDIA/Model-Optimizer

Length of output: 139


🏁 Script executed:

# Look for --full-conversations flag usage
rg "full-conversations" --type py -i -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 545


🏁 Script executed:

# Search for how this flag is processed
rg "full_conversations" --type py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 626


🏁 Script executed:

# Check the entire make_dataset.py file to understand context
wc -l "examples/speculative_decoding/prepare_input_conversations/make_dataset.py"

Repository: NVIDIA/Model-Optimizer

Length of output: 143


🏁 Script executed:

# Read the relevant section and surrounding code
cat -n "examples/speculative_decoding/prepare_input_conversations/make_dataset.py" | sed -n '250,320p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2813


🏁 Script executed:

# Check how _load_ultrachat_conversations is called and how args are passed
rg "_load_ultrachat_conversations" --type py -B 5 -A 10

Repository: NVIDIA/Model-Optimizer

Length of output: 3732


🏁 Script executed:

# Look for where datasets are loaded and how full_conversations is used
rg "load_dataset" --type py -B 3 -A 3 | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 7282


🏁 Script executed:

# Read more context around the function calls to understand data flow
cat -n "examples/speculative_decoding/prepare_input_conversations/make_dataset.py" | sed -n '1,100p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3932


🏁 Script executed:

# Check the README mentioned in the review
cat "examples/speculative_decoding/README.md" | sed -n '40,60p'

Repository: NVIDIA/Model-Optimizer

Length of output: 748


🏁 Script executed:

# See how _load_ultrachat_conversations is called in load_dataset_to_jsonl
cat -n "examples/speculative_decoding/prepare_input_conversations/make_dataset.py" | sed -n '350,450p'

Repository: NVIDIA/Model-Optimizer

Length of output: 4768


🏁 Script executed:

# Look for how other loaders or processing functions use full_conversations
rg "full.conversations" --type py -B 5 -A 5 | head -150

Repository: NVIDIA/Model-Optimizer

Length of output: 7547


🏁 Script executed:

# Check the utils to understand id_for_conversation
rg "def id_for_conversation" --type py -B 2 -A 10

Repository: NVIDIA/Model-Optimizer

Length of output: 851


🏁 Script executed:

# Check if there's any reference to messages field in ultrachat context
rg "messages" --type py -B 2 -A 2 "prepare_input_conversations"

Repository: NVIDIA/Model-Optimizer

Length of output: 133


🏁 Script executed:

# Search for any reference to messages field in the make_dataset.py file
grep -n "messages" "examples/speculative_decoding/prepare_input_conversations/make_dataset.py"

Repository: NVIDIA/Model-Optimizer

Length of output: 376


🏁 Script executed:

# Let me check the UltraChat dataset structure by looking at actual usages
# First check if there's any test or example that shows UltraChat structure
rg "ultrachat" --type py -i -B 3 -A 3 | grep -A 5 -B 5 "messages"

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🌐 Web query:

HuggingFaceH4/ultrachat_200k dataset schema structure fields

💡 Result:

HuggingFaceH4/ultrachat_200k entries are stored (Parquet) with this schema/field structure: prompt, messages, prompt_id. [1]

  • prompt: string (the instruction / opening user prompt) [1]
  • messages: list of message objects, each with: [1]
    • content: string
    • role: string (e.g., "user", "assistant")
  • prompt_id: string (a 64-char id / hash) [1]

The dataset provides 4 splits: train_sft, test_sft, train_gen, test_gen. [1]


🏁 Script executed:

# Let's verify the suggested fix by examining if similar patterns exist elsewhere
# Check how nemotron loader handles messages
cat -n "examples/speculative_decoding/prepare_input_conversations/make_dataset.py" | sed -n '361,401p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1749


🏁 Script executed:

# Check if the suggested diff's logic makes sense
# Specifically look at how other loaders check for empty messages/content
cat -n "examples/speculative_decoding/prepare_input_conversations/make_dataset.py" | sed -n '200,270p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3057


🏁 Script executed:

# Let's also check if prompt_id handling in the suggested fix is correct
# Look at how prompt_id is currently handled
cat -n "examples/speculative_decoding/prepare_input_conversations/make_dataset.py" | sed -n '272,287p'

Repository: NVIDIA/Model-Optimizer

Length of output: 893


Use the messages field from UltraChat dataset when available to honor --full-conversations flag.

The --full-conversations flag controls whether to preserve full conversation history, but _load_ultrachat_conversations ignores the dataset's messages field and always emits only the initial prompt as a single-turn conversation. This makes the flag ineffective for UltraChat. Extract multi-turn conversations from the messages field when present, and fall back to prompt-only when messages is empty or unavailable.

Suggested fix
     ds = load_dataset("HuggingFaceH4/ultrachat_200k", split=split_name)
     ds = ds.shuffle(seed=42)
     yield len(ds)
     for i in range(len(ds)):
-        prompt = ds[i]["prompt"].strip()
-        prompt_id = ds[i]["prompt_id"].strip()
-        if prompt:
-            msgs = [{"role": "user", "content": prompt}]
-            if not prompt_id:
-                prompt_id = id_for_conversation(msgs)
-            prompt_id = f"ultrachat-{split_name}-{prompt_id}"
-            yield {"conversation_id": prompt_id, "conversations": msgs}
+        row = ds[i]
+        msgs = [
+            {"role": msg["role"], "content": msg["content"].strip()}
+            for msg in row.get("messages", [])
+            if msg.get("content", "").strip()
+        ]
+        if not msgs:
+            prompt = row["prompt"].strip()
+            if not prompt:
+                continue
+            msgs = [{"role": "user", "content": prompt}]
+
+        prompt_id = row.get("prompt_id")
+        if not prompt_id:
+            prompt_id = id_for_conversation(msgs)
+        prompt_id = f"ultrachat-{split_name}-{prompt_id}"
+        yield {"conversation_id": prompt_id, "conversations": msgs}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/prepare_input_conversations/make_dataset.py`
around lines 272 - 287, In _load_ultrachat_conversations, the function currently
ignores the dataset's "messages" field and always emits a single-turn
conversation from "prompt"; update it to check ds[i].get("messages") and when
present and when the caller requested full conversations (honor the
--full-conversations flag by reading the relevant boolean variable or
parameter), convert those messages into the same [{"role": "...", "content":
"..."}] structure (mapping dataset speaker names to "user"/"assistant" as
appropriate), use that list as msgs, and only fall back to using the single
prompt (as today) when "messages" is missing/empty or full-conversations is
false; keep using id_for_conversation to generate prompt_id when needed and
preserve the prompt_id naming convention f"ultrachat-{split_name}-{prompt_id}"
and the existing length/yield semantics and final logger call.

Comment on lines +427 to +457
deduplication_ids = set()
unique_samples = []
max_num_samples = max_samples_for_constraint(num_samples, row_constraint)
async for sample in samples_it:
assert isinstance(sample, dict) and "conversations" in sample, (
"Each conversation sample must be a dict with a 'conversations' field."
)

# Strip the last turn of the conversation as long as it is an assistant completion,
# since we want to use these conversations as prompts only.
if strip_last_completion:
while sample["conversations"] and sample["conversations"][-1]["role"] != "user":
sample["conversations"].pop()

if not sample["conversations"]:
continue

sample["source_dataset"] = dataset_name
sample["source_split"] = split_name

# Deduplicate based on the first 512 characters from each turn.
# To avoid too many similar conversations with minor differences.
truncated_conversations = [
{"role": msg["role"], "content": msg["content"][0:512]}
for msg in sample["conversations"]
]
dedup_id = id_for_conversation(truncated_conversations)
if dedup_id not in deduplication_ids:
deduplication_ids.add(dedup_id)
unique_samples.append(sample)
if len(unique_samples) >= max_num_samples:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Deduplicate across the whole output, not just one split.

deduplication_ids is recreated on every load_conversations_for_split() call. Line 481 later concatenates many split results into one output file, so identical conversations from different splits or datasets can still survive into the same JSONL. If output uniqueness is a goal here, move the seen-set to the per-output scope and apply it before global_limit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/prepare_input_conversations/make_dataset.py`
around lines 427 - 457, The current deduplication_ids set is local to
load_conversations_for_split (created where deduplication_ids is assigned) so
identical conversations from different splits/datasets can bypass dedup when
later concatenated; to fix, move the seen-set to the per-output scope and pass
it into load_conversations_for_split (or deduplicate after gathering split
results) so id_for_conversation(truncated_conversations) is checked against a
shared deduplication_ids before adding to unique_samples and before applying
global_limit/max_samples_for_constraint; update callers to accept a shared
deduplication_ids and ensure the check/update occurs prior to enforcing
global_limit.

Comment on lines +204 to 205
In addition to the default dataset, we support adding several other commonly used datasets in `prepare_input_conversations/make_dataset.py`:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Point users to the YAML config here.

This sentence sends readers to make_dataset.py, but dataset selection in this workflow now lives in prepare_input_conversations/example_data_config.yaml. Pointing to the config file will match the new CLI flow and keep users out of the script.

📝 Suggested wording
-In addition to the default dataset, we support adding several other commonly used datasets in `prepare_input_conversations/make_dataset.py`:
+In addition to the default dataset, you can enable several other commonly used datasets in `prepare_input_conversations/example_data_config.yaml`:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
In addition to the default dataset, we support adding several other commonly used datasets in `prepare_input_conversations/make_dataset.py`:
In addition to the default dataset, you can enable several other commonly used datasets in `prepare_input_conversations/example_data_config.yaml`:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/README.md` around lines 204 - 205, Update the
README sentence that points readers to make_dataset.py so it instead directs
them to the example_data_config.yaml (the config used by the new CLI flow) and
briefly note that dataset selection is managed via that YAML config and the CLI
rather than by editing the script; change the reference text in
examples/speculative_decoding/README.md to mention example_data_config.yaml and
the CLI workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants