Skip to content

Marker gene validation#81

Open
suu-yi wants to merge 6 commits intoNygenAnalytics:masterfrom
suu-yi:marker-gene-validation
Open

Marker gene validation#81
suu-yi wants to merge 6 commits intoNygenAnalytics:masterfrom
suu-yi:marker-gene-validation

Conversation

@suu-yi
Copy link
Copy Markdown
Contributor

@suu-yi suu-yi commented Apr 8, 2026

Implement simple checks on the marker genes before payload construction:

  • Check for gene id to gene symbols mapping errors, warning for genes lost in mapping
  • Raise value error if marker genes are empty, prevent sending empty marker gene lists
  • Validation check if marker genes from adata.uns are gene id like

@suu-yi suu-yi marked this pull request as ready for review April 8, 2026 14:09
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add marker gene validation and ID detection checks

✨ Enhancement 🧪 Tests

Grey Divider

Walkthroughs

Description
• Add validation for marker gene extraction with lost genes detection
• Raise error when all marker gene lists are empty after mapping
• Detect and warn when rank_genes_groups contain gene IDs instead of symbols
• Implement helper functions to identify gene ID-like patterns
Diagram
flowchart LR
  A["Marker Gene Extraction"] --> B["Check Lost Genes"]
  B --> C["Log Warning if Lost"]
  A --> D["Validate Non-Empty Results"]
  D --> E["Raise Error if Empty"]
  F["Rank Genes Validation"] --> G["Sample Gene Names"]
  G --> H["Detect ID-like Patterns"]
  H --> I["Warn if IDs Found"]
Loading

Grey Divider

File Changes

1. cytetype/preprocessing/extraction.py ✨ Enhancement +13/-0

Add marker gene loss tracking and empty validation

• Track and log genes lost during ID-to-name mapping with warning messages
• Add validation to raise ValueError when all marker gene lists are empty
• Provide informative error message about var_names inconsistencies

cytetype/preprocessing/extraction.py


2. cytetype/preprocessing/validation.py ✨ Enhancement +22/-0

Add gene ID pattern detection and validation warning

• Sample genes from rank_genes_groups results for pattern analysis
• Implement gene ID-like detection using helper functions _id_like_percentage and
 _is_gene_id_like
• Warn users when gene IDs are detected instead of gene symbols
• Suggest re-running rank_genes_groups with current adata

cytetype/preprocessing/validation.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 8, 2026

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX Issues (0)

Grey Divider


Remediation recommended

1. Wrong empty-markers error🐞
Description
In extract_marker_genes, the new "all marker gene lists are empty" exception is raised before the
existing "no marker genes found" check, so cases like n_top_genes=0 or empty rank_genes_groups
columns will throw a mapping-mismatch error even though no top genes were extracted.
Code

cytetype/preprocessing/extraction.py[R88-96]

+    if not any(markers.values()):
+        raise ValueError(
+            "All marker gene lists are empty. Gene names in rank_genes_groups "
+            "could not be matched to adata.var_names. This typically happens "
+            "when var_names were changed after rank_genes_groups was run."
+        )
   if not any_genes_found:
       raise ValueError(
           "No marker genes found for any group. This could indicate issues with the "
Evidence
any_genes_found is only set when top_genes is non-empty, but the new emptiness check runs first and
will trigger whenever all extracted marker lists are empty (including when top_genes was empty for
every group). CyteType does not validate that n_top_genes is > 0, so this misleading error path is
user-reachable.

cytetype/preprocessing/extraction.py[69-99]
cytetype/main.py[76-170]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`extract_marker_genes()` raises the new “All marker gene lists are empty…” error before it checks whether any top genes were present at all (`any_genes_found`). This can produce a misleading mapping-mismatch error for cases where no top genes were extracted (e.g., `n_top_genes=0`, or rank_genes_groups produced empty columns).
## Issue Context
- `any_genes_found` only tracks whether `top_genes` was non-empty for at least one group, not whether mapping succeeded.
- The new `if not any(markers.values())` check currently executes before `if not any_genes_found`.
## Fix Focus Areas
- cytetype/preprocessing/extraction.py[69-99]
- cytetype/main.py[76-86]
## Suggested change
1) Validate `n_top_genes >= 1` early (either in `CyteType.__init__` or inside `extract_marker_genes`).
2) Reorder error checks at the end of `extract_marker_genes`:
 - First: if `not any_genes_found`: raise the existing “No marker genes found…” error.
 - Then: if `not any(markers.values())`: raise the new “All marker gene lists are empty…” error.
 - (Optional) If desired, add a more specific message when `any_genes_found` is true but mapping produced empty lists for all clusters.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

2. Noisy gene-ID warning🐞
Description
validate_adata warns whenever rank_genes_groups names look ID-like, but in common setups
rank_genes_groups legitimately returns var_names (often Ensembl IDs) while marker extraction maps
IDs→symbols using gene_symbols_col, making the warning misleading and likely to fire on normal runs.
Code

cytetype/preprocessing/validation.py[R341-361]

+    rank_names = adata.uns[rank_genes_key]["names"]
+    try:
+        sample_genes = [
+            str(g)
+            for field in rank_names.dtype.names[:3]
+            for g in rank_names[field][:20]
+        ]
+    except Exception:
+        sample_genes = []
+
+    if sample_genes:
+        id_pct = _id_like_percentage(sample_genes)
+        if id_pct > 50:
+            examples = [g for g in sample_genes if _is_gene_id_like(g)][:3]
+            logger.warning(
+                f"rank_genes_groups results contain gene IDs rather than gene symbols "
+                f"(e.g. {examples}). This typically happens when var_names were Ensembl "
+                f"IDs at the time rank_genes_groups was run but have since been replaced "
+                f"with gene symbols. Marker gene extraction may fail or produce empty "
+                f"results. Consider re-running sc.tl.rank_genes_groups on the current adata."
+            )
Evidence
The new warning logic checks only the rank_genes_groups 'names' sample and does not consider whether
a gene-symbol mapping column is present. The repo’s own integration fixture uses Ensembl-like var
index values and runs scanpy rank_genes_groups (producing ID-like marker names), while CyteType
passes a resolved gene_symbols_column into validate_adata and later maps those IDs to symbols during
extract_marker_genes—so the warning can be expected even when everything is configured correctly.

cytetype/preprocessing/validation.py[330-362]
cytetype/main.py[157-170]
cytetype/preprocessing/extraction.py[49-79]
tests/conftest.py[34-48]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`validate_adata()` emits a warning when rank_genes_groups marker names look like gene IDs, but it doesn’t account for the (valid) case where IDs are expected and will be mapped to gene symbols via `gene_symbols_col`. This can create noisy warnings and can mislead users (current message suggests var_names were replaced).
## Issue Context
- Rank genes results are sampled from `adata.uns[rank_genes_key]['names']`.
- CyteType commonly runs with Ensembl IDs in `var_names` and a separate gene-symbol column; extraction maps IDs → symbols.
## Fix Focus Areas
- cytetype/preprocessing/validation.py[341-361]
- cytetype/main.py[157-170]
## Suggested change
- Either gate the warning (e.g., only warn when `gene_symbols_col is None`, or when there is evidence of a mismatch such as: `rank_names` are ID-like but `adata.var_names` are not ID-like).
- Alternatively (or additionally), revise the warning text to avoid claiming var_names were replaced; instead state that IDs were detected and marker extraction requires consistent identifiers or a correct mapping column.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant