feat: add Dutch text normalization with number conversion, clitic expansion, and colloquial replacements#14
feat: add Dutch text normalization with number conversion, clitic expansion, and colloquial replacements#14egenthon-cmd wants to merge 5 commits intomainfrom
Conversation
📝 WalkthroughWalkthroughAdds a new Dutch language package to normalization: registers Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
normalization/languages/dutch/operators.py (1)
108-111: Consider moving import to module level.The lazy import inside
get_word_replacements()works but is unconventional. If there's no circular import issue, consider moving it to the top of the file for clarity.♻️ Suggested refactor
At module level (after line 11):
from normalization.languages.dutch.replacements import DUTCH_REPLACEMENTSThen simplify the method:
def get_word_replacements(self) -> dict[str, str]: - from normalization.languages.dutch.replacements import DUTCH_REPLACEMENTS - return DUTCH_REPLACEMENTS🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/dutch/operators.py` around lines 108 - 111, The get_word_replacements method currently does a lazy import of DUTCH_REPLACEMENTS inside the function; if there is no circular-import constraint, move the import "from normalization.languages.dutch.replacements import DUTCH_REPLACEMENTS" to the module top (after existing imports) and then simplify get_word_replacements to just return DUTCH_REPLACEMENTS directly, keeping the method name and signature unchanged.normalization/languages/dutch/number_normalizer.py (2)
451-455: Minor: redundant None check.The
output()function always returns astr, soword is not Noneis always true. This filter is harmless but unnecessary.♻️ Suggested simplification
def __call__(self, s: str) -> str: s = self.preprocess(s) - s = " ".join(word for word in self.process_words(s.split()) if word is not None) + s = " ".join(self.process_words(s.split())) s = self.postprocess(s) return s🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/dutch/number_normalizer.py` around lines 451 - 455, In __call__, the comprehension filters out None unnecessarily because process_words returns strings; remove the "if word is not None" check and simply join the iterator returned by self.process_words(s.split()) (e.g., s = " ".join(self.process_words(s.split()))), keeping preprocess and postprocess calls and the same function name __call__ to locate the change.
5-13: Move module docstring before imports.The module docstring should be at the very top of the file (before imports) to be recognized as the module's
__doc__attribute.♻️ Suggested fix
+""" +Dutch number normalizer: spelled-out numbers to digits. + +- Dutch compound order: ones + "en" + tens (e.g. "een en twintig" -> 21). +- Vocabulary: nul, een, twee, ..., tien, elf, twaalf, ..., twintig, dertig, ... +- Multipliers: honderd, duizend, miljoen, miljard, biljoen. +- Handles currency (euro, dollar, pond, cent), percent (procent), and decimal (komma). +- Currency output follows Dutch word order: amount then unit (e.g. €10 and "tien euro" -> "10 euros"). +""" + import re from fractions import Fraction from typing import Iterator, Match - -""" -Dutch number normalizer: spelled-out numbers to digits. -... -"""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/dutch/number_normalizer.py` around lines 5 - 13, The module docstring is placed after imports; move the existing triple-quoted Dutch number normalizer docstring to the very top of the file so it becomes the module's __doc__ (ensure it sits before any import statements), preserving its exact content and formatting; no other changes to functions like any normalization functions or class names in number_normalizer.py are required.tests/unit/languages/dutch_operators_test.py (1)
54-62: Consider adding edge case tests for number normalization.The current tests cover basic euro and dollar cases. Consider adding tests for:
- Negative amounts (
min tien euro)- Decimal amounts with
komma(tien komma vijf euro)- Combined amounts with cents
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/languages/dutch_operators_test.py` around lines 54 - 62, Add edge-case unit tests for negative and decimal number normalization by extending the existing test functions (or adding new ones) that call operators.expand_written_numbers: include cases like "min tien euro" -> "-10 euros", "tien komma vijf euro" -> "10.5 euros", and combined cents such as "tien euro en vijf cent" -> "10.05 euros" (and their symbol forms like "-€10", "€10,50"). Ensure you assert the expected normalized English-style outputs for each input.normalization/languages/dutch/sentence_replacements.py (1)
4-5: Verify intent: English phrases in Dutch sentence replacements."fifty fifty" and "fiftyfifty" are English phrases. If these are intentional for handling code-switching or borrowed expressions in Dutch/Flemish contexts, consider adding a brief comment explaining this. Otherwise, these may belong in the English module or be removed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/dutch/sentence_replacements.py` around lines 4 - 5, The two mapping entries "fifty fifty" and "fiftyfifty" in sentence_replacements.py appear to be English phrases; either move these entries into the English sentence replacements module or, if they are intentionally kept here to handle Dutch/Flemish code-switching, add a brief inline comment above the mappings explaining that intent (e.g., "handle common English code-switching in Dutch/Flemish input"); alternatively remove them if they are not needed. Ensure the change references the exact keys "fifty fifty" and "fiftyfifty" so reviewers can verify the adjustment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@normalization/languages/dutch/number_normalizer.py`:
- Around line 214-215: The current check that raises ValueError when
to_fraction() returns None (if f is None) will crash the pipeline; instead, in
the function that calls to_fraction() (look for the to_fraction() call and the
variable f in number_normalizer.py) return/yield the original token unchanged
and optionally emit a debug/warn log about the failed conversion to avoid
interrupting normalization; replace the raise ValueError branch with a non-fatal
fallback that preserves the input token and logs the failure so processing
continues.
In `@normalization/languages/dutch/operators.py`:
- Line 11: The import in operators.py uses the misspelled module name
`registery`; change the import to the correct module `registry` (i.e., import
register_language from normalization.languages.registry) and update any other
occurrences in this file that reference `registery` to `registry` so the module
resolves correctly at runtime.
In `@tests/unit/languages/dutch_operators_test.py`:
- Line 4: The test imports the wrong module name: change the import from
normalization.languages.registery to normalization.languages.registry so
get_language_registry resolves correctly; update the import statement that
currently references "registery" (and any other occurrences in the file) to the
correctly spelled "registry" so the get_language_registry symbol is imported
from the real module.
---
Nitpick comments:
In `@normalization/languages/dutch/number_normalizer.py`:
- Around line 451-455: In __call__, the comprehension filters out None
unnecessarily because process_words returns strings; remove the "if word is not
None" check and simply join the iterator returned by
self.process_words(s.split()) (e.g., s = "
".join(self.process_words(s.split()))), keeping preprocess and postprocess calls
and the same function name __call__ to locate the change.
- Around line 5-13: The module docstring is placed after imports; move the
existing triple-quoted Dutch number normalizer docstring to the very top of the
file so it becomes the module's __doc__ (ensure it sits before any import
statements), preserving its exact content and formatting; no other changes to
functions like any normalization functions or class names in
number_normalizer.py are required.
In `@normalization/languages/dutch/operators.py`:
- Around line 108-111: The get_word_replacements method currently does a lazy
import of DUTCH_REPLACEMENTS inside the function; if there is no circular-import
constraint, move the import "from normalization.languages.dutch.replacements
import DUTCH_REPLACEMENTS" to the module top (after existing imports) and then
simplify get_word_replacements to just return DUTCH_REPLACEMENTS directly,
keeping the method name and signature unchanged.
In `@normalization/languages/dutch/sentence_replacements.py`:
- Around line 4-5: The two mapping entries "fifty fifty" and "fiftyfifty" in
sentence_replacements.py appear to be English phrases; either move these entries
into the English sentence replacements module or, if they are intentionally kept
here to handle Dutch/Flemish code-switching, add a brief inline comment above
the mappings explaining that intent (e.g., "handle common English code-switching
in Dutch/Flemish input"); alternatively remove them if they are not needed.
Ensure the change references the exact keys "fifty fifty" and "fiftyfifty" so
reviewers can verify the adjustment.
In `@tests/unit/languages/dutch_operators_test.py`:
- Around line 54-62: Add edge-case unit tests for negative and decimal number
normalization by extending the existing test functions (or adding new ones) that
call operators.expand_written_numbers: include cases like "min tien euro" ->
"-10 euros", "tien komma vijf euro" -> "10.5 euros", and combined cents such as
"tien euro en vijf cent" -> "10.05 euros" (and their symbol forms like "-€10",
"€10,50"). Ensure you assert the expected normalized English-style outputs for
each input.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fa4367d7-b17a-4ba2-9534-62f38e38cc82
⛔ Files ignored due to path filters (1)
tests/e2e/files/gladia-3.csvis excluded by!**/*.csv
📒 Files selected for processing (7)
normalization/languages/__init__.pynormalization/languages/dutch/__init__.pynormalization/languages/dutch/number_normalizer.pynormalization/languages/dutch/operators.pynormalization/languages/dutch/replacements.pynormalization/languages/dutch/sentence_replacements.pytests/unit/languages/dutch_operators_test.py
| if f is None: | ||
| raise ValueError("Converting the fraction failed") |
There was a problem hiding this comment.
Consider graceful handling instead of raising ValueError.
If to_fraction() returns None, raising ValueError will crash the normalization pipeline. Consider yielding the original token unchanged to maintain robustness against unexpected input.
🛡️ Proposed fix
if re.match(r"^\d+(\.\d+)?$", current_without_prefix):
f = to_fraction(current_without_prefix)
if f is None:
- raise ValueError("Converting the fraction failed")
+ # Could not parse as fraction; emit unchanged
+ if value is not None:
+ yield output(value)
+ if pending_ones is not None:
+ yield output(pending_ones)
+ yield output(current)
+ continue📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if f is None: | |
| raise ValueError("Converting the fraction failed") | |
| if f is None: | |
| # Could not parse as fraction; emit unchanged | |
| if value is not None: | |
| yield output(value) | |
| if pending_ones is not None: | |
| yield output(pending_ones) | |
| yield output(current) | |
| continue |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@normalization/languages/dutch/number_normalizer.py` around lines 214 - 215,
The current check that raises ValueError when to_fraction() returns None (if f
is None) will crash the pipeline; instead, in the function that calls
to_fraction() (look for the to_fraction() call and the variable f in
number_normalizer.py) return/yield the original token unchanged and optionally
emit a debug/warn log about the failed conversion to avoid interrupting
normalization; replace the raise ValueError branch with a non-fatal fallback
that preserves the input token and logs the failure so processing continues.
| from normalization.languages.dutch.sentence_replacements import ( | ||
| DUTCH_SENTENCE_REPLACEMENTS, | ||
| ) | ||
| from normalization.languages.registery import register_language |
There was a problem hiding this comment.
Fix typo: registery → registry.
This is the root cause of the pipeline failure for this module.
🐛 Proposed fix
-from normalization.languages.registery import register_language
+from normalization.languages.registry import register_language📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from normalization.languages.registery import register_language | |
| from normalization.languages.registry import register_language |
🧰 Tools
🪛 GitHub Actions: CI
[error] 11-11: ImportError during test collection: ModuleNotFoundError: No module named 'normalization.languages.registery'. (from normalization.languages.dutch.operators import register_language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@normalization/languages/dutch/operators.py` at line 11, The import in
operators.py uses the misspelled module name `registery`; change the import to
the correct module `registry` (i.e., import register_language from
normalization.languages.registry) and update any other occurrences in this file
that reference `registery` to `registry` so the module resolves correctly at
runtime.
| import pytest | ||
|
|
||
| from normalization.languages.dutch.operators import DutchOperators | ||
| from normalization.languages.registery import get_language_registry |
There was a problem hiding this comment.
Fix typo: registery → registry.
Same typo as in other files - this needs to be registry to match the actual module name.
🐛 Proposed fix
-from normalization.languages.registery import get_language_registry
+from normalization.languages.registry import get_language_registry📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from normalization.languages.registery import get_language_registry | |
| from normalization.languages.registry import get_language_registry |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/languages/dutch_operators_test.py` at line 4, The test imports the
wrong module name: change the import from normalization.languages.registery to
normalization.languages.registry so get_language_registry resolves correctly;
update the import statement that currently references "registery" (and any other
occurrences in the file) to the correctly spelled "registry" so the
get_language_registry symbol is imported from the real module.
What does this PR do?
This PR add the dutch language to normalization.
Type of change
languages/{lang}/)steps/text/orsteps/word/)presets/)Checklist
New language
languages/{lang}/withoperators.py,replacements.py,__init__.pyreplacements.py, not inline inoperators.py@register_languagelanguages/__init__.pytests/unit/languages/tests/e2e/files/New step
nameclass attribute is unique and matches the YAML key@register_stepsteps/text/__init__.pyorsteps/word/__init__.pyoperators.config.*, no hardcoded language-specific valuesif operators.config.field is None: return textsteps/text/placeholders.pyandpipeline/base.py'svalidate()is updatedtests/unit/steps/uv run scripts/generate_step_docs.pyto regeneratedocs/steps.mdPreset change
Tests
Summary by CodeRabbit
New Features
Tests