feat: add Dutch text normalization with number conversion, clitic expansion, and colloquial replacements by egenthon-cmd · Pull Request #14 · gladiaio/normalization

egenthon-cmd · 2026-04-10T08:36:15Z

What does this PR do?

This PR add the dutch language to normalization.

Type of change

Checklist

New language

Created languages/{lang}/ with operators.py, replacements.py, __init__.py
All word-level substitutions are in replacements.py, not inline in operators.py
Decorated operators class with @register_language
Added one import line to languages/__init__.py
Added unit tests in tests/unit/languages/
Added e2e test rows in tests/e2e/files/

New step

name class attribute is unique and matches the YAML key
Decorated with @register_step
Added one import line to steps/text/__init__.py or steps/word/__init__.py
Algorithm reads data from operators.config.*, no hardcoded language-specific values
Optional config fields are guarded with if operators.config.field is None: return text
Placeholder protect/restore pairs are both in steps/text/placeholders.py and pipeline/base.py's validate() is updated
Added unit tests in tests/unit/steps/
Added step name to relevant preset YAMLs (new preset file if existing presets are affected)
If the class docstring was added or changed, ran uv run scripts/generate_step_docs.py to regenerate docs/steps.md

Preset change

Existing preset files are not modified — new behavior uses a new preset version file

Tests

Summary by CodeRabbit

New Features
- Dutch language added to normalization registry and UI-facing language list
- Automatic conversion of Dutch written numbers into digit form, including euro/cents handling
- Expansion of Dutch apostrophe clitics and temporal "'s" forms
- Dutch-specific single-word and phrase replacements for colloquial/Flemish variants
Tests
- Added unit tests covering Dutch normalization, contractions, number expansion, and replacements

coderabbitai · 2026-04-10T08:36:32Z

📝 Walkthrough

Walkthrough

Adds a new Dutch language package to normalization: registers DutchOperators, adds Dutch number normalization, contraction expansion, token and sentence replacement dictionaries, exposes dutch from the languages package, and includes unit tests validating operators, replacements, and number/currency handling.

Changes

Cohort / File(s)	Summary
Module Export `normalization/languages/__init__.py`	Import and export the new `dutch` submodule; update `__all__` to include `dutch`.
Dutch Package Init `normalization/languages/dutch/__init__.py`	Expose `DutchOperators` and `DUTCH_REPLACEMENTS` via `__all__`.
Operators / Config `normalization/languages/dutch/operators.py`	Add `DutchOperators` (registered language handler `nl`) and `DUTCH_CONFIG`; implements contraction expansion, delegates written-number expansion, and returns Dutch word/sentence replacements.
Number Normalizer `normalization/languages/dutch/number_normalizer.py`	Add `DutchNumberNormalizer` with preprocess/process/postprocess stages to parse spelled-out Dutch numbers, handle multipliers, ordinals, decimals, currency prefixes/suffixes, and produce digit-normalized output.
Replacements `normalization/languages/dutch/replacements.py`, `normalization/languages/dutch/sentence_replacements.py`	Add `DUTCH_REPLACEMENTS` (token-level mappings) and `DUTCH_SENTENCE_REPLACEMENTS` (phrase-level mappings) for dialect, clitics, and orthographic variants.
Tests `tests/unit/languages/dutch_operators_test.py`	Add unit tests verifying language registration, contraction expansion (multiple apostrophe variants), word/sentence replacements, and written-number / currency normalization cases.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Client as Client
participant Ops as DutchOperators
participant Num as DutchNumberNormalizer
participant Repl as Replacements
Client->>Ops: call expand_written_numbers(text) / expand_contractions(text)
Ops->>Num: preprocess and parse written numbers
Num-->>Ops: normalized numeric tokens
Ops->>Repl: get_word_replacements() / sentence_replacements
Ops-->>Client: final normalized text

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through words and numbers bright,
Turned "een en twintig" into digits right,
Split clitics, mended cents and euro song,
Dutch now marches tidy, swift, and strong!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description addresses the template requirements: it states the purpose, marks the 'New language' type as selected, and checks all relevant checklist items for adding a new language (operators/replacements/init.py creation, word substitutions in replacements.py, `@register_language` decorator, import line added, unit tests, and e2e tests). However, the 'Tests' section lacks detail about actual test execution.
Title check	✅ Passed	The title accurately and specifically summarizes the main feature added in the changeset: Dutch language support with number conversion, clitic expansion, and colloquial replacements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/dutch-normalizer

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (5)

normalization/languages/dutch/operators.py (1)

108-111: Consider moving import to module level.

The lazy import inside get_word_replacements() works but is unconventional. If there's no circular import issue, consider moving it to the top of the file for clarity.

♻️ Suggested refactor

At module level (after line 11):

from normalization.languages.dutch.replacements import DUTCH_REPLACEMENTS

Then simplify the method:

     def get_word_replacements(self) -> dict[str, str]:
-        from normalization.languages.dutch.replacements import DUTCH_REPLACEMENTS
-
         return DUTCH_REPLACEMENTS

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/operators.py` around lines 108 - 111, The
get_word_replacements method currently does a lazy import of DUTCH_REPLACEMENTS
inside the function; if there is no circular-import constraint, move the import
"from normalization.languages.dutch.replacements import DUTCH_REPLACEMENTS" to
the module top (after existing imports) and then simplify get_word_replacements
to just return DUTCH_REPLACEMENTS directly, keeping the method name and
signature unchanged.

normalization/languages/dutch/number_normalizer.py (2)

451-455: Minor: redundant None check.

The output() function always returns a str, so word is not None is always true. This filter is harmless but unnecessary.

♻️ Suggested simplification

     def __call__(self, s: str) -> str:
         s = self.preprocess(s)
-        s = " ".join(word for word in self.process_words(s.split()) if word is not None)
+        s = " ".join(self.process_words(s.split()))
         s = self.postprocess(s)
         return s

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/number_normalizer.py` around lines 451 - 455,
In __call__, the comprehension filters out None unnecessarily because
process_words returns strings; remove the "if word is not None" check and simply
join the iterator returned by self.process_words(s.split()) (e.g., s = "
".join(self.process_words(s.split()))), keeping preprocess and postprocess calls
and the same function name __call__ to locate the change.

5-13: Move module docstring before imports.

The module docstring should be at the very top of the file (before imports) to be recognized as the module's __doc__ attribute.

♻️ Suggested fix

+"""
+Dutch number normalizer: spelled-out numbers to digits.
+
+- Dutch compound order: ones + "en" + tens (e.g. "een en twintig" -> 21).
+- Vocabulary: nul, een, twee, ..., tien, elf, twaalf, ..., twintig, dertig, ...
+- Multipliers: honderd, duizend, miljoen, miljard, biljoen.
+- Handles currency (euro, dollar, pond, cent), percent (procent), and decimal (komma).
+- Currency output follows Dutch word order: amount then unit (e.g. €10 and "tien euro" -> "10 euros").
+"""
+
 import re
 from fractions import Fraction
 from typing import Iterator, Match
-
-"""
-Dutch number normalizer: spelled-out numbers to digits.
-...
-"""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/number_normalizer.py` around lines 5 - 13, The
module docstring is placed after imports; move the existing triple-quoted Dutch
number normalizer docstring to the very top of the file so it becomes the
module's __doc__ (ensure it sits before any import statements), preserving its
exact content and formatting; no other changes to functions like any
normalization functions or class names in number_normalizer.py are required.

tests/unit/languages/dutch_operators_test.py (1)

54-62: Consider adding edge case tests for number normalization.

The current tests cover basic euro and dollar cases. Consider adding tests for:

Negative amounts (min tien euro)

Decimal amounts with komma (tien komma vijf euro)

Combined amounts with cents
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/languages/dutch_operators_test.py` around lines 54 - 62, Add
edge-case unit tests for negative and decimal number normalization by extending
the existing test functions (or adding new ones) that call
operators.expand_written_numbers: include cases like "min tien euro" -> "-10
euros", "tien komma vijf euro" -> "10.5 euros", and combined cents such as "tien
euro en vijf cent" -> "10.05 euros" (and their symbol forms like "-€10",
"€10,50"). Ensure you assert the expected normalized English-style outputs for
each input.

normalization/languages/dutch/sentence_replacements.py (1)

4-5: Verify intent: English phrases in Dutch sentence replacements.

"fifty fifty" and "fiftyfifty" are English phrases. If these are intentional for handling code-switching or borrowed expressions in Dutch/Flemish contexts, consider adding a brief comment explaining this. Otherwise, these may belong in the English module or be removed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/sentence_replacements.py` around lines 4 - 5,
The two mapping entries "fifty fifty" and "fiftyfifty" in
sentence_replacements.py appear to be English phrases; either move these entries
into the English sentence replacements module or, if they are intentionally kept
here to handle Dutch/Flemish code-switching, add a brief inline comment above
the mappings explaining that intent (e.g., "handle common English code-switching
in Dutch/Flemish input"); alternatively remove them if they are not needed.
Ensure the change references the exact keys "fifty fifty" and "fiftyfifty" so
reviewers can verify the adjustment.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@normalization/languages/dutch/number_normalizer.py`:
- Around line 214-215: The current check that raises ValueError when
to_fraction() returns None (if f is None) will crash the pipeline; instead, in
the function that calls to_fraction() (look for the to_fraction() call and the
variable f in number_normalizer.py) return/yield the original token unchanged
and optionally emit a debug/warn log about the failed conversion to avoid
interrupting normalization; replace the raise ValueError branch with a non-fatal
fallback that preserves the input token and logs the failure so processing
continues.

In `@normalization/languages/dutch/operators.py`:
- Line 11: The import in operators.py uses the misspelled module name
`registery`; change the import to the correct module `registry` (i.e., import
register_language from normalization.languages.registry) and update any other
occurrences in this file that reference `registery` to `registry` so the module
resolves correctly at runtime.

In `@tests/unit/languages/dutch_operators_test.py`:
- Line 4: The test imports the wrong module name: change the import from
normalization.languages.registery to normalization.languages.registry so
get_language_registry resolves correctly; update the import statement that
currently references "registery" (and any other occurrences in the file) to the
correctly spelled "registry" so the get_language_registry symbol is imported
from the real module.

---

Nitpick comments:
In `@normalization/languages/dutch/number_normalizer.py`:
- Around line 451-455: In __call__, the comprehension filters out None
unnecessarily because process_words returns strings; remove the "if word is not
None" check and simply join the iterator returned by
self.process_words(s.split()) (e.g., s = "
".join(self.process_words(s.split()))), keeping preprocess and postprocess calls
and the same function name __call__ to locate the change.
- Around line 5-13: The module docstring is placed after imports; move the
existing triple-quoted Dutch number normalizer docstring to the very top of the
file so it becomes the module's __doc__ (ensure it sits before any import
statements), preserving its exact content and formatting; no other changes to
functions like any normalization functions or class names in
number_normalizer.py are required.

In `@normalization/languages/dutch/operators.py`:
- Around line 108-111: The get_word_replacements method currently does a lazy
import of DUTCH_REPLACEMENTS inside the function; if there is no circular-import
constraint, move the import "from normalization.languages.dutch.replacements
import DUTCH_REPLACEMENTS" to the module top (after existing imports) and then
simplify get_word_replacements to just return DUTCH_REPLACEMENTS directly,
keeping the method name and signature unchanged.

In `@normalization/languages/dutch/sentence_replacements.py`:
- Around line 4-5: The two mapping entries "fifty fifty" and "fiftyfifty" in
sentence_replacements.py appear to be English phrases; either move these entries
into the English sentence replacements module or, if they are intentionally kept
here to handle Dutch/Flemish code-switching, add a brief inline comment above
the mappings explaining that intent (e.g., "handle common English code-switching
in Dutch/Flemish input"); alternatively remove them if they are not needed.
Ensure the change references the exact keys "fifty fifty" and "fiftyfifty" so
reviewers can verify the adjustment.

In `@tests/unit/languages/dutch_operators_test.py`:
- Around line 54-62: Add edge-case unit tests for negative and decimal number
normalization by extending the existing test functions (or adding new ones) that
call operators.expand_written_numbers: include cases like "min tien euro" ->
"-10 euros", "tien komma vijf euro" -> "10.5 euros", and combined cents such as
"tien euro en vijf cent" -> "10.05 euros" (and their symbol forms like "-€10",
"€10,50"). Ensure you assert the expected normalized English-style outputs for
each input.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fa4367d7-b17a-4ba2-9534-62f38e38cc82

📥 Commits

Reviewing files that changed from the base of the PR and between 8974532 and a066eb6.

⛔ Files ignored due to path filters (1)

tests/e2e/files/gladia-3.csv is excluded by !**/*.csv

📒 Files selected for processing (7)

normalization/languages/__init__.py
normalization/languages/dutch/__init__.py
normalization/languages/dutch/number_normalizer.py
normalization/languages/dutch/operators.py
normalization/languages/dutch/replacements.py
normalization/languages/dutch/sentence_replacements.py
tests/unit/languages/dutch_operators_test.py

coderabbitai · 2026-04-10T08:40:32Z

normalization/languages/dutch/number_normalizer.py

+                if f is None:
+                    raise ValueError("Converting the fraction failed")


⚠️ Potential issue | 🟡 Minor

Consider graceful handling instead of raising ValueError.

If to_fraction() returns None, raising ValueError will crash the normalization pipeline. Consider yielding the original token unchanged to maintain robustness against unexpected input.

🛡️ Proposed fix

if re.match(r"^\d+(\.\d+)?$", current_without_prefix): f = to_fraction(current_without_prefix) if f is None: - raise ValueError("Converting the fraction failed") + # Could not parse as fraction; emit unchanged + if value is not None: + yield output(value) + if pending_ones is not None: + yield output(pending_ones) + yield output(current) + continue

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if f is None:

raise ValueError("Converting the fraction failed")

if f is None:

# Could not parse as fraction; emit unchanged

if value is not None:

yield output(value)

if pending_ones is not None:

yield output(pending_ones)

yield output(current)

continue

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@normalization/languages/dutch/number_normalizer.py` around lines 214 - 215, The current check that raises ValueError when to_fraction() returns None (if f is None) will crash the pipeline; instead, in the function that calls to_fraction() (look for the to_fraction() call and the variable f in number_normalizer.py) return/yield the original token unchanged and optionally emit a debug/warn log about the failed conversion to avoid interrupting normalization; replace the raise ValueError branch with a non-fatal fallback that preserves the input token and logs the failure so processing continues.

coderabbitai · 2026-04-10T08:40:32Z

normalization/languages/dutch/operators.py

+from normalization.languages.dutch.sentence_replacements import (
+    DUTCH_SENTENCE_REPLACEMENTS,
+)
+from normalization.languages.registery import register_language


⚠️ Potential issue | 🔴 Critical

Fix typo: registery → registry.

This is the root cause of the pipeline failure for this module.

🐛 Proposed fix

-from normalization.languages.registery import register_language +from normalization.languages.registry import register_language

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from normalization.languages.registery import register_language

from normalization.languages.registry import register_language

🧰 Tools

🪛 GitHub Actions: CI

[error] 11-11: ImportError during test collection: ModuleNotFoundError: No module named 'normalization.languages.registery'. (from normalization.languages.dutch.operators import register_language)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@normalization/languages/dutch/operators.py` at line 11, The import in operators.py uses the misspelled module name `registery`; change the import to the correct module `registry` (i.e., import register_language from normalization.languages.registry) and update any other occurrences in this file that reference `registery` to `registry` so the module resolves correctly at runtime.

coderabbitai · 2026-04-10T08:40:33Z

tests/unit/languages/dutch_operators_test.py

+import pytest
+
+from normalization.languages.dutch.operators import DutchOperators
+from normalization.languages.registery import get_language_registry


⚠️ Potential issue | 🔴 Critical

Fix typo: registery → registry.

Same typo as in other files - this needs to be registry to match the actual module name.

🐛 Proposed fix

-from normalization.languages.registery import get_language_registry +from normalization.languages.registry import get_language_registry

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from normalization.languages.registery import get_language_registry

from normalization.languages.registry import get_language_registry

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/unit/languages/dutch_operators_test.py` at line 4, The test imports the wrong module name: change the import from normalization.languages.registery to normalization.languages.registry so get_language_registry resolves correctly; update the import statement that currently references "registery" (and any other occurrences in the file) to the correctly spelled "registry" so the get_language_registry symbol is imported from the real module.

egenthon-cmd added 4 commits March 18, 2026 11:24

feat: number normalizer dutch language init

f46bbd2

feat: added replacements and operators

c645166

feat: extend dutch number normalizer and filler words

c4e4474

test: added basic operators tests

a066eb6

egenthon-cmd requested a review from Karamouche April 10, 2026 08:36

coderabbitai bot reviewed Apr 10, 2026

View reviewed changes

fix: ci pipeline try dutch normalizer

f6cf6af

egenthon-cmd changed the title ~~feat: add dutch language for normalizer~~ feat: add Dutch text normalization with number conversion, clitic expansion, and colloquial replacements Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Dutch text normalization with number conversion, clitic expansion, and colloquial replacements#14

feat: add Dutch text normalization with number conversion, clitic expansion, and colloquial replacements#14
egenthon-cmd wants to merge 5 commits intomainfrom
feat/dutch-normalizer

egenthon-cmd commented Apr 10, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 10, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 10, 2026

Uh oh!

coderabbitai bot Apr 10, 2026

Uh oh!

coderabbitai bot Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if f is None:
		raise ValueError("Converting the fraction failed")

-                if f is None:
-                    raise ValueError("Converting the fraction failed")
+                if f is None:
+                    # Could not parse as fraction; emit unchanged
+                    if value is not None:
+                        yield output(value)
+                    if pending_ones is not None:
+                        yield output(pending_ones)
+                    yield output(current)
+                    continue

	from normalization.languages.registery import register_language
	from normalization.languages.registry import register_language

Conversation

egenthon-cmd commented Apr 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of change

Checklist

New language

New step

Preset change

Tests

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

egenthon-cmd commented Apr 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 10, 2026 •

edited

Loading