Skip to content

feat: add Dutch text normalization with number conversion, clitic expansion, and colloquial replacements#14

Open
egenthon-cmd wants to merge 5 commits intomainfrom
feat/dutch-normalizer
Open

feat: add Dutch text normalization with number conversion, clitic expansion, and colloquial replacements#14
egenthon-cmd wants to merge 5 commits intomainfrom
feat/dutch-normalizer

Conversation

@egenthon-cmd
Copy link
Copy Markdown

@egenthon-cmd egenthon-cmd commented Apr 10, 2026

What does this PR do?

This PR add the dutch language to normalization.

Type of change

  • New language (languages/{lang}/)
  • New step (steps/text/ or steps/word/)
  • New preset version (presets/)
  • Bug fix
  • Refactor / internal cleanup
  • Docs / CI

Checklist

New language

  • Created languages/{lang}/ with operators.py, replacements.py, __init__.py
  • All word-level substitutions are in replacements.py, not inline in operators.py
  • Decorated operators class with @register_language
  • Added one import line to languages/__init__.py
  • Added unit tests in tests/unit/languages/
  • Added e2e test rows in tests/e2e/files/

New step

  • name class attribute is unique and matches the YAML key
  • Decorated with @register_step
  • Added one import line to steps/text/__init__.py or steps/word/__init__.py
  • Algorithm reads data from operators.config.*, no hardcoded language-specific values
  • Optional config fields are guarded with if operators.config.field is None: return text
  • Placeholder protect/restore pairs are both in steps/text/placeholders.py and pipeline/base.py's validate() is updated
  • Added unit tests in tests/unit/steps/
  • Added step name to relevant preset YAMLs (new preset file if existing presets are affected)
  • If the class docstring was added or changed, ran uv run scripts/generate_step_docs.py to regenerate docs/steps.md

Preset change

  • Existing preset files are not modified — new behavior uses a new preset version file

Tests

Summary by CodeRabbit

  • New Features

    • Dutch language added to normalization registry and UI-facing language list
    • Automatic conversion of Dutch written numbers into digit form, including euro/cents handling
    • Expansion of Dutch apostrophe clitics and temporal "'s" forms
    • Dutch-specific single-word and phrase replacements for colloquial/Flemish variants
  • Tests

    • Added unit tests covering Dutch normalization, contractions, number expansion, and replacements

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 10, 2026

📝 Walkthrough

Walkthrough

Adds a new Dutch language package to normalization: registers DutchOperators, adds Dutch number normalization, contraction expansion, token and sentence replacement dictionaries, exposes dutch from the languages package, and includes unit tests validating operators, replacements, and number/currency handling.

Changes

Cohort / File(s) Summary
Module Export
normalization/languages/__init__.py
Import and export the new dutch submodule; update __all__ to include dutch.
Dutch Package Init
normalization/languages/dutch/__init__.py
Expose DutchOperators and DUTCH_REPLACEMENTS via __all__.
Operators / Config
normalization/languages/dutch/operators.py
Add DutchOperators (registered language handler nl) and DUTCH_CONFIG; implements contraction expansion, delegates written-number expansion, and returns Dutch word/sentence replacements.
Number Normalizer
normalization/languages/dutch/number_normalizer.py
Add DutchNumberNormalizer with preprocess/process/postprocess stages to parse spelled-out Dutch numbers, handle multipliers, ordinals, decimals, currency prefixes/suffixes, and produce digit-normalized output.
Replacements
normalization/languages/dutch/replacements.py, normalization/languages/dutch/sentence_replacements.py
Add DUTCH_REPLACEMENTS (token-level mappings) and DUTCH_SENTENCE_REPLACEMENTS (phrase-level mappings) for dialect, clitics, and orthographic variants.
Tests
tests/unit/languages/dutch_operators_test.py
Add unit tests verifying language registration, contraction expansion (multiple apostrophe variants), word/sentence replacements, and written-number / currency normalization cases.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Client as Client
participant Ops as DutchOperators
participant Num as DutchNumberNormalizer
participant Repl as Replacements
Client->>Ops: call expand_written_numbers(text) / expand_contractions(text)
Ops->>Num: preprocess and parse written numbers
Num-->>Ops: normalized numeric tokens
Ops->>Repl: get_word_replacements() / sentence_replacements
Ops-->>Client: final normalized text

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through words and numbers bright,
Turned "een en twintig" into digits right,
Split clitics, mended cents and euro song,
Dutch now marches tidy, swift, and strong!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The PR description addresses the template requirements: it states the purpose, marks the 'New language' type as selected, and checks all relevant checklist items for adding a new language (operators/replacements/init.py creation, word substitutions in replacements.py, @register_language decorator, import line added, unit tests, and e2e tests). However, the 'Tests' section lacks detail about actual test execution.
Title check ✅ Passed The title accurately and specifically summarizes the main feature added in the changeset: Dutch language support with number conversion, clitic expansion, and colloquial replacements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/dutch-normalizer

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@egenthon-cmd egenthon-cmd requested a review from Karamouche April 10, 2026 08:36
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
normalization/languages/dutch/operators.py (1)

108-111: Consider moving import to module level.

The lazy import inside get_word_replacements() works but is unconventional. If there's no circular import issue, consider moving it to the top of the file for clarity.

♻️ Suggested refactor

At module level (after line 11):

from normalization.languages.dutch.replacements import DUTCH_REPLACEMENTS

Then simplify the method:

     def get_word_replacements(self) -> dict[str, str]:
-        from normalization.languages.dutch.replacements import DUTCH_REPLACEMENTS
-
         return DUTCH_REPLACEMENTS
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/operators.py` around lines 108 - 111, The
get_word_replacements method currently does a lazy import of DUTCH_REPLACEMENTS
inside the function; if there is no circular-import constraint, move the import
"from normalization.languages.dutch.replacements import DUTCH_REPLACEMENTS" to
the module top (after existing imports) and then simplify get_word_replacements
to just return DUTCH_REPLACEMENTS directly, keeping the method name and
signature unchanged.
normalization/languages/dutch/number_normalizer.py (2)

451-455: Minor: redundant None check.

The output() function always returns a str, so word is not None is always true. This filter is harmless but unnecessary.

♻️ Suggested simplification
     def __call__(self, s: str) -> str:
         s = self.preprocess(s)
-        s = " ".join(word for word in self.process_words(s.split()) if word is not None)
+        s = " ".join(self.process_words(s.split()))
         s = self.postprocess(s)
         return s
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/number_normalizer.py` around lines 451 - 455,
In __call__, the comprehension filters out None unnecessarily because
process_words returns strings; remove the "if word is not None" check and simply
join the iterator returned by self.process_words(s.split()) (e.g., s = "
".join(self.process_words(s.split()))), keeping preprocess and postprocess calls
and the same function name __call__ to locate the change.

5-13: Move module docstring before imports.

The module docstring should be at the very top of the file (before imports) to be recognized as the module's __doc__ attribute.

♻️ Suggested fix
+"""
+Dutch number normalizer: spelled-out numbers to digits.
+
+- Dutch compound order: ones + "en" + tens (e.g. "een en twintig" -> 21).
+- Vocabulary: nul, een, twee, ..., tien, elf, twaalf, ..., twintig, dertig, ...
+- Multipliers: honderd, duizend, miljoen, miljard, biljoen.
+- Handles currency (euro, dollar, pond, cent), percent (procent), and decimal (komma).
+- Currency output follows Dutch word order: amount then unit (e.g. €10 and "tien euro" -> "10 euros").
+"""
+
 import re
 from fractions import Fraction
 from typing import Iterator, Match
-
-"""
-Dutch number normalizer: spelled-out numbers to digits.
-...
-"""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/number_normalizer.py` around lines 5 - 13, The
module docstring is placed after imports; move the existing triple-quoted Dutch
number normalizer docstring to the very top of the file so it becomes the
module's __doc__ (ensure it sits before any import statements), preserving its
exact content and formatting; no other changes to functions like any
normalization functions or class names in number_normalizer.py are required.
tests/unit/languages/dutch_operators_test.py (1)

54-62: Consider adding edge case tests for number normalization.

The current tests cover basic euro and dollar cases. Consider adding tests for:

  • Negative amounts (min tien euro)
  • Decimal amounts with komma (tien komma vijf euro)
  • Combined amounts with cents
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/languages/dutch_operators_test.py` around lines 54 - 62, Add
edge-case unit tests for negative and decimal number normalization by extending
the existing test functions (or adding new ones) that call
operators.expand_written_numbers: include cases like "min tien euro" -> "-10
euros", "tien komma vijf euro" -> "10.5 euros", and combined cents such as "tien
euro en vijf cent" -> "10.05 euros" (and their symbol forms like "-€10",
"€10,50"). Ensure you assert the expected normalized English-style outputs for
each input.
normalization/languages/dutch/sentence_replacements.py (1)

4-5: Verify intent: English phrases in Dutch sentence replacements.

"fifty fifty" and "fiftyfifty" are English phrases. If these are intentional for handling code-switching or borrowed expressions in Dutch/Flemish contexts, consider adding a brief comment explaining this. Otherwise, these may belong in the English module or be removed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/sentence_replacements.py` around lines 4 - 5,
The two mapping entries "fifty fifty" and "fiftyfifty" in
sentence_replacements.py appear to be English phrases; either move these entries
into the English sentence replacements module or, if they are intentionally kept
here to handle Dutch/Flemish code-switching, add a brief inline comment above
the mappings explaining that intent (e.g., "handle common English code-switching
in Dutch/Flemish input"); alternatively remove them if they are not needed.
Ensure the change references the exact keys "fifty fifty" and "fiftyfifty" so
reviewers can verify the adjustment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@normalization/languages/dutch/number_normalizer.py`:
- Around line 214-215: The current check that raises ValueError when
to_fraction() returns None (if f is None) will crash the pipeline; instead, in
the function that calls to_fraction() (look for the to_fraction() call and the
variable f in number_normalizer.py) return/yield the original token unchanged
and optionally emit a debug/warn log about the failed conversion to avoid
interrupting normalization; replace the raise ValueError branch with a non-fatal
fallback that preserves the input token and logs the failure so processing
continues.

In `@normalization/languages/dutch/operators.py`:
- Line 11: The import in operators.py uses the misspelled module name
`registery`; change the import to the correct module `registry` (i.e., import
register_language from normalization.languages.registry) and update any other
occurrences in this file that reference `registery` to `registry` so the module
resolves correctly at runtime.

In `@tests/unit/languages/dutch_operators_test.py`:
- Line 4: The test imports the wrong module name: change the import from
normalization.languages.registery to normalization.languages.registry so
get_language_registry resolves correctly; update the import statement that
currently references "registery" (and any other occurrences in the file) to the
correctly spelled "registry" so the get_language_registry symbol is imported
from the real module.

---

Nitpick comments:
In `@normalization/languages/dutch/number_normalizer.py`:
- Around line 451-455: In __call__, the comprehension filters out None
unnecessarily because process_words returns strings; remove the "if word is not
None" check and simply join the iterator returned by
self.process_words(s.split()) (e.g., s = "
".join(self.process_words(s.split()))), keeping preprocess and postprocess calls
and the same function name __call__ to locate the change.
- Around line 5-13: The module docstring is placed after imports; move the
existing triple-quoted Dutch number normalizer docstring to the very top of the
file so it becomes the module's __doc__ (ensure it sits before any import
statements), preserving its exact content and formatting; no other changes to
functions like any normalization functions or class names in
number_normalizer.py are required.

In `@normalization/languages/dutch/operators.py`:
- Around line 108-111: The get_word_replacements method currently does a lazy
import of DUTCH_REPLACEMENTS inside the function; if there is no circular-import
constraint, move the import "from normalization.languages.dutch.replacements
import DUTCH_REPLACEMENTS" to the module top (after existing imports) and then
simplify get_word_replacements to just return DUTCH_REPLACEMENTS directly,
keeping the method name and signature unchanged.

In `@normalization/languages/dutch/sentence_replacements.py`:
- Around line 4-5: The two mapping entries "fifty fifty" and "fiftyfifty" in
sentence_replacements.py appear to be English phrases; either move these entries
into the English sentence replacements module or, if they are intentionally kept
here to handle Dutch/Flemish code-switching, add a brief inline comment above
the mappings explaining that intent (e.g., "handle common English code-switching
in Dutch/Flemish input"); alternatively remove them if they are not needed.
Ensure the change references the exact keys "fifty fifty" and "fiftyfifty" so
reviewers can verify the adjustment.

In `@tests/unit/languages/dutch_operators_test.py`:
- Around line 54-62: Add edge-case unit tests for negative and decimal number
normalization by extending the existing test functions (or adding new ones) that
call operators.expand_written_numbers: include cases like "min tien euro" ->
"-10 euros", "tien komma vijf euro" -> "10.5 euros", and combined cents such as
"tien euro en vijf cent" -> "10.05 euros" (and their symbol forms like "-€10",
"€10,50"). Ensure you assert the expected normalized English-style outputs for
each input.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fa4367d7-b17a-4ba2-9534-62f38e38cc82

📥 Commits

Reviewing files that changed from the base of the PR and between 8974532 and a066eb6.

⛔ Files ignored due to path filters (1)
  • tests/e2e/files/gladia-3.csv is excluded by !**/*.csv
📒 Files selected for processing (7)
  • normalization/languages/__init__.py
  • normalization/languages/dutch/__init__.py
  • normalization/languages/dutch/number_normalizer.py
  • normalization/languages/dutch/operators.py
  • normalization/languages/dutch/replacements.py
  • normalization/languages/dutch/sentence_replacements.py
  • tests/unit/languages/dutch_operators_test.py

Comment on lines +214 to +215
if f is None:
raise ValueError("Converting the fraction failed")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Consider graceful handling instead of raising ValueError.

If to_fraction() returns None, raising ValueError will crash the normalization pipeline. Consider yielding the original token unchanged to maintain robustness against unexpected input.

🛡️ Proposed fix
             if re.match(r"^\d+(\.\d+)?$", current_without_prefix):
                 f = to_fraction(current_without_prefix)
                 if f is None:
-                    raise ValueError("Converting the fraction failed")
+                    # Could not parse as fraction; emit unchanged
+                    if value is not None:
+                        yield output(value)
+                    if pending_ones is not None:
+                        yield output(pending_ones)
+                    yield output(current)
+                    continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if f is None:
raise ValueError("Converting the fraction failed")
if f is None:
# Could not parse as fraction; emit unchanged
if value is not None:
yield output(value)
if pending_ones is not None:
yield output(pending_ones)
yield output(current)
continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/number_normalizer.py` around lines 214 - 215,
The current check that raises ValueError when to_fraction() returns None (if f
is None) will crash the pipeline; instead, in the function that calls
to_fraction() (look for the to_fraction() call and the variable f in
number_normalizer.py) return/yield the original token unchanged and optionally
emit a debug/warn log about the failed conversion to avoid interrupting
normalization; replace the raise ValueError branch with a non-fatal fallback
that preserves the input token and logs the failure so processing continues.

from normalization.languages.dutch.sentence_replacements import (
DUTCH_SENTENCE_REPLACEMENTS,
)
from normalization.languages.registery import register_language
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix typo: registeryregistry.

This is the root cause of the pipeline failure for this module.

🐛 Proposed fix
-from normalization.languages.registery import register_language
+from normalization.languages.registry import register_language
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from normalization.languages.registery import register_language
from normalization.languages.registry import register_language
🧰 Tools
🪛 GitHub Actions: CI

[error] 11-11: ImportError during test collection: ModuleNotFoundError: No module named 'normalization.languages.registery'. (from normalization.languages.dutch.operators import register_language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/operators.py` at line 11, The import in
operators.py uses the misspelled module name `registery`; change the import to
the correct module `registry` (i.e., import register_language from
normalization.languages.registry) and update any other occurrences in this file
that reference `registery` to `registry` so the module resolves correctly at
runtime.

import pytest

from normalization.languages.dutch.operators import DutchOperators
from normalization.languages.registery import get_language_registry
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix typo: registeryregistry.

Same typo as in other files - this needs to be registry to match the actual module name.

🐛 Proposed fix
-from normalization.languages.registery import get_language_registry
+from normalization.languages.registry import get_language_registry
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from normalization.languages.registery import get_language_registry
from normalization.languages.registry import get_language_registry
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/languages/dutch_operators_test.py` at line 4, The test imports the
wrong module name: change the import from normalization.languages.registery to
normalization.languages.registry so get_language_registry resolves correctly;
update the import statement that currently references "registery" (and any other
occurrences in the file) to the correctly spelled "registry" so the
get_language_registry symbol is imported from the real module.

@egenthon-cmd egenthon-cmd changed the title feat: add dutch language for normalizer feat: add Dutch text normalization with number conversion, clitic expansion, and colloquial replacements Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant