Skip to content

feat: improve french text normalization with number conversion and contraction expansion#13

Open
egenthon-cmd wants to merge 9 commits intomainfrom
feat/french-normalizer
Open

feat: improve french text normalization with number conversion and contraction expansion#13
egenthon-cmd wants to merge 9 commits intomainfrom
feat/french-normalizer

Conversation

@egenthon-cmd
Copy link
Copy Markdown

@egenthon-cmd egenthon-cmd commented Apr 9, 2026

What does this PR do?

Enhances the French text normalization pipeline with richer, more consistent output:

Converts written numbers to digits
Expands apostrophe contractions
Adds word/phrase replacements for cleaner normalization

Also adds unit tests with a French operators fixture covering sentence-level replacements and contraction behavior, plus a runtime dependency for number-to-digit conversion.

Type of change

  • New language (languages/{lang}/)
  • New step (steps/text/ or steps/word/)
  • New preset version (presets/)
  • Bug fix
  • Refactor / internal cleanup
  • Docs / CI

Checklist

New language

  • Created languages/{lang}/ with operators.py, replacements.py, __init__.py
  • All word-level substitutions are in replacements.py, not inline in operators.py
  • Decorated operators class with @register_language
  • Added one import line to languages/__init__.py
  • Added unit tests in tests/unit/languages/
  • Added e2e test rows in tests/e2e/files/

New step

  • name class attribute is unique and matches the YAML key
  • Decorated with @register_step
  • Added one import line to steps/text/__init__.py or steps/word/__init__.py
  • Algorithm reads data from operators.config.*, no hardcoded language-specific values
  • Optional config fields are guarded with if operators.config.field is None: return text
  • Placeholder protect/restore pairs are both in steps/text/placeholders.py and pipeline/base.py's validate() is updated
  • Added unit tests in tests/unit/steps/
  • Added step name to relevant preset YAMLs (new preset file if existing presets are affected)
  • If the class docstring was added or changed, ran uv run scripts/generate_step_docs.py to regenerate docs/steps.md

Preset change

  • Existing preset files are not modified — new behavior uses a new preset version file

Tests

Summary by CodeRabbit

  • New Features

    • Improved French text normalization: converts written numbers to digits, expands certain apostrophe contractions, and adds many word/phrase replacements for more consistent output.
  • Tests

    • Added unit tests and a French operators fixture covering sentence-level replacements and contraction behavior.
  • Dependencies

    • Added a runtime dependency for number-to-digit conversion and included preset data in package configuration.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

Adds French number normalization (new FrenchNumberNormalizer using text2num.alpha2digit), expands French replacement maps and operators (contraction and written-number handling), adds tests and a pyproject dependency and package-data entry.

Changes

Cohort / File(s) Summary
French normalization core
normalization/languages/french/number_normalizer.py, normalization/languages/french/replacements.py, normalization/languages/french/sentence_replacements.py
Adds FrenchNumberNormalizer (conditional import of alpha2digit; pre-pass rewrites single-digit + large-unit patterns like 3 milliardstrois milliards, then calls alpha2digit(text, "fr")). Populates FRENCH_REPLACEMENTS and adds FRENCH_SENTENCE_REPLACEMENTS.
French operators
normalization/languages/french/operators.py
FrenchOperators now constructs a FrenchNumberNormalizer. Adds expand_contractions(text), expand_written_numbers(text), and get_word_replacements(). Updates FRENCH_CONFIG with digit_words, sentence_replacements, number_words, and plus_word="plus".
Packaging / deps
pyproject.toml
Adds runtime dependency text2num>=3.0.0 and package-data declaration for normalization/presets/*.yaml; reflows dependency formatting.
Tests & fixtures
tests/unit/steps/text/conftest.py, tests/unit/steps/text/apply_sentence_level_replacements_test.py, tests/unit/steps/text/expand_contractions_test.py
Adds french_operators fixture and tests: sentence-level replacement ("pour 100 de réduction""pourcent de réduction"), contraction expansion tests for English and a French case that remains unchanged.

Sequence Diagram(s)

sequenceDiagram
    actor Caller
    rect rgba(100,150,240,0.5)
    participant FrenchOperators
    end
    rect rgba(100,200,120,0.5)
    participant FrenchNumberNormalizer
    end
    rect rgba(220,120,120,0.5)
    participant text2num
    end
    Caller->>FrenchOperators: process text
    FrenchOperators->>FrenchNumberNormalizer: expand_written_numbers(text)
    FrenchNumberNormalizer->>FrenchNumberNormalizer: pre-pass rewrite (e.g., "3 milliards" → "trois milliards")
    FrenchNumberNormalizer->>text2num: alpha2digit(text, "fr")
    text2num-->>FrenchNumberNormalizer: numericified text
    FrenchNumberNormalizer-->>FrenchOperators: normalized text
    FrenchOperators-->>Caller: final normalized text
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 J'ai sauté, j'ai dit "trois" pour toi,

Un chiffre qui devient mot et puis redevient roi,
Les apostrophes se déplient, les mots se serrent,
Les tests applaudent, le normaliseur s'éclaire,
Je grignote une carotte, tout est en ordre, bonsoir 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main enhancement: improving French text normalization with number conversion and contraction expansion features.
Description check ✅ Passed The PR description is mostly complete with clear summary of changes, correct type selection, and comprehensive checklist items marked. However, the Tests section lacks detail about what was tested.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/french-normalizer

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
normalization/languages/french/replacements.py (1)

1-28: Add type annotation for consistency.

The FRENCH_REPLACEMENTS dictionary is missing an explicit type annotation, unlike FRENCH_SENTENCE_REPLACEMENTS which has : dict[str, str]. Adding this improves consistency and helps static type checkers.

♻️ Suggested fix
-FRENCH_REPLACEMENTS = {
+FRENCH_REPLACEMENTS: dict[str, str] = {
     # contractions in titles/prefixes
     "mme": "madame",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/french/replacements.py` around lines 1 - 28, The
FRENCH_REPLACEMENTS dict lacks an explicit type annotation; add the same
annotation used for FRENCH_SENTENCE_REPLACEMENTS (dict[str, str]) to
FRENCH_REPLACEMENTS to make types consistent and satisfy static type checkers,
updating the declaration of FRENCH_REPLACEMENTS accordingly.
normalization/languages/french/number_normalizer.py (1)

30-52: Consider adding "mille" (thousand) and "cent" (hundred) to the pattern.

Currently, the pattern only handles millions?|milliards?|billions?|trillions?, but French also uses mille and cent as scale words. The same concatenation issue that affects "3 milliards" → "31e9" could theoretically apply to "3 mille" and "3 cent". Since both are already in FRENCH_CONFIG.number_words and the fix (single-digit-to-word conversion) is identical to the existing logic, extending the pattern would be consistent:

r"\b(\d+)\s+(millions?|milliards?|billions?|trillions?|mille|cent)\b"

If alpha2digit reliably handles these cases without pre-normalization, the current implementation is fine; otherwise, this extension ensures uniform handling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/french/number_normalizer.py` around lines 30 - 52,
The _RE_MIXED_NUMBER regex and _normalize_mixed_numbers function need to cover
French "mille" and "cent" scales too; update the pattern defined in
_RE_MIXED_NUMBER to include "mille" and "cent" (and optional plural form for
"cent" if desired) so single-digit numbers like "3 mille" or "3 cent" are
converted to words via _DIGIT_TO_FRENCH before alpha2digit runs, preserving the
existing replace logic in _normalize_mixed_numbers unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@normalization/languages/french/operators.py`:
- Around line 115-117: The negative lookahead used to prevent expansions before
vowels+`h` is wrong because the `vowels` string (variable name `vowels`) omits
`'h'`, so the regex `_V = rf"(?![{vowels}{vowels.upper()}])"` will not block
elisions like "j'homme" or "l'heure"; fix by adding `'h'` (and optionally `'H'`
via the existing uppercasing) to the `vowels` string so that `_V` correctly
includes `h` in its character class, leaving the comment "Vowels + h" consistent
with the code.

---

Nitpick comments:
In `@normalization/languages/french/number_normalizer.py`:
- Around line 30-52: The _RE_MIXED_NUMBER regex and _normalize_mixed_numbers
function need to cover French "mille" and "cent" scales too; update the pattern
defined in _RE_MIXED_NUMBER to include "mille" and "cent" (and optional plural
form for "cent" if desired) so single-digit numbers like "3 mille" or "3 cent"
are converted to words via _DIGIT_TO_FRENCH before alpha2digit runs, preserving
the existing replace logic in _normalize_mixed_numbers unchanged.

In `@normalization/languages/french/replacements.py`:
- Around line 1-28: The FRENCH_REPLACEMENTS dict lacks an explicit type
annotation; add the same annotation used for FRENCH_SENTENCE_REPLACEMENTS
(dict[str, str]) to FRENCH_REPLACEMENTS to make types consistent and satisfy
static type checkers, updating the declaration of FRENCH_REPLACEMENTS
accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 351ddfe9-43d4-424e-b843-18029e918670

📥 Commits

Reviewing files that changed from the base of the PR and between 8974532 and 6735b41.

⛔ Files ignored due to path filters (1)
  • tests/e2e/files/gladia-3.csv is excluded by !**/*.csv
📒 Files selected for processing (8)
  • normalization/languages/french/number_normalizer.py
  • normalization/languages/french/operators.py
  • normalization/languages/french/replacements.py
  • normalization/languages/french/sentence_replacements.py
  • pyproject.toml
  • tests/unit/steps/text/apply_sentence_level_replacements_test.py
  • tests/unit/steps/text/conftest.py
  • tests/unit/steps/text/expand_contractions_test.py

@Karamouche Karamouche force-pushed the feat/french-normalizer branch from 0f9970b to 6ae7563 Compare April 9, 2026 17:20
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
normalization/languages/french/operators.py (1)

115-117: ⚠️ Potential issue | 🟠 Major

The apostrophe guard still expands before h.

The comment says “vowels + h”, but h is still missing from vowels, so inputs like j'habite and l'heure can be rewritten to je habite and le heure. This is the same bug that was raised on the previous revision and it is still present.

🐛 Minimal fix
-        vowels = "aàâeéèêiîïoôuùûy"
+        vowels = "aàâeéèêiîïoôuùûyh"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/french/operators.py` around lines 115 - 117, The
guard for elision is missing 'h' in the vowels set: update the vowels variable
used to build _V in normalization/languages/french/operators.py so it includes
'h' (and 'H' via the existing upper() usage) to match the comment "Vowels + h";
ensure the _V regex construction (the variable _V) continues to use that updated
vowels string so elision before h (e.g., "j'habite", "l'heure") will not be
expanded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@normalization/languages/french/operators.py`:
- Around line 55-92: The number_words list is missing common hyphenated French
numerals; update the number_words variable to include standard hyphenated forms
such as "dix-sept", "dix-huit", "dix-neuf", "soixante-dix", "quatre-vingt",
"quatre-vingts", and "quatre-vingt-dix" so the detection covers ordinary French
spellings (modify the number_words array in the French operators module).
- Around line 119-126: The generic expansions for c' and l' (the two re.sub
calls matching rf"\bc'{_V}" and rf"\bl'{_V}") are unsafe and produce
ungrammatical output (e.g., "c'pas" -> "ce pas", "l'voiture" -> "le voiture");
remove or restrict these two rules: either delete/comment out the re.sub lines
for c' and l' in operators.py, or replace them with narrowly scoped expansions
that only match known safe contractions (for example only expand "c'est"/"c'et"
to "ce " and avoid expanding arbitrary c' + _V, and do not expand l' at all).
Keep the other contractions (d', qu', n', s', m', t') unchanged and continue
using the same text variable and _V pattern.

---

Duplicate comments:
In `@normalization/languages/french/operators.py`:
- Around line 115-117: The guard for elision is missing 'h' in the vowels set:
update the vowels variable used to build _V in
normalization/languages/french/operators.py so it includes 'h' (and 'H' via the
existing upper() usage) to match the comment "Vowels + h"; ensure the _V regex
construction (the variable _V) continues to use that updated vowels string so
elision before h (e.g., "j'habite", "l'heure") will not be expanded.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7fa84fd4-88f8-4b61-a368-d649ce4796c0

📥 Commits

Reviewing files that changed from the base of the PR and between 0f9970b and 6ae7563.

⛔ Files ignored due to path filters (1)
  • tests/e2e/files/gladia-3.csv is excluded by !**/*.csv
📒 Files selected for processing (8)
  • normalization/languages/french/number_normalizer.py
  • normalization/languages/french/operators.py
  • normalization/languages/french/replacements.py
  • normalization/languages/french/sentence_replacements.py
  • pyproject.toml
  • tests/unit/steps/text/apply_sentence_level_replacements_test.py
  • tests/unit/steps/text/conftest.py
  • tests/unit/steps/text/expand_contractions_test.py
✅ Files skipped from review due to trivial changes (2)
  • normalization/languages/french/sentence_replacements.py
  • normalization/languages/french/replacements.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • tests/unit/steps/text/conftest.py
  • pyproject.toml
  • tests/unit/steps/text/expand_contractions_test.py
  • normalization/languages/french/number_normalizer.py
  • tests/unit/steps/text/apply_sentence_level_replacements_test.py

Comment on lines +55 to +92
number_words=[
"zéro",
"un",
"deux",
"trois",
"quatre",
"cinq",
"six",
"sept",
"huit",
"neuf",
"dix",
"onze",
"douze",
"treize",
"quatorze",
"quinze",
"seize",
"vingt",
"trente",
"quarante",
"cinquante",
"soixante",
"septante",
"octante",
"huitante",
"nonante",
"cent",
"mille",
"million",
"millions",
"milliard",
"milliards",
"billion",
"billions",
"trillion",
"trillions",
],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

number_words is missing common standard French forms.

This list is used for number-word detection, but ordinary spellings like dix-sept, dix-huit, dix-neuf, soixante-dix, quatre-vingt, quatre-vingts, and quatre-vingt-dix are absent while rarer regional variants are present. Those phrases will be skipped by config-driven steps.

💡 At minimum, add the standard hyphenated forms
     number_words=[
         "quinze",
         "seize",
+        "dix-sept",
+        "dix-huit",
+        "dix-neuf",
         "vingt",
         "trente",
         "quarante",
         "cinquante",
         "soixante",
+        "soixante-dix",
         "septante",
         "octante",
         "huitante",
         "nonante",
+        "quatre-vingt",
+        "quatre-vingts",
+        "quatre-vingt-dix",
         "cent",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/french/operators.py` around lines 55 - 92, The
number_words list is missing common hyphenated French numerals; update the
number_words variable to include standard hyphenated forms such as "dix-sept",
"dix-huit", "dix-neuf", "soixante-dix", "quatre-vingt", "quatre-vingts", and
"quatre-vingt-dix" so the detection covers ordinary French spellings (modify the
number_words array in the French operators module).

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
normalization/languages/french/operators.py (2)

55-92: ⚠️ Potential issue | 🟠 Major

Add standard hyphenated French numerals to number_words.

Line 55–92 still misses common standard forms (dix-sept, dix-huit, dix-neuf, soixante-dix, quatre-vingt, quatre-vingts, quatre-vingt-dix), so config-driven number-word detection can skip ordinary spellings.

Suggested patch
     number_words=[
         "zéro",
         "un",
         "deux",
         "trois",
         "quatre",
         "cinq",
         "six",
         "sept",
         "huit",
         "neuf",
         "dix",
         "onze",
         "douze",
         "treize",
         "quatorze",
         "quinze",
         "seize",
+        "dix-sept",
+        "dix-huit",
+        "dix-neuf",
         "vingt",
         "trente",
         "quarante",
         "cinquante",
         "soixante",
+        "soixante-dix",
         "septante",
         "octante",
         "huitante",
         "nonante",
+        "quatre-vingt",
+        "quatre-vingts",
+        "quatre-vingt-dix",
         "cent",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/french/operators.py` around lines 55 - 92, The
number_words list is missing standard hyphenated French numerals which will
cause ordinary spellings to be missed; update the number_words list in
normalization/languages/french/operators.py (the number_words variable) to
include the common hyphenated forms: dix-sept, dix-huit, dix-neuf, soixante-dix,
quatre-vingt, quatre-vingts, and quatre-vingt-dix (and any plural or variant
forms you need) so the parser recognizes these standard spellings; keep entries
as lowercase strings consistent with the existing list.

119-126: ⚠️ Potential issue | 🟠 Major

Avoid generic expansion for c' and l'.

Line 119 and Line 126 can produce ungrammatical output (c'pasce pas, l'voiturele voiture). These two rules are too ambiguous for unconditional expansion.

Safer minimal patch
         text = re.sub(rf"\bj'{_V}", "je ", text, flags=re.IGNORECASE)
-        text = re.sub(rf"\bc'{_V}", "ce ", text, flags=re.IGNORECASE)
         text = re.sub(rf"\bd'{_V}", "de ", text, flags=re.IGNORECASE)
         text = re.sub(rf"\bqu'{_V}", "que ", text, flags=re.IGNORECASE)
         text = re.sub(rf"\bn'{_V}", "ne ", text, flags=re.IGNORECASE)
         text = re.sub(rf"\bs'{_V}", "se ", text, flags=re.IGNORECASE)
         text = re.sub(rf"\bm'{_V}", "me ", text, flags=re.IGNORECASE)
         text = re.sub(rf"\bt'{_V}", "te ", text, flags=re.IGNORECASE)
-        text = re.sub(rf"\bl'{_V}", "le ", text, flags=re.IGNORECASE)
         return text
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/french/operators.py` around lines 119 - 126, The
unconditional replacements for rf"\bc'{_V}" and rf"\bl'{_V}" produce
ungrammatical expansions (e.g., "c'pas" → "ce pas", "l'voiture" → "le voiture");
remove these two generic lines or replace them with targeted rules that only
expand well-known contractions (e.g., match "c'est", "c'était", "c'll?" or a
small whitelist) rather than any c' or l' followed by a vowel. Locate the two
regex substitutions using rf"\bc'{_V}" and rf"\bl'{_V}" in the
normalization/languages/french/operators.py code and either delete them or
change them to explicit, whitelist-based patterns to avoid ambiguous expansions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@normalization/languages/french/operators.py`:
- Around line 55-92: The number_words list is missing standard hyphenated French
numerals which will cause ordinary spellings to be missed; update the
number_words list in normalization/languages/french/operators.py (the
number_words variable) to include the common hyphenated forms: dix-sept,
dix-huit, dix-neuf, soixante-dix, quatre-vingt, quatre-vingts, and
quatre-vingt-dix (and any plural or variant forms you need) so the parser
recognizes these standard spellings; keep entries as lowercase strings
consistent with the existing list.
- Around line 119-126: The unconditional replacements for rf"\bc'{_V}" and
rf"\bl'{_V}" produce ungrammatical expansions (e.g., "c'pas" → "ce pas",
"l'voiture" → "le voiture"); remove these two generic lines or replace them with
targeted rules that only expand well-known contractions (e.g., match "c'est",
"c'était", "c'll?" or a small whitelist) rather than any c' or l' followed by a
vowel. Locate the two regex substitutions using rf"\bc'{_V}" and rf"\bl'{_V}" in
the normalization/languages/french/operators.py code and either delete them or
change them to explicit, whitelist-based patterns to avoid ambiguous expansions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4c70e403-8181-4220-bb97-0a1cd681510b

📥 Commits

Reviewing files that changed from the base of the PR and between 6ae7563 and 8453b56.

⛔ Files ignored due to path filters (1)
  • tests/e2e/files/gladia-3.csv is excluded by !**/*.csv
📒 Files selected for processing (1)
  • normalization/languages/french/operators.py

@egenthon-cmd egenthon-cmd requested a review from Karamouche April 10, 2026 08:25
@egenthon-cmd egenthon-cmd changed the title feat: improve french normalizer feat: improve french text normalization with number conversion and contraction expansion Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants