feat: improve french text normalization with number conversion and contraction expansion#13
feat: improve french text normalization with number conversion and contraction expansion#13egenthon-cmd wants to merge 9 commits intomainfrom
Conversation
📝 WalkthroughWalkthroughAdds French number normalization (new FrenchNumberNormalizer using text2num.alpha2digit), expands French replacement maps and operators (contraction and written-number handling), adds tests and a pyproject dependency and package-data entry. Changes
Sequence Diagram(s)sequenceDiagram
actor Caller
rect rgba(100,150,240,0.5)
participant FrenchOperators
end
rect rgba(100,200,120,0.5)
participant FrenchNumberNormalizer
end
rect rgba(220,120,120,0.5)
participant text2num
end
Caller->>FrenchOperators: process text
FrenchOperators->>FrenchNumberNormalizer: expand_written_numbers(text)
FrenchNumberNormalizer->>FrenchNumberNormalizer: pre-pass rewrite (e.g., "3 milliards" → "trois milliards")
FrenchNumberNormalizer->>text2num: alpha2digit(text, "fr")
text2num-->>FrenchNumberNormalizer: numericified text
FrenchNumberNormalizer-->>FrenchOperators: normalized text
FrenchOperators-->>Caller: final normalized text
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
normalization/languages/french/replacements.py (1)
1-28: Add type annotation for consistency.The
FRENCH_REPLACEMENTSdictionary is missing an explicit type annotation, unlikeFRENCH_SENTENCE_REPLACEMENTSwhich has: dict[str, str]. Adding this improves consistency and helps static type checkers.♻️ Suggested fix
-FRENCH_REPLACEMENTS = { +FRENCH_REPLACEMENTS: dict[str, str] = { # contractions in titles/prefixes "mme": "madame",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/french/replacements.py` around lines 1 - 28, The FRENCH_REPLACEMENTS dict lacks an explicit type annotation; add the same annotation used for FRENCH_SENTENCE_REPLACEMENTS (dict[str, str]) to FRENCH_REPLACEMENTS to make types consistent and satisfy static type checkers, updating the declaration of FRENCH_REPLACEMENTS accordingly.normalization/languages/french/number_normalizer.py (1)
30-52: Consider adding "mille" (thousand) and "cent" (hundred) to the pattern.Currently, the pattern only handles
millions?|milliards?|billions?|trillions?, but French also usesmilleandcentas scale words. The same concatenation issue that affects "3 milliards" → "31e9" could theoretically apply to "3 mille" and "3 cent". Since both are already inFRENCH_CONFIG.number_wordsand the fix (single-digit-to-word conversion) is identical to the existing logic, extending the pattern would be consistent:r"\b(\d+)\s+(millions?|milliards?|billions?|trillions?|mille|cent)\b"If alpha2digit reliably handles these cases without pre-normalization, the current implementation is fine; otherwise, this extension ensures uniform handling.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/french/number_normalizer.py` around lines 30 - 52, The _RE_MIXED_NUMBER regex and _normalize_mixed_numbers function need to cover French "mille" and "cent" scales too; update the pattern defined in _RE_MIXED_NUMBER to include "mille" and "cent" (and optional plural form for "cent" if desired) so single-digit numbers like "3 mille" or "3 cent" are converted to words via _DIGIT_TO_FRENCH before alpha2digit runs, preserving the existing replace logic in _normalize_mixed_numbers unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@normalization/languages/french/operators.py`:
- Around line 115-117: The negative lookahead used to prevent expansions before
vowels+`h` is wrong because the `vowels` string (variable name `vowels`) omits
`'h'`, so the regex `_V = rf"(?![{vowels}{vowels.upper()}])"` will not block
elisions like "j'homme" or "l'heure"; fix by adding `'h'` (and optionally `'H'`
via the existing uppercasing) to the `vowels` string so that `_V` correctly
includes `h` in its character class, leaving the comment "Vowels + h" consistent
with the code.
---
Nitpick comments:
In `@normalization/languages/french/number_normalizer.py`:
- Around line 30-52: The _RE_MIXED_NUMBER regex and _normalize_mixed_numbers
function need to cover French "mille" and "cent" scales too; update the pattern
defined in _RE_MIXED_NUMBER to include "mille" and "cent" (and optional plural
form for "cent" if desired) so single-digit numbers like "3 mille" or "3 cent"
are converted to words via _DIGIT_TO_FRENCH before alpha2digit runs, preserving
the existing replace logic in _normalize_mixed_numbers unchanged.
In `@normalization/languages/french/replacements.py`:
- Around line 1-28: The FRENCH_REPLACEMENTS dict lacks an explicit type
annotation; add the same annotation used for FRENCH_SENTENCE_REPLACEMENTS
(dict[str, str]) to FRENCH_REPLACEMENTS to make types consistent and satisfy
static type checkers, updating the declaration of FRENCH_REPLACEMENTS
accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 351ddfe9-43d4-424e-b843-18029e918670
⛔ Files ignored due to path filters (1)
tests/e2e/files/gladia-3.csvis excluded by!**/*.csv
📒 Files selected for processing (8)
normalization/languages/french/number_normalizer.pynormalization/languages/french/operators.pynormalization/languages/french/replacements.pynormalization/languages/french/sentence_replacements.pypyproject.tomltests/unit/steps/text/apply_sentence_level_replacements_test.pytests/unit/steps/text/conftest.pytests/unit/steps/text/expand_contractions_test.py
0f9970b to
6ae7563
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
normalization/languages/french/operators.py (1)
115-117:⚠️ Potential issue | 🟠 MajorThe apostrophe guard still expands before
h.The comment says “vowels + h”, but
his still missing fromvowels, so inputs likej'habiteandl'heurecan be rewritten toje habiteandle heure. This is the same bug that was raised on the previous revision and it is still present.🐛 Minimal fix
- vowels = "aàâeéèêiîïoôuùûy" + vowels = "aàâeéèêiîïoôuùûyh"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/french/operators.py` around lines 115 - 117, The guard for elision is missing 'h' in the vowels set: update the vowels variable used to build _V in normalization/languages/french/operators.py so it includes 'h' (and 'H' via the existing upper() usage) to match the comment "Vowels + h"; ensure the _V regex construction (the variable _V) continues to use that updated vowels string so elision before h (e.g., "j'habite", "l'heure") will not be expanded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@normalization/languages/french/operators.py`:
- Around line 55-92: The number_words list is missing common hyphenated French
numerals; update the number_words variable to include standard hyphenated forms
such as "dix-sept", "dix-huit", "dix-neuf", "soixante-dix", "quatre-vingt",
"quatre-vingts", and "quatre-vingt-dix" so the detection covers ordinary French
spellings (modify the number_words array in the French operators module).
- Around line 119-126: The generic expansions for c' and l' (the two re.sub
calls matching rf"\bc'{_V}" and rf"\bl'{_V}") are unsafe and produce
ungrammatical output (e.g., "c'pas" -> "ce pas", "l'voiture" -> "le voiture");
remove or restrict these two rules: either delete/comment out the re.sub lines
for c' and l' in operators.py, or replace them with narrowly scoped expansions
that only match known safe contractions (for example only expand "c'est"/"c'et"
to "ce " and avoid expanding arbitrary c' + _V, and do not expand l' at all).
Keep the other contractions (d', qu', n', s', m', t') unchanged and continue
using the same text variable and _V pattern.
---
Duplicate comments:
In `@normalization/languages/french/operators.py`:
- Around line 115-117: The guard for elision is missing 'h' in the vowels set:
update the vowels variable used to build _V in
normalization/languages/french/operators.py so it includes 'h' (and 'H' via the
existing upper() usage) to match the comment "Vowels + h"; ensure the _V regex
construction (the variable _V) continues to use that updated vowels string so
elision before h (e.g., "j'habite", "l'heure") will not be expanded.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7fa84fd4-88f8-4b61-a368-d649ce4796c0
⛔ Files ignored due to path filters (1)
tests/e2e/files/gladia-3.csvis excluded by!**/*.csv
📒 Files selected for processing (8)
normalization/languages/french/number_normalizer.pynormalization/languages/french/operators.pynormalization/languages/french/replacements.pynormalization/languages/french/sentence_replacements.pypyproject.tomltests/unit/steps/text/apply_sentence_level_replacements_test.pytests/unit/steps/text/conftest.pytests/unit/steps/text/expand_contractions_test.py
✅ Files skipped from review due to trivial changes (2)
- normalization/languages/french/sentence_replacements.py
- normalization/languages/french/replacements.py
🚧 Files skipped from review as they are similar to previous changes (5)
- tests/unit/steps/text/conftest.py
- pyproject.toml
- tests/unit/steps/text/expand_contractions_test.py
- normalization/languages/french/number_normalizer.py
- tests/unit/steps/text/apply_sentence_level_replacements_test.py
| number_words=[ | ||
| "zéro", | ||
| "un", | ||
| "deux", | ||
| "trois", | ||
| "quatre", | ||
| "cinq", | ||
| "six", | ||
| "sept", | ||
| "huit", | ||
| "neuf", | ||
| "dix", | ||
| "onze", | ||
| "douze", | ||
| "treize", | ||
| "quatorze", | ||
| "quinze", | ||
| "seize", | ||
| "vingt", | ||
| "trente", | ||
| "quarante", | ||
| "cinquante", | ||
| "soixante", | ||
| "septante", | ||
| "octante", | ||
| "huitante", | ||
| "nonante", | ||
| "cent", | ||
| "mille", | ||
| "million", | ||
| "millions", | ||
| "milliard", | ||
| "milliards", | ||
| "billion", | ||
| "billions", | ||
| "trillion", | ||
| "trillions", | ||
| ], |
There was a problem hiding this comment.
number_words is missing common standard French forms.
This list is used for number-word detection, but ordinary spellings like dix-sept, dix-huit, dix-neuf, soixante-dix, quatre-vingt, quatre-vingts, and quatre-vingt-dix are absent while rarer regional variants are present. Those phrases will be skipped by config-driven steps.
💡 At minimum, add the standard hyphenated forms
number_words=[
"quinze",
"seize",
+ "dix-sept",
+ "dix-huit",
+ "dix-neuf",
"vingt",
"trente",
"quarante",
"cinquante",
"soixante",
+ "soixante-dix",
"septante",
"octante",
"huitante",
"nonante",
+ "quatre-vingt",
+ "quatre-vingts",
+ "quatre-vingt-dix",
"cent",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@normalization/languages/french/operators.py` around lines 55 - 92, The
number_words list is missing common hyphenated French numerals; update the
number_words variable to include standard hyphenated forms such as "dix-sept",
"dix-huit", "dix-neuf", "soixante-dix", "quatre-vingt", "quatre-vingts", and
"quatre-vingt-dix" so the detection covers ordinary French spellings (modify the
number_words array in the French operators module).
There was a problem hiding this comment.
♻️ Duplicate comments (2)
normalization/languages/french/operators.py (2)
55-92:⚠️ Potential issue | 🟠 MajorAdd standard hyphenated French numerals to
number_words.Line 55–92 still misses common standard forms (
dix-sept,dix-huit,dix-neuf,soixante-dix,quatre-vingt,quatre-vingts,quatre-vingt-dix), so config-driven number-word detection can skip ordinary spellings.Suggested patch
number_words=[ "zéro", "un", "deux", "trois", "quatre", "cinq", "six", "sept", "huit", "neuf", "dix", "onze", "douze", "treize", "quatorze", "quinze", "seize", + "dix-sept", + "dix-huit", + "dix-neuf", "vingt", "trente", "quarante", "cinquante", "soixante", + "soixante-dix", "septante", "octante", "huitante", "nonante", + "quatre-vingt", + "quatre-vingts", + "quatre-vingt-dix", "cent",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/french/operators.py` around lines 55 - 92, The number_words list is missing standard hyphenated French numerals which will cause ordinary spellings to be missed; update the number_words list in normalization/languages/french/operators.py (the number_words variable) to include the common hyphenated forms: dix-sept, dix-huit, dix-neuf, soixante-dix, quatre-vingt, quatre-vingts, and quatre-vingt-dix (and any plural or variant forms you need) so the parser recognizes these standard spellings; keep entries as lowercase strings consistent with the existing list.
119-126:⚠️ Potential issue | 🟠 MajorAvoid generic expansion for
c'andl'.Line 119 and Line 126 can produce ungrammatical output (
c'pas→ce pas,l'voiture→le voiture). These two rules are too ambiguous for unconditional expansion.Safer minimal patch
text = re.sub(rf"\bj'{_V}", "je ", text, flags=re.IGNORECASE) - text = re.sub(rf"\bc'{_V}", "ce ", text, flags=re.IGNORECASE) text = re.sub(rf"\bd'{_V}", "de ", text, flags=re.IGNORECASE) text = re.sub(rf"\bqu'{_V}", "que ", text, flags=re.IGNORECASE) text = re.sub(rf"\bn'{_V}", "ne ", text, flags=re.IGNORECASE) text = re.sub(rf"\bs'{_V}", "se ", text, flags=re.IGNORECASE) text = re.sub(rf"\bm'{_V}", "me ", text, flags=re.IGNORECASE) text = re.sub(rf"\bt'{_V}", "te ", text, flags=re.IGNORECASE) - text = re.sub(rf"\bl'{_V}", "le ", text, flags=re.IGNORECASE) return text🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/french/operators.py` around lines 119 - 126, The unconditional replacements for rf"\bc'{_V}" and rf"\bl'{_V}" produce ungrammatical expansions (e.g., "c'pas" → "ce pas", "l'voiture" → "le voiture"); remove these two generic lines or replace them with targeted rules that only expand well-known contractions (e.g., match "c'est", "c'était", "c'll?" or a small whitelist) rather than any c' or l' followed by a vowel. Locate the two regex substitutions using rf"\bc'{_V}" and rf"\bl'{_V}" in the normalization/languages/french/operators.py code and either delete them or change them to explicit, whitelist-based patterns to avoid ambiguous expansions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@normalization/languages/french/operators.py`:
- Around line 55-92: The number_words list is missing standard hyphenated French
numerals which will cause ordinary spellings to be missed; update the
number_words list in normalization/languages/french/operators.py (the
number_words variable) to include the common hyphenated forms: dix-sept,
dix-huit, dix-neuf, soixante-dix, quatre-vingt, quatre-vingts, and
quatre-vingt-dix (and any plural or variant forms you need) so the parser
recognizes these standard spellings; keep entries as lowercase strings
consistent with the existing list.
- Around line 119-126: The unconditional replacements for rf"\bc'{_V}" and
rf"\bl'{_V}" produce ungrammatical expansions (e.g., "c'pas" → "ce pas",
"l'voiture" → "le voiture"); remove these two generic lines or replace them with
targeted rules that only expand well-known contractions (e.g., match "c'est",
"c'était", "c'll?" or a small whitelist) rather than any c' or l' followed by a
vowel. Locate the two regex substitutions using rf"\bc'{_V}" and rf"\bl'{_V}" in
the normalization/languages/french/operators.py code and either delete them or
change them to explicit, whitelist-based patterns to avoid ambiguous expansions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4c70e403-8181-4220-bb97-0a1cd681510b
⛔ Files ignored due to path filters (1)
tests/e2e/files/gladia-3.csvis excluded by!**/*.csv
📒 Files selected for processing (1)
normalization/languages/french/operators.py
What does this PR do?
Enhances the French text normalization pipeline with richer, more consistent output:
Converts written numbers to digits
Expands apostrophe contractions
Adds word/phrase replacements for cleaner normalization
Also adds unit tests with a French operators fixture covering sentence-level replacements and contraction behavior, plus a runtime dependency for number-to-digit conversion.
Type of change
languages/{lang}/)steps/text/orsteps/word/)presets/)Checklist
New language
languages/{lang}/withoperators.py,replacements.py,__init__.pyreplacements.py, not inline inoperators.py@register_languagelanguages/__init__.pytests/unit/languages/tests/e2e/files/New step
nameclass attribute is unique and matches the YAML key@register_stepsteps/text/__init__.pyorsteps/word/__init__.pyoperators.config.*, no hardcoded language-specific valuesif operators.config.field is None: return textsteps/text/placeholders.pyandpipeline/base.py'svalidate()is updatedtests/unit/steps/uv run scripts/generate_step_docs.pyto regeneratedocs/steps.mdPreset change
Tests
Summary by CodeRabbit
New Features
Tests
Dependencies