Skip to content

fix(eval): Support non-English languages in response_match_score#3923

Open
AhrendsW wants to merge 2 commits intogoogle:mainfrom
AhrendsW:fix/eval-non-english-languages
Open

fix(eval): Support non-English languages in response_match_score#3923
AhrendsW wants to merge 2 commits intogoogle:mainfrom
AhrendsW:fix/eval-non-english-languages

Conversation

@AhrendsW
Copy link

@AhrendsW AhrendsW commented Dec 16, 2025

Summary

  • Fixes evaluation response_match_score metric for non-English languages (non-Latin scripts like Chinese, Japanese, Korean, Arabic, etc.)
  • Adds a Unicode-aware tokenizer that handles scripts without whitespace word boundaries
  • Falls back to character-level tokenization for non-Latin scripts when nltk word tokenizer doesn't split properly

Test plan

  • All eval metric tests pass
  • Full unittest suite passes
  • pyink and isort formatting verified via autoformat.sh
  • Lint checks pass (pyink --check reports no changes needed)
  • No merge conflicts with main

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @AhrendsW, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug in the ROUGE-1 evaluation system that previously prevented accurate scoring for non-English languages. By implementing a script detection mechanism, the system can now intelligently apply or disable language-specific stemming, ensuring that evaluation scores are reliable and meaningful across a diverse range of global languages. This enhancement significantly improves the utility of the evaluation metric for internationalized content.

Highlights

  • Internationalization Fix: The ROUGE-1 evaluation metric now correctly handles non-English languages by addressing an issue where the English-specific Porter stemmer caused incorrect 'Match score: 0' results for non-Latin scripts.
  • Script Detection Logic: A new helper function, _is_latin_script(), has been introduced. It uses Python's unicodedata module to determine if a given text is primarily composed of Latin script characters, allowing for intelligent conditional application of stemming.
  • Conditional Stemming: The _calculate_rouge_1_scores() function has been updated to only enable stemming when both the candidate and reference texts are detected as primarily Latin script. This preserves correct behavior for English and other Latin-based languages while disabling stemming for non-Latin scripts (e.g., Thai, Chinese, Arabic, Japanese, Korean) to ensure accurate scoring.
  • Comprehensive Testing: Extensive unit tests have been added, including 15 tests for the _is_latin_script() function covering various script types and edge cases, and 22 tests for non-English ROUGE scoring across multiple languages (Thai, Chinese, Arabic, Japanese, Korean, Portuguese, French, German, Spanish), as well as English stemming verification and mixed-script scenarios.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@adk-bot adk-bot added the eval [Component] This issue is related to evaluation label Dec 16, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the issue of incorrect ROUGE-1 scores for non-English languages by introducing a mechanism to conditionally disable stemming. The new _is_latin_script helper function is a reasonable approach to detect script type, and its application in _calculate_rouge_1_scores correctly disables the Porter stemmer for non-Latin text. The accompanying unit tests are very comprehensive and cover a wide range of languages and scenarios. I have a few suggestions to make the test assertions more precise, which will strengthen the test suite against future regressions. Overall, this is a solid fix.

Comment on lines +249 to +251
assert rouge_1_score.fmeasure > 0
assert rouge_1_score.fmeasure < 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertions > 0 and < 1.0 are correct but not very precise. We can calculate the exact expected F-measure to make this test stronger. Given the candidate and reference texts, the F-measure should be exactly 0.4.

Calculation:

  • Candidate tokens: 5
  • Reference tokens: 5
  • Common tokens: 2
  • Precision = 2/5 = 0.4
  • Recall = 2/5 = 0.4
  • F-measure = 2 * (0.4 * 0.4) / (0.4 + 0.4) = 0.4
Suggested change
assert rouge_1_score.fmeasure > 0
assert rouge_1_score.fmeasure < 1.0
assert rouge_1_score.fmeasure == pytest.approx(0.4)

reference = "สวัสดี ค่ะ"
rouge_1_score = _calculate_rouge_1_scores(candidate, reference)
# Should match "สวัสดี" (1 out of 2 words)
assert rouge_1_score.fmeasure == pytest.approx(0.5, rel=0.1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertion uses a relative tolerance rel=0.1, which is unnecessarily loose as the expected F-measure is exactly 0.5. Using pytest.approx(0.5) without a tolerance is more precise and makes the test stricter.

Suggested change
assert rouge_1_score.fmeasure == pytest.approx(0.5, rel=0.1)
assert rouge_1_score.fmeasure == pytest.approx(0.5)

Comment on lines +273 to +279
candidate = "今天 天气 很好" # "Today's weather is good"
reference = "今天 我 很 开心" # "Today I am happy"
rouge_1_score = _calculate_rouge_1_scores(candidate, reference)
# Should match "今天" and "很"
assert rouge_1_score.fmeasure > 0
assert rouge_1_score.fmeasure < 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment on line 276, Should match "今天" and "很", is inconsistent with the candidate string "今天 天气 很好". The default tokenizer will treat "很好" as a single token, so only "今天" will be matched.

To align with the comment's intent and create a stronger test, I suggest splitting "很好" into "很 好" in the candidate string. This will result in an F-measure of 0.5.

Suggested change
candidate = "今天 天气 很好" # "Today's weather is good"
reference = "今天 我 很 开心" # "Today I am happy"
rouge_1_score = _calculate_rouge_1_scores(candidate, reference)
# Should match "今天" and "很"
assert rouge_1_score.fmeasure > 0
assert rouge_1_score.fmeasure < 1.0
candidate = "今天 天气 很 好" # "Today's weather is very good"
reference = "今天 我 很 开心" # "Today I am happy"
rouge_1_score = _calculate_rouge_1_scores(candidate, reference)
# Should match "今天" and "很"
assert rouge_1_score.fmeasure == pytest.approx(0.5)

reference = "今日 は 仕事 が 忙しい です" # "Today work is busy"
rouge_1_score = _calculate_rouge_1_scores(candidate, reference)
# Should match "今日", "は", "が", "です"
assert rouge_1_score.fmeasure > 0.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertion > 0.5 is correct but could be more precise. The expected F-measure for this case is exactly 2/3. Using pytest.approx(2 / 3) will make the test more robust against future changes.

Suggested change
assert rouge_1_score.fmeasure > 0.5
assert rouge_1_score.fmeasure == pytest.approx(2 / 3)

Comment on lines +337 to +339
assert rouge_1_score.fmeasure > 0
assert rouge_1_score.fmeasure < 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertions > 0 and < 1.0 are correct but are not very specific. The expected F-measure can be calculated precisely as 2/3 for this test case. Using a more precise assertion makes the test stronger.

Calculation:

  • Candidate tokens: 3 (오늘, 날씨가, 좋습니다)
  • Reference tokens: 3 (오늘, 기분이, 좋습니다)
  • Common tokens: 2 (오늘, 좋습니다)
  • Precision = 2/3, Recall = 2/3
  • F-measure = 2/3
Suggested change
assert rouge_1_score.fmeasure > 0
assert rouge_1_score.fmeasure < 1.0
assert rouge_1_score.fmeasure == pytest.approx(2 / 3)

@ryanaiagent ryanaiagent self-assigned this Dec 16, 2025
@ryanaiagent ryanaiagent added the request clarification [Status] The maintainer need clarification or more information from the author label Dec 17, 2025
@ryanaiagent
Copy link
Collaborator

Hi @AhrendsW , Thank you for your contribution! We appreciate you taking the time to submit this pull request.
Can you fix lint errors by running autoformat.sh

@ryanaiagent ryanaiagent added needs review [Status] The PR/issue is awaiting review from the maintainer and removed request clarification [Status] The maintainer need clarification or more information from the author labels Dec 17, 2025
@ryanaiagent
Copy link
Collaborator

Hi @seanzhou1023 , can you please review this.

@ryanaiagent
Copy link
Collaborator

Hi @AhrendsW , Thank you for your patience here. I apologize for the delay in getting to this review; I know this has been sitting for a while. This PR has merge conflicts that require changes from your end. Could you please rebase your branch with the latest main branch to address these? Once this is complete, please let us know so we can proceed with the review.

@ryanaiagent ryanaiagent added request clarification [Status] The maintainer need clarification or more information from the author and removed needs review [Status] The PR/issue is awaiting review from the maintainer labels Jan 20, 2026
@AhrendsW AhrendsW force-pushed the fix/eval-non-english-languages branch from 144ec44 to de3c53a Compare January 22, 2026 14:02
@ryanaiagent
Copy link
Collaborator

Hi @AhrendsW , Your PR has been received by the team and is currently under review. We will provide feedback as soon as we have an update to share.

@ryanaiagent
Copy link
Collaborator

Hi @wukath . can you please review this.

@ryanaiagent ryanaiagent added needs review [Status] The PR/issue is awaiting review from the maintainer and removed request clarification [Status] The maintainer need clarification or more information from the author labels Jan 22, 2026
@AhrendsW
Copy link
Author

AhrendsW commented Feb 7, 2026

Hi @ryanaiagent @wukath, following up on this PR. All CI checks are passing — let me know if there's anything blocking the review.

@AhrendsW AhrendsW changed the title fix(eval): Support non-English languages in response_match_score fix: Support pipe operator (X | Y) union syntax in function parameter parser Feb 7, 2026
The ROUGE-1 evaluation was returning score 0 for non-English languages
(Thai, Chinese, Arabic, etc.) because the Porter stemmer only works
for English text.

This fix:
- Adds _is_latin_script() function to detect text script using unicodedata
- Disables stemmer for non-Latin scripts while preserving it for English
- Adds comprehensive tests for Thai, Chinese, Arabic, Japanese, Korean,
  Portuguese, French, German, and Spanish

Fixes google#3111
The default rouge_scorer tokenizer only handles ASCII characters,
returning empty token lists for non-Latin scripts (Thai, Chinese,
Arabic, Japanese, Korean). This caused ROUGE scores of 0.0 even for
identical strings.

Changes:
- Add _UnicodeTokenizer class using Unicode-aware regex
- Use custom tokenizer for non-Latin scripts
- Fix import order per isort requirements
@AhrendsW AhrendsW force-pushed the fix/eval-non-english-languages branch from 3e00b16 to 9acb4e5 Compare February 7, 2026 01:52
@AhrendsW AhrendsW changed the title fix: Support pipe operator (X | Y) union syntax in function parameter parser fix(eval): Support non-English languages in response_match_score Feb 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eval [Component] This issue is related to evaluation needs review [Status] The PR/issue is awaiting review from the maintainer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants