fix(eval): Support non-English languages in response_match_score#3923
fix(eval): Support non-English languages in response_match_score#3923AhrendsW wants to merge 2 commits intogoogle:mainfrom
Conversation
Summary of ChangesHello @AhrendsW, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical bug in the ROUGE-1 evaluation system that previously prevented accurate scoring for non-English languages. By implementing a script detection mechanism, the system can now intelligently apply or disable language-specific stemming, ensuring that evaluation scores are reliable and meaningful across a diverse range of global languages. This enhancement significantly improves the utility of the evaluation metric for internationalized content. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the issue of incorrect ROUGE-1 scores for non-English languages by introducing a mechanism to conditionally disable stemming. The new _is_latin_script helper function is a reasonable approach to detect script type, and its application in _calculate_rouge_1_scores correctly disables the Porter stemmer for non-Latin text. The accompanying unit tests are very comprehensive and cover a wide range of languages and scenarios. I have a few suggestions to make the test assertions more precise, which will strengthen the test suite against future regressions. Overall, this is a solid fix.
| assert rouge_1_score.fmeasure > 0 | ||
| assert rouge_1_score.fmeasure < 1.0 |
There was a problem hiding this comment.
The assertions > 0 and < 1.0 are correct but not very precise. We can calculate the exact expected F-measure to make this test stronger. Given the candidate and reference texts, the F-measure should be exactly 0.4.
Calculation:
- Candidate tokens: 5
- Reference tokens: 5
- Common tokens: 2
- Precision = 2/5 = 0.4
- Recall = 2/5 = 0.4
- F-measure = 2 * (0.4 * 0.4) / (0.4 + 0.4) = 0.4
| assert rouge_1_score.fmeasure > 0 | |
| assert rouge_1_score.fmeasure < 1.0 | |
| assert rouge_1_score.fmeasure == pytest.approx(0.4) |
| reference = "สวัสดี ค่ะ" | ||
| rouge_1_score = _calculate_rouge_1_scores(candidate, reference) | ||
| # Should match "สวัสดี" (1 out of 2 words) | ||
| assert rouge_1_score.fmeasure == pytest.approx(0.5, rel=0.1) |
There was a problem hiding this comment.
The assertion uses a relative tolerance rel=0.1, which is unnecessarily loose as the expected F-measure is exactly 0.5. Using pytest.approx(0.5) without a tolerance is more precise and makes the test stricter.
| assert rouge_1_score.fmeasure == pytest.approx(0.5, rel=0.1) | |
| assert rouge_1_score.fmeasure == pytest.approx(0.5) |
| candidate = "今天 天气 很好" # "Today's weather is good" | ||
| reference = "今天 我 很 开心" # "Today I am happy" | ||
| rouge_1_score = _calculate_rouge_1_scores(candidate, reference) | ||
| # Should match "今天" and "很" | ||
| assert rouge_1_score.fmeasure > 0 | ||
| assert rouge_1_score.fmeasure < 1.0 |
There was a problem hiding this comment.
The comment on line 276, Should match "今天" and "很", is inconsistent with the candidate string "今天 天气 很好". The default tokenizer will treat "很好" as a single token, so only "今天" will be matched.
To align with the comment's intent and create a stronger test, I suggest splitting "很好" into "很 好" in the candidate string. This will result in an F-measure of 0.5.
| candidate = "今天 天气 很好" # "Today's weather is good" | |
| reference = "今天 我 很 开心" # "Today I am happy" | |
| rouge_1_score = _calculate_rouge_1_scores(candidate, reference) | |
| # Should match "今天" and "很" | |
| assert rouge_1_score.fmeasure > 0 | |
| assert rouge_1_score.fmeasure < 1.0 | |
| candidate = "今天 天气 很 好" # "Today's weather is very good" | |
| reference = "今天 我 很 开心" # "Today I am happy" | |
| rouge_1_score = _calculate_rouge_1_scores(candidate, reference) | |
| # Should match "今天" and "很" | |
| assert rouge_1_score.fmeasure == pytest.approx(0.5) |
| reference = "今日 は 仕事 が 忙しい です" # "Today work is busy" | ||
| rouge_1_score = _calculate_rouge_1_scores(candidate, reference) | ||
| # Should match "今日", "は", "が", "です" | ||
| assert rouge_1_score.fmeasure > 0.5 |
There was a problem hiding this comment.
| assert rouge_1_score.fmeasure > 0 | ||
| assert rouge_1_score.fmeasure < 1.0 |
There was a problem hiding this comment.
The assertions > 0 and < 1.0 are correct but are not very specific. The expected F-measure can be calculated precisely as 2/3 for this test case. Using a more precise assertion makes the test stronger.
Calculation:
- Candidate tokens: 3 (
오늘,날씨가,좋습니다) - Reference tokens: 3 (
오늘,기분이,좋습니다) - Common tokens: 2 (
오늘,좋습니다) - Precision = 2/3, Recall = 2/3
- F-measure = 2/3
| assert rouge_1_score.fmeasure > 0 | |
| assert rouge_1_score.fmeasure < 1.0 | |
| assert rouge_1_score.fmeasure == pytest.approx(2 / 3) |
|
Hi @AhrendsW , Thank you for your contribution! We appreciate you taking the time to submit this pull request. |
|
Hi @seanzhou1023 , can you please review this. |
|
Hi @AhrendsW , Thank you for your patience here. I apologize for the delay in getting to this review; I know this has been sitting for a while. This PR has merge conflicts that require changes from your end. Could you please rebase your branch with the latest main branch to address these? Once this is complete, please let us know so we can proceed with the review. |
144ec44 to
de3c53a
Compare
|
Hi @AhrendsW , Your PR has been received by the team and is currently under review. We will provide feedback as soon as we have an update to share. |
|
Hi @wukath . can you please review this. |
|
Hi @ryanaiagent @wukath, following up on this PR. All CI checks are passing — let me know if there's anything blocking the review. |
The ROUGE-1 evaluation was returning score 0 for non-English languages (Thai, Chinese, Arabic, etc.) because the Porter stemmer only works for English text. This fix: - Adds _is_latin_script() function to detect text script using unicodedata - Disables stemmer for non-Latin scripts while preserving it for English - Adds comprehensive tests for Thai, Chinese, Arabic, Japanese, Korean, Portuguese, French, German, and Spanish Fixes google#3111
The default rouge_scorer tokenizer only handles ASCII characters, returning empty token lists for non-Latin scripts (Thai, Chinese, Arabic, Japanese, Korean). This caused ROUGE scores of 0.0 even for identical strings. Changes: - Add _UnicodeTokenizer class using Unicode-aware regex - Use custom tokenizer for non-Latin scripts - Fix import order per isort requirements
3e00b16 to
9acb4e5
Compare
Summary
response_match_scoremetric for non-English languages (non-Latin scripts like Chinese, Japanese, Korean, Arabic, etc.)nltkword tokenizer doesn't split properlyTest plan
pyinkandisortformatting verified viaautoformat.shpyink --checkreports no changes needed)main