Cheap Character Noise for OCR-Robust Multilingual Embeddings - Datasets, Resources and Adapted Models
This repository accompanies our ACL2025 Findings paper, providing models, noisy datasets, and tools for robust multilingual embeddings under OCR noise. You’ll find fine-tuned models, evaluation and training data, and utilities for simulating character-level OCR noise.
- Overview
- Repository Structure
- Models
- Datasets
- Reproducing the Experiments
- Citation
- Further Support & Contributing
- About Impresso
- License
The repository is organized as follows:
├── noisy_evaluation_datasets
│ └── The noised evaluation datasets (CLSD - WMT19/21) produced.
├── noisy_finetuning_data
│ └── The 10K (per language) noised training samples (TED - X-News) used for fine-tuning the models. Includes both random and realistic OCR Noise variants.
├── ocr_simulator
│ └── The ocr_simulator library used to induce realistic ocr noise to texts.
├── generate_random_character_noise_latin_alphabet
│ └── The script to generate stochastically the character level noise used to fine-tune our models.
A version of our OCR Robust models (fine-tuned on TED-X with random noise) is available on Hugging Face:
impresso-project/OCR-robust-gte-multilingual-base
Noisy variants of the CLSD WMT datasets are available in noisy_evaluation_datasets.
Noisy versions (random and realistic) of TED and X-News parallel texts are available in noisy_finetuning_data.
Additional datasets used for evaluation and finetuning are also provided (link):
Instructions for reproducing the experiments will be available soon!
If you use these resources, please cite our paper:
@inproceedings{michail-etal-2025-cheap,
title = "Cheap Character Noise for {OCR}-Robust Multilingual Embeddings",
author = "Michail, Andrianos and
Opitz, Juri and
Wang, Yining and
Meister, Robin and
Sennrich, Rico and
Clematide, Simon",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-acl.609/",
pages = "11705--11716",
ISBN = "979-8-89176-256-5"
}In the future, we will work towards creating multilingual embedding models that are diversely robust. If you are interested in contributing or need access to any (not yet) released material, please reach out to andrianos.michail@cl.uzh.ch.
Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.
Copyright (C) 2025 The Impresso team.
This program is provided as open source under the GNU Affero General Public License v3 or later.
