Cheap Character Noise for OCR-Robust Multilingual Embeddings - Datasets, Resources and Adapted Models

Overview

This repository accompanies our ACL2025 Findings paper, providing models, noisy datasets, and tools for robust multilingual embeddings under OCR noise. You’ll find fine-tuned models, evaluation and training data, and utilities for simulating character-level OCR noise.

Repository Structure

The repository is organized as follows:

├── noisy_evaluation_datasets
│   └── The noised evaluation datasets (CLSD - WMT19/21) produced.
├── noisy_finetuning_data
│   └── The 10K (per language) noised training samples (TED - X-News) used for fine-tuning the models. Includes both random and realistic OCR Noise variants.
├── ocr_simulator
│   └── The ocr_simulator library used to induce realistic ocr noise to texts.
├── generate_random_character_noise_latin_alphabet
│   └── The script to generate stochastically the character level noise used to fine-tune our models.

Models

A version of our OCR Robust models (fine-tuned on TED-X with random noise) is available on Hugging Face:
impresso-project/OCR-robust-gte-multilingual-base

Datasets

Evaluation Datasets

Noisy variants of the CLSD WMT datasets are available in noisy_evaluation_datasets.

Finetuning Datasets

Noisy versions (random and realistic) of TED and X-News parallel texts are available in noisy_finetuning_data.

Other Datasets

Additional datasets used for evaluation and finetuning are also provided (link):

STS-X: paper
CLSD: paper
HistLUX: paper

Reproducing the Experiments

Instructions for reproducing the experiments will be available soon!

Citation

If you use these resources, please cite our paper:

@inproceedings{michail-etal-2025-cheap,
    title = "Cheap Character Noise for {OCR}-Robust Multilingual Embeddings",
    author = "Michail, Andrianos  and
      Opitz, Juri  and
      Wang, Yining  and
      Meister, Robin  and
      Sennrich, Rico  and
      Clematide, Simon",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.609/",
    pages = "11705--11716",
    ISBN = "979-8-89176-256-5"
}

Further Support & Contributing

In the future, we will work towards creating multilingual embedding models that are diversely robust. If you are interested in contributing or need access to any (not yet) released material, please reach out to andrianos.michail@cl.uzh.ch.

About Impresso

Impresso project

Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.

Copyright

License

This program is provided as open source under the GNU Affero General Public License v3 or later.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
generate_random_character_noise_latin_alphabet		generate_random_character_noise_latin_alphabet
noisy_evaluation_datasets		noisy_evaluation_datasets
noisy_finetuning_data		noisy_finetuning_data
ocr_simulator		ocr_simulator
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cheap Character Noise for OCR-Robust Multilingual Embeddings - Datasets, Resources and Adapted Models

Overview

Table of Contents

Repository Structure

Models

Datasets

Evaluation Datasets

Finetuning Datasets

Other Datasets

Reproducing the Experiments

Citation

Further Support & Contributing

About Impresso

Impresso project

Copyright

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cheap Character Noise for OCR-Robust Multilingual Embeddings - Datasets, Resources and Adapted Models

Overview

Table of Contents

Repository Structure

Models

Datasets

Evaluation Datasets

Finetuning Datasets

Other Datasets

Reproducing the Experiments

Citation

Further Support & Contributing

About Impresso

Impresso project

Copyright

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages