Skip to content

impresso/ocr-robust-multilingual-embeddings

Repository files navigation

Cheap Character Noise for OCR-Robust Multilingual Embeddings - Datasets, Resources and Adapted Models

acl2025 vienna License: AGPLV3+


Overview

This repository accompanies our ACL2025 Findings paper, providing models, noisy datasets, and tools for robust multilingual embeddings under OCR noise. You’ll find fine-tuned models, evaluation and training data, and utilities for simulating character-level OCR noise.


Table of Contents


Repository Structure

The repository is organized as follows:

├── noisy_evaluation_datasets
│   └── The noised evaluation datasets (CLSD - WMT19/21) produced.
├── noisy_finetuning_data
│   └── The 10K (per language) noised training samples (TED - X-News) used for fine-tuning the models. Includes both random and realistic OCR Noise variants.
├── ocr_simulator
│   └── The ocr_simulator library used to induce realistic ocr noise to texts.
├── generate_random_character_noise_latin_alphabet
│   └── The script to generate stochastically the character level noise used to fine-tune our models.

Models

A version of our OCR Robust models (fine-tuned on TED-X with random noise) is available on Hugging Face:
impresso-project/OCR-robust-gte-multilingual-base


Datasets

Evaluation Datasets

Noisy variants of the CLSD WMT datasets are available in noisy_evaluation_datasets.

Finetuning Datasets

Noisy versions (random and realistic) of TED and X-News parallel texts are available in noisy_finetuning_data.

Other Datasets

Additional datasets used for evaluation and finetuning are also provided (link):


Reproducing the Experiments

Instructions for reproducing the experiments will be available soon!


Citation

If you use these resources, please cite our paper:

@inproceedings{michail-etal-2025-cheap,
    title = "Cheap Character Noise for {OCR}-Robust Multilingual Embeddings",
    author = "Michail, Andrianos  and
      Opitz, Juri  and
      Wang, Yining  and
      Meister, Robin  and
      Sennrich, Rico  and
      Clematide, Simon",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.609/",
    pages = "11705--11716",
    ISBN = "979-8-89176-256-5"
}

Further Support & Contributing

In the future, we will work towards creating multilingual embedding models that are diversely robust. If you are interested in contributing or need access to any (not yet) released material, please reach out to andrianos.michail@cl.uzh.ch.

About Impresso

Impresso project

Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.

Copyright

Copyright (C) 2025 The Impresso team.

License

This program is provided as open source under the GNU Affero General Public License v3 or later.


Impresso Project Logo

About

This repository provides datasets, adapted models, and starter code for the ACL 2025 paper "Cheap Character Noise for OCR-Robust Multilingual Embeddings." It supports research on multilingual embeddings that are robust to OCR noise. All resources are publicly available and open-source.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages