Skip to content

UMNLibraries/ocr-it

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

OCR test for Digitization folks

Tesseract/ocrmypdf on RHEL9

  • default repos (including EPEL) only includes the English data package; we would need to install manually from the tessdata_fast repo
  • Tesseract does not support reading from a PDF, hence we'd need to either convert to TIFF or run it through ocrmypdf
  • ocrmypdf also includes a watched folder

Tesseract+ocrmypdf Examples

A given folder could be designated as an input folder for a given language eg., new-eng,new-deu,new-yid with corresponding output folders, done-eng etc.

One of the challenges for this project is the fact that we have multi-language documents that Abbyy requires us to preselect which languages to recognize, ie., it's an all or nothing proposition.

Example workflow with a custom language

Textract

Azure

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages