- default repos (including EPEL) only includes the English data package; we would need to install manually from the tessdata_fast repo
- Tesseract does not support reading from a PDF, hence we'd need to either convert to TIFF or run it through
ocrmypdf ocrmypdfalso includes a watched folder
A given folder could be designated as an input folder for a given language eg., new-eng,new-deu,new-yid with corresponding output folders, done-eng etc.
One of the challenges for this project is the fact that we have multi-language documents that Abbyy requires us to preselect which languages to recognize, ie., it's an all or nothing proposition.
Example workflow with a custom language
-
Homepage: https://aws.amazon.com/textract/
-
code samples: https://github.com/aws-samples/amazon-textract-code-samples
-
how would normal end-users drop stuff into S3 Bucket for OCR'ing: https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html