- Name: OCRmyPDF
- Homepage: https://github.com/jbarlow83/OCRmyPDF
- Why should this be included in the repository? We already have tesseract, but it doesn't directly handle PDF files. You must first run imagemagick to covert the pages, etc etc. OCRmyPDF just automates this, plus it has some additional features. Please read the project README for details.
- Is it Open Source: yes
- Who and how many users do you anticipate will use this software? I don't know HOW many users, but students and old book collectors (thus with no digital copies) will benefit from this for sure.
- Link to source tarball/zip file: https://github.com/jbarlow83/OCRmyPDF/archive/v7.0.5.tar.gz
Apologies for bringing this up again, but I would like to point 2 things out:
It requires ruffus as rundep, which looks dead.
As of v.9.0.0 this dependency was removed (see release notes)
Why should this be included in the repository? We already have tesseract, but it doesn't directly handle PDF files. You must first run imagemagick to covert the pages, etc etc. OCRmyPDF just automates this, plus it has some additional features.
All of OCRmyPDF's required dependencies (and 1 out of 2 *optional* dependencies) are already availabe in the repositories. It's an actively developed program, small in size, with quick and great results. The only somewhat practical (open-source) alternative I've come across is OCRFeeder, a gui application that comes as a flatpak. But this delivers far worse results, it's slow and the flatpak hasn't seen any updates in almost 5 years now. In addition, it requires a ~2.5 GB download and 5 GB of disk space once installed.
Thus, I'd like to claim that OCRmyPDF is the better solution and much more in accordance with what I understand to be Solus' goals!
If ruffus was the reason for not including it, that obstacle is gone now.
If general maintenance of the package is the problem, I completely understand. But if that's the reason for not including it, please clarify.
Thank you for your time!