You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Merlijn Wajer c4372c4c7a version: 1.1.28 4 days ago
bin hocr-to-epub: accept relatively large images 4 days ago
docs docs: update path to search for module 7 months ago
hocr version: 1.1.28 4 days ago
tests tests/hocr_res.json: update for word threshold removal 2 years ago
COPYING Add AGPLv3 2 years ago
LICENSE.txt setup: initial attempt at pypi packaging 1 year ago
README Add README.rst 2 years ago
README.rst Add README.rst 2 years ago
conftest.py Add test config 2 years ago
requirements.txt setup.py: add iso639 9 months ago
setup.cfg setup: initial attempt at pypi packaging 1 year ago
setup.py setup: require lxml 4.6.5 specifically 6 months ago

README

archive-hocr-tools
==================

This repostory contains a python package to ease hocr parsing in a streaming
manner. The library is called ``hocr``.

It also contains various tools:

* ``hocr-combine-stream``: A tool to combine many hocr files into a big hocr
file. Used internally to combine tesseract per-page results into a larger hocr
resulting file for an entire book.
* ``hocr-fold-chars``: A tool to transform a per-character hocr file into a
per-word hocr file.