uniwuezpd/larex
A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
164
LAREX is a semi-automatic open-source tool for layout analysis on early printed books. It uses a rule based connected components approach which is very fast, easily comprehensible for the user and allows an intuitive manual correction if necessary. The PAGE XML format is used to support integration into existing OCR workflows. Evaluations showed that LAREX provides an efficient and flexible way to segment pages of early printed books.
Please feel free to visit the tool homepage. A short user manual is available here.
Additional information about developing for LAREX see here
This guide uses Docker and allows a platform agnostic installation of LAREX
Production
Development
development/build.sh
, run cd development
and sh build.sh
This guide uses Tomcat 8, Java 8 and Ubuntu (please adjust accordingly for your setup)
apt-get install tomcat8 maven openjdk-8-jdk
git clone https://github.com/OCR4all/LAREX.git
mvn clean install -f LAREX/pom.xml
.cp LAREX/target/Larex.war /var/lib/tomcat8/webapps/Larex.war
sudo ln -s $PWD/LAREX/target/Larex.war /var/lib/tomcat8/webapps/Larex.war
systemctl start tomcat8
systemctl restart tomcat8
)systemctl enable tomcat8
)This guide uses Eclipse to simplify the setup on Windows
Window
-> Show View
-> Other...
-> Server
-> Servers
Apache
-> Tomcat <version> Server
-> Next
-> set your Tomcat installation directory -> Finish
.File
-> Import
-> Git
-> Projects from Git
-> Clone URI
-> Set URI: https://github.com/OCR4all/LAREX.git
-> [✓] master
-> Next >
-> Next >
-> Import as gernal project
-> Finish
Larex
-> Configure
-> Convert to Maven Project
-> Finish
Larex
-> Maven
-> Update Project...
-> OK
Larex
-> Run As
-> Run on Server
.Note: LAREX is mainly developed on Linux so the macOS build introductions may be outdated from time to time. If this is the case, feel free to contact us This guide uses homebrew (please adjust accordingly for your setup).
brew update
.brew cask install adoptopenjdk8
brew install tomcat git maven
brew services list
tomcat should be listed in the output of this commandgit clone https://github.com/OCR4all/LAREX.git
mvn clean install -f LAREX/pom.xml
.cp LAREX/target/Larex.war /usr/local/Cellar/tomcat/[version]/libexec/webapps/Larex.war
ln -s $PWD/LAREX/target/Larex.war /usr/local/Cellar/tomcat/[version]/libexec/webapps/Larex.war
brew services start tomcat
brew services restart tomcat
)Go to localhost:8080/Larex
.
You can add your own books by copying them to src/webapp/resources/books
(Or an alternative direction set in the config file. See section Configuration for more information).
Book directories must have the following structure:
bookDir/
├── <book_name>/
│ ├── <page_name>.png
│ └── <page_name>.xml
└── <book2_name>/
└── …
Detailed information about the usage of LAREX can be found in the OCR4all getting started guides.
See sections and chapters about Segmentation, Ground Truth Correction and Post Correction.
LAREX contains a configuration file (src/webapp/WEB-INF/larex.properties
) with a few settings that can be set before running the application.
The setting bookpath sets the file path of the books folder.
e.g. bookpath=/home/user/books
(Linux)
e.g. bookpath=C:\Users\user\Documents\books
(Windows)
LAREX will load the books from this folder.
[default /src/main/webapp/resources/books]
The setting localsave tells the application how to handle results locally when saved.
Please note:
To work properly in local mode it's required that the Page@imageFilename
-attribute matches the actual filename (apart from the extension). This label will be used for local storage.
<mode>=[bookpath|savedir|none]
bookpath
: save the result in the bookpath
savedir
: save the result in a defined savedir
none
: do not save the result locally [default]
e.g. localsave:bookpath
The setting savedir is needed if localsave mode is set to "savedir".
e.g. savedir=/home/user/save
(Linux)
e.g. savedir=C:\Users\user\Documents\save
(Windows)
The setting websave tells the application how to handle results on the browser side when saved.
<value>=[true|false]
true
: download the result after saving [default]
false
: no action after saving
e.g. websave=true
Set the accessible modes in the LAREX GUI <value>=[[segment][edit][lines][text]]
A combination of the modes "segment", "edit", "lines" and "text" can be set as
a space separated string.
e.g. modes=segment lines
The order of those modes in the string also determines which mode is opened on startup, with the first in the list being opened as main mode. The mode "segment" can be replaced with "edit" in order to hide all auto segmentation features. ("edit" will be ignored if both are present)
[Default] modes=segment lines text
This setting enables or disables the direct open feature.
<value>=[enable|disable]
This feature allows users to load a book from everywhere on the servers drive as well as to alter the options websave, localsave and savedir.
enable
: enable direct request
disable
: disable direct request [default]
e.g. directrequest=enable
This feature should be used with caution but is very useful when using LAREX in a workflow with other web applications. (e.g. in Docker)
The easiest direct request would be via a html form with the values bookpath, bookname, websave (optional), localsave (optional) and savedir (optional).
<form action="http://localhost:8080/Larex/direct" method="POST">
bookpath: <input type="text" name="bookpath"/><br>
bookname: <input type="text" name="bookname"/><br>
websave: <input type="text" name="websave"/><br>
localsave: <input type="text" name="localsave"/><br>
savedir: <input type="text" name="savedir"/><br>
modes: <input type="text" name="modes"/><br>
<input type="submit"/>
</form>
This setting enables or disables OCR4all UI mode.
<value>=[enable|disable]
This setting allows displaying and/or hiding certain UI elements when LAREX is used in combination with OCR4all.
enable
: enable OCR4all UI mode
disable
: disable OCR4all UI mode [default]
e.g. ocr4all=enable
If you are using LAREX please cite:
Reul, C., Springmann, U., Puppe, F.: Larex: A semi-automatic open-source tool for layout analysis and region extraction on early printed books Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage (2017)
@inproceedings{reul2017larex,
title={Larex: A semi-automatic open-source tool for layout analysis and region extraction on early printed books},
author={Reul, Christian and Springmann, Uwe and Puppe, Frank},
booktitle={Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage},
pages={137--142},
year={2017}
}
docker pull uniwuezpd/larex