Public | Automated Build

Last pushed: never
Short Description
snapshot version von cloudy docapp version
Full Description

DocApp - A Dockerized Document Parser

What is all this

  • A dockerized document parser app, turning your scanned (image based) PDF files into searchable documents
    • Special Thanks to jbarlow83 for sharing his tesseract implementation
  • A simple search interface for searching through your documents
  • Platform-independant through the usage of Docker
    • Yes, it even works in Windows
  • Designed as an application to run on your own machine
    • Because you wouldn't want to entrust personal or confidential documents to an external service, would you?
  • Designed to be easily portable
    • You only have to worry about a single directory. No complicated backup strategy required!

Great, can I have some?

Sure, just follow these steps:

  • Install Docker
    • Under Windows / OSX make sure to set your DOCKER_HOST environment variable
  • Pull the docker image from the Docker Hub: docker pull salgmachine/docparser
  • Run the application: docker run -d -p 8080:8080 -v /path/to/pdfs:/path/to/pdfs salgmachine/docparser --docapp.language=deu

A short explainer:

  • docker run -d This will run the image "in the background", kinda like a daemon
  • -p 8080:8080 This will provide a port mapping between the running docker container and your system
  • -v /path/to/pdfs:/path/to/pdfs This will provide the docker container with what is called "a writable volume"
  • This argument tells the application which directory to use for working
  • --docapp.language=deu This argument tells the application which language to expect when parsing the documents

    Further information:

The application creates a directory called "incoming" in this directory, that's where you'll have to drop your scanned PDF files in.

It is mandatory that you use the correct language as this will ensure the best results possible when transforming your images! For a list of available languages have a look here. By default, the application is installed with the following languages: English, Spanish, German

Additional languages must be installed manually into your docker container like so: apt-get install tesseract-ocr-{lang} (don't forget to update the commandline argument correspondingly when you do)

Docker Pull Command
Source Repository