overview/overview-convert-framework
Base image for Overview converters
576
Base image for Overview converters.
A converter's job is to turn files of one type into files of another type. It does this in a loop. It receives jobs from an internal Overview HTTP server.
This base image provides portable executables that communicate with Overview. They make up a framework: they'll call your converter program, which you can write in any language.
Your converter will have a Dockerfile that looks like this:
FROM overview/overview-converter-framework AS framework
# multi-stage build
FROM alpine:3.7 AS build
... (build your executables, including `do-convert-single-file`)
FROM alpine:3.7 AS production
# Add ca-certificates to let container download from S3 https:// URLs
RUN apk add --update --no-cache ca-certificates
WORKDIR /app
# The framework provides the main executable
COPY --from=framework /app/run /app/run
# Your `do-convert` code can choose from a few different input and output
# formats. The framework provides many `/app/convert` implementations: pick
# the one that matches your `do-convert`.
COPY --from=framework /app/convert-single-file /app/convert
COPY --from=build /app/do-convert-single-file /app/do-convert-single-file
/app/run
This framework runs on a loop:
/app/convert MIME-BOUNDARY JSON
and pipe the results to
Overview./app/run
handles all communication with Overview. In particular:
/app/run
polls for tasks at POLL_URL
. Overview's administrator must set
POLL_URL
for your container./app/run
will retry if there is a connection error./app/run
will never crash./app/run
will poll Overview to check if the task is canceled. It
will notify /app/convert
with SIGINT
if the task is canceled./app/convert
-- a.k.a., /app/convert-*
/app/convert
is a program we provide, under a few different names. That is,
when you create your program you'll choose one of the following implementations
to copy into /app/convert
in your image.
From /app/run
's point of view, /app/convert
will read the input stream
and JSON
command-line argument and produce a multipart/form-data
output
stream with MIME boundary MIME-BOUNDARY
(in C lingo, argv[1]
).
/app/convert
will never crash, and it will always output a data stream that
Overview can handle.
Your code is invoked by /app/convert
, following one of these strategies:
/app/convert-single-file
This version of /app/convert
will:
input.blob
in a temporary directory and verify it's
the correct size/app/do-convert-single-file JSON
(your code) in the temporary
directorystdout
from your code into progress events or an error event0
and no error message, pipe
output.json
, output.blob
-- and if they exist, output-thumbnail.jpg
,
output-thumbnail.png
and output.txt
-- and a done
eventSpecial cases:
/app/run
sends a SIGINT
signal, sends your program
SIGINT
. Your program should kill and wait for any child processes, then
exit. Its standard output and standard error will be ignored./app/do-convert-single-file
exits with non-zero return value,
pipes an error
event.You must provide /app/do-convert-single-file
. The framework will invoke
/app/do-convert JSON
. Your program can read input.blob
in the current
working directory. Your program must:
stdout
, newline-delimited, that look like:
p1/2
-- "finished processing page 1 of 2"b102/412
-- "finished processing byte 102 of 412"0.324
-- "finished processing 32.4% of input"anything else at all
-- "ERROR: [the line of text]"output.json
, output.blob
, and optionally output-thumbnail.jpg
,
output-thumbnail.png
and/or output.txt
.0
. Any other exit code is an error in your code./app/test-convert-single-file
You can test /app/do-convert-single-file
by creating a Docker image with the
special framework program, /app/test-convert-single-file
. This is designed to
integrate with automated build enviroments like Docker Hub.
Your Docker build stage doesn't need a CMD
. It should include:
/app/test-convert-single-file
-- and you should
RUN [ "/app/test-convert-single-file" ]
/app/do-convert-single-file
and everything it depends on --
/app/test-convert-single-file
will invoke it once per test/app/test/test-*
: one directory per test, e.g. /app/test/test-with-ocr
.
Each test directory should contain:
input.blob
input.json
-- the JSON passed to do-convert-single-file
stdout
-- expected standard output from do-convert-single-file
0.blob
-- expected 0.blob
output0.json
-- expected 0.json
output0.txt
(optional) -- expected 0.txt
output0-thumbnail.{png,jpg}
(optional) -- expected outputtest-convert-single-file
will run do-convert-single-file
in a separate
directory per test. It will output in TAP format
and exit with status code 1
if any test fails.
Copying failed-test files from the test suite
The test output is designed to help you correct your tests. For instance, here
is example output from a test that fails because you did not write
0-thumbnail.jpg
Step 12/13 : RUN [ "/app/test-convert-single-file" ]
---> Running in f65521f3a30c
1..3
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
not ok 1 - test-jpg-ocr
do-convert-single-file wrote /tmp/test-do-convert-single-file912093989/0-thumbnail.jpg, but we expected it not to exist
...
Upon seeing this error, you can
docker cp f65521f3a30c:/tmp/test-do-convert-single-file912093989/0-thumbnail.jpg .
to inspect the file in question (and perhaps make it the expected one).
Testing PDF conversion
PDF output is a common case. We use QPDF for file comparison, to ease debugging.
Your Dockerfile must install QPDF -- e.g., apk --no-cache add qpdf
-- before
running RUN [ "/app/test-convert-single-file" ]
if you are testing PDF output.
/app/convert-stream-to-mime-multipart
This version of /app/convert
will:
/app/do-convert-stream-to-mime-multipart MIME-BOUNDARY JSON
(your
code) within the temporary directorystdin
and and pipe
your program's stdout
to OverviewSpecial cases:
/app/run
sends a SIGINT
signal, sends your program
SIGINT
. Your program should kill and wait for any child processes, then
exit. Its standard output and standard error will be ignored.error
event.error
event if your program does not produce a
error
or done
event or end with --MIME-BOUNDARY--
.You must provide /app/do-convert-stream-to-mime-multipart
. The framework
will invoke it with MIME-BOUNDARY
and JSON
as arguments. MIME-BOUNDARY
will match the regex [a-fA-F0-9]{1,60}
. Your program can read input.blob
in the current directory.
Your program must write valid multipart/form-data
output to stdout
. For
instance:
--MIME-BOUNDARY\r\n
Content-Disposition: form-data; name="0.json"\r\n
\r\n
{JSON for first output file}\r\n
--MIME-BOUNDARY\r\n
Content-Disposition: form-data; name="0.blob"\r\n
\r\n
Blob for first output file\r\n
--MIME-BOUNDARY\r\n
Content-Disposition: form-data; name="progress"\r\n
\r\n
{"pages":{"nProcessed":1,"nTotal":3}}\r\n
--MIME-BOUNDARY\r\n
Content-Disposition: form-data; name="done"\r\n
\r\n
--MIME-BOUNDARY--
Rules:
done
or error
element. A done
element
should be empty; an error
element must include an error message.0.json
, 0.blob
, (optionally 0.png
,
0.jpg
and/or 0.txt
), 1.json
, 1.blob
, ..., done
.N.json
to help
Overview's progressbar behave well.Even more lightweight than /app/convert-stream-to-mime-multipart
is to roll
your own version of /app/convert
. Beware, though:
/app/convert
must always output messages to Overview:
especially a done
or error
event. Without those events, Overview will
never finish processing the file: it will retry indefinitely./app/convert
must always exit successfully. The
trickiest case, in our experience, is handling "out of memory." If your
/app/convert
does not exit successfully, Overview will retry indefinitely
and the file will never be processed./app/convert
should output helpful error messages, so
you can debug it easily./app/convert
should end quickly after receiving
SIGUSR
, because Overview will ignore all further output./app/convert
must ensure temporary files invoked during
one invocation aren't read by the next invocation: that would leak users'
documents to other users./app/convert-stream-to-mime-multipart
is small and fast, and it solves these
problems for you. You probably want it.
./dev
will start a development loop that runs tests. Restart it if you edit
Dockerfile
.
docker build .
will run all tests.
Tests are in ./test/*/suite.bats
. They're run in
bats, an ideal framework for testing
programs that pipe data around.
./release MAJOR.MINOR.PATCH
will push to GitHub. Docker Hub will build the
images for mass consumption.
This software is Copyright 2011-2018 Jonathan Stray and Copyright 2019-2020 Overview Computing Inc., and distributed under the terms of the GNU Affero General Public License. See the LICENSE file for details.
docker pull overview/overview-convert-framework