Docker orchestration solutions all have a feature to keep a certain number of process for a task around. That means that if one of the process crash, a new one will be spin up in order to have the required number of tasks running. Even though that this feature is now offered by basically every orchestration provider (think Kubernetes, Swarm, ECS, etc.), I found that the monitoring tools out there are still not doing a very good job of detecting such problems.
Specifically, we had a case were a bad rewrite of a NodeJS function made one app crash very frequently. The orchestration layer was hard a at work to keep the number of running processes at the configured level - yet, the "container monitoring" solution we had in place did not report any problem. Quick side note: Kudos to LogEntries alerts - all our app have a "Restart" tag that tags the lines where a Java or NodeJS process restart. This tag, like many others, will fire an alert if it happens too frequently.
So, while we fixed the problematic NodeJS code, I decided to create a Docker image that woul crash after a specific time to test out different monitoring solution and whether or not they catch containers that crash (non-clean exit code).
How to use:
The PORT env. variable can be used to change the port this app listens to.
The ERROR_TIMEOUT env. variable can be used to change the number of seconds to wait before crashing (defaults to 5 minutes).
While up, the server will answer "Hello World".
Logs are written upon app startup and just before crashing.
After the ERROR_TIMEOUT, the app will try to access a file that doesn't exist and since there's no error handler, it will completely crash.
Pushing it a bit further: you could mount the <tbd> file on some but not all of your Docker hosts. Therefore the app would alwas crash on some and never on others. See if your monitoring solution picks up on that! (e.g: Nodes A, B and E are unhealthy)