Public | Automated Build

Last pushed: a year ago
Short Description
A docker image of Armon's hlld server.
Full Description

Docker Image

A Docker image for Armon's hlld server.

Getting Started

docker run -d -p 4553:4553 speyside/hlld:v0.5.4

hlld is now listening on port 4553 on localhost.

Persistent Volumes

The hlld docker image mounts the volumes /etc/hlld and /data. The default configuration can be overwritten by mounting a /etc/hlld volume containing a new configuration file. The default configuration file is:

# hlld defaults
[hlld]
cold_interval = 0
data_dir = /data/hlld
log_level = INFO
flush_interval = 300
port = 4553
workers =  2

The data volume can also be mounted. For example:

docker run -d -p 4553:4553 -v <config_dir>:/etc/hlld

hlld

hlld is a high-performance C server which is used
to expose HyperLogLog sets and operations over them to
networked clients. It uses a simple ASCI protocol
which is human readable, and similar to memcached.

HyperLogLog's are a relatively new sketching data structure.
They are used to estimate cardinality, i.e. the unique number
of items in a set. They are based on the observation that any
bit in a "good" hash function is independent of any other
bit and that the probability of getting a string of N
bits all set to the same value is 1/(2^N). There is a lot more in
the math, but that is the basic intuition. What is even more
incredible is that the storage required to do the counting
is log(log(N)). So with a 6 bit register, we can count well into
the trillions. For more information, please see the original github repo.

TL;DR: HyperLogLogs enable you to have a set with about 1.6% variance,
using 3280 bytes, and estimate sizes in the trillions.

Features

  • Scalable non-blocking core allows for many connected
    clients and concurrent operations
  • Implements 6bit wide HyperLogLogs, allowing almost unbounded counts
  • Supports asynchronous flushes to disk for persistence
  • Supports non-disk backed sets for high I/O
  • Automatically faults cold sets out of memory to save resources
  • Dead simple to start and administer
  • FAST, FAST, FAST

Clients

Here is a list of known client implementations:

Here is a list of "best-practices" for client implementations:

  • Maintain a set of open connections to the server to minimize connection time
  • Make use of the bulk operations when possible, as they are more efficient.
  • For long keys, it is better to do a client-side hash (SHA1 at least), and send
    the hash as the key to minimize network traffic.

Configuration Options

Each configuration option is documented below:

  • tcp_port : Integer, sets the tcp port to listen on. Default 4553.

  • port: Same as above. For compatibility.

  • udp_port : Integer, sets the udp port. Currently listened on

             but otherwise unused. Default 4554.
    
  • bind_address: The IP address to bind on. Defaults to 0.0.0.0.

  • data_dir : The data directory that is used. Defaults to /tmp/hlld

  • log_level : The logging level that hlld should use. One of:
    DEBUG, INFO, WARN, ERROR, or CRITICAL. All logs go to syslog,
    and stderr if that is a TTY. Default is INFO.

  • workers : This controls the number of worker threads that are used.
    Defaults to 1. If many different sets are used, it can be advantageous
    to increase this to the number of CPU cores. If only a few sets are used,
    the increased lock contention may reduce throughput, and a single worker
    may be better.

  • flush_interval : This is the time interval in seconds in which
    sets are flushed to disk. Defaults to 60 seconds. Set to 0 to
    disable.

  • cold_interval : If a set is not accessed (set or bulk), for
    this amount of time, it is eligible to be removed from memory
    and left only on disk. If a set is accessed, it will automatically
    be faulted back into memory. Set to 3600 seconds by default (1 hour).
    Set to 0 to disable cold faulting.

  • in_memory : If set to 1, then all sets are in-memory ONLY by
    default. This means they are not persisted to disk, and are not
    eligible for cold fault out. Defaults to 0.

  • use_mmap : If set to 1, the hlld internal buffer management
    is disabled, and instead buffers use a plain mmap() and rely on
    the kernel for all management. This increases data safety in the
    case that hlld crashes, but has adverse affects on performance
    if the total memory utilization of the system is high. In general,
    this should be left to 0, which is the default.

  • default_eps: If not provided to create, this is the default
    error of the HyperLogLog. This is an upper bound and is used to
    compute the precision that should be used. This option overrides
    a given default precision. Defaults to 1.625%, which is a precision
    of 12. Only one of default_eps or default_precision should be provided.

  • default_precision : If not provided to create, this is the default
    "precision" of the HyperLogLog. This controls the error in the size
    estimate. This option overrides a given default eps. Defaults to 12,
    which is results in a variance of about 1.625%. Only one of default_eps
    or default_precision should be provided.

It is important to note that reducing the error bound increases the
required precision. The size utilization of a HyperLogLog increases
exponentially with the precision, so it should be increased carefully.

Protocol

By default, hlld will listen for TCP connections on port 4553.
It uses a simple ASCII protocol that is very similar to memcached.

A command has the following syntax::

cmd [args][\r]\n

We start each line by specifying a command, providing optional arguments,
and ending the line in a newline (carriage return is optional).

There are a total of 9 commands:

  • create - Create a new set (a set is a named HyperLogLog)
  • list - List all sets or those matching a prefix
  • drop - Drop a set (Deletes from disk)
  • close - Closes a set (Unmaps from memory, but still accessible)
  • clear - Clears a set from the lists (Removes memory, left on disk)
  • set|s - Set an item in a set
  • bulk|b - Set many items in a set at once
  • info - Gets info about a set
  • flush - Flushes all sets or just a specified one

For the create command, the format is::

create set_name [precision=prec] [eps=max_eps] [in_memory=0|1]

Where set_name is the name of the set,
and can contain the characters a-z, A-Z, 0-9, ., _.
If a precision is provided the set
will be created with the given bits of precision, otherwise the configured default value will be used.
If a maximum epsilon is provided, that will be used to compute a precision, otherwise the configured default is used.
You can optionally specify in_memory to force the set to not be persisted to disk. If both precision and
eps are specified, it is not specified which one will be used. Generally, only one should be provided,
as the other will be computed.

As an example::

create foobar eps=0.01

This will create a set foobar that has a maximum variance of 1%.
Valid responses are either "Done", "Exists", or "Delete in progress". The last response
occurs if a set of the same name was recently deleted, and hlld
has not yet completed the delete operation. If so, a client should
retry the create in a few seconds.

The list command takes either no arguments or a set prefix, and returns information
about the matching sets.

For example, doing:

list foo

Will return a list of all sets with the foo prefix. Here is an example response:

START
foobar 0.010000 14 13108 0
END

This indicates a single set named foobar, with a variance
of 0.01, precision 14, a 13108 byte size, a current size estimate of 0
items.

The drop, close and clear commands are like create, but only takes a set name.
It can either return "Done" or "Set does not exist". clear can also return "Set is not proxied. Close it first.".
This means that the set is still in-memory and not qualified for being cleared.
This can be resolved by first closing the set.

set is a very simple command:

set set_name key

The command must specify a set and a key to use.
It will either return "Done", or "Set does not exist".

The bulk command is similar to set but allows for many keys
to be set at once. Keys must be separated by a space:

bulk set_name key1 [key_2 [key_3 [key_N]]]

The bulk and set commands can also be called by their aliases
b and s respectively.

The info command takes a set name, and returns
information about the set. Here is an example output:

START
in_memory 1
page_ins 0
page_outs 0
eps 0.02
precision 12
sets 0
size 1540
storage 3280
END

The command may also return "Set does not exist" if the set does
not exist.

The flush command may be called without any arguments, which
causes all sets to be flushed. If a set name is provided
then that set will be flushed. This will either return "Done" or
"Set does not exist".

Example

Here is an example of a client flow, assuming hlld is
running on the default port using just telnet::

$ telnet localhost 4553
> list
START
END

> create foobar
Done

> set foobar zipzab
Done

> bulk foobar zipzab blah boo
Done

> list
START
foobar 0.016250 12 3280 3
END

> drop foobar
Done

> list
START
END
Docker Pull Command
Owner
speyside
Source Repository

Comments (0)