okdataset is proof of concept of a lightweight, on demand map-reduce cluster. okdataset is intended for prototyping and analysis on small- to medium-sized datasets. A few motivations for creating okdataset:
- An API providing the majority of the PySpark API.
- No JVMs - tuning JVMs is just tedious and unnecessary, and this is all Python anyways.
- Supportability - simple JSON-structured logging, and basic fine-grained profiling.
- Containers containers containers!
- Schema-less - less time worrying about strict typing and unwinding.
Client example - sum elements in a list
from okdataset import ChainableList, Context, Logger logger = Logger("sum example") context = Context() logger.info("Building list") l = ChainableList([ x for x in xrange(1, 30) ]) logger.info("Building dataset") ds = context.dataSet(l, label="sum", bufferSize=1) logger.info("Calling reduce") print ds.reduce(lambda x, y: x + y) logger.info("All done!")
docker-compose up -d# starts master, worker, and redis
docker-compose scale worker=8# scale workers
okdataset uses ZeroMQ for middleware, and Redis as a distributed cache. There is a single server process which implements a REQ/REP pattern for client-server connectivity, and a ventilator/sink pattern for master/worker calculation task delegation.
DataSet class is a subclass of the
ChainableList class, which is where the PySpark API subset is implemented, providing the usual
reduceByKey. Clients create
DataSet objects from lists, push the data to the master (and thus into the cache), push serialized (cloud pickle) methods to the server, and
collect calls trigger method chains to be applied to buffers of the dataset on workers. Results are stored in the cache and aggregated by the master for return to the client.