Scalable REST API wrapper for the Caffe deep learning framework.
Caffe is an awesome deep learning framework, but running it on a single laptop or desktop computer isn't nearly as productive as running it in the cloud at scale.
ElasticThought gives you the ability to:
- Run multiple Caffe training jobs in parallel
- Queue up training jobs
- Tune the number of workers that process jobs on the queue
- Interact with it via a REST API (and later build Web/Mobile apps on top of it)
- Multi-tenancy to allow multiple users to interact with it, each having access to only their own data
- Caffe - core deep learning framework
- Couchbase Server - Distributed document database used as an object store (source code)
- Sync Gateway - REST adapter layer for Couchbase Server + Mobile Sync gateway
- CBFS - Couchbase Distributed File System used as blob store
- NSQ - Distributed message queue
- ElasticThought REST Service - REST API server written in Go
Here is what a typical cluster might look like:
If running on AWS, each CoreOS instance would be running on its own EC2 instance.
Although not shown, all components would be running inside of Docker containers.
CoreOS Fleet would be leveraged to auto-restart any failed components, including Caffe workers.
Current Status: everything under heavy construction, not ready for public consumption yet
- [done] Working end-to-end with IMAGE_DATA caffe layer using a single test set with a single training set, and ability to query trained set.
- [in progress] ---> Support LEVELDB / LMDB data formats, to run mnist example.
- Support the majority of caffe use cases
- Package everything up to make it easy to deploy <-- initial release
- Ability to auto-scale worker instances up and down based on how many jobs are in the message queue.
- Attempt to add support for other deep learning frameworks: pylearn2, cuda-convnet, etc.
- Build a Web App on top of the REST API that leverages PouchDB
- Build Android and iOS mobile apps on top of the REST API that leverages Couchbase Mobile
- 100% Open Source (Apache 2 / BSD), including all components used.
- Architected to enable warehouse scale computing
- No IAAS lockin -- easily migrate between AWS, GCE, or your own private data center
- Ability to scale down as well as up
ElasticThought is not trying to be a grid computing (aka distributed computation) solution.
For that, check out:
Kick things off: Aws
Launch EC2 instances via CloudFormation script
Note: the instance will launch in us-east-1. If you want to launch in another region, please file an issue.
- Launch CPU Stack or Launch GPU Stack
- Choose 3 node cluster with m3.medium or g2.2xlarge (GPU case) instance type
- All other values should be default
Kick off ElasticThought
Ssh into one of the machines (doesn't matter which):
ssh -A email@example.com
$ wget https://raw.githubusercontent.com/tleyden/elastic-thought/master/docker/scripts/elasticthought-cluster-init.sh $ chmod +x elasticthought-cluster-init.sh $ ./elasticthought-cluster-init.sh -v 3.0.1 -n 3 -u "user:passw0rd" -p gpu
Once it launches, verify your cluster by running
It should look like this:
UNIT MACHINE ACTIVE SUB firstname.lastname@example.org 2340c553.../10.225.17.229 active running email@example.com fbd4562e.../10.182.197.145 active running firstname.lastname@example.org 0f5e2e11.../10.168.212.210 active running email@example.com 2340c553.../10.225.17.229 active running firstname.lastname@example.org fbd4562e.../10.182.197.145 active running email@example.com 0f5e2e11.../10.168.212.210 active running couchbase_bootstrap_node.service 0f5e2e11.../10.168.212.210 active running couchbase_bootstrap_node_announce.service 0f5e2e11.../10.168.212.210 active running couchbase_node.1.service 2340c553.../10.225.17.229 active running couchbase_node.2.service fbd4562e.../10.182.197.145 active running firstname.lastname@example.org 2340c553.../10.225.17.229 active running email@example.com fbd4562e.../10.182.197.145 active running firstname.lastname@example.org 0f5e2e11.../10.168.212.210 active running email@example.com 2340c553.../10.225.17.229 active running firstname.lastname@example.org fbd4562e.../10.182.197.145 active running email@example.com 0f5e2e11.../10.168.212.210 active running firstname.lastname@example.org 2340c553.../10.225.17.229 active running email@example.com fbd4562e.../10.182.197.145 active running firstname.lastname@example.org 0f5e2e11.../10.168.212.210 active running
At this point you should be able to access the REST API on the public ip any of the three Sync Gateway machines.
Kick things off: Vagrant
Make sure you're running a current version of Vagrant, otherwise the plugin install below may fail.
$ vagrant -v 1.7.1
Open the user-data file, and add:
write_files: - path: /etc/systemd/system/docker.service.d/increase-ulimit.conf owner: core:core permissions: 0644 content: | [Service] LimitMEMLOCK=infinity - path: /var/lib/couchbase/data/.README owner: core:core permissions: 0644 content: | Couchbase Data files are stored here - path: /var/lib/couchbase/index/.README owner: core:core permissions: 0644 content: | Couchbase Index files are stored here - path: /var/lib/cbfs/data/.README owner: core:core permissions: 0644 content: | CBFS files are stored here
Increase RAM size of VM's
Couchbase Server wants a lot of RAM. Bump up the vm memory size to 2GB.
Edit your Vagrantfile:
$vb_memory = 2048
Setup port forwarding for Couchbase UI (optional)
This is only needed if you want to be able to connect to the Couchbase web UI from a browser on your host OS (ie, OSX)
Add the following snippet to your Vagrant file:
if i == 1 # create a port forward mapping to view couchbase web ui config.vm.network "forwarded_port", guest: 8091, host: 5091 end
Disable Transparent Huge Pages (optional)
Not sure how crucial this is, but I'll mention it just in case. After the CoreOS machines startup, ssh into each one:
$ sudo bash # echo never > /sys/kernel/mm/transparent_hugepage/enabled && echo never > /sys/kernel/mm/transparent_hugepage/defrag