Command-line interface to running docker containers in parallel.
Developed at Radiant Genomics for bioinformatics needs.
The framework is built on Docker and Kubernetes.
###A typical use case
1) Create docker image from docker folder and config file (described in later sections)
run-pipeline build -df <path to docker folder> -c <path to config file> --mode <aws or local>
2) Start a run
run-pipeline start -c <path to config file> -n <name of run> --mode <aws or local>
3) Monitor the runs
Check general stats run-pipeline stats -n <name of run> --mode <aws or local> Check logs (stdout and stderr) run-pipeline log -n <name of run> --mode <aws or local> Look for keywords in the logs every 15 min run-pipeline log -n <name of run> --grep "Exception|error" --watch 15m --mode <aws or local>
1) Install using npm.
npm install -g run-pipeline
The installation process will ask whether you want to install all the pre-requisites, such as mongo, AWS cli, and Kubernetes. Answer 'yes' only if you are on a Debian-based Linux operating system because this process makes use of apt-get. If you are not on a Debian-based Linux operatating system, take a look at the prerequisites list to see what needs to be installed.
2) Add the following variables to your environment (modify .bashrc).
DEFAULT_MODE is the default value for
--mode (allowed values:
aws). AWS variables are not required if you are using local mode.
AWS_ACCESS_KEY_ID=**paste your aws access key id here** AWS_SECRET_ACCESS_KEY=**paste your aws access key secret here** AWS_DEFAULT_REGION=us-west-2 AWS_DEFAULT_AVAILABILITYZONE=us-west-2c AWS_KEYPAIR=**your key pair** AWS_ARN=**create an aws arn** DEFAULT_MODE=local
The command-line interface relies on a configuration file. The configuration file is used to specify the amount of RAM and/or CPU, the inputs, and the outputs for a single execution of a pipeline. Below is a simple hello-world example. Other example configuration files can be found in the
/examples folder. The files for creating the docker containers are located in a separate repository
name: Hello World Run 3 author: Deepak description: testing pipeline: image: helloworld:latest command: /run.sh memory_requirement: 120M volumes: shared: hostPath: /home/deepak/mydata local_config: cpu_cores: 0-2 output: location: /home/deepak/mydata/output inputs: location: /home/deepak/mydata/inputs file_names: - inputA: f1 inputB: r1 - inputA: f2 inputB: r2 - inputA: f3 inputB: r3
name: Hello World Run 3 author: Deepak description: testing pipeline: repository: 570340283117.dkr.ecr.us-west-2.amazonaws.com image: helloworld:latest command: /run.sh public: no memory_requirement: 120M cloud_config: workerCount: 2 workerInstanceType: t2.nano workerRootVolumeSize: 100 local_config: cpu_cores: 0-2 output: location: s3://some/location inputs: location: s3://some/location file_names: - inputA: f1 inputB: r1 - inputA: f2 inputB: r2 - inputA: f3 inputB: r3
usage: runPipeline.js [-h] [-c path] [-n name] [-df path] [--grep regex] [--cmd command] [-t name] [-q key-value] [--watch interval] [-v] [--mode MODE] [--overwrite] [--limit number] option Run dockerized pipelines locally or in Amazon Cloud Positional arguments: option Options: "init" to initialize a cluster, "build" to build the docker container, "run" or "start" to start tasks, "stop" to stop tasks, "kill" to destroy a cluster, "log" to get current or previous logs, "status" to get status of running tasks, "search" to search for previous run configs Optional arguments: -h, --help Show this help message and exit. -c path, --config path Cluster configuration file (required for init) -n name, --name name Name of run (if config file is also specified, then uses this name instead of the one in config file) -df path, --dockerfile path Specify the dockerfile that needs to be built --grep regex Perform grep operation on the logs. Use only with the log option --cmd command, --exec command Execute a command inside all the docker containers. Use only with the log option -t name, --task name Specify one task, e.g. task-1, instead of all tasks. Use only with log, status, run, or stop. -q key-value, --query key-value For use with the search program, e.g. name: "Hello" --watch interval Check log every X seconds, minutes, or hours. Input examples: 5s, 1m, 1h. Only use with run or log. Useful for monitoring unattended runs. -v, --verbose Print intermediate outputs to console --mode MODE, -m MODE Where to create the cluster. Options: local, aws --overwrite Overwrite existing outputs. Default behavior: does not run tasks where output files exist. --limit number Use with search or log to limit the number of outputs.
Build a docker image for local usage:
run-pipeline build --config helloworld.yaml --dockerfile ./inputs/helloworld/Dockerfile
Build and upload a docker image for Cloud usage:
run-pipeline build --config helloworld.yaml --dockerfile ./inputs/helloworld/Dockerfile --mode aws
For local use, the Run will automatically initialize and run a pipeline:
run-pipeline start --config helloworld.yaml
Turn on automatic logging while running. In this case, watch the logs every 1 min:
run-pipeline start --config helloworld.yaml --watch 1m
Use --verbose to see the commands that are being executed by runPipeline.
For cloud usage, you would need to initialize the pipeline before running:
run-pipeline init --config helloworld.yaml --mode aws run-pipeline start --config helloworld.yaml -m aws or you can use the name of the run run-pipeline start --name "Hello World" -mode aws shorthand run-pipeline start -n "Hello World" -m aws
A task is defined as a pipeline and a specific input (or set of input files). For example, in the above configuration example, there are three tasks: the first using f1 and r1, the second using f2 and r2, and the last using f3 and r3 input files.
Check the status of each task or a specific task
run-pipeline status -c helloworld.yaml -m aws run-pipeline status -c helloworld.yaml -m aws -t task-1
Colored and indented console outputs
Obtain the logs from each task or a specific task
run-pipeline logs helloworld.yaml -m aws run-pipeline logs helloworld.yaml --task "task-1" -m aws
grep (find) a specific phrase (regex) inside logs
run-pipeline logs helloworld.yaml --grep "grep Downloads -A 2"
Execute a command inside each task or a specific task
run-pipeline logs helloworld.yaml --cmd "cat /tmp/file1.txt" run-pipeline logs helloworld.yaml --cmd "cat /tmp/file1.txt" -t task-1
Search old logs
run-pipeline logs --query "run: Hello World, createdAt: 2016-04-10"
Search old run configurations
run-pipeline search --query "name: Hello World, createdAt: 2016-06-10"
Restart a task
run-pipeline restart -c inputs/HS4000_plate3_bfcspadesmeta.yaml -t task-1
Restart tasks automatically when specific keywords are found in the logs or status. Note that regular expressions are allowed.
./auto-restart.js "HiSeq Plate 3 SPAdes" --logs '(not connect to the endpoint)|(different)' --status 'reason: Error|OOMKilled'
Check for specific key-value pairs in the status
run-pipeline status -c inputs/HS4000_plate3_bfcspadesmeta.yaml --query "status: exit" run-pipeline status -c inputs/HS4000_plate3_bfcspadesmeta.yaml --query "status: running"
Keep checking logs every 60 seconds (and store in database) - useful for unattended runs
monitor free memory every 60 seconds run-pipeline logs helloworld.yaml --cmd "free -h" --watch 60s
Look for key words in the logs and print 5 lines after it
run-pipeline log -c inputs/HS4000_plate3_bfcspadesmeta.yaml --grep "'different number' -A 5"
Run a configuration using its name (gets latest version in the database)
run-pipeline start -n "Hello World"
Useful commands for monitoring memory and disk usage inside each task
monitor free memory run-pipeline logs helloworld.yaml --cmd "free -h" monitor CPU and memory usage run-pipeline logs helloworld.yaml --cmd "ps aux" monitor free disk space run-pipeline logs helloworld.yaml --cmd "df -h"
Divide executions into categories, i.e. namespace. The executions will use the same cluster (when run in AWS)
run-pipeline start helloworld.yaml --namespace A run-pipeline start helloworld.yaml --namespace B get logs from different pipelines in the same cluster run-pipeline status helloworld.yaml --namespace A run-pipeline logs helloworld.yaml --namespace B
--mode aws, you may use Kubernetes directly for features not provided by runPipeline.
cd Hello-World kubectl --kubeconfig=kubeconfig get nodes kubectl --kubeconfig=kubeconfig describe node ip-10-0-0-111.us-west-2.compute.internal
A pipeline is a Docker image that uses environmental variables to define input and output files. A pipeline can be executed multiple times, in parallel, using different inputs and/or outputs.
Building and executing docker containers is described in another repository
In order to create a pipeline that is amenable to cloud computing, you need to upload the image to a docker repository, such as the AWS repository or Docker Hub. Below are the steps for uploading to the AWS docker repository.
docker build -t 570340283117.dkr.ecr.us-west-2.amazonaws.com/helloworld:latest ./helloworld
docker push 570340283117.dkr.ecr.us-west-2.amazonaws.com/helloworld:latest
##Issues to keep in mind
###When running locally
./runPipeline killwill destroy all containers; if you do not use
kill, you should also cleanup stopped docker containers. Useful commands to put in your
Kill all running containers. alias dockerkillall='docker kill $(docker ps -q)' Delete all stopped containers. alias dockercleanc='printf "\n>>> Deleting stopped containers\n\n" && docker rm $(docker ps -a -q)' Delete all untagged images. alias dockercleani='printf "\n>>> Deleting untagged images\n\n" && docker rmi $(docker images -q -f dangling=true)'
AWS Container Service will place multiple containers on the same EC2 instance. This decision is based on the amount of memory and number of cores required by the container, so be sure to specify the memory-requirement in the input file correctly. Otherwise, random processes may start failing due to overuse of EC2 instances
The kill command may not correctly kill a cluster and cloudformation components due to dependencies. It is difficult to identify all the dependencies programmatically; in such cases, you have to go to the cloudformation link in the AWS Console and manually delete those stacks. In my observation, the EC2 instances are terminated properly with the kill command. AWS will not allow you to create more clusters if the maximum number of VPCs has been reached. VPCs should get deleted along with the clusters when kill is executed, but this is not always the case. You need to delete them manually. From the AWS discussion groups, this appears to be an issue that is affecting many people.
AWS only a allows a limited number of Virtual Private Clouds (VPC), which are created by the ECS service for each cluster. Deleting VPCs is difficult through the command-line AWS interface because they have other dependencies. Deleting the VPCs from the AWS Console is relatively easy, however.
##Prerequisites 1) Docker, see instructions here: https://docs.docker.com/engine/installation/ Add yourself to docker group.
2) Python 2.7 or greater.
3) Use pip to install the following Python packages: pyaml, python-dateutil, awscli
4) NPM and NodeJS version 6+
5) Kubernetes 1.2.4. Download from https://github.com/kubernetes/kubernetes/releases/download/v1.2.4/
6) MongoDB. Google will tell you how to install it.
Copyright (c) 2016 Deepak Chandran
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.