Ubuntu Manpage: work_queue_factory - maintain a pool of Work Queue workers on a batch system.

Provided by: coop-computing-tools_7.0.22-1ubuntu1_amd64

NAME

       work_queue_factory - maintain a pool of Work Queue workers on a batch system.

SYNOPSIS

       work_queue_factory -M <project-name> -T <batch-type> [options]

DESCRIPTION

work_queue_factory submits and maintains a number of work_queue_worker(1) processes on
various batch systems, such as Condor and SGE. All the workers managed by a
work_queue_factory process will be directed to work for a specific master, or any set of
masters matching a given project name. work_queue_factory will automatically determine
the correct number of workers to have running, based on criteria set on the command line.
The decision on how many workers to run is reconsidered once per minute.

By default, work_queue_factory will run as many workers as the indicated masters have
tasks ready to run. If there are multiple masters, then enough workers will be started to
satisfy their collective needs. For example, if there are two masters with the same
project name, each with 10 tasks to run, then work_queue_factory will start a total of 20
workers.

If the number of needed workers increases, work_queue_factory will submit more workers to
meet the desired need. However, it will not run more than a fixed maximum number of
workers, given by the -W option.

If the need for workers drops, work_queue_factory does not remove them immediately, but
waits to them to exit on their own. (This happens when the worker has been idle for a
certain time.) A minimum number of workers will be maintained, given by the -w option.

If given the -c option, then work_queue_factory will consider the capacity reported by
each master. The capacity is the estimated number of workers that the master thinks it
can handle, based on the task execution and data transfer times currently observed at the
master. With the -c option on, work_queue_factory will consider the master's capacity to
be the maximum number of workers to run.

If work_queue_factory receives a terminating signal, it will attempt to remove all running
workers before exiting.

OPTIONS

   Batch Options
       -M, --master-name=<project>
              Project name of masters to serve, can be a regular expression.

       -F, --foremen-name=<project>
              Foremen to serve, can be a regular expression.

        --catalog <catalog>
              Catalog   server   to   query   for  masters  (default:  catalog.cse.nd.edu,backup-
              catalog.cse.nd.edu:9097).

       -T, --batch-type=<type>
              Batch system type (required). One of: local, wq, condor, sge, torque,  mesos,  k8s,
              moab, slurm, chirp, amazon, lambda, dryrun, amazon-batch

       -B, --batch-options=<options>
              Add these options to all batch submit files.

       -P, --password=<file>
              Password file for workers to authenticate to master.

       -C, --config-file=<file>
              Use the configuration file <file>.

       -w, --min-workers=<workers>
              Minimum workers running.  (default=5)

       -W, --max-workers=<workers>
              Maximum workers running.  (default=100)

        --workers-per-cycle <workers>
              Maximum  number  of  new  workers  per  30  seconds.  ( less than 1 disables limit,
              default=5)

        --tasks-per-worker <workers>
              Average tasks per worker (default=one task per core).

       -t, --timeout=<time>
              Workers abort after this amount of idle time (default=300).

        --env <variable=value>
              Environment variable that should be added to the worker (May be specified  multiple
              times).

       -E, --extra-options=<options>
              Extra options that should be added to the worker.

        --cores <n>
              Set the number of cores requested per worker.

        --gpus <n>
              Set the number of GPUs requested per worker.

        --memory <mb>
              Set the amount of memory (in MB) requested per worker.

        --disk <mb>
              Set the amount of disk (in MB) requested per worker.

        --autosize
              Automatically size a worker to an available slot (Condor, Mesos, and Kubernetes).

        --condor-requirements
              Set  requirements  for  the  workers as Condor jobs. May be specified several times
              with expresions and-ed together (Condor only).

        --factory-timeout <n>
              Exit after no master has been seen in <n> seconds.

       -S, --scratch-dir=<file>
              Use this scratch dir for temporary files (default is /tmp/wq-pool-$uid).

        -c, --capacity
              Use worker capacity reported by masters.

       -d, --debug=<subsystem>
              Enable debugging for this subsystem.

        --amazon-config
              Specify Amazon config file (for use with -T amazon).

        --wrapper
              Wrap factory with this command prefix.

        --wrapper-input
              Add this input file needed by the wrapper.

        --mesos-master <hostname>
              Specify the host name to mesos master node (for use with -T mesos).

        --mesos-path <filepath>
              Specify path to mesos python library (for use with -T mesos).

        --mesos-preload <library>
              Specify the linking libraries for running mesos (for use with -T mesos).

        --k8s-image
              Specify the container image for using Kubernetes (for use with -T k8s).

        --k8s-worker-image
              Specify the container image that  contains  work_queue_worker  availabe  for  using
              Kubernetes (for use with -T k8s).

       -o, --debug-file=<file>
              Send debugging to this file (can also be :stderr, :stdout, :syslog, or :journal).

       -O, --debug-file-size=<mb>
              Specify the size of the debug file (must use with -o option).

        --worker-binary <file>
              Specify  the binary to use for the worker (relative or hard path). It should accept
              the same arguments as the default work_queue_worker.

        --runos <img>
              Will make a best attempt to ensure the worker will  execute  in  the  specified  OS
              environment, regardless of the underlying OS.

        -v, --version
              Show the version string.

        -h, --help
              Show this screen.

EXIT STATUS

       On success, returns zero. On failure, returns non-zero.

EXAMPLES

       Suppose you have a Work Queue master with a project name of "barney".  To maintain workers
       for barney, do this:

               work_queue_factory -T condor -M barney

       To maintain a maximum of 100 workers on an SGE batch system, do this:

               work_queue_factory -T sge -M barney -W 100

       To start workers such that the workers exit after 5 minutes (300s) of idleness:

               work_queue_factory -T condor -M barney -t 300

       If you want to start workers that match any project that begins with barney, use a regular
       expression:

               work_queue_factory -T condor -M barney.-t 300

       If running on condor, you may manually specify condor requirements:

               work_queue_factory -T condor -M barney --condor-requirements 'MachineGroup == "disc"' --condor-requirements 'has_matlab == true'

       Repeated  uses  of  condor-requirements  are  and-ed  together.  The previous example will
       produce a statement equivalent to:

       requirements = ((MachineGroup == "disc") && (has_matlab == true))

       Use the configuration file my_conf:

               work_queue_factory -Cmy_conf

       my_conf should be a proper JSON document, as:

               {
                       "master-name": "my_master.*",
                       "max-workers": 100,
                       "min-workers": 0
               }

       Valid configuration fields are:

               master-name
               foremen-name
               min-workers
               max-workers
               workers-per-cycle
               task-per-worker
               timeout
               worker-extra-options
               condor-requirements
               cores
               memory
               disk

KNOWN BUGS

       The capacity measurement  currently  assumes  single-core  tasks  running  on  single-core
       workers, and behaves unexpectedly with multi-core tasks or multi-core workers.

COPYRIGHT

       The  Cooperative  Computing  Tools are Copyright (C) 2003-2004 Douglas Thain and Copyright
       (C) 2005-2015 The University of Notre Dame.  This software is distributed  under  the  GNU
       General Public License.  See the file COPYING for details.