The ANTS Load Balancing System - Info
When and Why ?
Development was started in may/july 2000 by me, the author, Jakob
Østergaard. I needed a good queuing system for large scale
compilations, and found that the ones that were freely available were
either not fast enough, or didn't work well enough. So I decided to
roll my own, and here it is.
On What ?
The software is built and tested on RedHat Linux 7.0, but I
think the platform list is something along the lines of:
- Any Linux-2.0 based system: Will most likely not
work. antsd uses some socket tricks which I think aren't available on
Linux-2.0 or earlier. This can be worked around though, using the
identd support already in the code.
- Any Linux-2.2 and 2.4 based system: Will work. antsd most likely
depends heavily on glibc, but it should work with any distribution
shipping glibc. I guess.
- Any BSD or SVR4 UNIX: It could be made to work fairly
easily I guess. There used to be identd support in the code, to remove the
need for the Linux specific socket tricks. However, this code has not been
integrated as ANTS changed over time... If you need ident support there
is some code in there that can be hacked back into working condition - but I
won't do it because I don't need it.
There are some requirements though:
- You cannot mix CPU architectures, like IA-32 and Alpha. However,
for the jobs this system was intended for, that doesn't make sense
anyway.
- I have not tested this on anything but IA-32. There may well be
problems with 64-bit architectures, but I wouldn't know.
- The user's home directories must be shared on all nodes. Currently
the absolute path name to the home directories must be the same on all
nodes, but this will be fixed later.
How ?
The ANTS Load Balancing System can be used to execute jobs on a
cluster, automatically selecting the best suited node for the job type
given. A small utility is used to spawn jobs, it will contact the
antsd daemons and have the job executed on a suitable node.
The ANTS system consists of two parts:
- The antsd daemon. Each node in the cluster run a copy of the
antsd daemon. Each node has it's own configuration with resource limits
and job type configurations.
- The rant ``job shell''. Instead of executing a compiler with
gcc -c prog.c, you can now call it as rant -t gcc gcc -c
prog.c, and the compiler will be executed on the node which is
best suited for the job.
The antsd daemons will communicate their local status (metrics) to the
other daemons using a simple UDP based protocol. A job execution
involves the following steps:
- A rant instance connects to it's local daemon using a UNIX socket,
and can request a job-slot for some job type this way.
- The antsd daemon will then consider all known metrics from all
known hosts.
- The best host for the given job type is then requested (via UDP)
to acknowledge a job-slot allocation for the job type.
- If the best host responds with a ``no-can-do'', or doesn't
respond at all, the rant client is told to retry later
- If the best host acknowledges the job-slot allocation request, it
will return an authentication handle along with the
acknowledgement. The address of the host as well as the authentication
handle is then passed on to the rant client via. the UNIX socket
- The client will now close the connection to the local antsd, and
open a TCP connection to the best host.
- The client sends the authentication handle along with information
about job-type, executable name, arguments, etc. to the best host
using the TCP connection.
- If the remote antsd daemon decides that everything seems to add
up, it will execute the command given, and pass on output and exit
status from the command to the remote rant.