The ANTS Load Balancing System - Info

When and Why ?

Development was started in may/july 2000 by me, the author, Jakob Østergaard. I needed a good queuing system for large scale compilations, and found that the ones that were freely available were either not fast enough, or didn't work well enough. So I decided to roll my own, and here it is.

On What ?

The software is built and tested on RedHat Linux 7.0, but I think the platform list is something along the lines of:

Any Linux-2.0 based system: Will most likely not work. antsd uses some socket tricks which I think aren't available on Linux-2.0 or earlier. This can be worked around though, using the identd support already in the code.
Any Linux-2.2 and 2.4 based system: Will work. antsd most likely depends heavily on glibc, but it should work with any distribution shipping glibc. I guess.
Any BSD or SVR4 UNIX: It could be made to work fairly easily I guess. There used to be identd support in the code, to remove the need for the Linux specific socket tricks. However, this code has not been integrated as ANTS changed over time... If you need ident support there is some code in there that can be hacked back into working condition - but I won't do it because I don't need it.

There are some requirements though:

You cannot mix CPU architectures, like IA-32 and Alpha. However, for the jobs this system was intended for, that doesn't make sense anyway.
I have not tested this on anything but IA-32. There may well be problems with 64-bit architectures, but I wouldn't know.
The user's home directories must be shared on all nodes. Currently the absolute path name to the home directories must be the same on all nodes, but this will be fixed later.

How ?

The ANTS Load Balancing System can be used to execute jobs on a cluster, automatically selecting the best suited node for the job type given. A small utility is used to spawn jobs, it will contact the antsd daemons and have the job executed on a suitable node.
The ANTS system consists of two parts:

The antsd daemon. Each node in the cluster run a copy of the antsd daemon. Each node has it's own configuration with resource limits and job type configurations.
The rant ``job shell''. Instead of executing a compiler with gcc -c prog.c, you can now call it as rant -t gcc gcc -c prog.c, and the compiler will be executed on the node which is best suited for the job.

The antsd daemons will communicate their local status (metrics) to the other daemons using a simple UDP based protocol. A job execution involves the following steps:

A rant instance connects to it's local daemon using a UNIX socket, and can request a job-slot for some job type this way.
The antsd daemon will then consider all known metrics from all known hosts.
The best host for the given job type is then requested (via UDP) to acknowledge a job-slot allocation for the job type.
If the best host responds with a ``no-can-do'', or doesn't respond at all, the rant client is told to retry later
If the best host acknowledges the job-slot allocation request, it will return an authentication handle along with the acknowledgement. The address of the host as well as the authentication handle is then passed on to the rant client via. the UNIX socket
The client will now close the connection to the local antsd, and open a TCP connection to the best host.
The client sends the authentication handle along with information about job-type, executable name, arguments, etc. to the best host using the TCP connection.
If the remote antsd daemon decides that everything seems to add up, it will execute the command given, and pass on output and exit status from the command to the remote rant.