The ANTS Load Balancing System

Current version: 0.5.3 - October 20th, 2004

What ?

The ANTS Load Balancing System is a piece of software that will allow jobs to be executed on computers connected in a network (eg. a Beowulf). The node best suited (at the time of execution) for the job given, will be chosen to execute the job.
This is an approach different from that of traditional Queue systems. A job is not queued, it is executed immediately if any suitable host (for the given job type) can be found. This makes the system suitable for execution of a large number of small jobs, such as compilers. A traditional queue system will often take up too much time managing it's queues, to allow tasks such as large-scale compilations to gain much speedup using it.

How and Where ?

While I originally built this software in my spare time, it was built for and is further maintained by my employer, Evalesco Systems ApS, the maker of the SysOrb Monitoring System.

The software is distributed under the terms GNU Public License. Get the software here:

antsd-0.5.3.tar.gz (source package)
antsd_0.5.3-1_i386.deb (for Debian Woody)
ants-0.5.3.ebuild (ebuild for Gentoo)
There's also the information page and the sample setup page.
If you run into problems with this software, or have improvements or suggestions, send me an e-mail (jakob@unthought.net). Especially I would be very interested in hearing from anyone who would like to add ANTS awareness into the GNU Make utility. GNU Make already has hooks for this, and I believe it would be a fairly simple matter to integrate the two. I don't know the make code well enough to undertake this, and there are still things in ANTS I want to work on first.

You may also notice that in the source there are the vague beginnings of a GNOME applet for job monitoring. I haven't finished this one yet, but it will work to some extent in a vertical panel for now. If you're a GNOME panel-applet wizard and want to contribute to the applet, please do so.

Whom ?

The ANTS System is designed to be run in environments where you never know how many of the hosts in the cluster are available. It can be used on a cluster of workstations, where some workstations may be turned off (or booted into a non-networked OS).
The system will attempt to get metrics from and send job-requests to other nodes in the cluster, but if a node doesn't respond, it's just left out of the consideration. No harm done.

News:

20 10 2004: Small error-code bugfixes and a signal bugfix. Fixed startup if NIS support was compiled in but no NIS was configured on the host system.
23 06 2004: Various networking bugfixes (stupid things). Added the ability for rant to send stdin to the remote process.
22 06 2004: Finally... A good number of bugfixes, environment passing, NIS netgroup support, and other niceties.
03 10 2001: Another small update - improved the error messages and error reporting capabilities.
01 10 2001: A small update - added the gANTS applet again, fixed a race in antsd that could cause output (stdout or stderr) to not appear in certain cases at job termination. Also fixed rant so that it now actually understands the options it claims to support.
11 03 2001: First release in a long time - many changes since last... This is the first release under the ANTS name. ANTS no longer uses the shell for executing jobs, this makes it possible to compile the Linux kernel using ANTS.
10 06 2000: The jobd system has served me well already, and I'll be fixing some of the TODO items this weekend. However, the system will change name to ``ANTS'', the Autonomous Networked Task Scheduler, due to a name clash with a quite similar project that existed much before this one.
06 06 2000: Initial release. The jobd system works for me, and although it has some shortcomings which will be addressed later, it is good enough for me right now.

Common problems:

I cannot run jobs: If rant exits with an error like:
[root@eagle /root]# rant -t gcc hostname
Remote antsd at 10.0.0.14 closed connection (Success).
then it's because you're running as the root user. ANTS will refuse to run jobs with root privileges. This is a security measure - you must run as a user with UID>=500.

I get NFS errors: Many older Linux kernels had problems with NFS. I got reasonably good results with Linux-2.4.2 and NFSv3. You should definitely run 2.2.18 or later in the 2.2 series, or 2.4.2 or later in the 2.4 series. You can experiment with NFS options such as rsize/wsize/sync. If nothing helps, contact the linux-kernel mailing-list and help the developers iron out your NFS problem.

I can't compile ANTS: Make sure you have a decent C++ compiler, preferably the GCC 3.X series or later. ANTS is written in C++ for many reasons, you'll probably need the compiler for something else anyway.

TODO:

My current TODO file:
* rash - remote shell (like rant but with default jobtype and tty handling)

* Distribute jobs from each user evenly among the nodes

* Much faster discovery of host failure should be done (should be trivial)
 
* Implement pxargs, a parallel (ants-enabled) xargs routine
 
* Total mem + BogoMIPS + MHz stats and a cute utility to show them
 
* jobd should unsubscribe when killed
 
* Strip home-dir part from cwd in rant, and prepend it again
  on final execution node.  This allows for different absolute
  paths as cwd on different hosts, if cwd is inside the user
  home directory (which is usually the case)

Credits:

I would like to thank Erik Mouw for contributing some much-needed automake/autoconf clean-ups.
Others have contributed with comments, suggestions and real-world ants usage stories - I like those, please keep'em coming :)

Cry 'Havoc', and let slip the Dogs of War
- William Shakespeare, "Julius Caesar"