The ANTS Load Balancing System
Current version: 0.5.3 - October 20th, 2004
What ?
The ANTS Load Balancing System is a piece of software that will allow
jobs to be executed on computers connected in a network (eg. a
Beowulf). The node best suited (at the time of execution) for the job
given, will be chosen to execute the job.
This is an approach different from that of traditional Queue
systems. A job is not queued, it is executed immediately if any
suitable host (for the given job type) can be found. This makes the
system suitable for execution of a large number of small jobs, such as
compilers. A traditional queue system will often take up too much
time managing it's queues, to allow tasks such as large-scale
compilations to gain much speedup using it.
How and Where ?
While I originally built this software in my spare time, it was built for and is
further maintained by my employer, Evalesco
Systems ApS, the maker of the SysOrb
Monitoring System.
The software is distributed under the terms GNU Public License. Get
the software here:
antsd-0.5.3.tar.gz (source package)
antsd_0.5.3-1_i386.deb (for Debian Woody)
ants-0.5.3.ebuild (ebuild for Gentoo)
There's also the information page and the sample setup page.
If you run into problems with this software, or have improvements or
suggestions, send me an e-mail
(jakob@unthought.net). Especially I
would be very interested in hearing from anyone who would like to add
ANTS awareness into the GNU Make utility. GNU Make already has hooks
for this, and I believe it would be a fairly simple matter to
integrate the two. I don't know the make code well enough to
undertake this, and there are still things in ANTS I want to work on
first.
You may also notice that in the source there are the vague beginnings
of a GNOME applet for job monitoring. I haven't finished this one yet,
but it will work to some extent in a vertical panel for now. If you're
a GNOME panel-applet wizard and want to contribute to the applet, please
do so.
Whom ?
The ANTS System is designed to be run in environments where you never
know how many of the hosts in the cluster are available. It can be
used on a cluster of workstations, where some workstations may be
turned off (or booted into a non-networked OS).
The system will attempt to get metrics from and send job-requests to
other nodes in the cluster, but if a node doesn't respond, it's just
left out of the consideration. No harm done.
News:
20 10 2004: Small error-code bugfixes and a signal bugfix.
Fixed startup if NIS support was compiled in but no NIS was configured
on the host system.
23 06 2004: Various networking bugfixes (stupid things). Added
the ability for rant to send stdin to the remote process.
22 06 2004: Finally... A good number of bugfixes, environment
passing, NIS netgroup support, and other niceties.
03 10 2001: Another small update - improved the error messages
and error reporting capabilities.
01 10 2001: A small update - added the gANTS applet again, fixed
a race in antsd that could cause output (stdout or stderr) to not appear
in certain cases at job termination. Also fixed rant so that it now
actually understands the options it claims to support.
11 03 2001: First release in a long time - many changes since last...
This is the first release under the ANTS name. ANTS no longer uses the
shell for executing jobs, this makes it possible to compile the Linux kernel
using ANTS.
10 06 2000: The jobd system has served me well already, and I'll
be fixing some of the TODO items this weekend. However, the system will
change name to ``ANTS'', the Autonomous Networked Task Scheduler,
due to a name clash with a quite similar project that existed much before
this one.
06 06 2000: Initial release. The jobd system works for me, and
although it has some shortcomings which will be addressed later, it is
good enough for me right now.
Common problems:
I cannot run jobs: If rant exits with an error like:
[root@eagle /root]# rant -t gcc hostname
Remote antsd at 10.0.0.14 closed connection (Success).
then it's because you're running as the root user. ANTS will refuse
to run jobs with root privileges. This is a security measure - you
must run as a user with UID>=500.
I get NFS errors: Many older Linux kernels had problems with NFS. I got
reasonably good results with Linux-2.4.2 and NFSv3. You should definitely run
2.2.18 or later in the 2.2 series, or 2.4.2 or later in the 2.4 series. You
can experiment with NFS options such as rsize/wsize/sync. If nothing helps,
contact the linux-kernel mailing-list and help the developers iron out your NFS
problem.
I can't compile ANTS: Make sure you have a decent C++ compiler,
preferably the GCC 3.X series or later. ANTS is written in C++ for many
reasons, you'll probably need the compiler for something else anyway.
TODO:
My current TODO file:
* rash - remote shell (like rant but with default jobtype and tty handling)
* Distribute jobs from each user evenly among the nodes
* Much faster discovery of host failure should be done (should be trivial)
* Implement pxargs, a parallel (ants-enabled) xargs routine
* Total mem + BogoMIPS + MHz stats and a cute utility to show them
* jobd should unsubscribe when killed
* Strip home-dir part from cwd in rant, and prepend it again
on final execution node. This allows for different absolute
paths as cwd on different hosts, if cwd is inside the user
home directory (which is usually the case)
Credits:
I would like to thank Erik Mouw for contributing some much-needed
automake/autoconf clean-ups.
Others have contributed with comments, suggestions and real-world
ants usage stories - I like those, please keep'em coming :)
Cry 'Havoc', and let slip the Dogs of War
- William Shakespeare, "Julius Caesar"