We ought to distribute information among the nodes and the scheduler on a regular basis, that would allow the scheduler to make smarter decisions about which groups of nodes should be assigned which jobs.
This information would be a set of data that describes which nodes are working, for how long they may be working, which ones are idle, and so on. This information is inaccurate (it is out-dated the second it is generated, because of the ever changing state of the cluster), so there will probably be a need to look into heuristics and the theory of planning under uncertainty. A node will not know what other nodes are idle or busy, but it may have a rough idea of the probabilities of the different nodes being idle or busy.
It is important that we do not ask the nodes about their status before submitting jobs from the scheduler. If this was done, we would take a performance hit from the network latency, especially this would be bad for large clusters if we asked 100 or more nodes whether they were idle or busy, before we decided what to do with a job.
A possible solution would be to submit the job to the nodes that have the highest probability of being idle. Somehow the number of jobs each node holds must be propagated, over time, to the entire cluster. If a node becomes idle, it could request the job from a node that accidentally got two jobs (because of the uncertainty when the scheduler distributed the jobs).
Another approach could be that idle nodes register with the scheduler immediately when they become idle. Then the scheduler would know exactly what nodes where idle, and it could make up a sub-cluster with the currently idle nodes, and delegate the new job to this sub-cluster.
If there are tasks enough, we could also just make sure that all nodes at all times have at least one, but preferably two runnable tasks. This would ensure, that while a node is waiting for input, or returning it's reply, it still has computation work to do. This is a simple way of making the node idle-time approach 0, and to make the network latency and bandwidth much less important. It requires that the data sets in the tasks held by the nodes fit in the memory of the nodes of course.
It will require a lot of testing and benchmarking, to come up with a scheme that will fit our needs both with respect to speed, and to information detail and accuracy. Maybe even different approaches will be needed for smaller and larger clusters.
Currently the scheduler knows what nodes exist in the cluster. It will delegate a job to an (arbitrarily selected) master node, and tell the master node to utilize all remaining nodes in the cluster, if possible, for the job. This is not the way things should be done, but it's a reasonable way to do things for now, since we can't schedule more than one job at a time anyway (see Section 2.3).