next up previous contents
Next: Conclusion Up: Performance Previous: The Fibonacci test   Contents

Matrix multiplication

The second test is a simple repeated matrix multiplication. The code is found in Appendix A.2

The program is simply a repeated matrix multiplication of two different $100\times 100$ matrices, with a third matrix. The multiplication series is repeated 100 times.

This time, I only tested MatLab and TONS.

Language Time (seconds)  
MatLab 9 .2
TONS 20 .5
TONS$^2$ 10 .9

That looks more like a reasonable result from MatLab. The matrix multiplication I implemented is a very fast hack, and it is far from being fast. Obviously, MatLab has efficient multiplication routines implemented, so even though their virtual machine, or interpreter, is slow as molasses, MatLab is twice as fast as TONS, on one CPU.

We scale almost linearly in performance, as the extra CPU is taken into use. This is because the amount of work done in the two independent loops (which of course the loop is transformed into as we add the second CPU) is the same. No node server waits for the other.

If we added a third CPU, it would never be taken into use. This code just does not parallelize onto three or more CPUs, with the current state of the TONS parallelizer. I do not see an easy way of parallelizing this program any further, at the virtual machine opcode level, without changing the order in which things are happening, which we refrain from.

We could however, sometime in the future, implement parallel versions of the instructions, so that if nodes where available, the matrix multiplication could run in parallel on several nodes. But there are two things to this. It is not ``automatic parallelization'' in the sense that the virtual machine code is parallelized, it is simply a matter of exchanging the actual implementations of the instructions with parallel ones. Secondly, implementing parallel matrix operations in C++ is way beyond the scope of this work. It is an area in which there has been a lot of research, and it should be fairly simple just to plug in well-known efficient parallel matrix routines, once we get the actual free-node/busy-node communication done.


next up previous contents
Next: Conclusion Up: Performance Previous: The Fibonacci test   Contents