A short description of the hardware and software that resulted in the benchmarks deserves mentioning:
Hardware | : | IBM Power2 RS/6000 |
C Compiler | : | IBM xlc |
Options | : | -qtune=pwr2 -O3 |
Fortran compiler | : | |
Options | : | -qhot -qtune=pwr2 -O3 |
The SUPERLU library was compiled using the supplied BLAS library. The IBM machines comes with an optimized BLAS library (essl2), but that library gave worse performance than the BLAS supplied with SUPERLU. This is probably due to the very small size of the system.
On a 56 x 56 system with 11.7% non-zeros, the following results were achieved:
Results | OPTIMQR | SUPERLU |
Run time | 6.66 sec. | 26.6 sec. |
FLOPS rate | 93.16 MFLOPS | 17.13 MFLOPS |
Number of FLOPS | ||
Total instructions |
The numbers come from the rs2hpm hardware monitor that comes with the IBM computers. The runs where repeated several times to ensure correctness, and both benchmarks where run using the queue system, to guarantee that the programs ran alone on the machine.
As expected the number of FLOPS used by SUPERLU is lower than the number used by OPTIMQR. Yet, OPTIMQR runs 4 times faster than SUPERLU. It could be expected that OPTIMQR would be faster, because it saves the work of pivoting rows every time the system is solved. But 4 times is a large difference.
If one looks at the total number of instructions executed by the two programs, it is seen that SUPERLU executes almost 10 times as many instructions as OPTIMQR. This can be explained by the fact that SUPERLU uses some fairly sophisticated logic to decide when and how to pivot which rows. Because of the small size of the system benchmarked here, this ``sophistication'' kills SUPERLU performance.