High-level opcodes

If we look a bit at the extremes, we can take a look at how the vmult() function from the previous example look when it is translated for the CPU to execute. I passed the C code thru an optimizing C compiler, and fetched the assembly output from the compiler.

Translated into real CPU instructions for my PC at home, the code becomes what is seen in Table 3.2.

**Table 3.2:** The `vmult()` routine translated into Intel i686 machine code.
$\begin{table} \begin{center} {\small\begin{verbatim}vmult: pushl %ebp movl %esp... ...: popl %ebx movl %ebp,%esp popl %ebp ret\end{verbatim}} \end{center}\end{table}$

Note that the tight loop that performs the multiplication takes up eight instructions (it's the code between the two labels .L5 and .L4 above). These eight instructions are executed 1000 times each, when the vmult(A,B,C,1000) call is made.

This is efficient because it runs directly on the CPU. But if we used a language such as Intel 80x86 assembly (like the code above) as the representation language in our virtual machine, we would end up with an extremely slow system. This is because we have an overhead for executing an instruction, on top of what the instruction itself requires (also see section 2.4).

Remember, the virtual machine must first recognize the opcode, then move to the relevant routine in the virtual machine that handles that particular recognized opcode, and then the CPU gets to actually execute the machine-code instructions that performs the action dictated by the recognized virtual machine opcode.

Not only for the performance reasons would it be infeasible to feed the virtual machine code on the form of the assembly language in the previous example. Think about parallelizing such code. The C example given in the previous section may look bad, but the lower the level we see the code at, the worse it gets.

The solution is to define a fairly large set of instructions, that can take care of these trivial tasks, efficiently. For example, the mult instruction used for multiplying vectors, will also perform matrix multiplication. The actual code that does the real work is written in C++ and doesn't look all that much different from the C example presented earlier.

Again the virtual machine is different from a ``real'' (hardware) machine here. In hardware, it is extremely costly to implement a large set of instructions. That is why we see the high performance hardware of today moving away from CISC (complex/complete instruction set computing) towards RISC (reduced instruction set computing) architectures instead. In the virtual machine however, a new instruction has no impact on the efficiency of the virtual machine (with exception from possible cache pollution due to the increased code size). Instructions are decoded using a technique similar to hash-table lookups which are $\mathcal{O}(1)$ independently of the number of instructions. Another megabyte of code in the virtual machine doesn't really cost us anything with regards to efficiency, but another million of transistors is a whole other story for the hardware producers.

The current set of opcodes implemented in the virtual machine are listed in table 3.2.4.

Opcode	Description
`move dst, src`	Move data from a source to a destination
`decr dst`	Decreases the value of the argument by one
`cmpgt dst, src, src`	Compares the two source values and stores a ``1'' in the destination if `srcsrc`. Otherwise it stores a ``0''.
`mult dst, src`	Stores the product `dst` $\times$ `src` in `dst`.
`add dst, src`	Stores the sum `dst`+`src` into `dst`.
`zero dst, src`	Stores zeros into `dst`. If `src` is a scalar, it will make sure `dst` can hold either `src` elements in the case where `dst` is a vector, or `src` elements in the case where `dst` is a matrix. This can conveniently be used to allocate vectors or matrices.
`div dst, src`	Divides `dst` with `src` and stores the result in `dst`.
`call dst, src, src`	Calls the function named `src` with the argument `src`, and stores the returned value in `dst`.

The arguments passed to the instructions are named dst (short for ``destination'') when the argument is something the instruction will touch (modify). Arguments are named src (short for ``source'') when they are only used.

We will need a much larger set of opcodes, but the current set has been sufficient to implement a number of small test programs that were used for developing the automatic parallelization.

Depending on the opcodes, the arguments can be scalars, vectors, matrices or tuples. For example, the mult opcode implements both scalar integer and floating-point multiplication, element-wise vector multiplication, element-wise tuple multiplication, and matrix multiplication.