Then, why didn't the designer of the CPU make such that
LDW
instruction takes 5 clock cycles to
begin with, rather than let the programmer insert 4
NOPs
? The answer is that you can
insert other instructions other than
NOPs
as far as those instructions do
not use the result of the
LDW
instruction above. By doing this, the CPU can execute
additional instructions while waiting for the result of the
LDW
instruction to be valid, greatly
reducing the total execution time of the entire program.
More on instructions with delay slots
The Table 3-5 in TI's instruction set description shows the
execution of the instructions with delay slots in moredetail. The instructions with delay slots are multiply
(
MPY
, 1 delay slot), the load
(
LDB, LDW
B
, 5
delay slots) instruction.
The functional unit latency indicates for how many clock cycles each instructions actually use afunctional unit. All C62x instructions have 1 functionalunit latency, meaning that each functional unit is ready to execute the next instruction after 1 clock cycle regardlessof the delay slots of the instructions. Therefore, the following instructions are valid:
1 LDW .D1 *A10, A4
2 ADD .D1 A1,A2,A3
Although the first
LDW
instruction do
not load the
A4
register correctly
while the
ADD
is executed, the
D1
functional unit becomes available
in the clock cycle right after the one in which
LDW
is executed.
To clarify the execution of instructions with delay slots,
let's think of the following example of
LDW
instruction. Let's assume
A10 = 0x0100
A2=1
,
and your intent is loading
A9
with the
32-bit word at the address
0x0104
. The
3
MV
instructions are not related to
the
LDW
instruction. They do something
else.
1 LDW .D1 *A10++[A2], A92 MV .L1 A10, A8
3 MV .L1 A1, A104 MV .L1 A1, A2
5 ...
We can ask several interesting questions at this point:
- What is the value loaded to
A8
? That is, in which clock cycle, the address pointer isupdated? - Can we load the address offset register
A2
before theLDW
instruction finishes the actual loading? - Is it legal to load to
A10
before the firstLDW
finishes loading the memory content toA9
? That is, can we change the address pointer before the 4 delay slotselapse?
- Although it takes extra 4 clock cycles for the
LDW
instruction to load the memory content toA9
, the address pointer and offset registers (A10
andA2
) are read and updated in the clock cycle theLDW
instruction is issued. Therefore, in line 2,A8
is loaded with the updatedA10
, that isA10 = A8 = 0x104
. - Because the
LDW
reads theA10
andA2
registers in the first clock cycle, you are free to change these registers and do not affect the operationof the firstLDW
. - This was already answered above.
Similar theory holds for
MPY
and
B
(when using a register as a branch
address) instructions. The
MPY
reads
in the source values in the first clock cycle and loads themultiplication result after the 2nd clock cycle. For
B
, the address pointer is read in the
first clock cycle, and the actual branching occurs after the5th clock cycle. Thus, after the first clock cycle, you are
free to modify the source or the address pointer registers.For more details, refer Table 3-5 in the instruction set
description or read the description of the individualinstruction.