15430a7f99e35819b2221874607a4f806d99a47c
[libreriscv.git] / 3d_gpu / architecture / compared_to_register_renaming.mdwn
1 # Comparing the 6600-derived architecture to the traditional register-renaming/OoO architecture
2
3 One critical difference between the 6600-derived architecture and
4 traditional register-renaming OoO speculative processors is that
5 writes to any one particular ISA-level register max out at 1 per clock
6 cycle (without special measures to improve that) in the 6600-derived
7 architecture, whereas the register-renamed version can easily handle
8 multiple such register writes per clock cycle since the register writes
9 are spread out across multiple physical registers.
10
11 (Note from lkcl: 6600 Reservation Stations *are* "register-renaming"
12 stations. unlike in the Tomasulo Algorithm, they're just not given
13 "names" because Cray and Thornton solved a problem they didn't realise
14 everyone else would have. See [[tomasulo_transformation]] and
15 <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-October/001050.html>
16 However further investigation shows that this may be WaW hazard relate)
17
18 The following diagrams are assuming that the fetch, decode, branch
19 prediction, and register renaming can handle 4 instructions per clock
20 cycle (usual on Intel's processors for many generations). They assume that
21 `ldu` can write the address register after 1 clock cycle of execution
22 and the destination register after 4 clock cycles of execution (can be
23 achieved by splitting into 2 separate micro-ops).
24
25 The following C program is used:
26
27 ```C
28 #include <stdint.h>
29
30 void f(uint64_t *r3, uint64_t r4) {
31 uint64_t ctr, r9;
32 ctr = r4;
33 do {
34 r9 = *++r3;
35 r9 += 100;
36 *r3 = r9;
37 } while(--ctr != 0);
38 }
39 ```
40
41 [See on Compiler Explorer](https://gcc.godbolt.org/z/hzf7d7)
42
43 It produces the following Power instructions (edited for style):
44
45 ```
46 f:
47 mtctr r4
48 .L2:
49 ldu r9, 8(r3)
50 addi r9, r9, 100
51 std r9, 0(r3)
52 bdnz .L2
53 blr
54 ```
55
56 ## Register Renaming
57
58 Renamed hardware registers are named `h0`, `h1`, `h2`, ...
59
60 The syntax `ldu h7, 8(h5 -> h8)` will be used to mean that the address read comes from `h5` and the address write goes to `h8`
61
62 The register rename table starts out as following:
63
64 | `r3` | `r4` |
65 |------|------|
66 | `h0` | `h1` |
67
68
69 | ISA-level instruction | Num | Renamed Instruction | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
70 |-----------------------|-----|--------------------------|-------|--------|---------------------|--------------|---------------------|----------------------|-----------------------|-----------------------|----------------------|-----------------------|------------------------|------------------------|--------------|--------------|--------|
71 | `mtctr r4` | #0 | `mtctr h1` | Fetch | Decode | Ex: Rd `h1` | Ex: Wr `ctr` | Retire | | | | | | | | | | |
72 | `ldu r9, 8(r3)` | #1 | `ldu h2, 8(h0 -> h3)` | Fetch | Decode | Ex: Rd `h0` | Ex: Wr `h3` | Ex | Ex: Wr `h2` | Retire | | | | | | | | |
73 | `addi r9, r9, 100` | #2 | `addi h4, h2, 1` | Fetch | Decode | Wait: `h2` | Wait: `h2` | Wait: `h2` | Ex: Rd `h2` | Ex: Wr `h4` | Retire | | | | | | | |
74 | `std r9, 0(r3)` | #3 | `std h4, 0(h3)` | Fetch | Decode | Wait: `h3` and `h4` | Wait: `h4` | Wait: `h4` | Wait: `h4` | Ex: Rd `h3` and `h4` | Ex | Ex | Retire | | | | | |
75 | `bdnz .L2` | #4 | `bdnz .L2` | | Fetch | Decode | Ex: Rd `ctr` | Ex: Wr `ctr` | Wait: Retire | Wait: Retire | Wait: Retire | Wait: Retire | Retire | | | | | |
76 | `ldu r9, 8(r3)` | #5 | `ldu h5, 8(h3 -> h6)` | | | Fetch | Decode | Ex: Rd `h3` | Ex: Wr `h6` | Ex | Ex: Wr `h5` | Wait: Retire | Retire | | | | | |
77 | `addi r9, r9, 100` | #6 | `addi h7, h5, 100` | | | Fetch | Decode | Wait: `h5` | Wait: `h5` | Wait: `h5` | Ex: Rd `h5` | Ex: Wr `h7` | Retire | | | | | |
78 | `std r9, 0(r3)` | #7 | `std h7, 0(h6)` | | | Fetch | Decode | Wait: `h6` and `h7` | Wait: `h7` | Wait: `h7` | Wait: `h7` | Ex: Rd `h6` and `h7` | Ex | Ex | Retire | | | |
79 | `bdnz .L2` | #8 | `bdnz .L2` | | | Fetch | Decode | Ex: Rd `ctr` | Ex: Wr `ctr` | Wait: Retire | Wait: Retire | Wait: Retire | Wait: Retire | Wait: Retire | Retire | | | |
80 | `ldu r9, 8(r3)` | #9 | `ldu h8, 8(h6 -> h9)` | | | | Fetch | Decode | Ex: Rd `h6` | Ex: Wr `h9` | Ex | Ex: Wr `h8` | Wait: Retire | Wait: Retire | Retire | | | |
81 | `addi r9, r9, 100` | #10 | `addi h10, h8, 100` | | | | Fetch | Decode | Wait: `h8` | Wait: `h8` | Wait: `h8` | Ex: Rd `h8` | Ex: Wr `h10` | Wait: Retire | Retire | | | |
82 | `std r9, 0(r3)` | #11 | `std h10, 0(h9)` | | | | Fetch | Decode | Wait: `h9` and `h10` | Wait: `h10` | Wait: `h10` | Wait: `h10` | Ex: Rd `h9` and `h10` | Ex | Ex | Retire | | |
83 | `bdnz .L2` | #12 | `bdnz .L2` | | | | Fetch | Decode | Ex: Rd `ctr` | Ex: Wr `ctr` | Wait: Retire | Wait: Retire | Wait: Retire | Wait: Retire | Wait: Retire | Retire | | |
84 | `ldu r9, 8(r3)` | #13 | `ldu h11, 8(h9 -> h12)` | | | | | Fetch | Decode | Ex: Rd `h9` | Ex: Wr `h12` | Ex | Ex: Wr `h11` | Wait: Retire | Wait: Retire | Retire | | |
85 | `addi r9, r9, 100` | #14 | `addi h13, h11, 100` | | | | | Fetch | Decode | Wait: `h11` | Wait: `h11` | Wait: `h11` | Ex: Rd `h11` | Ex: Wr `h13` | Wait: Retire | Retire | | |
86 | `std r9, 0(r3)` | #15 | `std h13, 0(h12)` | | | | | Fetch | Decode | Wait: `h12` and `h13` | Wait: `h13` | Wait: `h13` | Wait: `h13` | Ex: Rd `h12` and `h13` | Ex | Ex | Retire | |
87 | `bdnz .L2` | #16 | `bdnz .L2` | | | | | Fetch | Decode | Ex: Rd `ctr` | Ex: Wr `ctr` | Wait: Retire | Wait: Retire | Wait: Retire | Wait: Retire | Wait: Retire | Retire | |
88 | `ldu r9, 8(r3)` | #17 | `ldu h14, 8(h12 -> h15)` | | | | | | Fetch | Decode | Ex: Rd `h12` | Ex: Wr `h15` | Ex | Ex: Wr `h14` | Wait: Retire | Wait: Retire | Retire | |
89 | `addi r9, r9, 100` | #18 | `addi h16, h14, 100` | | | | | | Fetch | Decode | Wait: `h14` | Wait: `h14` | Wait: `h14` | Ex: Rd `h14` | Ex: Wr `h16` | Wait: Retire | Retire | |
90 | `std r9, 0(r3)` | #19 | `std h16, 0(h15)` | | | | | | Fetch | Decode | Wait: `h15` and `h16` | Wait: `h16` | Wait: `h16` | Wait: `h16` | Ex: Rd `h15` and `h16` | Ex | Ex | Retire |
91 | `bdnz .L2` | #20 | `bdnz .L2` | | | | | | Fetch | Decode | Ex: Rd `ctr` | Ex: Wr `ctr` | Wait: Retire | Wait: Retire | Wait: Retire | Wait: Retire | Wait: Retire | Retire |
92 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
93
94 ## 6600-derived
95
96 Notice how the WaR Waits on `r9` cause 2 instructions to finish per cycle (5 micro-ops per 2 cycles) instead of the 4 per cycle for the Register Renaming version, this means the processor's resources will eventually be full, limiting total throughput to 2 instructions/clock.
97
98 For the following table:
99 - Assumes that `ldu` instructions are split into two micro-ops in the decode stage. The address computation is denoted "#5.a" and the memory read is denoted "#5.m".
100 - Assumes that a mechanism for forwarding from a FU's result latch to a waiting operation is in place, without having to wait until the result can be written to the register file.
101 - "Av `r3`" denotes that the value to be written to `r3` is computed and is available for forwarding but can't yet be written to the register file.
102 - "SW: #4" denotes that the instruction is waiting on the shadow produced by instruction #4.
103 - "Rf #5:`r5`" denotes that the instruction reads the result latch for instruction #5's new value for `r5` through the forwarding mechanism.
104
105 | ISA-level instruction | Num | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
106 |-----------------------|-------|-------|--------|---------------|--------------|------------------|------------------|------------------|-------------------|------------------|---------------------------|----------------------------|----------------------------|----------------|----------------|----------------|----------------|-------------|--------|
107 | `mtctr r4` | #0 | Fetch | Decode | Ex: Rd `r4` | Ex: Wr `ctr` | Finish | | | | | | | | | | | | | |
108 | `ldu r9, 8(r3)` | #1.a | Fetch | Decode | Ex: Rd `r3` | Ex: Av `r3` | SW: #1.m | Ex: Wr `r3` | Finish | | | | | | | | | | | |
109 | `ldu r9, 8(r3)` | #1.m | | Decode | Wait: #1.a | Ex | Ex | Ex: Wr `r9` | Finish | | | | | | | | | | | |
110 | `addi r9, r9, 100` | #2 | Fetch | Decode | Wait: #1.m | Wait: #1.m | Wait: #1.m | Ex: Rd `r9` | Ex: Wr `r9` | Finish | | | | | | | | | | |
111 | `std r9, 0(r3)` | #3 | Fetch | Decode | Wait: #1.a #2 | Wait: #2 | Wait: #2 | Wait: #2 | Ex: Rd `r3` `r9` | Ex | Ex | Finish | | | | | | | | |
112 | `bdnz .L2` | #4 | | Fetch | Decode | Ex: Rd `ctr` | Ex: Av `ctr` | SW: #3 | SW: #3 | SW: #3 | SW: #3 | Ex: Wr `ctr` | Finish | | | | | | | |
113 | `ldu r9, 8(r3)` | #5.a | | | Fetch | Decode | Ex: Rf #1.a:`r3` | Ex: Av `r3` | SW: #5.m | SW: #3 | SW: #3 | Ex: Wr `r3` | Finish | | | | | | | |
114 | `ldu r9, 8(r3)` | #5.m | | | | Decode | Wait: #5.a | Ex | Ex | Ex: Av `r9` | SW: #3 | Ex: Wr `r9` | Finish | | | | | | | |
115 | `addi r9, r9, 100` | #6 | | | Fetch | Decode | Wait: #5.m | Wait: #5.m | Wait: #5.m | Ex: Rf #5.m:`r9` | Ex: Av `r9` | WaR Wait: `r9` | Ex: Wr `r9` | Finish | | | | | | |
116 | `std r9, 0(r3)` | #7 | | | Fetch | Decode | Wait: #5.a #6 | Wait: #6 | Wait: #6 | Wait: #6 | Ex: Rf #6:`r9` | Ex | Ex | Finish | | | | | | |
117 | `bdnz .L2` | #8 | | | Fetch | Decode | Ex: Rf #4:`ctr` | Ex: Av `ctr` | SW: #7 | SW: #7 | SW: #7 | SW: #7 | SW: #7 | Ex: Wr `ctr` | Finish | | | | | |
118 | `ldu r9, 8(r3)` | #9.a | | | | Fetch | Decode | Ex: Rf #5.m:`r3` | Ex: Av `r3` | SW: #9.m | SW: #7 | SW: #7 | SW: #7 | Ex: Wr `r3` | Finish | | | | | |
119 | `ldu r9, 8(r3)` | #9.m | | | | | Decode | Wait: #9.a | Ex | Ex | Ex: Av `r9` | SW: #7 | SW: #7 | Ex: Wr `r9` | Finish | | | | | |
120 | `addi r9, r9, 100` | #10 | | | | Fetch | Decode | Wait: #9.m | Wait: #9.m | Wait: #9.m | Ex: Rf #9.m:`r9` | Ex: Av `r9` | SW: #7 | WaR Wait: `r9` | Ex: Wr `r9` | Finish | | | | |
121 | `std r9, 0(r3)` | #11 | | | | Fetch | Decode | Wait: #9.a #10 | Wait: #10 | Wait: #10 | Wait: #10 | Ex: Rf #9.a:`r3` #10:`r9` | Ex | Ex | Finish | | | | | |
122 | `bdnz .L2` | #12 | | | | Fetch | Decode | Ex: Rf `ctr` | Ex: Av `ctr` | SW: #11 | SW: #11 | SW: #11 | SW: #11 | SW: #11 | Ex: Wr `ctr` | Finish | | | | |
123 | `ldu r9, 8(r3)` | #13.a | | | | | Fetch | Decode | Ex: Rf #9.a:`r3` | Ex: Av `r3` | SW: #13.m | SW: #11 | SW: #11 | SW: #11 | Ex: Wr `r3` | Finish | | | | |
124 | `ldu r9, 8(r3)` | #13.m | | | | | | Decode | Wait: #13.a | Ex | Ex | Ex: Av `r9` | SW: #11 | SW: #11 | WaR Wait: `r9` | Ex: Wr `r9` | Finish | | | |
125 | `addi r9, r9, 100` | #14 | | | | | Fetch | Decode | Wait: #13.m | Wait: #13.m | Wait: #13.m | Ex: Rf #13.m:`r9` | Ex: Av `r9` | SW: #11 | WaR Wait: `r9` | WaR Wait: `r9` | Ex: Wr `r9` | Finish | | |
126 | `std r9, 0(r3)` | #15 | | | | | Fetch | Decode | Wait: #13.a #14 | Wait: #14 | Wait: #14 | Wait: #14 | Ex: Rf #13.a:`r3` #14:`r9` | Ex | Ex | Finish | | | | |
127 | `bdnz .L2` | #16 | | | | | Fetch | Decode | Ex: Rf #12:`ctr` | Ex: Av `ctr` | SW: #15 | SW: #15 | SW: #15 | SW: #15 | SW: #15 | Ex: Wr `ctr` | Finish | | | |
128 | `ldu r9, 8(r3)` | #17.a | | | | | | Fetch | Decode | Ex: Rf #13.a:`r3` | Ex: Av `r3` | SW: #17.m | SW: #15 | SW: #15 | SW: #15 | Ex: Wr `r3` | Finish | | | |
129 | `ldu r9, 8(r3)` | #17.m | | | | | | | Decode | Wait: #17.a | Ex | Ex | Ex: Av `r9` | SW: #15 | SW: #15 | WaR Wait: `r9` | WaR Wait: `r9` | Ex: Wr `r9` | Finish | |
130 | `addi r9, r9, 100` | #18 | | | | | | Fetch | Decode | Wait: #17.m | Wait: #17.m | Wait: #17.m | Ex: Rf #17.m:`r9` | Ex: Av `r9` | SW: #15 | WaR Wait: `r9` | WaR Wait: `r9` | WaR Wait: `r9` | Ex: Wr `r9` | Finish |
131 | `std r9, 0(r3)` | #19 | | | | | | Fetch | Decode | Wait: #17.a #18 | Wait: #18 | Wait: #18 | Wait: #18 | Ex: Rf #17.a:`r3` #18:`r9` | Ex | Ex | Finish | | | |
132 | `bdnz .L2` | #20 | | | | | | Fetch | Decode | Ex: Rf #16:`ctr` | Ex: Av `ctr` | SW: #19 | SW: #19 | SW: #19 | SW: #19 | SW: #19 | Finish | | | |
133 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |