(no commit message)
[libreriscv.git] / 3d_gpu / architecture / regfile.mdwn
1 # Register Files
2
3 Discussion:
4
5 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-June/008368.html>
6
7 These register files are required for POWER:
8
9 * Floating-point
10 * Integer
11 * Control and Condition Code Registers (CR0-7)
12 * SPRs (Special Purpose Registers)
13 * Fast Registers (CTR, LR, SRR0, SRR1 etc.)
14 * "State" Registers (CIA, MSR, SimpleV VL)
15
16 Source code:
17
18 * <https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/regfiles.py;hb=HEAD>
19
20 For a GPU, the FP and Integer registers need to be a massive 128 x 64-bit.
21
22 # Regfile groups, Port Allocations and bit-widths
23
24 * INT regfile: 32x 64-bit with 4R1W
25 * SPR regfile: 1024x 64-bit (!) needs a "map" on that 1R1W
26 * CR regfile: 8x 4-bit with full 8R8W (for full 32-bit read/write)
27 - CR0-7: 4-bit
28 * XER regfile: 2x 2-bit, 1x 1-bit with full 3R3W
29 - CA(32) - 2-bit
30 - OV(32) - 2-bit
31 - SO - 1 bit
32 * FAST regfile: 5x 64-bit, full 3R2W (possibly greater)
33 - LR: 64-bit
34 - CTR: 64-bit
35 - TAR: 64-bit
36 - SRR1: 64-bit
37 - SRR2: 64-bit
38 * STATE regfile: 3x 64-bit, 2R1W (possibly greater)
39 - MSR: 64-bit
40 - PC: 64-bit
41 - SVSTATE: 64-bit
42
43 # Connectivity between regfiles and Function Units
44
45 The target for the first ASICs is a minimum of 4 32-bit FMACs per clock cycle.
46 If it is acceptable that this be achieved on sequentially-adjacent-numbered
47 registers, a significant reduction in the amount of regfile porting may be
48 achieved (down from 12R4W)
49
50 It does however require that the register file be broken into four
51 completely separate and independent quadrants, each with their own
52 separate and independent 3R1W (or 4R1W ports).
53
54 This then requires some Bus Architecture to connect and keep the pipelines
55 busy. Below is the connectivity diagram:
56
57 * A single Dynamic PartitionedSignal capable 64-bit-wide pipeline is at the
58 top left and top right.
59 * Multiple **pairs** of 32-bit Function Units (making up a 64-bit data
60 path) connect, as "Concurrent Units", to each pipeline.
61 * The number of **pairs** of Function Units **must** match (or preferably
62 exceed) the number of pipeline stages.
63 * Connected to each of the Operand and Result Ports on each Function Unit
64 is a cyclic buffer.
65 * Read-operands may "cycle" to reach their destination
66 * Write-operands may be "cycled" so as to pick an appropriate destination.
67 * **Independent** Common Data Buses, one for each Quadrant of the Regfile,
68 connect between the Function Unit's cyclic buffers and the **global**
69 cyclic buffers dedicated to that Quadrant.
70 * Within each Quadrant's global cyclic buffers, inter-buffer transfer ports
71 allow for copies of regfile data to be transferred from write-side to
72 read-side. This constitutes the entirety of what is known as an
73 **Operand Forwarding Bus**.
74 * **Between** each Quadrant's global cyclic buffers, there exists a 4x4
75 Crossbar that allows data to move (slowly, and if necessary) across
76 Quadrants.
77
78 Notes:
79
80 * There is only **one** 4x4 crossbar (or, one for reads, one for writes?)
81 and thus only **one** inter-Quadrant 32-bit-wide data path (total
82 bandwidth 4x32 bits). These to be shared by **five** groups of
83 operand ports at each of the Quadrant Global Cyclic Buffers.
84 * The **only** way for register results and operands to cross over between
85 quadrants of the regfile is that 4x4 crossbar. Data transfer bandwidth
86 being limited, the placement of an operation adversely affects its
87 completion time. Thus, given that read operands exceed the number
88 of write operands, allocation of operations to Function Units should
89 prioritise placing the operation where the "reads" may go straight
90 through.
91 * Outlined in this comment <https://bugs.libre-soc.org/show_bug.cgi?id=296#10>
92 the infrastructure above can, by way of the cyclic buffers, cope with
93 and automatically adapt between a *serial* delivery of operands, and
94 a *parallel* delivery of operands. And, that, actually, performance is
95 not adversely affected by the serial delivery, although the latency
96 of an FMAC is extended by 3 cycles: this being the fact that only one
97 CDB is available to deliver operands.
98
99 Click on the image to expand it full-screen:
100
101 [[!img regfile_hilo_32_odd_even.png size="500px"]]
102
103 # Regspecs
104
105 * Source: <https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/regspec.py;hb=HEAD>
106
107 "Regspecs" is a term used for describing the relationship between register files,
108 register file ports, register widths, and the Computation Units that they connect
109 to.
110
111 Regspecs are defined, in python, as follows:
112
113 | Regfile name | CompUnit Record name | bit range register mapping |
114 | ---- | ---------- | ------------ |
115 | INT | ra | 0:3,5 |
116
117 Description of each heading:
118
119 * Regfile name: INT corresponds to the INTEGER file, CR to Condition Register etc.
120 * CompUnit Record name: in the Input or Output Record there will be a signal by
121 name. This field refers to that record signal, thus providing a sequential
122 ordering for the fields.
123 * Bit range: this is specified as an *inclusive* range of the form "start:end"
124 or just a single bit, "N". Multiple ranges may be specified, and are
125 comma-separated.
126
127 Here is how they are used:
128 ```
129 class CRInputData(IntegerData):
130 regspec = [('INT', 'a', '0:63'), # 64 bit range
131 ('INT', 'b', '0:63'), # 6B bit range
132 ('CR', 'full_cr', '0:31'), # 32 bit range
133 ('CR', 'cr_a', '0:3'), # 4 bit range
134 ('CR', 'cr_b', '0:3'), # 4 bit range
135 ('CR', 'cr_c', '0:3')] # 4 bit range
136 ```
137
138 This tells us, when used by MultiCompUnit, that:
139
140 * CompUnit src reg 0 is from the INT regfile, is linked to CRInputData.a, 64-bit
141 * CompUnit src reg 1 is from the INT regfile, is linked to CRInputData.b, 64-bit
142 * CompUnit src reg 2 is from the CR regfile, is CRInputData.full\_cr, and 32-bit
143 * CompUnit src reg 3 is from the CR regfile, is CRInputData.cr\_a, and 4-bit
144 * CompUnit src reg 4 is from the CR regfile, is CRInputData.cr\_b, and 4-bit
145 * CompUnit src reg 5 is from the CR regfile, is CRInputData.cr\_c, and 4-bit
146
147 Likewise there is a corresponding regspec for CROutputData. The two are combined
148 and associated with the Pipeline:
149
150 ```
151 class CRPipeSpec(CommonPipeSpec):
152 regspec = (CRInputData.regspec, CROutputData.regspec)
153 opsubsetkls = CompCROpSubset
154 ```
155
156 In this way the pipeline can be connected up to a generic, general-purpose class
157 (MultiCompUnit), which would otherwise know nothing about the details of the ALU
158 (Pipeline) that it is being connected to.
159
160 In addition, on the other side of the MultiCompUnit, the regspecs contain enough
161 information to be able to wire up batches of MultiCompUnits (now known, because
162 of their association with an ALU, as FunctionUnits), associating the MultiCompUnits
163 correctly with their corresponding Register File.
164
165 Note: there are two exceptions to the "generic-ness and abstraction"
166 where MultiCompUnit "knows nothing":
167
168 1. When the Operand Subset has a member "zero_a". this tells MultiCompUnit
169 to create a multiplexer that, if operand.zero_a is set, will put **ZERO**
170 into its first src operand (src_i[0]) and it will **NOT** put out a
171 read request (**NOT** raise rd.req[0]) for that first register.
172 2. When the Operand Subset has a member "imm_data". this tells
173 MultiCompUnit to create a multiplexer that, if operand.imm_data.ok is
174 set, will copy operand.imm_data into its *second* src operand (src_i[1]).
175 Further: that it will **NOT** put out a read request (**NOT** raise
176 rd.req[1]) for that second register.
177
178 These should only be activated for INTEGER and Logical pipelines, and
179 the regspecs for them must note and respect the requirements: input
180 regspec[0] may *only* be associated with operand.zero_a, and input
181 regspec[1] may *only* be associated with operand.imm_data. the POWER9
182 Decoder and the actual INTEGER and Logical pipelines have these
183 expectations **specifically** hard-coded into them.