fosdem2024_bigint: remove test.dia
[libreriscv.git] / 3d_gpu / architecture / regfile.mdwn
1 # Register Files
2
3 Discussion:
4
5 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-June/008368.html>
6
7 These register files are required for POWER:
8
9 * Floating-point
10 * Integer
11 * Control and Condition Code Registers (CR0-7)
12 * SPRs (Special Purpose Registers)
13 * Fast Registers (CTR, LR, SRR0, SRR1 etc.)
14 * "State" Registers (CIA, MSR, SimpleV VL)
15
16 Source code:
17
18 * register files: <https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/regfiles.py;hb=HEAD>
19 * core.py: <https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/simple/core.py;hb=HEAD>
20 * priority picker: <https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/picker.py;hb=HEAD>
21 * all function units: <https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/compunits/compunits.py;hb=HEAD>
22 * ReservationStations2 <https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/concurrentunit.py;hb=HEAD>
23
24 For a GPU, the FP and Integer registers need to be a massive 128 x 64-bit.
25
26 Video walkthrough of regfile relationship to Function Units in core:
27 <https://youtu.be/7Th1b-jq40k>
28
29 [[!img core_regfiles_fus_pickers.jpg size="700x"]]
30
31 # Regfile groups, Port Allocations and bit-widths
32
33 * INT regfile: 32x 64-bit with 4R1W
34 * SPR regfile: 1024x 64-bit (!) needs a "map" on that 1R1W
35 * CR regfile: 8x 4-bit with full 8R8W (for full 32-bit read/write)
36 - CR0-7: 4-bit
37 * XER regfile: 2x 2-bit, 1x 1-bit with full 3R3W
38 - CA(32) - 2-bit
39 - OV(32) - 2-bit
40 - SO - 1 bit
41 * FAST regfile: 5x 64-bit, full 3R2W (possibly greater)
42 - LR: 64-bit
43 - CTR: 64-bit
44 - TAR: 64-bit
45 - SRR1: 64-bit
46 - SRR2: 64-bit
47 * STATE regfile: 3x 64-bit, 2R1W (possibly greater)
48 - MSR: 64-bit
49 - PC: 64-bit
50 - SVSTATE: 64-bit
51
52 # Connectivity between regfiles and Function Units
53
54 The target for the first ASICs is a minimum of 4 32-bit FMACs per clock cycle.
55 If it is acceptable that this be achieved on sequentially-adjacent-numbered
56 registers, a significant reduction in the amount of regfile porting may be
57 achieved (down from 12R4W)
58
59 It does however require that the register file be broken into four
60 completely separate and independent quadrants, each with their own
61 separate and independent 3R1W (or 4R1W ports).
62
63 This then requires some Bus Architecture to connect and keep the pipelines
64 busy. Below is the connectivity diagram:
65
66 * A single Dynamic PartitionedSignal capable 64-bit-wide pipeline is at the
67 top left and top right.
68 * Multiple **pairs** of 32-bit Function Units (making up a 64-bit data
69 path) connect, as "Concurrent Units", to each pipeline.
70 * The number of **pairs** of Function Units **must** match (or preferably
71 exceed) the number of pipeline stages.
72 * Connected to each of the Operand and Result Ports on each Function Unit
73 is a cyclic buffer.
74 * Read-operands may "cycle" to reach their destination
75 * Write-operands may be "cycled" so as to pick an appropriate destination.
76 * **Independent** Common Data Buses, one for each Quadrant of the Regfile,
77 connect between the Function Unit's cyclic buffers and the **global**
78 cyclic buffers dedicated to that Quadrant.
79 * Within each Quadrant's global cyclic buffers, inter-buffer transfer ports
80 allow for copies of regfile data to be transferred from write-side to
81 read-side. This constitutes the entirety of what is known as an
82 **Operand Forwarding Bus**.
83 * **Between** each Quadrant's global cyclic buffers, there exists a 4x4
84 Crossbar that allows data to move (slowly, and if necessary) across
85 Quadrants.
86
87 Notes:
88
89 * There is only **one** 4x4 crossbar (or, one for reads, one for writes?)
90 and thus only **one** inter-Quadrant 32-bit-wide data path (total
91 bandwidth 4x32 bits). These to be shared by **five** groups of
92 operand ports at each of the Quadrant Global Cyclic Buffers.
93 * The **only** way for register results and operands to cross over between
94 quadrants of the regfile is that 4x4 crossbar. Data transfer bandwidth
95 being limited, the placement of an operation adversely affects its
96 completion time. Thus, given that read operands exceed the number
97 of write operands, allocation of operations to Function Units should
98 prioritise placing the operation where the "reads" may go straight
99 through.
100 * Outlined in this comment <https://bugs.libre-soc.org/show_bug.cgi?id=296#10>
101 the infrastructure above can, by way of the cyclic buffers, cope with
102 and automatically adapt between a *serial* delivery of operands, and
103 a *parallel* delivery of operands. And, that, actually, performance is
104 not adversely affected by the serial delivery, although the latency
105 of an FMAC is extended by 3 cycles: this being the fact that only one
106 CDB is available to deliver operands.
107
108 Click on the image to expand it full-screen:
109
110 [[!img regfile_hilo_32_odd_even.png size="900x"]]
111
112 # Regspecs
113
114 * Source: <https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/regspec.py;hb=HEAD>
115
116 "Regspecs" is a term used for describing the relationship between register files,
117 register file ports, register widths, and the Computation Units that they connect
118 to.
119
120 Regspecs are defined, in python, as follows:
121
122 | Regfile name | CompUnit Record name | bit range register mapping |
123 | ---- | ---------- | ------------ |
124 | INT | ra | 0:3,5 |
125
126 Description of each heading:
127
128 * Regfile name: INT corresponds to the INTEGER file, CR to Condition Register etc.
129 * CompUnit Record name: in the Input or Output Record there will be a signal by
130 name. This field refers to that record signal, thus providing a sequential
131 ordering for the fields.
132 * Bit range: this is specified as an *inclusive* range of the form "start:end"
133 or just a single bit, "N". Multiple ranges may be specified, and are
134 comma-separated.
135
136 Here is how they are used:
137
138 class CRInputData(IntegerData):
139 regspec = [('INT', 'a', '0:63'), # 64 bit range
140 ('INT', 'b', '0:63'), # 6B bit range
141 ('CR', 'full_cr', '0:31'), # 32 bit range
142 ('CR', 'cr_a', '0:3'), # 4 bit range
143 ('CR', 'cr_b', '0:3'), # 4 bit range
144 ('CR', 'cr_c', '0:3')] # 4 bit range
145
146 This tells us, when used by MultiCompUnit, that:
147
148 * CompUnit src reg 0 is from the INT regfile, is linked to CRInputData.a, 64-bit
149 * CompUnit src reg 1 is from the INT regfile, is linked to CRInputData.b, 64-bit
150 * CompUnit src reg 2 is from the CR regfile, is CRInputData.full\_cr, and 32-bit
151 * CompUnit src reg 3 is from the CR regfile, is CRInputData.cr\_a, and 4-bit
152 * CompUnit src reg 4 is from the CR regfile, is CRInputData.cr\_b, and 4-bit
153 * CompUnit src reg 5 is from the CR regfile, is CRInputData.cr\_c, and 4-bit
154
155 Likewise there is a corresponding regspec for CROutputData. The two are combined
156 and associated with the Pipeline:
157
158 class CRPipeSpec(CommonPipeSpec):
159 regspec = (CRInputData.regspec, CROutputData.regspec)
160 opsubsetkls = CompCROpSubset
161
162 In this way the pipeline can be connected up to a generic, general-purpose class
163 (MultiCompUnit), which would otherwise know nothing about the details of the ALU
164 (Pipeline) that it is being connected to.
165
166 In addition, on the other side of the MultiCompUnit, the regspecs contain enough
167 information to be able to wire up batches of MultiCompUnits (now known, because
168 of their association with an ALU, as FunctionUnits), associating the MultiCompUnits
169 correctly with their corresponding Register File.
170
171 Note: there are two exceptions to the "generic-ness and abstraction"
172 where MultiCompUnit "knows nothing":
173
174 1. When the Operand Subset has a member "zero_a". this tells MultiCompUnit
175 to create a multiplexer that, if operand.zero_a is set, will put **ZERO**
176 into its first src operand (src_i[0]) and it will **NOT** put out a
177 read request (**NOT** raise rd.req[0]) for that first register.
178 2. When the Operand Subset has a member "imm_data". this tells
179 MultiCompUnit to create a multiplexer that, if operand.imm_data.ok is
180 set, will copy operand.imm_data into its *second* src operand (src_i[1]).
181 Further: that it will **NOT** put out a read request (**NOT** raise
182 rd.req[1]) for that second register.
183
184 These should only be activated for INTEGER and Logical pipelines, and
185 the regspecs for them must note and respect the requirements: input
186 regspec[0] may *only* be associated with operand.zero_a, and input
187 regspec[1] may *only* be associated with operand.imm_data. the POWER9
188 Decoder and the actual INTEGER and Logical pipelines have these
189 expectations **specifically** hard-coded into them.