(no commit message)
[libreriscv.git] / openpower / sv / remap.mdwn
1 [[!tag standards]]
2
3 # REMAP <a name="remap" />
4
5 * <https://bugs.libre-soc.org/show_bug.cgi?id=143>
6 * see [[sv/propagation]] for a future way to apply
7 REMAP.
8
9 REMAP is an advanced form of Vector "Structure Packing" that
10 provides hardware-level support for commonly-used *nested* loop patterns.
11
12
13 REMAP allows the usual vector loop `0..VL-1` to be "reshaped" (re-mapped)
14 from a linear form to a 2D or 3D transposed form, or "offset" to permit
15 arbitrary access to elements, independently on each Vector src or dest
16 register.
17
18 Their primary use is for Matrix Multiplication, reordering of sequential
19 data in-place. Four SPRs are provided so that a single FMAC may be
20 used in a single loop to perform 4x4 times 4x4 Matrix multiplication,
21 generating 64 FMACs. Additional uses include regular "Structure Packing"
22 such as RGB pixel data extraction and reforming.
23
24 REMAP, like all of SV, is abstracted out, meaning that unlike traditional
25 Vector ISAs which would typically only have a limited set of instructions
26 that can be structure-packed (LD/ST typically), REMAP may be applied to
27 literally any instruction: CRs, Arithmetic, Logical, LD/ST, anything.
28
29 Note that REMAP does not apply to sub-vector elements: that is what
30 swizzle is for. Swizzle *can* however be applied to the same instruction
31 as REMAP.
32
33 In its general form, REMAP is quite expensive to set up, and on some
34 implementations introduce
35 latency, so should realistically be used only where it is worthwhile.
36 Commonly-used patterns such as Matrix Multiply, DCT and FFT have
37 helper instruction options which make REMAP easier to use.
38
39 # Principle
40
41 * normal vector element read/write as operands would be sequential
42 (0 1 2 3 ....)
43 * this is not appropriate for (e.g.) Matrix multiply which requires
44 accessing elements in alternative sequences (0 3 6 1 4 7 ...)
45 * normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
46 with this. both are expensive (copy large vectors, spill through memory)
47 * REMAP **redefines** the order of access according to set "Schedules".
48 * The Schedules are not necessarily restricted to power-of-two boundaries
49 making it unnecessary to have for exsmple specialised 3x4 transpose
50 instructions.
51
52 Only the most commonly-used algorithms in computer science have REMAP
53 support, due to the high cost in both the ISA and in hardware.
54
55 # REMAP SPR
56
57 | 0 | 2 | 4 | 6 | 8 | 10.14 | 15..23 |
58 | -- | -- | -- | -- | -- | ----- | ------ |
59 |mi0 |mi1 |mi2 |mo0 |mo1 | SVme | rsv |
60
61 mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register.
62 mi0-2 apply to RA, RB, RC respectively, as input registers, and
63 likewise mo0-1 apply to output registers (FRT, FRS respectively).
64 SVme is 5 bits, and indicates indicate whether the
65 SVSHAPE is actively applied or not.
66
67 * bit 0 of SVme indicates if mi0 is applied to RA / FRA
68 * bit 1 of SVme indicates if mi1 is applied to RB / FRB
69 * bit 2 of SVme indicates if mi2 is applied to RC / FRC
70 * bit 3 of SVme indicates if mo0 is applied to RT / FRT
71 * bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS
72 (LD/ST-with-update has an implicit 2nd write register, RA)
73
74 There is also a corresponding SVRM-Form for the svremap
75 instruction which matches the above SPR:
76
77 |0 |6 |11 |13 |15 |17 |19 |21 | 22 |26 |31 |
78 | PO | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst | rsvd | XO | / |
79
80 # SHAPE 1D/2D/3D vector-matrix remapping SPRs
81
82 There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
83 which have the same format.
84
85 [[!inline raw="yes" pages="openpower/sv/shape_table_format" ]]
86
87 The algorithm below shows how REMAP works more clearly, and may be
88 executed as a python program:
89
90 ```
91 [[!inline quick="yes" raw="yes" pages="openpower/sv/remap.py" ]]
92 ```
93
94 An easier-to-read version (using python iterators) shows the loop nesting:
95
96 ```
97 [[!inline quick="yes" raw="yes" pages="openpower/sv/remapyield.py" ]]
98 ```
99
100 Each element index from the for-loop `0..VL-1`
101 is run through the above algorithm to work out the **actual** element
102 index, instead. Given that there are four possible SHAPE entries, up to
103 four separate registers in any given operation may be simultaneously
104 remapped:
105
106 function op_add(rd, rs1, rs2) # add not VADD!
107 ...
108 ...
109  for (i = 0; i < VL; i++)
110 xSTATE.srcoffs = i # save context
111 if (predval & 1<<i) # predication uses intregs
112    ireg[rd+remap1(id)] <= ireg[rs1+remap2(irs1)] +
113 ireg[rs2+remap3(irs2)];
114 if (!int_vec[rd ].isvector) break;
115 if (int_vec[rd ].isvector)  { id += 1; }
116 if (int_vec[rs1].isvector)  { irs1 += 1; }
117 if (int_vec[rs2].isvector)  { irs2 += 1; }
118
119 By changing remappings, 2D matrices may be transposed "in-place" for one
120 operation, followed by setting a different permutation order without
121 having to move the values in the registers to or from memory.
122
123 Note that:
124
125 * Over-running the register file clearly has to be detected and
126 an illegal instruction exception thrown
127 * When non-default elwidths are set, the exact same algorithm still
128 applies (i.e. it offsets *polymorphic* elements *within* registers rather
129 than entire registers).
130 * If permute option 000 is utilised, the actual order of the
131 reindexing does not change. However, modulo MVL still occurs
132 which will result in repeated operations (use with caution).
133 * If two or more dimensions are set to zero, the actual order does not change!
134 * The above algorithm is pseudo-code **only**. Actual implementations
135 will need to take into account the fact that the element for-looping
136 must be **re-entrant**, due to the possibility of exceptions occurring.
137 See SVSTATE SPR, which records the current element index.
138 Continuing after return from an interrupt may introduce latency
139 due to re-computation of the remapped offsets.
140 * Twin-predicated operations require **two** separate and distinct
141 element offsets. The above pseudo-code algorithm will be applied
142 separately and independently to each, should each of the two
143 operands be remapped. *This even includes unit-strided LD/ST*
144 and other operations
145 in that category, where in that case it will be the **offset** that is
146 remapped.
147 * Offset is especially useful, on its own, for accessing elements
148 within the middle of a register. Without offsets, it is necessary
149 to either use a predicated MV, skipping the first elements, or
150 performing a LOAD/STORE cycle to memory.
151 With offsets, the data does not have to be moved.
152 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
153 less than MVL is **perfectly legal**, albeit very obscure. It permits
154 entries to be regularly presented to operands **more than once**, thus
155 allowing the same underlying registers to act as an accumulator of
156 multiple vector or matrix operations, for example.
157 * Note especially that Program Order **must** still be respected
158 even when overlaps occur that read or write the same register
159 elements *including polymorphic ones*
160
161 Clearly here some considerable care needs to be taken as the remapping
162 could hypothetically create arithmetic operations that target the
163 exact same underlying registers, resulting in data corruption due to
164 pipeline overlaps. Out-of-order / Superscalar micro-architectures with
165 register-renaming will have an easier time dealing with this than
166 DSP-style SIMD micro-architectures.
167
168 ## svstate instruction
169
170 Please note: this is **not** intended for production. It sets up
171 (overwrites) all required SVSHAPE SPRs and indicates that the
172 *next instruction* shall have those REMAP shapes applied to it,
173 assuming that instruction is of the form FRT,FRA,FRC,FRB.
174
175 Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]])
176
177 | 0.5|6.10 |11.15 |16..20 | 21..25 | 25 | 26..30 |31| name |
178 | -- | -- | --- | ----- | ------ | -- | ------ |--| -------- |
179 |OPCD| SVxd | SVyd | SVzd | SVRM | vf | XO |/ | svstate |
180
181
182 Fields:
183
184 * **SVxd** - SV REMAP "xdim"
185 * **SVyd** - SV REMAP "ydim"
186 * **SVMM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT)
187 * **vf** - sets "Vertical-First" mode
188 * **XO** - standard 5-bit XO field
189
190 # 4x4 Matrix to vec4 Multiply Example
191
192 The following settings will allow a 4x4 matrix (starting at f8), expressed
193 as a sequence of 16 numbers first by row then by column, to be multiplied
194 by a vector of length 4 (starting at f0), using a single FMAC instruction.
195
196 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
197 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
198 * VL=16, f4=vec, f0=vec, f8=vec
199 * FMAC f4, f0, f8, f4
200
201 The permutation on SHAPE0 will use f0 as a vec4 source. On the first
202 four iterations through the hardware loop, the REMAPed index will not
203 increment. On the second four, the index will increase by one. Likewise
204 on each subsequent group of four.
205
206 The permutation on SHAPE1 will increment f4 continuously cycling through
207 f4-f7 every four iterations of the hardware loop.
208
209 At the same time, VL will, because there is no SHAPE on f8, increment
210 straight sequentially through the 16 values f8-f23 in the Matrix. The
211 equivalent sequence thus is issued:
212
213 fmac f4, f0, f8, f4
214 fmac f5, f0, f9, f5
215 fmac f6, f0, f10, f6
216 fmac f7, f0, f11, f7
217 fmac f4, f1, f12, f4
218 fmac f5, f1, f13, f5
219 fmac f6, f1, f14, f6
220 fmac f7, f1, f15, f7
221 fmac f4, f2, f16, f4
222 fmac f5, f2, f17, f5
223 fmac f6, f2, f18, f6
224 fmac f7, f2, f19, f7
225 fmac f4, f3, f20, f4
226 fmac f5, f3, f21, f5
227 fmac f6, f3, f22, f6
228 fmac f7, f3, f23, f7
229
230 The only other instruction required is to ensure that f4-f7 are
231 initialised (usually to zero).
232
233 It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively
234 the same technique applied to four independent vectors, can be done by
235 setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs,
236 and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get
237 it to apply four times to compute the four columns worth of vectors.
238
239 # Warshall transitive closure algorithm
240
241 TODO move to [[sv/remap/discussion]] page, copied from here
242 http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html
243
244 with thanks to Hendrik.
245
246 > Just a note: interpreting + as 'or', and * as 'and',
247 > operating on Boolean matrices,
248 > and having result, X, and Y be the exact same matrix,
249 > updated while being used,
250 > gives the traditional Warshall transitive-closure
251 > algorithm, if the loops are nested exactly in thie order.
252
253 this can be done with the ternary instruction which has
254 an in-place triple boolean input:
255
256 RT = RT | (RA & RB)
257
258 and also has a CR Field variant of the same
259
260 notes from conversations:
261
262 > > for y in y_r:
263 > > for x in x_r:
264 > > for z in z_r:
265 > > result[y][x] +=
266 > > a[y][z] *
267 > > b[z][x]
268
269 > This nesting of loops works for matrix multiply, but not for transitive
270 > closure.
271
272 > > it can be done:
273 > >
274 > >   for z in z_r:
275 > >    for y in y_r:
276 > >     for x in x_r:
277 > >       result[y][x] +=
278 > >          a[y][z] *
279 > >          b[z][x]
280 >
281 > And this ordering of loops *does* work for transitive closure, when a,
282 > b, and result are the very same matrix, updated while being used.
283 >
284 > By the way, I believe there is a graph algorithm that does the
285 > transitive closure thing, but instead of using boolean, "and", and "or",
286 > they use real numbers, addition, and minimum.  I think that one computes
287 > shortest paths between vertices.
288 >
289 > By the time the z'th iteration of the z loop begins, the algorithm has
290 > already peocessed paths that go through vertices numbered < z, and it
291 > adds paths that go through vertices numbered z.
292 >
293 > For this to work, the outer loop has to be the one on teh subscript that
294 > bridges a and b (which in this case are teh same matrix, of course).
295
296 # SUBVL Remap
297
298 Remapping even of SUBVL (vec2/3/4) elements is permitted, as if the
299 sub-vectir elements were simply part of the main VL loop. This is the
300 *complete opposite* of predication which **only** applies to the whole
301 vec2/3/4. In pseudocode this would be:
302
303  for (i = 0; i < VL; i++)
304 if (predval & 1<<i) # apply to VL not SUBVL
305  for (j = 0; j < SUBVL; j++)
306 id = i*SUBVL + j # not, "id=i".
307    ireg[RT+remap1(id)] ...
308
309 The reason for allowing SUBVL Remaps is that some regular patterns using
310 Swizzle which would otherwise require multiple explicit instructions
311 with 12 bit swizzles encoded in them may be efficently encoded with Remap
312 instead. Not however that Swizzle is *still permitted to be applied*.
313
314 An example where SUBVL Remap is appropriate is the Rijndael MixColumns
315 stage:
316
317 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />
318
319 Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
320 a 2D REMAP allows:
321
322 * the column bytes (as a vec4) to be iterated over as an inner loop,
323 progressing vertically (`a00 a10 a20 a30`)
324 * the columns themselves to be iterated as an outer loop
325 * a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
326
327 This entirely in-place without special 128-bit opcodes. Below is
328 the pseudocode for [[!wikipedia Rijndael MixColumns]]
329
330 ```
331 void gmix_column(unsigned char *r) {
332 unsigned char a[4];
333 unsigned char b[4];
334 unsigned char c;
335 unsigned char h;
336 // no swizzle here but still SUBVL.Remap
337 // can be done as vec4 byte-level
338 // elwidth overrides though.
339 for (c = 0; c < 4; c++) {
340 a[c] = r[c];
341 h = (unsigned char)((signed char)r[c] >> 7);
342 b[c] = r[c] << 1;
343 b[c] ^= 0x1B & h; /* Rijndael's Galois field */
344 }
345 // SUBVL.Remap still needed here
346 // bytelevel elwidth overrides and vec4
347 // These may then each be 4x 8bit bit Swizzled
348 // r0.vec4 = b.vec4
349 // r0.vec4 ^= a.vec4.WXYZ
350 // r0.vec4 ^= a.vec4.ZWXY
351 // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
352 r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
353 r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
354 r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
355 r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
356 }
357 ```
358
359 With the assumption made by the above code that the column bytes have
360 already been turned around (vertical rather than horizontal) SUBVL.REMAP
361 may transparently fill that role, in-place, without a complex byte-level
362 mv operation.
363
364 The application of the swizzles allows the remapped vec4 a, b and r
365 variables to perform four straight linear 32 bit XOR operations where a
366 scalar processor would be required to perform 16 byte-level individual
367 operations. Given wide enough SIMD backends in hardware these 3 bit
368 XORs may be done as single-cycle operations across the entire 128 bit
369 Rijndael Matrix.
370
371 The other alternative is to simply perform the actual 4x4 GF(256) Matrix
372 Multiply using the MDS Matrix.
373
374 # TODO
375
376 investigate https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429
377 in https://bugs.libre-soc.org/show_bug.cgi?id=653