(no commit message)
[libreriscv.git] / openpower / sv / remap.mdwn
1 [[!tag standards]]
2
3 # REMAP <a name="remap" />
4
5 * <https://bugs.libre-soc.org/show_bug.cgi?id=143>
6 * see [[sv/propagation]] for a future way to apply
7 REMAP.
8
9 REMAP is an advanced form of Vector "Structure Packing" that
10 provides hardware-level support for commonly-used *nested* loop patterns.
11 For more general reordering an Indexed REMAP mode is available.
12
13 REMAP allows the usual vector loop `0..VL-1` to be "reshaped" (re-mapped)
14 from a linear form to a 2D or 3D transposed form, or "offset" to permit
15 arbitrary access to elements, independently on each Vector src or dest
16 register.
17
18 The initial primary motivation of REMAP was for Matrix Multiplication, reordering of sequential
19 data in-place. Four SPRs are provided so that a single FMAC may be
20 used in a single loop to perform 4x4 times 4x4 Matrix multiplication,
21 generating 64 FMACs. Additional uses include regular "Structure Packing"
22 such as RGB pixel data extraction and reforming.
23
24 REMAP, like all of SV, is abstracted out, meaning that unlike traditional
25 Vector ISAs which would typically only have a limited set of instructions
26 that can be structure-packed (LD/ST typically), REMAP may be applied to
27 literally any instruction: CRs, Arithmetic, Logical, LD/ST, anything.
28
29 Note that REMAP does not apply to sub-vector elements: that is what
30 swizzle is for. Swizzle *can* however be applied to the same instruction
31 as REMAP.
32
33 In its general form, REMAP is quite expensive to set up, and on some
34 implementations introduce
35 latency, so should realistically be used only where it is worthwhile.
36 Commonly-used patterns such as Matrix Multiply, DCT and FFT have
37 helper instruction options which make REMAP easier to use.
38
39 There are three types of REMAP:
40
41 * **Matrix**, also known as 2D and 3D reshaping
42 * **FFT/DCT**, with full triple-loop in-place support: limited to
43 Power-2 RADIX
44 * **Indexing**, for any general-purpose reordering. Currently
45 under development.
46
47 # Principle
48
49 * normal vector element read/write of operands would be sequential
50 (0 1 2 3 ....)
51 * this is not appropriate for (e.g.) Matrix multiply which requires
52 accessing elements in alternative sequences (0 3 6 1 4 7 ...)
53 * normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
54 with this. both are expensive (copy large vectors, spill through memory)
55 and very few Packed SIMD ISAs cope with non-Power-2.
56 * REMAP **redefines** the order of access according to set "Schedules".
57 * The Schedules are not necessarily restricted to power-of-two boundaries
58 making it unnecessary to have for example specialised 3x4 transpose
59 instructions.
60
61 Only the most commonly-used algorithms in computer science have REMAP
62 support, due to the high cost in both the ISA and in hardware. For
63 arbitrary remapping the `Indexed` REMAP may be used.
64
65 # Executive Summary Usage
66
67 * `svshape` to set the type of reordering to be applied to the
68 usual `0..VL-1` hardware for-loop
69 * `svremap` to set which registers a given reordering is to apply to
70 (RA, RT etc)
71 * `sv.instruction` where any Vectotised register marked by `svremap`
72 will have its ordering REMAPPED according to the schedule set
73 by `svshape`.
74
75 The following illustrative example multiplies a 3x4 and a 5x3
76 matrix to create
77 a 5x4 result:
78
79 svshape 5, 4, 3, 0, 0
80 svremap 31, 1, 2, 3, 0, 0, 0, 0
81 sv.fmadds 0.v, 8.v, 16.v, 0.v
82
83 The example may be executed as a unit test and demo,
84 [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;h=c15479db9a36055166b6b023c7495f9ca3637333;hb=a17a252e474d5d5bf34026c25a19682e3f2015c3#l94)
85
86 # REMAP types
87
88 This section summarises the motivation for each REMAP Schedule
89 and briefly goes over their characteristics and limitations.
90
91 ## Matrix (1D/2D/3D shaping)
92
93 TODO
94
95 ## FFT/DCT Triple Loop
96
97 TODO
98
99 ## Indexed
100
101 The purpose of Indexing is to provide a generalised version of
102 Vector ISA "Permute" instructions, such as VSX `vperm`. The
103 Indexing is abstracted out and may be applied to much more
104 than an element move/copy, and is not limited for example
105 to the number of bytes that can fit into a VSX register.
106 Indexing may be applied to LD/ST (even on Indexed LD/ST
107 instructions such as `sv.lbzx`), arithmetic operations,
108 extsw: there is no artificial limit.
109
110 The only major caveat is that the registers to be used as
111 Indices must not be modified by any instruction after Indexed Mode
112 is established, and neither must MAXVL be altered. Failure to observe
113 these conditions results in `UNDEFINED` behaviour.
114 These conditions allow a Read-After-Write (RAW) Hazard to be created on
115 the entire range of Indices to be subsequently used, but a corresponding
116 Write-After-Read Hazard by any instruction that modifies the Indices
117 **does not have to be created**. Given the large number of registers
118 involved in Indexing this is a huge resource saving and reduction
119 in micro-architectural complexity. MAXVL is likewise
120 included in the RAW Hazards because it is involved in calculating
121 how many registers are to be considered Indices.
122
123
124 # REMAP SPR
125
126 | 0 | 2 | 4 | 6 | 8 | 10.14 | 15..23 |
127 | -- | -- | -- | -- | -- | ----- | ------ |
128 |mi0 |mi1 |mi2 |mo0 |mo1 | SVme | rsv |
129
130 mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register.
131 mi0-2 apply to RA, RB, RC respectively, as input registers, and
132 likewise mo0-1 apply to output registers (FRT, FRS respectively).
133 SVme is 5 bits, and indicates indicate whether the
134 SVSHAPE is actively applied or not.
135
136 * bit 0 of SVme indicates if mi0 is applied to RA / FRA
137 * bit 1 of SVme indicates if mi1 is applied to RB / FRB
138 * bit 2 of SVme indicates if mi2 is applied to RC / FRC
139 * bit 3 of SVme indicates if mo0 is applied to RT / FRT
140 * bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS
141 (LD/ST-with-update has an implicit 2nd write register, RA)
142
143 # svremap instruction
144
145 There is also a corresponding SVRM-Form for the svremap
146 instruction which matches the above SPR:
147
148 svremap SVme,mi0,mi1,mi2,mo0,mo2,pst
149
150 |0 |6 |11 |13 |15 |17 |19 |21 | 22 |26 |31 |
151 | -- | -- | -- | -- | -- | -- | -- | -- | ---- | ----- |-- |
152 | PO | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst | rsvd | XO | / |
153
154 # SHAPE Remapping SPRs
155
156 There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
157 which have the same format.
158
159 [[!inline raw="yes" pages="openpower/sv/shape_table_format" ]]
160
161 # svshape instruction
162
163 `svshape` is a convenience instruction that reduces instruction
164 count for common usage patterns, particularly Matrix, DCT and FFT. It sets up
165 (overwrites) all required SVSHAPE SPRs and also modifies SVSTATE
166 including VL and MAXVL. Using `svshape` therefore does not also
167 require `setvl`.
168
169 Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]])
170
171 svshape SVxd,SVyd,SVzd,SVRM,vf
172
173 | 0.5|6.10 |11.15 |16..20 | 21..25 | 25 | 26..31| name |
174 | -- | -- | --- | ----- | ------ | -- | ------| -------- |
175 |OPCD| SVxd | SVyd | SVzd | SVRM | vf | XO | svstate |
176
177 Fields:
178
179 * **SVxd** - SV REMAP "xdim"
180 * **SVyd** - SV REMAP "ydim"
181 * **SVzd** - SV REMAP "zdim"
182 * **SVRM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT etc.)
183 * **vf** - sets "Vertical-First" mode
184 * **XO** - standard 6-bit XO field
185
186 | SVRM | Remap Mode description |
187 | -- | -- |
188 | 0b0000 | Matrix 1/2/3D |
189 | 0b0001 | FFT Butterfly |
190 | 0b0010 | DCT Inner butterfly, pre-calculated coefficients |
191 | 0b0011 | DCT Outer butterfly |
192 | 0b0100 | DCT Inner butterfly, on-the-fly (Vertical-First Mode) |
193 | 0b0101 | DCT COS table index generation |
194 | 0b0110 | DCT half-swap |
195 | 0b0111 | reserved |
196 | 0b1000 | reserved |
197 | 0b1001 | reserved |
198 | 0b1010 | iDCT Inner butterfly, pre-calculated coefficients |
199 | 0b1011 | iDCT Outer butterfly |
200 | 0b1100 | iDCT Inner butterfly, on-the-fly (Vertical-First Mode) |
201 | 0b1101 | iDCT COS table index generation |
202 | 0b1110 | iDCT half-swap |
203 | 0b1111 | FFT half-swap |
204
205 Examples showing how all of these Modes operate exists in the online
206 [SVP64 unit tests](https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/decoder/isa;hb=HEAD)
207
208 In Indexed Mode, there are only 5 bits available to specify the GPR
209 to use, out of 128 GPRs (7 bit numbering). Therefore, only the top
210 5 bits are given in the `SVxd` field: the bottom two implicit bits
211 will be zero (`SVxd || 0b00`).
212
213 `svshape` has *limited applicability* due to being a 32-bit instruction.
214 The full capability of SVSHAPE SPRs may be accessed by directly writing
215 to SVSHAPE0-3 with `mtspr`. Circumstances include Matrices with dimensions
216 larger than 32, and in-place Transpose. Potentially a future v3.1 Prefixed
217 instruction, `psvshape`, may extend the capability here.
218
219 # svindex instruction
220
221 `svindex` is a convenience instruction that reduces instruction
222 count for Indexed REMAP Mode. It sets up
223 (overwrites) all required SVSHAPE SPRs and can modify the REMAP
224 SPR as well.
225
226 Form: SVI-Form SV "Indexed" Form (see [[isatables/fields.text]])
227
228 svindex RS,mask,SVd,ew,yz,mr,sk
229
230 | 0.5|6.10 |11.15 |16.20 | 21..25 | 26..31| name |
231 | -- | -- | --- | ---- | ----------- | ------| -------- |
232 |OPCD| RS | mask | SVd | ew/yx/mm/sk | XO | svindex |
233
234 Fields:
235
236 * **SVd** - SV REMAP x/y dim
237 * **mask** - sets remap mi0-2/mo0-1 and SVSHAPEs, controlled by mm
238 * **ew** - sets element width override
239 * **RS** - GPR RS<<2 to be used for Indexing
240 * **yx** - 2D reordering to be used if yx=1
241 * **mm** - mask mode. determines how mask is interpreted.
242 * **sk** - Dimension skipping enabled
243 * **XO** - standard 6-bit XO field
244
245 When `mm=0`:
246
247 * bit 0 corresponds to mi0, bit 1 to mi1, bit 2 to mi2,
248 bit 3 to mo0 and bit 4 to mi1
249 * all SVSHAPEs and the REMAP SPR are first reset (initialised to zero)
250 * for each bit set in the 5-bit mask, in order, the first
251 as-yet-unset SVSHAPE will be updated
252 with the other operands in the instruction, and the REMAP
253 SPR set.
254 * If all 5 bits of mask are set then both mi0 and mo1 use SVSHSAPE0.
255
256 Example 1: if mask=0b00110 then SVSHAPE0 and SVSHAPE1 are set up,
257 and the REMAP SPR set so that mi1 uses SVSHAPE0 and mi2
258 uses mi2.
259
260 Example 2: if mask=0b10001 then again SVSHAPE0 and SVSHAPE1
261 are set up, but the REMAP SPR is set so that mi0 uses SVSHAPE0
262 and mo1 uses SVSHAPE1.
263
264
265 # REMAP Matrix pseudocode
266
267 The algorithm below shows how REMAP works more clearly, and may be
268 executed as a python program:
269
270 ```
271 [[!inline quick="yes" raw="yes" pages="openpower/sv/remap.py" ]]
272 ```
273
274 An easier-to-read version (using python iterators) shows the loop nesting:
275
276 ```
277 [[!inline quick="yes" raw="yes" pages="openpower/sv/remapyield.py" ]]
278 ```
279
280 Each element index from the for-loop `0..VL-1`
281 is run through the above algorithm to work out the **actual** element
282 index, instead. Given that there are four possible SHAPE entries, up to
283 four separate registers in any given operation may be simultaneously
284 remapped:
285
286 function op_add(rd, rs1, rs2) # add not VADD!
287 ...
288 ...
289  for (i = 0; i < VL; i++)
290 xSTATE.srcoffs = i # save context
291 if (predval & 1<<i) # predication uses intregs
292    ireg[rd+remap1(id)] <= ireg[rs1+remap2(irs1)] +
293 ireg[rs2+remap3(irs2)];
294 if (!int_vec[rd ].isvector) break;
295 if (int_vec[rd ].isvector)  { id += 1; }
296 if (int_vec[rs1].isvector)  { irs1 += 1; }
297 if (int_vec[rs2].isvector)  { irs2 += 1; }
298
299 By changing remappings, 2D matrices may be transposed "in-place" for one
300 operation, followed by setting a different permutation order without
301 having to move the values in the registers to or from memory.
302
303 Note that:
304
305 * Over-running the register file clearly has to be detected and
306 an illegal instruction exception thrown
307 * When non-default elwidths are set, the exact same algorithm still
308 applies (i.e. it offsets *polymorphic* elements *within* registers rather
309 than entire registers).
310 * If permute option 000 is utilised, the actual order of the
311 reindexing does not change. However, modulo MVL still occurs
312 which will result in repeated operations (use with caution).
313 * If two or more dimensions are set to zero, the actual order does not change!
314 * The above algorithm is pseudo-code **only**. Actual implementations
315 will need to take into account the fact that the element for-looping
316 must be **re-entrant**, due to the possibility of exceptions occurring.
317 See SVSTATE SPR, which records the current element index.
318 Continuing after return from an interrupt may introduce latency
319 due to re-computation of the remapped offsets.
320 * Twin-predicated operations require **two** separate and distinct
321 element offsets. The above pseudo-code algorithm will be applied
322 separately and independently to each, should each of the two
323 operands be remapped. *This even includes unit-strided LD/ST*
324 and other operations
325 in that category, where in that case it will be the **offset** that is
326 remapped.
327 * Offset is especially useful, on its own, for accessing elements
328 within the middle of a register. Without offsets, it is necessary
329 to either use a predicated MV, skipping the first elements, or
330 performing a LOAD/STORE cycle to memory.
331 With offsets, the data does not have to be moved.
332 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
333 less than MVL is **perfectly legal**, albeit very obscure. It permits
334 entries to be regularly presented to operands **more than once**, thus
335 allowing the same underlying registers to act as an accumulator of
336 multiple vector or matrix operations, for example.
337 * Note especially that Program Order **must** still be respected
338 even when overlaps occur that read or write the same register
339 elements *including polymorphic ones*
340
341 Clearly here some considerable care needs to be taken as the remapping
342 could hypothetically create arithmetic operations that target the
343 exact same underlying registers, resulting in data corruption due to
344 pipeline overlaps. Out-of-order / Superscalar micro-architectures with
345 register-renaming will have an easier time dealing with this than
346 DSP-style SIMD micro-architectures.
347
348 # 4x4 Matrix to vec4 Multiply Example
349
350 The following settings will allow a 4x4 matrix (starting at f8), expressed
351 as a sequence of 16 numbers first by row then by column, to be multiplied
352 by a vector of length 4 (starting at f0), using a single FMAC instruction.
353
354 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
355 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
356 * VL=16, f4=vec, f0=vec, f8=vec
357 * FMAC f4, f0, f8, f4
358
359 The permutation on SHAPE0 will use f0 as a vec4 source. On the first
360 four iterations through the hardware loop, the REMAPed index will not
361 increment. On the second four, the index will increase by one. Likewise
362 on each subsequent group of four.
363
364 The permutation on SHAPE1 will increment f4 continuously cycling through
365 f4-f7 every four iterations of the hardware loop.
366
367 At the same time, VL will, because there is no SHAPE on f8, increment
368 straight sequentially through the 16 values f8-f23 in the Matrix. The
369 equivalent sequence thus is issued:
370
371 fmac f4, f0, f8, f4
372 fmac f5, f0, f9, f5
373 fmac f6, f0, f10, f6
374 fmac f7, f0, f11, f7
375 fmac f4, f1, f12, f4
376 fmac f5, f1, f13, f5
377 fmac f6, f1, f14, f6
378 fmac f7, f1, f15, f7
379 fmac f4, f2, f16, f4
380 fmac f5, f2, f17, f5
381 fmac f6, f2, f18, f6
382 fmac f7, f2, f19, f7
383 fmac f4, f3, f20, f4
384 fmac f5, f3, f21, f5
385 fmac f6, f3, f22, f6
386 fmac f7, f3, f23, f7
387
388 The only other instruction required is to ensure that f4-f7 are
389 initialised (usually to zero).
390
391 It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively
392 the same technique applied to four independent vectors, can be done by
393 setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs,
394 and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get
395 it to apply four times to compute the four columns worth of vectors.
396
397 # Warshall transitive closure algorithm
398
399 TODO move to [[sv/remap/discussion]] page, copied from here
400 http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html
401
402 with thanks to Hendrik.
403
404 <https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm>
405
406 > Just a note: interpreting + as 'or', and * as 'and',
407 > operating on Boolean matrices,
408 > and having result, X, and Y be the exact same matrix,
409 > updated while being used,
410 > gives the traditional Warshall transitive-closure
411 > algorithm, if the loops are nested exactly in thie order.
412
413 this can be done with the ternary instruction which has
414 an in-place triple boolean input:
415
416 RT = RT | (RA & RB)
417
418 and also has a CR Field variant of the same
419
420 notes from conversations:
421
422 > > for y in y_r:
423 > > for x in x_r:
424 > > for z in z_r:
425 > > result[y][x] +=
426 > > a[y][z] *
427 > > b[z][x]
428
429 > This nesting of loops works for matrix multiply, but not for transitive
430 > closure.
431
432 > > it can be done:
433 > >
434 > >   for z in z_r:
435 > >    for y in y_r:
436 > >     for x in x_r:
437 > >       result[y][x] +=
438 > >          a[y][z] *
439 > >          b[z][x]
440 >
441 > And this ordering of loops *does* work for transitive closure, when a,
442 > b, and result are the very same matrix, updated while being used.
443 >
444 > By the way, I believe there is a graph algorithm that does the
445 > transitive closure thing, but instead of using boolean, "and", and "or",
446 > they use real numbers, addition, and minimum.  I think that one computes
447 > shortest paths between vertices.
448 >
449 > By the time the z'th iteration of the z loop begins, the algorithm has
450 > already peocessed paths that go through vertices numbered < z, and it
451 > adds paths that go through vertices numbered z.
452 >
453 > For this to work, the outer loop has to be the one on teh subscript that
454 > bridges a and b (which in this case are teh same matrix, of course).
455
456 # SUBVL Remap
457
458 Remapping even of SUBVL (vec2/3/4) elements is permitted, as if the
459 sub-vectir elements were simply part of the main VL loop. This is the
460 *complete opposite* of predication which **only** applies to the whole
461 vec2/3/4. In pseudocode this would be:
462
463  for (i = 0; i < VL; i++)
464 if (predval & 1<<i) # apply to VL not SUBVL
465  for (j = 0; j < SUBVL; j++)
466 id = i*SUBVL + j # not, "id=i".
467    ireg[RT+remap1(id)] ...
468
469 The reason for allowing SUBVL Remaps is that some regular patterns using
470 Swizzle which would otherwise require multiple explicit instructions
471 with 12 bit swizzles encoded in them may be efficently encoded with Remap
472 instead. Not however that Swizzle is *still permitted to be applied*.
473
474 An example where SUBVL Remap is appropriate is the Rijndael MixColumns
475 stage:
476
477 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />
478
479 Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
480 a 2D REMAP allows:
481
482 * the column bytes (as a vec4) to be iterated over as an inner loop,
483 progressing vertically (`a00 a10 a20 a30`)
484 * the columns themselves to be iterated as an outer loop
485 * a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
486
487 This entirely in-place without special 128-bit opcodes. Below is
488 the pseudocode for [[!wikipedia Rijndael MixColumns]]
489
490 ```
491 void gmix_column(unsigned char *r) {
492 unsigned char a[4];
493 unsigned char b[4];
494 unsigned char c;
495 unsigned char h;
496 // no swizzle here but still SUBVL.Remap
497 // can be done as vec4 byte-level
498 // elwidth overrides though.
499 for (c = 0; c < 4; c++) {
500 a[c] = r[c];
501 h = (unsigned char)((signed char)r[c] >> 7);
502 b[c] = r[c] << 1;
503 b[c] ^= 0x1B & h; /* Rijndael's Galois field */
504 }
505 // SUBVL.Remap still needed here
506 // bytelevel elwidth overrides and vec4
507 // These may then each be 4x 8bit bit Swizzled
508 // r0.vec4 = b.vec4
509 // r0.vec4 ^= a.vec4.WXYZ
510 // r0.vec4 ^= a.vec4.ZWXY
511 // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
512 r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
513 r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
514 r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
515 r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
516 }
517 ```
518
519 With the assumption made by the above code that the column bytes have
520 already been turned around (vertical rather than horizontal) SUBVL.REMAP
521 may transparently fill that role, in-place, without a complex byte-level
522 mv operation.
523
524 The application of the swizzles allows the remapped vec4 a, b and r
525 variables to perform four straight linear 32 bit XOR operations where a
526 scalar processor would be required to perform 16 byte-level individual
527 operations. Given wide enough SIMD backends in hardware these 3 bit
528 XORs may be done as single-cycle operations across the entire 128 bit
529 Rijndael Matrix.
530
531 The other alternative is to simply perform the actual 4x4 GF(256) Matrix
532 Multiply using the MDS Matrix.
533
534 # TODO
535
536 * investigate https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429
537 in https://bugs.libre-soc.org/show_bug.cgi?id=653
538 * Triangular REMAP
539 * Cross-Product REMAP (actually, skew Matrix: https://en.m.wikipedia.org/wiki/Skew-symmetric_matrix)
540