(no commit message)
[libreriscv.git] / openpower / sv / remap.mdwn
1 [[!tag standards]]
2
3 # REMAP <a name="remap" />
4
5 see [[sv/propagation]] because it is currently the only way to apply
6 REMAP.
7
8 REMAP allows the usual vector loop `0..VL-1` to be "reshaped" (re-mapped)
9 from a linear form to a 2D or 3D transposed form, or "offset" to permit
10 arbitrary access to elements, independently on each Vector src or dest
11 register.
12
13 Their primary use is for Matrix Multiplication, reordering of sequential
14 data in-place. Four SPRs are provided so that a single FMAC may be
15 used in a single loop to perform 4x4 times 4x4 Matrix multiplication,
16 generating 64 FMACs. Additional uses include regular "Structure Packing"
17 such as RGB pixel data extraction and reforming.
18
19 REMAP, like all of SV, is abstracted out, meaning that unlike traditional
20 Vector ISAs which would typically only have a limited set of instructions
21 that can be structure-packed (LD/ST typically), REMAP may be applied to
22 literally any instruction: CRs, Arithmetic, Logical, LD/ST, anything.
23
24 Note that REMAP does not apply to sub-vector elements: that is what
25 swizzle is for. Swizzle *can* however be applied to the same instruction
26 as REMAP.
27
28 REMAP is quite expensive to set up, and on some implementations introduce
29 latency, so should realistically be used only where it is worthwhile
30
31 # Principle
32
33 * normal vector element read/write as operands would be sequential
34 (0 1 2 3 ....)
35 * this is not appropriate for (e.g.) Matrix multiply which requires
36 accessing elements in alternative sequences (0 3 6 1 4 7 ...)
37 * normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
38 with this. both are expensive (copy large vectors, spill through memory)
39 * REMAP **redefines** the order of access according to set "Schedules"
40
41 Only the most commonly-used algorithms in computer science have REMAP
42 support, due to the high cost in both the ISA and in hardware.
43
44 # REMAP SPR
45
46 | 0 | 2 | 4 | 6 | 8 | 10.14 | 15..23 |
47 | -- | -- | -- | -- | -- | ----- | ------ |
48 |mi0 |mi1 |mi2 |mo0 |mo1 | SVme | rsv |
49
50 mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register.
51 mi0-2 apply to RA, RB, RC respectively, as input registers, and
52 likewise mo0-1 apply to output registers (FRT, FRS respectively).
53 SVme is 5 bits, and indicates indicate whether the
54 SVSHAPE is actively applied or not.
55
56 * bit 0 of SVme indicates if mi0 is applied to RA / FRA
57 * bit 1 of SVme indicates if mi1 is applied to RB / FRB
58 * bit 2 of SVme indicates if mi2 is applied to RC / FRC
59 * bit 3 of SVme indicates if mo0 is applied to RT / FRT
60 * bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS
61 (LD/ST-with-update has an implicit 2nd write register, RA)
62
63 There is also a corresponding SVRM-Form for the svremap
64 instruction which matches the above SPR:
65
66 |0 |6 |11 |13 |15 |17 |19 |21 | 22 |26 |31 |
67 | PO | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst | rsvd | XO | / |
68
69 # SHAPE 1D/2D/3D vector-matrix remapping SPRs
70
71 There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
72 which have the same format.
73
74 [[!inline raw="yes" pages="openpower/sv/shape_table_format" ]]
75
76 The algorithm below shows how REMAP works more clearly, and may be
77 executed as a python program:
78
79 ```
80 [[!inline quick="yes" raw="yes" pages="openpower/sv/remap.py" ]]
81 ```
82
83 An easier-to-read version (using python iterators) shows the loop nesting:
84
85 ```
86 [[!inline quick="yes" raw="yes" pages="openpower/sv/remapyield.py" ]]
87 ```
88
89 Each element index from the for-loop `0..VL-1`
90 is run through the above algorithm to work out the **actual** element
91 index, instead. Given that there are four possible SHAPE entries, up to
92 four separate registers in any given operation may be simultaneously
93 remapped:
94
95 function op_add(rd, rs1, rs2) # add not VADD!
96 ...
97 ...
98  for (i = 0; i < VL; i++)
99 xSTATE.srcoffs = i # save context
100 if (predval & 1<<i) # predication uses intregs
101    ireg[rd+remap1(id)] <= ireg[rs1+remap2(irs1)] +
102 ireg[rs2+remap3(irs2)];
103 if (!int_vec[rd ].isvector) break;
104 if (int_vec[rd ].isvector)  { id += 1; }
105 if (int_vec[rs1].isvector)  { irs1 += 1; }
106 if (int_vec[rs2].isvector)  { irs2 += 1; }
107
108 By changing remappings, 2D matrices may be transposed "in-place" for one
109 operation, followed by setting a different permutation order without
110 having to move the values in the registers to or from memory.
111
112 Note that:
113
114 * Over-running the register file clearly has to be detected and
115 an illegal instruction exception thrown
116 * When non-default elwidths are set, the exact same algorithm still
117 applies (i.e. it offsets *polymorphic* elements *within* registers rather
118 than entire registers).
119 * If permute option 000 is utilised, the actual order of the
120 reindexing does not change. However, modulo MVL still occurs
121 which will result in repeated operations (use with caution).
122 * If two or more dimensions are set to zero, the actual order does not change!
123 * The above algorithm is pseudo-code **only**. Actual implementations
124 will need to take into account the fact that the element for-looping
125 must be **re-entrant**, due to the possibility of exceptions occurring.
126 See SVSTATE SPR, which records the current element index.
127 Continuing after return from an interrupt may introduce latency
128 due to re-computation of the remapped offsets.
129 * Twin-predicated operations require **two** separate and distinct
130 element offsets. The above pseudo-code algorithm will be applied
131 separately and independently to each, should each of the two
132 operands be remapped. *This even includes unit-strided LD/ST*
133 and other operations
134 in that category, where in that case it will be the **offset** that is
135 remapped.
136 * Offset is especially useful, on its own, for accessing elements
137 within the middle of a register. Without offsets, it is necessary
138 to either use a predicated MV, skipping the first elements, or
139 performing a LOAD/STORE cycle to memory.
140 With offsets, the data does not have to be moved.
141 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
142 less than MVL is **perfectly legal**, albeit very obscure. It permits
143 entries to be regularly presented to operands **more than once**, thus
144 allowing the same underlying registers to act as an accumulator of
145 multiple vector or matrix operations, for example.
146 * Note especially that Program Order **must** still be respected
147 even when overlaps occur that read or write the same register
148 elements *including polymorphic ones*
149
150 Clearly here some considerable care needs to be taken as the remapping
151 could hypothetically create arithmetic operations that target the
152 exact same underlying registers, resulting in data corruption due to
153 pipeline overlaps. Out-of-order / Superscalar micro-architectures with
154 register-renaming will have an easier time dealing with this than
155 DSP-style SIMD micro-architectures.
156
157 ## svstate instruction
158
159 Please note: this is **not** intended for production. It sets up
160 (overwrites) all required SVSHAPE SPRs and indicates that the
161 *next instruction* shall have those REMAP shapes applied to it,
162 assuming that instruction is of the form FRT,FRA,FRC,FRB.
163
164 Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]])
165
166 | 0.5|6.10 |11.15 |16..20 | 21..25 | 25 | 26..30 |31| name |
167 | -- | -- | --- | ----- | ------ | -- | ------ |--| -------- |
168 |OPCD| SVxd | SVyd | SVzd | SVRM | vf | XO |/ | svstate |
169
170
171 Fields:
172
173 * **SVxd** - SV REMAP "xdim"
174 * **SVyd** - SV REMAP "ydim"
175 * **SVMM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT)
176 * **vf** - sets "Vertical-First" mode
177 * **XO** - standard 5-bit XO field
178
179 # 4x4 Matrix to vec4 Multiply Example
180
181 The following settings will allow a 4x4 matrix (starting at f8), expressed
182 as a sequence of 16 numbers first by row then by column, to be multiplied
183 by a vector of length 4 (starting at f0), using a single FMAC instruction.
184
185 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
186 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
187 * VL=16, f4=vec, f0=vec, f8=vec
188 * FMAC f4, f0, f8, f4
189
190 The permutation on SHAPE0 will use f0 as a vec4 source. On the first
191 four iterations through the hardware loop, the REMAPed index will not
192 increment. On the second four, the index will increase by one. Likewise
193 on each subsequent group of four.
194
195 The permutation on SHAPE1 will increment f4 continuously cycling through
196 f4-f7 every four iterations of the hardware loop.
197
198 At the same time, VL will, because there is no SHAPE on f8, increment
199 straight sequentially through the 16 values f8-f23 in the Matrix. The
200 equivalent sequence thus is issued:
201
202 fmac f4, f0, f8, f4
203 fmac f5, f0, f9, f5
204 fmac f6, f0, f10, f6
205 fmac f7, f0, f11, f7
206 fmac f4, f1, f12, f4
207 fmac f5, f1, f13, f5
208 fmac f6, f1, f14, f6
209 fmac f7, f1, f15, f7
210 fmac f4, f2, f16, f4
211 fmac f5, f2, f17, f5
212 fmac f6, f2, f18, f6
213 fmac f7, f2, f19, f7
214 fmac f4, f3, f20, f4
215 fmac f5, f3, f21, f5
216 fmac f6, f3, f22, f6
217 fmac f7, f3, f23, f7
218
219 The only other instruction required is to ensure that f4-f7 are
220 initialised (usually to zero).
221
222 It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively
223 the same technique applied to four independent vectors, can be done by
224 setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs,
225 and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get
226 it to apply four times to compute the four columns worth of vectors.
227
228 # Warshall transitive closure algorithm
229
230 TODO move to [[sv/remap/discussion]] page, copied from here
231 http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html
232
233 with thanks to Hendrik.
234
235 > Just a note: interpreting + as 'or', and * as 'and',
236 > operating on Boolean matrices,
237 > and having result, X, and Y be the exact same matrix,
238 > updated while being used,
239 > gives the traditional Warshall transitive-closure
240 > algorithm, if the loops are nested exactly in thie order.
241
242 this can be done with the ternary instruction which has
243 an in-place triple boolean input:
244
245 RT = RT | (RA & RB)
246
247 and also has a CR Field variant of the same
248
249 notes from conversations:
250
251 > > for y in y_r:
252 > > for x in x_r:
253 > > for z in z_r:
254 > > result[y][x] +=
255 > > a[y][z] *
256 > > b[z][x]
257
258 > This nesting of loops works for matrix multiply, but not for transitive
259 > closure.
260
261 > > it can be done:
262 > >
263 > >   for z in z_r:
264 > >    for y in y_r:
265 > >     for x in x_r:
266 > >       result[y][x] +=
267 > >          a[y][z] *
268 > >          b[z][x]
269 >
270 > And this ordering of loops *does* work for transitive closure, when a,
271 > b, and result are the very same matrix, updated while being used.
272 >
273 > By the way, I believe there is a graph algorithm that does the
274 > transitive closure thing, but instead of using boolean, "and", and "or",
275 > they use real numbers, addition, and minimum.  I think that one computes
276 > shortest paths between vertices.
277 >
278 > By the time the z'th iteration of the z loop begins, the algorithm has
279 > already peocessed paths that go through vertices numbered < z, and it
280 > adds paths that go through vertices numbered z.
281 >
282 > For this to work, the outer loop has to be the one on teh subscript that
283 > bridges a and b (which in this case are teh same matrix, of course).
284
285 # SUBVL Remap
286
287 Remapping even of SUBVL (vec2/3/4) elements is permitted, as if the
288 sub-vectir elements were simply part of the main VL loop. This is the
289 *complete opposite* of predication which **only** applies to the whole
290 vec2/3/4. In pseudocode this would be:
291
292  for (i = 0; i < VL; i++)
293 if (predval & 1<<i) # apply to VL not SUBVL
294  for (j = 0; j < SUBVL; j++)
295 id = i*SUBVL + j # not, "id=i".
296    ireg[RT+remap1(id)] ...
297
298 The reason for allowing SUBVL Remaps is that some regular patterns using
299 Swizzle which would otherwise require multiple explicit instructions
300 with 12 bit swizzles encoded in them may be efficently encoded with Remap
301 instead. Not however that Swizzle is *still permitted to be applied*.
302
303 An example where SUBVL Remap is appropriate is the Rijndael MixColumns
304 stage:
305
306 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />
307
308 Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
309 a 2D REMAP allows:
310
311 * the column bytes (as a vec4) to be iterated over as an inner loop,
312 progressing vertically (`a00 a10 a20 a30`)
313 * the columns themselves to be iterated as an outer loop
314 * a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
315
316 This entirely in-place without special 128-bit opcodes. Below is
317 the pseudocode for [[!wikipedia Rijndael MixColumns]]
318
319 ```
320 void gmix_column(unsigned char *r) {
321 unsigned char a[4];
322 unsigned char b[4];
323 unsigned char c;
324 unsigned char h;
325 // no swizzle here but still SUBVL.Remap
326 // can be done as vec4 byte-level
327 // elwidth overrides though.
328 for (c = 0; c < 4; c++) {
329 a[c] = r[c];
330 h = (unsigned char)((signed char)r[c] >> 7);
331 b[c] = r[c] << 1;
332 b[c] ^= 0x1B & h; /* Rijndael's Galois field */
333 }
334 // SUBVL.Remap still needed here
335 // bytelevel elwidth overrides and vec4
336 // These may then each be 4x 8bit bit Swizzled
337 // r0.vec4 = b.vec4
338 // r0.vec4 ^= a.vec4.WXYZ
339 // r0.vec4 ^= a.vec4.ZWXY
340 // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
341 r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
342 r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
343 r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
344 r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
345 }
346 ```
347
348 With the assumption made by the above code that the column bytes have
349 already been turned around (vertical rather than horizontal) SUBVL.REMAP
350 may transparently fill that role, in-place, without a complex byte-level
351 mv operation.
352
353 The application of the swizzles allows the remapped vec4 a, b and r
354 variables to perform four straight linear 32 bit XOR operations where a
355 scalar processor would be required to perform 16 byte-level individual
356 operations. Given wide enough SIMD backends in hardware these 3 bit
357 XORs may be done as single-cycle operations across the entire 128 bit
358 Rijndael Matrix.
359
360 The other alternative is to simply perform the actual 4x4 GF(256) Matrix
361 Multiply using the MDS Matrix.
362
363 # TODO
364
365 investigate https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429
366 in https://bugs.libre-soc.org/show_bug.cgi?id=653