(no commit message)
[libreriscv.git] / openpower / sv / remap.mdwn
1 # REMAP <a name="remap" />
2
3 <!-- hide -->
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=143> matrix multiply
5 * <https://bugs.libre-soc.org/show_bug.cgi?id=867> add svindex
6 * <https://bugs.libre-soc.org/show_bug.cgi?id=885> svindex in simulator
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=911> offset svshape option
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=864> parallel reduction
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=930> DCT/FFT "strides"
10 * see [[sv/remap/appendix]] for examples and usage
11 * see [[sv/propagation]] for a future way to apply REMAP
12 * [[remap/discussion]]
13 <!-- show -->
14
15 REMAP is an advanced form of Vector "Structure Packing" that provides
16 hardware-level support for commonly-used *nested* loop patterns that would
17 otherwise require full inline loop unrolling. For more general reordering
18 an Indexed REMAP mode is available (a RISC-paradigm
19 abstracted analog to `xxperm`).
20
21 REMAP allows the usual sequential vector loop `0..VL-1` to be "reshaped"
22 (re-mapped) from a linear form to a 2D or 3D transposed form, or "offset"
23 to permit arbitrary access to elements, independently on each
24 Vector src or dest register. Up to four separate independent REMAPs may be applied
25 to the registers of any instruction.
26
27 A normal Vector Add:
28
29 ```
30  for i in range(VL):
31  GPR[RT+i] <= GPR[RA+i] + GPR[RB+i];
32 ```
33
34 A Hardware-assisted REMAP Vector Add:
35
36 ```
37 for i in range(VL):
38 GPR[RT+remap1(i)] <= GPR[RA+remap2(i)] + GPR[RB+remap3(i)];
39 ```
40
41 Aside from
42 Indexed REMAP this is entirely Hardware-accelerated reordering and
43 consequently not costly in terms of register access for the Indices. It will however
44 place a burden on Multi-Issue systems but no more than if the equivalent
45 Scalar instructions were explicitly loop-unrolled without SVP64, and
46 some advanced implementations may even find the Deterministic nature of
47 the Scheduling to be easier on resources.
48
49 *Hardware note: in its general form, REMAP is quite expensive to set up, and on some
50 implementations may introduce latency, so should realistically be used
51 only where it is worthwhile. Given that even with latency the fact
52 that up to 127 operations can be Deterministically issued (from a single
53 instruction) it should be clear that REMAP should not be dismissed
54 for *possible* latency alone. Commonly-used patterns such as Matrix
55 Multiply, DCT and FFT have helper instruction options which make REMAP
56 easier to use.*
57
58 There are five types of REMAP:
59
60 * **Matrix**, also known as 2D and 3D reshaping, can perform in-place
61 Matrix transpose and rotate. The Shapes are set up for an "Outer Product"
62 Matrix Multiply (a future variant may introduce Inner Product).
63 * **FFT/DCT**, with full triple-loop in-place support: limited to
64 Power-2 RADIX
65 * **Indexing**, for any general-purpose reordering, also includes
66 limited 2D reshaping as well as Element "offsetting".
67 * **Parallel Reduction**, for scheduling a sequence of operations
68 in a Deterministic fashion, in a way that may be parallelised,
69 to reduce a Vector down to a single value.
70 * **Parallel Prefix Sum**, implemented as a work-efficient Schedule,
71 has several key Computer Science uses. Again Prefix Sum is 100%
72 Deterministic.
73
74 Best implemented on top of a Multi-Issue Out-of-Order Micro-architecture,
75 REMAP Schedules are 100% Deterministic **including Indexing** and are
76 designed to be incorporated in between the Decode and Issue phases,
77 directly into Register Hazard Management.
78
79 As long as the SVSHAPE SPRs
80 are not written to directly, Hardware may treat REMAP as 100%
81 Deterministic: all REMAP Management instructions take static
82 operands (no dynamic register operands)
83 with the exception of Indexed Mode, and even then
84 Architectural State is permitted to assume that the Indices
85 are cacheable from the point at which the `svindex` instruction
86 is executed.
87
88 Further details on the Deterministic Precise-Interruptible algorithms
89 used in these Schedules is found in the [[sv/remap/appendix]].
90
91 *Future specification note: future versions of the REMAP Management instructions
92 will extend to EXT1xx Prefixed variants. This will overcome some of the limitations
93 present in the 32-bit variants of the REMAP Management instructions that at
94 present require direct writing to SVSHAPE0-3 SPRs. Additional
95 REMAP Modes may also be introduced at that time.*
96
97 ## Determining Register Hazards (hphint)
98
99 For high-performance (Multi-Issue, Out-of-Order) systems it is critical
100 to be able to statically determine the extent of Vectors in order to
101 allocate pre-emptive Hazard protection. The next task is to eliminate
102 masked-out elements using predicate bits, freeing up the associated
103 Hazards.
104
105 For non-REMAP situations `VL` is sufficient to ascertain early
106 Hazard coverage, and with SVSTATE being a high priority cached
107 quantity at the same level of MSR and PC this is not a problem.
108
109 The problems come when REMAP is enabled. Indexed REMAP must instead
110 use `MAXVL` as the earliest (simplest)
111 batch-level Hazard Reservation indicator (after taking element-width
112 overriding on the Index source into consideration),
113 but Matrix, FFT and Parallel Reduction must all use completely different
114 schemes. The reason is that VL is used to step through the total
115 number of *operations*, not the number of registers.
116 The "Saving Grace" is that all of the REMAP Schedules are 100% Deterministic.
117
118 Advance-notice Parallel computation and subsequent cacheing
119 of all of these complex Deterministic REMAP Schedules is
120 *strongly recommended*, thus allowing clear and precise multi-issue
121 batched Hazard coverage to be deployed, *even for Indexed Mode*.
122 This is only possible for Indexed due to the strict guidelines
123 given to Programmers.
124
125 In short, there exists solutions to the problem of Hazard Management,
126 with varying degrees of refinement possible at correspondingly
127 increasing levels of complexity in hardware.
128
129 A reminder: when Rc=1 each result register (element) has an associated
130 co-result CR Field (one per result element). Thus above when determining
131 the Write-Hazards for result registers the corresponding Write-Hazards for the
132 corresponding associated co-result CR Field must not be forgotten, *including* when
133 Predication is used.
134
135 **Horizontal-Parallelism Hint**
136
137 To help further in reducing Hazards,
138 `SVSTATE.hphint` is an indicator to hardware of how many elements are 100%
139 fully independent. Hardware is permitted to assume that groups of elements
140 up to `hphint` in size need not have Register (or Memory) Hazards created
141 between them, including when `hphint > VL`, which greatly aids simplification of
142 Multi-Issue implementations.
143
144 If care is not taken in setting `hphint` correctly it may wreak havoc.
145 For example Matrix Outer Product relies on the innermost loop computations
146 being independent. If `hphint` is set to greater than the Outer Product
147 depth then data corruption is guaranteed to occur.
148
149 Likewise on FFTs it is assumed that each layer of the RADIX2 triple-loop
150 is independent, but that there is strict *inter-layer* Register Hazards.
151 Therefore if `hphint` is set to greater than the RADIX2 width of the FFT,
152 data corruption is guaranteed.
153
154 Thus the key message is that setting `hphint` requires in-depth knowledge
155 of the REMAP Algorithm Schedules, given in the Appendix.
156
157 ## REMAP area of SVSTATE SPR
158
159 The following bits of the SVSTATE SPR are used for REMAP:
160
161 ```
162 |32:33|34:35|36:37|38:39|40:41| 42:46 | 62 |
163 | -- | -- | -- | -- | -- | ----- | ------ |
164 |mi0 |mi1 |mi2 |mo0 |mo1 | SVme | RMpst |
165 ```
166
167 mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register.
168 mi0-2 apply to RA, RB, RC respectively, as input registers, and
169 likewise mo0-1 apply to output registers (RT/FRT, RS/FRS) respectively.
170 SVme is 5 bits (one for each of mi0-2/mo0-1) and indicates whether the
171 SVSHAPE is actively applied or not, and if so, to which registers.
172
173 * bit 4 of SVme indicates if mi0 is applied to source RA / FRA / BA / BFA / RT / FRT
174 * bit 3 of SVme indicates if mi1 is applied to source RB / FRB / BB
175 * bit 2 of SVme indicates if mi2 is applied to source RC / FRC / BC
176 * bit 1 of SVme indicates if mo0 is applied to result RT / FRT / BT / BF
177 * bit 0 of SVme indicates if mo1 is applied to result Effective Address / FRS / RS
178 (LD/ST-with-update has an implicit 2nd write register, RA)
179
180 The "persistence" bit if set will result in all Active REMAPs being applied
181 indefinitely.
182
183 -----------
184
185 \newpage{}
186
187 # svremap instruction <a name="svremap"> </a>
188
189 SVRM-Form:
190
191 |0 |6 |11 |13 |15 |17 |19 |21 | 22:25 |26:31 |
192 | -- | -- | -- | -- | -- | -- | -- | -- | ---- | ----- |
193 | PO | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst | rsvd | XO |
194
195 * svremap SVme,mi0,mi1,mi2,mo0,mo1,pst
196
197 Pseudo-code:
198
199 ```
200 # registers RA RB RC RT EA/FRS SVSHAPE0-3 indices
201 SVSTATE[32:33] <- mi0
202 SVSTATE[34:35] <- mi1
203 SVSTATE[36:37] <- mi2
204 SVSTATE[38:39] <- mo0
205 SVSTATE[40:41] <- mo1
206 # enable bit for RA RB RC RT EA/FRS
207 SVSTATE[42:46] <- SVme
208 # persistence bit (applies to more than one instruction)
209 SVSTATE[62] <- pst
210 ```
211
212 Special Registers Altered:
213
214 ```
215 SVSTATE
216 ```
217
218 `svremap` determines the relationship between registers and SVSHAPE SPRs.
219 The bitmask `SVme` determines which registers have a REMAP applied, and mi0-mo1
220 determine which shape is applied to an activated register. the `pst` bit if
221 cleared indicated that the REMAP operation shall only apply to the immediately-following
222 instruction. If set then REMAP remains permanently enabled until such time as it is
223 explicitly disabled, either by `setvl` setting a new MAXVL, or with another
224 `svremap` instruction. `svindex` and `svshape2` are also capable of setting or
225 clearing persistence, as well as partially covering a subset of the capability of
226 `svremap` to set register-to-SVSHAPE relationships.
227
228 Programmer's Note: applying non-persistent `svremap` to an instruction that has
229 no REMAP enabled or is a Scalar operation will obviously have no effect but
230 the bits 32 to 46 will at least have been set in SVSTATE. This may prove useful
231 when using `svindex` or `svshape2`.
232
233 Hardware Architectural Note: when persistence is not set it is critically important
234 to treat the `svremap` and the following SVP64 instruction as an indivisible fused operation.
235 *No state* is stored in the SVSTATE SPR in order to allow continuation should an
236 Interrupt occur between the two instructions. Thus, Interrupts must be prohibited
237 from occurring or other workaround deployed. When persistence is set this issue
238 is moot.
239
240 It is critical to note that if persistence is clear then `svremap` is the *only* way
241 to activate REMAP on any given (following) instruction. If persistence is set however then
242 **all** SVP64 instructions go through REMAP as long as `SVme` is non-zero.
243
244 -------------
245
246 \newpage{}
247
248 # SHAPE Remapping SPRs
249
250 There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
251 which have the same format.
252
253 Shape is 32-bits. When SHAPE is set entirely to zeros, remapping is
254 disabled: the register's elements are a linear (1D) vector.
255
256 |0:5 |6:11 | 12:17 | 18:20 | 21:23 |24:27 |28:29 |30:31| Mode |
257 |----- |----- | ------- | ------- | ------ |------|------ |---- | ----- |
258 |xdimsz|ydimsz| zdimsz | permute | invxyz |offset|skip |mode |Matrix |
259 |xdimsz|ydimsz|SVGPR | 11/ |sk1/invxy|offset|elwidth|0b00 |Indexed|
260 |xdimsz|mode | zdimsz | submode2| invxyz |offset|submode|0b01 |DCT/FFT|
261 | rsvd |rsvd |xdimsz | rsvd | invxyz |offset|submode|0b10 |Red/Sum|
262 | | | | | | | |0b11 |rsvd |
263
264 `mode` sets different behaviours (straight matrix multiply, FFT, DCT).
265
266 * **mode=0b00** sets straight Matrix Mode
267 * **mode=0b00** with permute=0b110 or 0b111 sets Indexed Mode
268 * **mode=0b01** sets "FFT/DCT" mode and activates submodes
269 * **mode=0b10** sets "Parallel Reduction or Prefix-Sum" Schedules.
270
271 *Architectural Resource Allocation note: the four SVSHAPE SPRs are best
272 allocated sequentially and contiguously in order that `sv.mtspr` may
273 be used. This is safe to do as long as `SVSTATE.SVme=0`*
274
275 ## Parallel Reduction / Prefix-Sum Mode
276
277 Creates the Schedules for Parallel Tree Reduction and Prefix-Sum
278
279 * **submode=0b00** selects the left operand index for Reduction
280 * **submode=0b01** selects the right operand index for Reduction
281 * **submode=0b10** selects the left operand index for Prefix-Sum
282 * **submode=0b11** selects the right operand index for Prefix-Sum
283
284 * When bit 0 of `invxyz` is set, the order of the indices
285 in the inner for-loop are reversed. This has the side-effect
286 of placing the final reduced result in the last-predicated element.
287 It also has the indirect side-effect of swapping the source
288 registers: Left-operand index numbers will always exceed
289 Right-operand indices.
290 When clear, the reduced result will be in the first-predicated
291 element, and Left-operand indices will always be *less* than
292 Right-operand ones.
293 * When bit 1 of `invxyz` is set, the order of the outer loop
294 step is inverted: stepping begins at the nearest power-of two
295 to half of the vector length and reduces by half each time.
296 When clear the step will begin at 2 and double on each
297 inner loop.
298
299 **Parallel Prefix Sum**
300
301 This is a work-efficient Parallel Schedule that for example produces Trangular
302 or Factorial number sequences. Half of the Prefix Sum Schedule is near-identical
303 to Parallel Reduction. Whilst the Arithmetic mapreduce Mode (`/mr`) may achieve the same
304 end-result, implementations may only implement Mapreduce in serial form (or give
305 the appearance to Programmers of the same). The Parallel Prefix Schedule is
306 *required* to be implemented in such a way that its Deterministic Schedule may be
307 parallelised. Like the Reduction Schedule it is 100% Deterministic and consequently
308 may be used with non-commutative operations.
309 The Schedule Algorithm may be found in the [[sv/remap/appendix]]
310
311 **Parallel Reduction**
312
313 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base"
314 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
315 *appearance* and *effect* of Reduction. Parallel Reduction is not limited
316 to Power-of-two but is limited as usual by the total number of
317 element operations (127) as well as available register file size.
318
319 In Horizontal-First Mode, Vector-result reduction **requires**
320 the destination to be a Vector, which will be used to store
321 intermediary results, in order to achieve a correct final
322 result.
323
324 Given that the tree-reduction schedule is deterministic,
325 Interrupts and exceptions
326 can therefore also be precise. The final result will be in the first
327 non-predicate-masked-out destination element, but due again to
328 the deterministic schedule programmers may find uses for the intermediate
329 results, even for non-commutative Defined Word operations.
330 Additionally, because the intermediate results are always written out
331 it is possible to service Precise Interrupts without affecting latency
332 (a common limitation of Vector ISAs implementing explicit
333 Parallel Reduction instructions, because their Architectural State cannot
334 hold the partial results).
335
336 When Rc=1 a corresponding Vector of co-resultant CRs is also
337 created. No special action is taken: the result *and its CR Field*
338 are stored "as usual" exactly as all other SVP64 Rc=1 operations.
339
340 Note that the Schedule only makes sense on top of certain instructions:
341 X-Form with a Register Profile of `RT,RA,RB` is fine because two sources
342 and the destination are all the same type. Like Scalar
343 Reduction, nothing is prohibited:
344 the results of execution on an unsuitable instruction may simply
345 not make sense. With care, even 3-input instructions (madd, fmadd, ternlogi)
346 may be used, and whilst it is down to the Programmer to walk through the
347 process the Programmer can be confident that the Parallel-Reduction is
348 guaranteed 100% Deterministic.
349
350 Critical to note regarding use of Parallel-Reduction REMAP is that,
351 exactly as with all REMAP Modes, the `svshape` instruction *requests*
352 a certain Vector Length (number of elements to reduce) and then
353 sets VL and MAXVL at the number of **operations** needed to be
354 carried out. Thus, equally as importantly, like Matrix REMAP
355 the total number of operations
356 is restricted to 127. Any Parallel-Reduction requiring more operations
357 will need to be done manually in batches (hierarchical
358 recursive Reduction).
359
360 Also important to note is that the Deterministic Schedule is arranged
361 so that some implementations *may* parallelise it (as long as doing so
362 respects Program Order and Register Hazards). Performance (speed)
363 of any given
364 implementation is neither strictly defined or guaranteed. As with
365 the Vulkan(tm) Specification, strict compliance is paramount whilst
366 performance is at the discretion of Implementors.
367
368 **Parallel-Reduction with Predication**
369
370 To avoid breaking the strict RISC-paradigm, keeping the Issue-Schedule
371 completely separate from the actual element-level (scalar) operations,
372 Move operations are **not** included in the Schedule. This means that
373 the Schedule leaves the final (scalar) result in the first-non-masked
374 element of the Vector used. With the predicate mask being dynamic
375 (but deterministic) at a superficial glance it seems this result
376 could be anywhere.
377
378 If that result is needed to be moved to a (single) scalar register
379 then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be
380 needed to get it, where the predicate is the exact same predicate used
381 in the prior Parallel-Reduction instruction.
382
383 * If there was only a single
384 bit in the predicate then the result will not have moved or been altered
385 from the source vector prior to the Reduction
386 * If there was more than one bit the result will be in the
387 first element with a predicate bit set.
388
389 In either case the result is in the element with the first bit set in
390 the predicate mask. Thus, no move/copy *within the Reduction itself* was needed.
391
392 Programmer's Note: For *some* hardware implementations
393 the vector-to-scalar copy may be a slow operation, as may the Predicated
394 Parallel Reduction itself.
395 It may be better to perform a pre-copy
396 of the values, compressing them (VREDUCE-style) into a contiguous block,
397 which will guarantee that the result goes into the very first element
398 of the destination vector, in which case clearly no follow-up
399 predicated vector-to-scalar MV operation is needed. A VREDUCE effect
400 is achieved by setting just a source predicate mask on Twin-Predicated
401 operations.
402
403 **Usage conditions**
404
405 The simplest usage is to perform an overwrite, specifying all three
406 register operands the same.
407
408 ```
409 svshape parallelreduce, 6
410 sv.add *8, *8, *8
411 ```
412
413 The Reduction Schedule will issue the Parallel Tree Reduction spanning
414 registers 8 through 13, by adjusting the offsets to RT, RA and RB as
415 necessary (see "Parallel Reduction algorithm" in a later section).
416
417 A non-overwrite is possible as well but just as with the overwrite
418 version, only those destination elements necessary for storing
419 intermediary computations will be written to: the remaining elements
420 will **not** be overwritten and will **not** be zero'd.
421
422 ```
423 svshape parallelreduce, 6
424 sv.add *0, *8, *8
425 ```
426
427 However it is critical to note that if the source and destination are
428 not the same then the trick of using a follow-up vector-scalar MV will
429 not work.
430
431 **Sub-Vector Horizontal Reduction**
432
433 To achieve Sub-Vector Horizontal Reduction, Pack/Unpack should be enabled,
434 which will turn the Schedule around such that issuing of the Scalar
435 Defined Words is done with SUBVL looping as the inner loop not the
436 outer loop. Rc=1 with Sub-Vectors (SUBVL=2,3,4) is `UNDEFINED` behaviour.
437
438 *Programmer's Note: Overwrite Parallel Reduction with Sub-Vectors
439 will clearly result in data corruption. It may be best to perform
440 a Pack/Unpack Transposing copy of the data first*
441
442 ## FFT/DCT mode
443
444 submode2=0 is for FFT. For FFT submode the following schedules may be
445 selected:
446
447 * **submode=0b00** selects the ``j`` offset of the innermost for-loop
448 of Tukey-Cooley
449 * **submode=0b10** selects the ``j+halfsize`` offset of the innermost for-loop
450 of Tukey-Cooley
451 * **submode=0b11** selects the ``k`` of exptable (which coefficient)
452
453 When submode2 is 1 or 2, for DCT inner butterfly submode the following
454 schedules may be selected. When submode2 is 1, additional bit-reversing
455 is also performed.
456
457 * **submode=0b00** selects the ``j`` offset of the innermost for-loop,
458 in-place
459 * **submode=0b010** selects the ``j+halfsize`` offset of the innermost for-loop,
460 in reverse-order, in-place
461 * **submode=0b10** selects the ``ci`` count of the innermost for-loop,
462 useful for calculating the cosine coefficient
463 * **submode=0b11** selects the ``size`` offset of the outermost for-loop,
464 useful for the cosine coefficient ``cos(ci + 0.5) * pi / size``
465
466 When submode2 is 3 or 4, for DCT outer butterfly submode the following
467 schedules may be selected. When submode is 3, additional bit-reversing
468 is also performed.
469
470 * **submode=0b00** selects the ``j`` offset of the innermost for-loop,
471 * **submode=0b01** selects the ``j+1`` offset of the innermost for-loop,
472
473 `zdimsz` is used as an in-place "Stride", particularly useful for
474 column-based in-place DCT/FFT.
475
476 ## Matrix Mode
477
478 In Matrix Mode, skip allows dimensions to be skipped from being included
479 in the resultant output index. This allows sequences to be repeated:
480 ```0 0 0 1 1 1 2 2 2 ...``` or in the case of skip=0b11 this results in
481 modulo ```0 1 2 0 1 2 ...```
482
483 * **skip=0b00** indicates no dimensions to be skipped
484 * **skip=0b01** sets "skip 1st dimension"
485 * **skip=0b10** sets "skip 2nd dimension"
486 * **skip=0b11** sets "skip 3rd dimension"
487
488 invxyz will invert the start index of each of x, y or z. If invxyz[0] is
489 zero then x-dimensional counting begins from 0 and increments, otherwise
490 it begins from xdimsz-1 and iterates down to zero. Likewise for y and z.
491
492 offset will have the effect of offsetting the result by ```offset``` elements:
493
494 ```
495 for i in 0..VL-1:
496 GPR(RT + remap(i) + SVSHAPE.offset) = ....
497 ```
498
499 This appears redundant because the register RT could simply be changed by a compiler, until element width overrides are introduced. Also
500 bear in mind that unlike a static compiler SVSHAPE.offset may
501 be set dynamically at runtime.
502
503 xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
504 that the array dimensionality for that dimension is 1. any dimension
505 not intended to be used must have its value set to 0 (dimensionality
506 of 1). A value of xdimsz=2 would indicate that in the first dimension
507 there are 3 elements in the array. For example, to create a 2D array
508 X,Y of dimensionality X=3 and Y=2, set xdimsz=2, ydimsz=1 and zdimsz=0
509
510 The format of the array is therefore as follows:
511
512 ```
513 array[xdimsz+1][ydimsz+1][zdimsz+1]
514 ```
515
516 However whilst illustrative of the dimensionality, that does not take the
517 "permute" setting into account. "permute" may be any one of six values
518 (0-5, with values of 6 and 7 indicating "Indexed" Mode). The table
519 below shows how the permutation dimensionality order works:
520
521 | permute | order | array format |
522 | ------- | ----- | ------------------------ |
523 | 000 | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
524 | 001 | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
525 | 010 | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
526 | 011 | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
527 | 100 | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
528 | 101 | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
529 | 110 | 0,1 | Indexed (xdim+1)(ydim+1) |
530 | 111 | 1,0 | Indexed (ydim+1)(xdim+1) |
531
532 In other words, the "permute" option changes the order in which
533 nested for-loops over the array would be done. See executable
534 python reference code for further details.
535
536 *Note: permute=0b110 and permute=0b111 enable Indexed REMAP Mode,
537 described below*
538
539 With all these options it is possible to support in-place transpose,
540 in-place rotate, Matrix Multiply and Convolutions, without being
541 limited to Power-of-Two dimension sizes.
542
543 **Limitations and caveats**
544
545 Limitations of Matrix REMAP are that the Vector Length (VL) is currently
546 restricted to 127: up to 127 FMAs (or other operation)
547 may be performed in total.
548 Also given that it is in-registers only at present some care has to be
549 taken on regfile resource utilisation. However it is perfectly possible
550 to utilise Matrix REMAP to perform the three inner-most "kernel" loops of
551 the usual 6-level "Tiled" large Matrix Multiply, without the usual
552 difficulties associated with SIMD.
553
554 Also the `svshape` instruction only provides access to *part* of the
555 Matrix REMAP capability. Rotation and mirroring need to be done by
556 programming the SVSHAPE SPRs directly, which can take a lot more
557 instructions. Future versions of SVP64 will include EXT1xx prefixed
558 variants (`psvshape`) which provide more comprehensive capacity and
559 mitigate the need to write direct to the SVSHAPE SPRs.
560
561 Additionally there is not yet a way to set Matrix sizes from registers
562 with `svshape`: this was an intentional decision to simplify Hardware, that
563 may be corrected in a future version of SVP64. The limitation may presently
564 be overcome by direct programming of the SVSHAPE SPRs.
565
566 *Hardware Architectural note: with the Scheduling applying as a Phase between
567 Decode and Issue in a Deterministic fashion the Register Hazards may be
568 easily computed and a standard Out-of-Order Micro-Architecture exploited to good
569 effect. Even an In-Order system may observe that for large Outer Product
570 Schedules there will be no stalls, but if the Matrices are particularly
571 small size an In-Order system would have to stall, just as it would if
572 the operations were loop-unrolled without Simple-V. Thus: regardless
573 of the Micro-Architecture the Hardware Engineer should first consider
574 how best to process the exact same equivalent loop-unrolled instruction
575 stream. Once solved Matrix REMAP will fit naturally.*
576
577 ## Indexed Mode
578
579 Indexed Mode activates reading of the element indices from the GPR
580 and includes optional limited 2D reordering.
581 In its simplest form (without elwidth overrides or other modes):
582
583 ```
584 def index_remap(i):
585 return GPR((SVSHAPE.SVGPR<<1)+i) + SVSHAPE.offset
586
587 for i in 0..VL-1:
588 element_result = ....
589 GPR(RT + indexed_remap(i)) = element_result
590 ```
591
592 With element-width overrides included, and using the pseudocode
593 from the SVP64 [[sv/svp64/appendix#elwidth]] elwidth section
594 this becomes:
595
596 ```
597 def index_remap(i):
598 svreg = SVSHAPE.SVGPR << 1
599 srcwid = elwid_to_bitwidth(SVSHAPE.elwid)
600 offs = SVSHAPE.offset
601 return get_polymorphed_reg(svreg, srcwid, i) + offs
602
603 for i in 0..VL-1:
604 element_result = ....
605 rt_idx = indexed_remap(i)
606 set_polymorphed_reg(RT, destwid, rt_idx, element_result)
607 ```
608
609 Matrix-style reordering still applies to the indices, except limited
610 to up to 2 Dimensions (X,Y). Ordering is therefore limited to (X,Y) or
611 (Y,X) for in-place Transposition.
612 Only one dimension may optionally be skipped. Inversion of either
613 X or Y or both is possible (2D mirroring). Pseudocode for Indexed Mode (including elwidth
614 overrides) may be written in terms of Matrix Mode, specifically
615 purposed to ensure that the 3rd dimension (Z) has no effect:
616
617 ```
618 def index_remap(ISHAPE, i):
619 MSHAPE.skip = 0b0 || ISHAPE.sk1
620 MSHAPE.invxyz = 0b0 || ISHAPE.invxy
621 MSHAPE.xdimsz = ISHAPE.xdimsz
622 MSHAPE.ydimsz = ISHAPE.ydimsz
623 MSHAPE.zdimsz = 0 # disabled
624 if ISHAPE.permute = 0b110 # 0,1
625 MSHAPE.permute = 0b000 # 0,1,2
626 if ISHAPE.permute = 0b111 # 1,0
627 MSHAPE.permute = 0b010 # 1,0,2
628 el_idx = remap_matrix(MSHAPE, i)
629 svreg = ISHAPE.SVGPR << 1
630 srcwid = elwid_to_bitwidth(ISHAPE.elwid)
631 offs = ISHAPE.offset
632 return get_polymorphed_reg(svreg, srcwid, el_idx) + offs
633 ```
634
635 The most important observation above is that the Matrix-style
636 remapping occurs first and the Index lookup second. Thus it
637 becomes possible to perform in-place Transpose of Indices which
638 may have been costly to set up or costly to duplicate
639 (waste register file space). In other words: it is fine for two or more
640 SVSHAPEs to simultaneously use the same
641 Indices (use the same GPRs), even if one SVSHAPE has different
642 2D dimensions and ordering from the others.
643
644 **Caveats and Limitations**
645
646 The purpose of Indexing is to provide a generalised version of
647 Vector ISA "Permute" instructions, such as VSX `vperm`. The
648 Indexing is abstracted out and may be applied to much more
649 than an element move/copy, and is not limited for example
650 to the number of bytes that can fit into a VSX register.
651 Indexing may be applied to LD/ST (even on Indexed LD/ST
652 instructions such as `sv.lbzx`), arithmetic operations,
653 extsw: there is no artificial limit.
654
655 The only major caveat is that the registers to be used as
656 Indices must not be modified by any instruction after Indexed Mode
657 is established, and neither must MAXVL be altered. Additionally,
658 no register used as an Index may exceed MAXVL-1.
659
660 Failure to observe
661 these conditions results in `UNDEFINED` behaviour.
662 These conditions allow a Read-After-Write (RAW) Hazard to be created on
663 the entire range of Indices to be subsequently used, but a corresponding
664 Write-After-Read Hazard by any instruction that modifies the Indices
665 **does not have to be created**. Given the large number of registers
666 involved in Indexing this is a huge resource saving and reduction
667 in micro-architectural complexity. MAXVL is likewise
668 included in the RAW Hazards because it is involved in calculating
669 how many registers are to be considered Indices.
670
671 With these Hazard Mitigations in place, high-performance implementations
672 may read-cache the Indices at the point where a given `svindex` instruction
673 is called (or SVSHAPE SPRs - and MAXVL - directly altered) by issuing
674 background GPR register file reads whilst other instructions are being
675 issued and executed.
676
677 Indexed REMAP **does not prevent conflicts** (overlapping
678 destinations), which on a superficial analysis may be perceived to be a
679 problem, until it is recalled that, firstly, Simple-V is designed specifically
680 to require Program Order to be respected, and that Matrix, DCT and FFT
681 all *already* critically depend on overlapping Reads/Writes: Matrix
682 uses overlapping registers as accumulators. Thus the Register Hazard
683 Management needed by Indexed REMAP *has* to be in place anyway.
684
685 *Programmer's Note: `hphint` may be used to help hardware identify
686 parallelism opportunities but it is critical to remember that the
687 groupings are by `FLOOR(step/MAXVL)` not `FLOOR(REMAP(step)/MAXVL)`.*
688
689 The cost compared to Matrix and other REMAPs (and Pack/Unpack) is
690 clearly that of the additional reading of the GPRs to be used as Indices,
691 plus the setup cost associated with creating those same Indices.
692 If any Deterministic REMAP can cover the required task, clearly it
693 is adviseable to use it instead.
694
695 *Programmer's note: some algorithms may require skipping of Indices exceeding
696 VL-1, not MAXVL-1. This may be achieved programmatically by performing
697 an `sv.cmp *BF,*RA,RB` where RA is the same GPRs used in the Indexed REMAP,
698 and RB contains the value of VL returned from `setvl`. The resultant
699 CR Fields may then be used as Predicate Masks to exclude those operations
700 with an Index exceeding VL-1.*
701
702 -------------
703
704 \newpage{}
705
706 # svshape instruction <a name="svshape"> </a>
707
708 SVM-Form
709
710 svshape SVxd,SVyd,SVzd,SVRM,vf
711
712 | 0:5|6:10 |11:15 |16:20 | 21:24 | 25 | 26:31 | name |
713 | -- | -- | --- | ----- | ------ | -- | ------| -------- |
714 |PO | SVxd | SVyd | SVzd | SVRM | vf | XO | svshape |
715
716 See [[sv/remap/appendix]] for `svshape` pseudocode
717
718 Special Registers Altered:
719
720 ```
721 SVSTATE, SVSHAPE0-3
722 ```
723
724 `svshape` is a convenience instruction that reduces instruction
725 count for common usage patterns, particularly Matrix, DCT and FFT. It sets up
726 (overwrites) all required SVSHAPE SPRs and also modifies SVSTATE
727 including VL and MAXVL. Using `svshape` therefore does not also
728 require `setvl`.
729
730 Fields:
731
732 * **SVxd** - SV REMAP "xdim" (X-dimension)
733 * **SVyd** - SV REMAP "ydim" (Y-dimension, sometimes used for sub-mode selection)
734 * **SVzd** - SV REMAP "zdim" (Z-dimension)
735 * **SVRM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT etc.)
736 * **vf** - sets "Vertical-First" mode
737 * **XO** - standard 6-bit XO field
738
739 *Note: SVxd, SVyz and SVzd are all stored "off-by-one". In the assembler
740 mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*
741
742 There are 12 REMAP Modes (2 Modes are RESERVED for `svshape2`, 2 Modes
743 are RESERVED)
744
745 | SVRM | Remap Mode description |
746 | -- | -- |
747 | 0b0000 | Matrix 1/2/3D |
748 | 0b0001 | FFT Butterfly |
749 | 0b0010 | reserved |
750 | 0b0011 | DCT Outer butterfly |
751 | 0b0100 | DCT Inner butterfly, on-the-fly (Vertical-First Mode) |
752 | 0b0101 | DCT COS table index generation |
753 | 0b0110 | DCT half-swap |
754 | 0b0111 | Parallel Reduction and Prefix Sum |
755 | 0b1000 | reserved for svshape2 |
756 | 0b1001 | reserved for svshape2 |
757 | 0b1010 | reserved |
758 | 0b1011 | iDCT Outer butterfly |
759 | 0b1100 | iDCT Inner butterfly, on-the-fly (Vertical-First Mode) |
760 | 0b1101 | iDCT COS table index generation |
761 | 0b1110 | iDCT half-swap |
762 | 0b1111 | FFT half-swap |
763
764 Examples showing how all of these Modes operate exists in the online
765 [SVP64 unit tests](https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/decoder/isa;hb=HEAD). Explaining
766 these Modes further in detail is beyond the scope of this document.
767
768 In Indexed Mode, there are only 5 bits available to specify the GPR
769 to use, out of 128 GPRs (7 bit numbering). Therefore, only the top
770 5 bits are given in the `SVxd` field: the bottom two implicit bits
771 will be zero (`SVxd || 0b00`).
772
773 `svshape` has *limited applicability* due to being a 32-bit instruction.
774 The full capability of SVSHAPE SPRs may be accessed by directly writing
775 to SVSHAPE0-3 with `mtspr`. Circumstances include Matrices with dimensions
776 larger than 32, and in-place Transpose. Potentially a future v3.1 Prefixed
777 instruction, `psvshape`, may extend the capability here.
778
779 Programmer's Note: Parallel Reduction Mode is selected by setting `SVRM=7,SVyd=1`.
780 Prefix Sum Mode is selected by setting `SVRM=7,SVyd=3`:
781
782 ```
783 # Vector length of 8.
784 svshape 8, 3, 1, 0x7, 0
785 # activate SVSHAPE0 (prefix-sum lhs) for RA
786 # activate SVSHAPE1 (prefix-sum rhs) for RT and RB
787 svremap 7, 0, 1, 0, 1, 0, 0
788 sv.add *10, *10, *10
789 ```
790
791 *Architectural Resource Allocation note: the SVRM field is carefully
792 crafted to allocate two Modes, corresponding to bits 21-23 within the
793 instruction being set to the value `0b100`, to `svshape2` (not
794 `svshape`). These two Modes are
795 considered "RESERVED" within the context of `svshape` but it is
796 absolutely critical to allocate the exact same pattern in XO for
797 both instructions in bits 26-31.*
798
799 -------------
800
801 \newpage{}
802
803
804 # svindex instruction <a name="svindex"> </a>
805
806 SVI-Form
807
808 | 0:5|6:10 |11:15 |16:20 | 21:25 | 26:31 | Form |
809 | -- | -- | --- | ---- | ----------- | ------| -------- |
810 | PO | SVG | rmm | SVd | ew/yx/mm/sk | XO | SVI-Form |
811
812 * svindex SVG,rmm,SVd,ew,SVyx,mm,sk
813
814 See [[sv/remap/appendix]] for `svindex` pseudocode
815
816 Special Registers Altered:
817
818 ```
819 SVSTATE, SVSHAPE0-3
820 ```
821
822 `svindex` is a convenience instruction that reduces instruction count
823 for Indexed REMAP Mode. It sets up (overwrites) all required SVSHAPE
824 SPRs and **unlike** `svshape` can modify the REMAP area of the SVSTATE
825 SPR as well, including setting persistence. The relevant SPRs *may*
826 be directly programmed with `mtspr` however it is laborious to do so:
827 svindex saves instructions covering much of Indexed REMAP capability.
828
829 Fields:
830
831 * **SVd** - SV REMAP x/y dim
832 * **rmm** - REMAP mask: sets remap mi0-2/mo0-1 and SVSHAPEs,
833 controlled by mm
834 * **ew** - sets element width override on the Indices
835 * **SVG** - GPR SVG<<2 to be used for Indexing
836 * **yx** - 2D reordering to be used if yx=1
837 * **mm** - mask mode. determines how `rmm` is interpreted.
838 * **sk** - Dimension skipping enabled
839
840 *Note: SVd, like SVxd, SVyz and SVzd of `svshape`, are all stored
841 "off-by-one". In the assembler
842 mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*.
843
844 *Note: when `yx=1,sk=0` the second dimension is calculated as
845 `CEIL(MAXVL/SVd)`*.
846
847 When `mm=0`:
848
849 * `rmm`, like REMAP.SVme, has bit 0
850 correspond to mi0, bit 1 to mi1, bit 2 to mi2,
851 bit 3 to mo0 and bit 4 to mi1
852 * all SVSHAPEs and the REMAP parts of SVSHAPE are first reset (initialised to zero)
853 * for each bit set in the 5-bit `rmm`, in order, the first
854 as-yet-unset SVSHAPE will be updated
855 with the other operands in the instruction, and the REMAP
856 SPR set.
857 * If all 5 bits of `rmm` are set then both mi0 and mo1 use SVSHAPE0.
858 * SVSTATE persistence bit is cleared
859 * No other alterations to SVSTATE are carried out
860
861 Example 1: if rmm=0b00110 then SVSHAPE0 and SVSHAPE1 are set up,
862 and the REMAP SPR set so that mi1 uses SVSHAPE0 and mi2
863 uses mi2. REMAP.SVme is also set to 0b00110, REMAP.mi1=0
864 (SVSHAPE0) and REMAP.mi2=1 (SVSHAPE1)
865
866 Example 2: if rmm=0b10001 then again SVSHAPE0 and SVSHAPE1
867 are set up, but the REMAP SPR is set so that mi0 uses SVSHAPE0
868 and mo1 uses SVSHAPE1. REMAP.SVme=0b10001, REMAP.mi0=0, REMAP.mo1=1
869
870 Rough algorithmic form:
871
872 ```
873 marray = [mi0, mi1, mi2, mo0, mo1]
874 idx = 0
875 for bit = 0 to 4:
876 if not rmm[bit]: continue
877 setup(SVSHAPE[idx])
878 SVSTATE{marray[bit]} = idx
879 idx = (idx+1) modulo 4
880 ```
881
882 When `mm=1`:
883
884 * bits 0-2 (MSB0 numbering) of `rmm` indicate an index selecting mi0-mo1
885 * bits 3-4 (MSB0 numbering) of `rmm` indicate which SVSHAPE 0-3 shall
886 be updated
887 * only the selected SVSHAPE is overwritten
888 * only the relevant bits in the REMAP area of SVSTATE are updated
889 * REMAP persistence bit is set.
890
891 Example 1: if `rmm`=0b01110 then bits 0-2 (MSB0) are 0b011 and
892 bits 3-4 are 0b10. thus, mo0 is selected and SVSHAPE2
893 to be updated. REMAP.SVme[3] will be set high and REMAP.mo0
894 set to 2 (SVSHAPE2).
895
896 Example 2: if `rmm`=0b10011 then bits 0-2 (MSB0) are 0b100 and
897 bits 3-4 are 0b11. thus, mo1 is selected and SVSHAPE3
898 to be updated. REMAP.SVme[4] will be set high and REMAP.mo1
899 set to 3 (SVSHAPE3).
900
901 Rough algorithmic form:
902
903 ```
904 marray = [mi0, mi1, mi2, mo0, mo1]
905 bit = rmm[0:2]
906 idx = rmm[3:4]
907 setup(SVSHAPE[idx])
908 SVSTATE{marray[bit]} = idx
909 SVSTATE.pst = 1
910 ```
911
912 In essence, `mm=0` is intended for use to set as much of the
913 REMAP State SPRs as practical with a single instruction,
914 whilst `mm=1` is intended to be a little more refined.
915
916 **Usage guidelines**
917
918 * **Disable 2D mapping**: to only perform Indexing without
919 reordering use `SVd=1,sk=0,yx=0` (or set SVd to a value larger
920 or equal to VL)
921 * **Modulo 1D mapping**: to perform Indexing cycling through the
922 first N Indices use `SVd=N,sk=0,yx=0` where `VL>N`. There is
923 no requirement to set VL equal to a multiple of N.
924 * **Modulo 2D transposed**: `SVd=M,sk=0,yx=1`, sets
925 `xdim=M,ydim=CEIL(MAXVL/M)`.
926
927 Beyond these mappings it becomes necessary to write directly to
928 the SVSTATE SPRs manually.
929
930 -------------
931
932 \newpage{}
933
934
935 # svshape2 (offset-priority) <a name="svshape2"> </a>
936
937 SVM2-Form
938
939 | 0:5|6:9 |10|11:15 |16:20 | 21:24 | 25 | 26:31 | Form |
940 | -- |----|--| --- | ----- | ------ | -- | ------| -------- |
941 | PO |offs|yx| rmm | SVd | 100/mm | sk | XO | SVM2-Form |
942
943 * svshape2 offs,yx,rmm,SVd,sk,mm
944
945 See [[sv/remap/appendix]] for `svshape2` pseudocode
946
947 Special Registers Altered:
948
949 ```
950 SVSTATE, SVSHAPE0-3
951 ```
952
953 `svshape2` is an additional convenience instruction that prioritises
954 setting `SVSHAPE.offset`. Its primary purpose is for use when
955 element-width overrides are used. It has identical capabilities to `svindex`
956 in terms of both options (skip, etc.) and ability to activate REMAP
957 (rmm, mask mode) but unlike `svindex` it does not set GPR REMAP:
958 only a 1D or 2D `svshape`, and
959 unlike `svshape` it can set an arbitrary `SVSHAPE.offset` immediate.
960
961 One of the limitations of Simple-V is that Vector elements start on the boundary
962 of the Scalar regfile, which is fine when element-width overrides are not
963 needed. If the starting point of a Vector with smaller elwidths must begin
964 in the middle of a register, normally there would be no way to do so except
965 through costly LD/ST. `SVSHAPE.offset` caters for this scenario and `svshape2`
966 makes it easier to access.
967
968 **Operand Fields**:
969
970 * **offs** (4 bits) - unsigned offset
971 * **yx** (1 bit) - swap XY to YX
972 * **SVd** dimension size
973 * **rmm** REMAP mask
974 * **mm** mask mode
975 * **sk** (1 bit) skips 1st dimension if set
976
977 Dimensions are calculated exactly as `svindex`. `rmm` and
978 `mm` are as per `svindex`.
979
980 *Programmer's Note: offsets for `svshape2` may be specified in the range
981 0-15. Given that the principle of Simple-V is to fit on top of
982 byte-addressable register files and that GPR and FPR are 64-bit (8 bytes)
983 it should be clear that the offset may, when `elwidth=8`, begin an
984 element-level operation starting element zero at any arbitrary byte.
985 On cursory examination attempting to go beyond the range 0-7 seems
986 unnecessary given that the **next GPR or FPR** is an
987 alias for an offset in the range 8-15. Thus by simply increasing
988 the starting Vector point of the operation to the next register it
989 can be seen that the offset of 0-7 would be sufficient. Unfortunately
990 however some operations are EXTRA2-encoded it is **not possible**
991 to increase the GPR/FPR register number by one, because EXTRA2-encoding
992 of GPR/FPR Vector numbers are restricted to even numbering.
993 For CR Fields the EXTRA2 encoding is even more sparse.
994 The additional offset range (8-15) helps overcome these limitations.*
995
996 *Hardware Implementor's note: with the offsets only being immediates
997 and with register numbering being entirely immediate as well it is
998 possible to correctly compute Register Hazards without requiring
999 reading the contents of any SPRs. If however there are
1000 instructions that have directly written to the SVSTATE or SVSHAPE
1001 SPRs and those instructions are still in-flight then this position
1002 is clearly **invalid**. This is why Programmers are strongly
1003 discouraged from directly writing to these SPRs.*
1004
1005 *Architectural Resource Allocation note: this instruction shares
1006 the space of `svshape`. Therefore it is critical that the two
1007 instructions, `svshape` and `svshape2` have the exact same XO
1008 in bits 26 thru 31. It is also critical that for `svshape2`,
1009 bit 21 of XO is a 1, bit 22 of XO is a 0, and bit 23 of XO is a 0.*
1010
1011 [[!tag standards]]
1012
1013 -------------
1014
1015 \newpage{}
1016