(no commit message)
[libreriscv.git] / openpower / sv / rfc / ls009.mdwn
1 # RFC ls009 Simple-V REMAP Subsystem
2
3 **URLs**:
4
5 * <https://libre-soc.org/openpower/sv/rfc/ls009/>
6 * <https://bugs.libre-soc.org/show_bug.cgi?id=1042>
7 * <https://git.openpower.foundation/isa/PowerISA/issues/124>
8
9 **Severity**: Major
10
11 **Status**: New
12
13 **Date**: 26 Mar 2023. v2 TODO
14
15 **Target**: v3.2B
16
17 **Source**: v3.0B
18
19 **Books and Section affected**:
20
21 ```
22 Book I, new Zero-Overhead-Loop Chapter.
23 Appendix E Power ISA sorted by opcode
24 Appendix F Power ISA sorted by version
25 Appendix G Power ISA sorted by Compliancy Subset
26 Appendix H Power ISA sorted by mnemonic
27 ```
28
29 **Summary**
30
31 ```
32 svremap - Re-Mapping of Register Element Offsets
33 svindex - General-purpose setting of SHAPEs to be re-mapped
34 svshape - Hardware-level setting of SHAPEs for element re-mapping
35 svshape2 - Hardware-level setting of SHAPEs for element re-mapping (v2)
36 ```
37
38 **Submitter**: Luke Leighton (Libre-SOC)
39
40 **Requester**: Libre-SOC
41
42 **Impact on processor**:
43
44 ```
45 Addition of four new "Zero-Overhead-Loop-Control" DSP-style Vector-style
46 Management Instructions which provide advanced features such as Matrix
47 FFT DCT Hardware-Assist Schedules and general-purpose Index reordering.
48 ```
49
50 **Impact on software**:
51
52 ```
53 Requires support for new instructions in assembler, debuggers,
54 and related tools.
55 ```
56
57 **Keywords**:
58
59 ```
60 Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC),
61 Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
62 Digital Signal Processing (DSP)
63 ```
64
65 **Motivation**
66
67 These REMAP Management instructions provide state-of-the-art advanced capabilities
68 to dramatically decrease instruction count and power reduction whilst retaining
69 unprecedented general-purpose capability and a standard Sequential Execution Model.
70
71 **Notes and Observations**:
72
73 1. Although costly the alternatives in SIMD-paradigm software result in
74 huge algorithmic complexity and associated power consumption increases.
75 Loop-unrolling compilers are prevalent as is thousands to tens of
76 thousands of instructions.
77 2. Core inner kernels of Matrix DCT DFT FFT NTT are dramatically reduced
78 to a handful of instructions.
79 3. No REMAP instructions with the exception of Indexed rely on registers
80 for the establishment of REMAP capability.
81 4. Future EXT1xx variants and SVP64/VSX will dramatically expand the power
82 and flexibility of REMAP.
83 5. The Simple-V Compliancy Subsets make REMAP optional except in the
84 Advanced Levels. Like v3.1 MMA it is **not** necessary for **all**
85 hardware to implement REMAP.
86
87 **Changes**
88
89 Add the following entries to:
90
91 * the Appendices of SV Book
92 * Instructions of SV Book as a new Section
93 * SVI, SVM, SVM2, SVRM Form of Book I Section 1.6.1.6 and 1.6.2
94
95 ----------------
96
97 \newpage{}
98
99 # Rationale and background
100
101 In certain common algorithms there are clear patterns of behaviour which
102 typically require inline loop-unrolled instructions comprising interleaved
103 permute and arithmetic operations. REMAP was invented to split the permuting
104 from the arithmetic, and to allow the Indexing to be done as a hardware Schedule.
105
106 A normal Vector Add:
107
108 ```
109  for i in range(VL):
110    GPR[RT+i] <= GPR[RA+i] +
111 GPR[RB+i];
112 ```
113
114 A Hardware-assisted REMAP Vector Add:
115
116 ```
117  for i in range(VL):
118    GPR[RT+remap1(i)] <= GPR[RA+remap2(i)] +
119 GPR[RB+remap3(i)];
120 ```
121
122 The result is a huge saving on register file accesses (no need to calculate Indices
123 then use Permutation instructions), instruction count (Matrix Multiply up to 127 FMACs
124 is 3 instructions), and programmer sanity.
125
126 # REMAP types
127
128 This section summarises the motivation for each REMAP Schedule
129 and briefly goes over their characteristics and limitations.
130 Further details on the Deterministic Precise-Interruptible algorithms
131 used in these Schedules is found in the [[sv/remap/appendix]].
132
133 ## Matrix (1D/2D/3D shaping)
134
135 Matrix Multiplication is a huge part of High-Performance Compute,
136 and 3D.
137 In many PackedSIMD as well as Scalable Vector ISAs, non-power-of-two
138 Matrix sizes are a serious challenge. PackedSIMD ISAs, in order to
139 cope with for example 3x4 Matrices, recommend rolling data-repetition and loop-unrolling.
140 Aside from the cost of the load on the L1 I-Cache, the trick only
141 works if one of the dimensions X or Y are power-two. Prime Numbers
142 (5x7, 3x5) become deeply problematic to unroll.
143
144 Even traditional Scalable Vector ISAs have issues with Matrices, often
145 having to perform data Transpose by pushing out through Memory and back
146 (costly),
147 or computing Transposition Indices (costly) then copying to another
148 Vector (costly).
149
150 Matrix REMAP was thus designed to solve these issues by providing Hardware
151 Assisted
152 "Schedules" that can view what would otherwise be limited to a strictly
153 linear Vector as instead being 2D (even 3D) *in-place* reordered.
154 With both Transposition and non-power-two being supported the issues
155 faced by other ISAs are mitigated.
156
157 Limitations of Matrix REMAP are that the Vector Length (VL) is currently
158 restricted to 127: up to 127 FMAs (or other operation)
159 may be performed in total.
160 Also given that it is in-registers only at present some care has to be
161 taken on regfile resource utilisation. However it is perfectly possible
162 to utilise Matrix REMAP to perform the three inner-most "kernel" loops of
163 the usual 6-level "Tiled" large Matrix Multiply, without the usual
164 difficulties associated with SIMD.
165
166 Also the `svshape` instruction only provides access to part of the
167 Matrix REMAP capability. Rotation and mirroring need to be done by
168 programming the SVSHAPE SPRs directly, which can take a lot more
169 instructions. Future versions of SVP64 will include EXT1xx prefixed
170 variants (`psvshape`) which provide more comprehensive capacity and
171 mitigate the need to write direct to the SVSHAPE SPRs.
172
173 ## FFT/DCT Triple Loop
174
175 DCT and FFT are some of the most astonishingly used algorithms in
176 Computer Science. Radar, Audio, Video, R.F. Baseband and dozens more. At least
177 two DSPs, TMS320 and Hexagon, have VLIW instructions specially tailored
178 to FFT.
179
180 An in-depth analysis showed that it is possible to do in-place in-register
181 DCT and FFT as long as twin-result "butterfly" instructions are provided.
182 These can be found in the [[openpower/isa/svfparith]] page if performing
183 IEEE754 FP transforms. *(For fixed-point transforms, equivalent 3-in 2-out
184 integer operations would be required)*. These "butterfly" instructions
185 avoid the need for a temporary register because the two array positions
186 being overwritten will be "in-flight" in any In-Order or Out-of-Order
187 micro-architecture.
188
189 DCT and FFT Schedules are currently limited to RADIX2 sizes and do not
190 accept predicate masks. Given that it is common to perform recursive
191 convolutions combining smaller Power-2 DCT/FFT to create larger DCT/FFTs
192 in practice the RADIX2 limit is not a problem. A Bluestein convolution
193 to compute arbitrary length is demonstrated by
194 [Project Nayuki](https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py)
195
196 ## Indexed
197
198 The purpose of Indexing is to provide a generalised version of
199 Vector ISA "Permute" instructions, such as VSX `vperm`. The
200 Indexing is abstracted out and may be applied to much more
201 than an element move/copy, and is not limited for example
202 to the number of bytes that can fit into a VSX register.
203 Indexing may be applied to LD/ST (even on Indexed LD/ST
204 instructions such as `sv.lbzx`), arithmetic operations,
205 extsw: there is no artificial limit.
206
207 The original motivation for Indexed REMAP was to mitigate the need to add
208 an expensive `mv.x` to the Scalar ISA, which was highly likely to be rejected as
209 a stand-alone instruction
210 (`GPR(RT) <- GPR(GPR(RA))`). Usually a Vector ISA would add a non-conflicting
211 variant (as in VSX `vperm`) but it is common to need to permute by source,
212 with the risk of conflict, that has to be resolved, for example, in AVX-512
213 with `conflictd`.
214
215 Indexed REMAP is the "Get out of Jail Free" card which (for a price)
216 allows any general-purpose arbitrary Element Permutation not covered by
217 the other Hardware Schedules.
218
219 ## Parallel Reduction
220
221 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base"
222 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
223 *appearance* and *effect* of Reduction. Parallel Reduction is not limited
224 to Power-of-two but is limited as usual by the total number of
225 element operations (127) as well as available register file size.
226
227 Parallel Reduction is normally done explicitly as a loop-unrolled operation,
228 taking up a significant number of instructions. With REMAP it is just three
229 instructions: two for setup and one Scalar Base.
230
231 ## Parallel Prefix Sum
232
233 This is a work-efficient Parallel Schedule that for example produces Trangular
234 or Factorial number sequences. Half of the Prefix Sum Schedule is near-identical
235 to Parallel Reduction. Whilst the Arithmetic mapreduce Mode (`/mr`) may achieve the same
236 end-result, implementations may only implement Mapreduce in serial form (or give
237 the appearance to Programmers of the same). The Parallel Prefix Schedule is
238 *required* to be implemented in such a way that its Deterministic Schedule may be
239 parallelised. Like the Reduction Schedule it is 100% Deterministic and consequently
240 may be used with non-commutative operations.
241
242 Parallel Reduction combined with Parallel Prefix Sum can be combined to help
243 perform AI non-linear interpolation in around twelve to fifteen instructions.
244
245 ----------------
246
247 \newpage{}
248
249 Add the following to SV Book
250
251 [[!inline pages="openpower/sv/remap" raw=yes ]]
252
253 # Forms
254
255 Add `SVI, SVM, SVM2, SVRM` to `XO (26:31)` Field in Book I, 1.6.2
256
257 Add the following to Book I, 1.6.1, SVI-Form
258
259 ```
260 |0 |6 |11 |16 |21 |23 |24|25|26 31|
261 | PO | SVG|rmm | SVd |ew |SVyx|mm|sk| XO |
262 ```
263
264 Add the following to Book I, 1.6.1, SVM-Form
265
266 ```
267 |0 |6 |11 |16 |21 |25 |26 |31 |
268 | PO | SVxd | SVyd | SVzd | SVrm |vf | XO |
269 ```
270
271 Add the following to Book I, 1.6.1, SVM2-Form
272
273 ```
274 |0 |6 |10 |11 |16 |21 |24|25 |26 |31 |
275 | PO | SVo |SVyx| rmm | SVd |XO |mm|sk | XO |
276 ```
277
278 Add the following to Book I, 1.6.1, SVRM-Form
279
280 ```
281 |0 |6 |11 |13 |15 |17 |19 |21 |22 |26 |31 |
282 | PO | SVme |mi0 | mi1 | mi2 | mo0 | mo1 |pst |/// | XO |
283 ```
284
285 Add the following to Book I, 1.6.2
286
287 ```
288 mi0 (11:12)
289 Field used in REMAP to select the SVSHAPE for 1st input register
290 Formats: SVRM
291 mi1 (13:14)
292 Field used in REMAP to select the SVSHAPE for 2nd input register
293 Formats: SVRM
294 mi2 (15:16)
295 Field used in REMAP to select the SVSHAPE for 3rd input register
296 Formats: SVRM
297 mm (24)
298 Field used to specify the meaning of the rmm field for SVI-Form
299 and SVM2-Form
300 Formats: SVI, SVM2
301 mo0 (17:18)
302 Field used in REMAP to select the SVSHAPE for 1st output register
303 Formats: SVRM
304 mo1 (19:20)
305 Field used in REMAP to select the SVSHAPE for 2nd output register
306 Formats: SVRM
307 pst (21)
308 Field used in REMAP to indicate "persistence" mode (REMAP
309 continues to apply to multiple instructions)
310 Formats: SVRM
311 rmm (11:15)
312 REMAP Mode field for SVI-Form and SVM2-Form
313 Formats: SVI, SVM2
314 sk (25)
315 Field used to specify dimensional skipping in svindex
316 Formats: SVI, SVM2
317 SVd (16:20)
318 Immediate field used to specify the size of the REMAP dimension
319 in the svindex and svshape2 instructions
320 Formats: SVI, SVM2
321 SVDS (16:29)
322 Immediate field used to specify a 9-bit signed
323 two's complement integer which is concatenated
324 on the right with 0b00 and sign-extended to 64 bits.
325 Formats: SVDS
326 SVG (6:10)
327 Field used to specify a GPR to be used as a
328 source for indexing.
329 Formats: SVI
330 SVi (16:22)
331 Simple-V immediate field for setting VL or MVL
332 Formats: SVL
333 SVme (6:10)
334 Simple-V "REMAP" map-enable bits (0-4)
335 Formats: SVRM
336 SVo (6:9)
337 Field used by the svshape2 instruction as an offset
338 Formats: SVM2
339 SVrm (21:24)
340 Simple-V "REMAP" Mode
341 Formats: SVM
342 SVxd (6:10)
343 Simple-V "REMAP" x-dimension size
344 Formats: SVM
345 SVyd (11:15)
346 Simple-V "REMAP" y-dimension size
347 Formats: SVM
348 SVzd (16:20)
349 Simple-V "REMAP" z-dimension size
350 Formats: SVM
351 XO (21:23,26:31)
352 Extended opcode field. Note that bit 21 must be 1, 22 and 23
353 must be zero, and bits 26-31 must be exactly the same as
354 used for svshape.
355 Formats: SVM2
356 ```
357
358 # Appendices
359
360 Appendix E Power ISA sorted by opcode
361 Appendix F Power ISA sorted by version
362 Appendix G Power ISA sorted by Compliancy Subset
363 Appendix H Power ISA sorted by mnemonic
364
365 | Form | Book | Page | Version | mnemonic | Description |
366 |------|------|------|---------|----------|-------------|
367 | SVRM | I | # | 3.0B | svremap | REMAP enabling instruction |
368 | SVM | I | # | 3.0B | svshape | REMAP shape instruction |
369 | SVM2 | I | # | 3.0B | svshape2 | REMAP shape instruction (2) |
370 | SVI | I | # | 3.0B | svindex | REMAP General-purpose Indexing |
371
372 [[!inline pages="openpower/sv/remap/appendix" raw=yes ]]
373
374 ## REMAP pseudocode
375
376 Written in python3 the following stand-alone executable source code is the Canonical
377 Specification for each REMAP. Vectors of "loopends" are returned when Rc=1
378 in Vectors of CR Fields on `sv.svstep.`, or in Vertical-First Mode
379 a single CR Field (CR0) on `svstep.`. The `SVSTATE.srcstep` or `SVSTATE.dststep` sequential
380 offset is put through each algorithm to determine the actual Element Offset.
381 Alternative implementations producing different ordering
382 is prohibited as software will be critically relying on these Deterministic Schedules.
383
384 ### REMAP 2D/3D Matrix
385
386 The following stand-alone executable source code is the Canonical
387 Specification for Matrix (2D/3D) REMAP.
388 Hardware implementations are achievable with simple cascading counter-and-compares.
389
390 ```
391 # python "yield" can be iterated. use this to make it clear how
392 # the indices are generated by using natural-looking nested loops
393 def iterate_indices(SVSHAPE):
394 # get indices to iterate over, in the required order
395 xd = SVSHAPE.lims[0]
396 yd = SVSHAPE.lims[1]
397 zd = SVSHAPE.lims[2]
398 # create lists of indices to iterate over in each dimension
399 x_r = list(range(xd))
400 y_r = list(range(yd))
401 z_r = list(range(zd))
402 # invert the indices if needed
403 if SVSHAPE.invxyz[0]: x_r.reverse()
404 if SVSHAPE.invxyz[1]: y_r.reverse()
405 if SVSHAPE.invxyz[2]: z_r.reverse()
406 # start an infinite (wrapping) loop
407 step = 0 # track src/dst step
408 while True:
409 for z in z_r: # loop over 1st order dimension
410 z_end = z == z_r[-1]
411 for y in y_r: # loop over 2nd order dimension
412 y_end = y == y_r[-1]
413 for x in x_r: # loop over 3rd order dimension
414 x_end = x == x_r[-1]
415 # ok work out which order to construct things in.
416 # start by creating a list of tuples of the dimension
417 # and its limit
418 vals = [(SVSHAPE.lims[0], x, "x"),
419 (SVSHAPE.lims[1], y, "y"),
420 (SVSHAPE.lims[2], z, "z")
421 ]
422 # now select those by order. this allows us to
423 # create schedules for [z][x], [x][y], or [y][z]
424 # for matrix multiply.
425 vals = [vals[SVSHAPE.order[0]],
426 vals[SVSHAPE.order[1]],
427 vals[SVSHAPE.order[2]]
428 ]
429 # ok now we can construct the result, using bits of
430 # "order" to say which ones get stacked on
431 result = 0
432 mult = 1
433 for i in range(3):
434 lim, idx, dbg = vals[i]
435 # some of the dimensions can be "skipped". the order
436 # was actually selected above on all 3 dimensions,
437 # e.g. [z][x][y] or [y][z][x]. "skip" allows one of
438 # those to be knocked out
439 if SVSHAPE.skip == i+1: continue
440 idx *= mult # shifts up by previous dimension(s)
441 result += idx # adds on this dimension
442 mult *= lim # for the next dimension
443
444 loopends = (x_end |
445 ((y_end and x_end)<<1) |
446 ((y_end and x_end and z_end)<<2))
447
448 yield result + SVSHAPE.offset, loopends
449 step += 1
450
451 def demo():
452 # set the dimension sizes here
453 xdim = 3
454 ydim = 2
455 zdim = 4
456
457 # set total (can repeat, e.g. VL=x*y*z*4)
458 VL = xdim * ydim * zdim
459
460 # set up an SVSHAPE
461 class SVSHAPE:
462 pass
463 SVSHAPE0 = SVSHAPE()
464 SVSHAPE0.lims = [xdim, ydim, zdim]
465 SVSHAPE0.order = [1,0,2] # experiment with different permutations, here
466 SVSHAPE0.mode = 0b00
467 SVSHAPE0.skip = 0b00
468 SVSHAPE0.offset = 0 # experiment with different offset, here
469 SVSHAPE0.invxyz = [0,0,0] # inversion if desired
470
471 # enumerate over the iterator function, getting new indices
472 for idx, (new_idx, end) in enumerate(iterate_indices(SVSHAPE0)):
473 if idx >= VL:
474 break
475 print ("%d->%d" % (idx, new_idx), "end", bin(end)[2:])
476
477 # run the demo
478 if __name__ == '__main__':
479 demo()
480 ```
481
482 ### REMAP Parallel Reduction pseudocode
483
484 The python3 program below is stand-alone executable and is the Canonical Specification
485 for Parallel Reduction REMAP.
486 The Algorithm below is not limited to RADIX2 sizes, and Predicate
487 sources, unlike in Matrix REMAP, apply to the Element Indices **after** REMAP
488 has been applied, not before. MV operations are not required: the algorithm
489 tracks positions of elements that would normally be moved and when applying
490 an Element Reduction Operation sources the operands from their last-known (tracked)
491 position.
492
493 ```
494 # a "yield" version of the Parallel Reduction REMAP algorithm.
495 # the algorithm is in-place. it does not perform "MV" operations.
496 # instead, where a masked-out value *should* be read from is tracked
497
498 def iterate_indices(SVSHAPE, pred=None):
499 # get indices to iterate over, in the required order
500 xd = SVSHAPE.lims[0]
501 # create lists of indices to iterate over in each dimension
502 ix = list(range(xd))
503 # invert the indices if needed
504 if SVSHAPE.invxyz[0]: ix.reverse()
505 # start a loop from the lowest step
506 step = 1
507 steps = []
508 while step < xd:
509 step *= 2
510 steps.append(step)
511 # invert the indices if needed
512 if SVSHAPE.invxyz[1]: steps.reverse()
513 for step in steps:
514 stepend = (step == steps[-1]) # note end of steps
515 idxs = list(range(0, xd, step))
516 results = []
517 for i in idxs:
518 other = i + step // 2
519 ci = ix[i]
520 oi = ix[other] if other < xd else None
521 other_pred = other < xd and (pred is None or pred[oi])
522 if (pred is None or pred[ci]) and other_pred:
523 if SVSHAPE.skip == 0b00: # submode 00
524 result = ci
525 elif SVSHAPE.skip == 0b01: # submode 01
526 result = oi
527 results.append([result + SVSHAPE.offset, 0])
528 elif other_pred:
529 ix[i] = oi
530 if results:
531 results[-1][1] = (stepend<<1) | 1 # notify end of loops
532 yield from results
533
534 def demo():
535 # set the dimension sizes here
536 xdim = 9
537
538 # set up an SVSHAPE
539 class SVSHAPE:
540 pass
541 SVSHAPE0 = SVSHAPE()
542 SVSHAPE0.lims = [xdim, 0, 0]
543 SVSHAPE0.order = [0,1,2]
544 SVSHAPE0.mode = 0b10
545 SVSHAPE0.skip = 0b00
546 SVSHAPE0.offset = 0 # experiment with different offset, here
547 SVSHAPE0.invxyz = [0,0,0] # inversion if desired
548
549 SVSHAPE1 = SVSHAPE()
550 SVSHAPE1.lims = [xdim, 0, 0]
551 SVSHAPE1.order = [0,1,2]
552 SVSHAPE1.mode = 0b10
553 SVSHAPE1.skip = 0b01
554 SVSHAPE1.offset = 0 # experiment with different offset, here
555 SVSHAPE1.invxyz = [0,0,0] # inversion if desired
556
557 # enumerate over the iterator function, getting new indices
558 shapes = list(iterate_indices(SVSHAPE0)), \
559 list(iterate_indices(SVSHAPE1))
560 for idx in range(len(shapes[0])):
561 l = shapes[0][idx]
562 r = shapes[1][idx]
563 (l_idx, lend) = l
564 (r_idx, rend) = r
565 print ("%d->%d:%d" % (idx, l_idx, r_idx),
566 "end", bin(lend)[2:], bin(rend)[2:])
567
568 # run the demo
569 if __name__ == '__main__':
570 demo()
571 ```
572
573 ### REMAP FFT pseudocode
574
575 The FFT REMAP is RADIX2 only.
576
577 ```
578 # a "yield" version of the REMAP algorithm, for FFT Tukey-Cooley schedules
579 # original code for the FFT Tukey-Cooley schedule:
580 # https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py
581 """
582 # Radix-2 decimation-in-time FFT (real, not complex)
583 size = 2
584 while size <= n:
585 halfsize = size // 2
586 tablestep = n // size
587 for i in range(0, n, size):
588 k = 0
589 for j in range(i, i + halfsize):
590 jh = j+halfsize
591 jl = j
592 temp1 = vec[jh] * exptable[k]
593 temp2 = vec[jl]
594 vec[jh] = temp2 - temp1
595 vec[jl] = temp2 + temp1
596 k += tablestep
597 size *= 2
598 """
599
600 # python "yield" can be iterated. use this to make it clear how
601 # the indices are generated by using natural-looking nested loops
602 def iterate_butterfly_indices(SVSHAPE):
603 # get indices to iterate over, in the required order
604 n = SVSHAPE.lims[0]
605 stride = SVSHAPE.lims[2] # stride-multiplier on reg access
606 # creating lists of indices to iterate over in each dimension
607 # has to be done dynamically, because it depends on the size
608 # first, the size-based loop (which can be done statically)
609 x_r = []
610 size = 2
611 while size <= n:
612 x_r.append(size)
613 size *= 2
614 # invert order if requested
615 if SVSHAPE.invxyz[0]: x_r.reverse()
616
617 if len(x_r) == 0:
618 return
619
620 # start an infinite (wrapping) loop
621 skip = 0
622 while True:
623 for size in x_r: # loop over 3rd order dimension (size)
624 x_end = size == x_r[-1]
625 # y_r schedule depends on size
626 halfsize = size // 2
627 tablestep = n // size
628 y_r = []
629 for i in range(0, n, size):
630 y_r.append(i)
631 # invert if requested
632 if SVSHAPE.invxyz[1]: y_r.reverse()
633 for i in y_r: # loop over 2nd order dimension
634 y_end = i == y_r[-1]
635 k_r = []
636 j_r = []
637 k = 0
638 for j in range(i, i+halfsize):
639 k_r.append(k)
640 j_r.append(j)
641 k += tablestep
642 # invert if requested
643 if SVSHAPE.invxyz[2]: k_r.reverse()
644 if SVSHAPE.invxyz[2]: j_r.reverse()
645 for j, k in zip(j_r, k_r): # loop over 1st order dimension
646 z_end = j == j_r[-1]
647 # now depending on MODE return the index
648 if SVSHAPE.skip == 0b00:
649 result = j # for vec[j]
650 elif SVSHAPE.skip == 0b01:
651 result = j + halfsize # for vec[j+halfsize]
652 elif SVSHAPE.skip == 0b10:
653 result = k # for exptable[k]
654
655 loopends = (z_end |
656 ((y_end and z_end)<<1) |
657 ((y_end and x_end and z_end)<<2))
658
659 yield (result * stride) + SVSHAPE.offset, loopends
660
661 def demo():
662 # set the dimension sizes here
663 xdim = 8
664 ydim = 0 # not needed
665 zdim = 1 # stride must be set to 1
666
667 # set total. err don't know how to calculate how many there are...
668 # do it manually for now
669 VL = 0
670 size = 2
671 n = xdim
672 while size <= n:
673 halfsize = size // 2
674 tablestep = n // size
675 for i in range(0, n, size):
676 for j in range(i, i + halfsize):
677 VL += 1
678 size *= 2
679
680 # set up an SVSHAPE
681 class SVSHAPE:
682 pass
683 # j schedule
684 SVSHAPE0 = SVSHAPE()
685 SVSHAPE0.lims = [xdim, ydim, zdim]
686 SVSHAPE0.order = [0,1,2] # experiment with different permutations, here
687 SVSHAPE0.mode = 0b01
688 SVSHAPE0.skip = 0b00
689 SVSHAPE0.offset = 0 # experiment with different offset, here
690 SVSHAPE0.invxyz = [0,0,0] # inversion if desired
691 # j+halfstep schedule
692 SVSHAPE1 = SVSHAPE()
693 SVSHAPE1.lims = [xdim, ydim, zdim]
694 SVSHAPE1.order = [0,1,2] # experiment with different permutations, here
695 SVSHAPE0.mode = 0b01
696 SVSHAPE1.skip = 0b01
697 SVSHAPE1.offset = 0 # experiment with different offset, here
698 SVSHAPE1.invxyz = [0,0,0] # inversion if desired
699 # k schedule
700 SVSHAPE2 = SVSHAPE()
701 SVSHAPE2.lims = [xdim, ydim, zdim]
702 SVSHAPE2.order = [0,1,2] # experiment with different permutations, here
703 SVSHAPE0.mode = 0b01
704 SVSHAPE2.skip = 0b10
705 SVSHAPE2.offset = 0 # experiment with different offset, here
706 SVSHAPE2.invxyz = [0,0,0] # inversion if desired
707
708 # enumerate over the iterator function, getting new indices
709 schedule = []
710 for idx, (jl, jh, k) in enumerate(zip(iterate_butterfly_indices(SVSHAPE0),
711 iterate_butterfly_indices(SVSHAPE1),
712 iterate_butterfly_indices(SVSHAPE2))):
713 if idx >= VL:
714 break
715 schedule.append((jl, jh, k))
716
717 # ok now pretty-print the results, with some debug output
718 size = 2
719 idx = 0
720 while size <= n:
721 halfsize = size // 2
722 tablestep = n // size
723 print ("size %d halfsize %d tablestep %d" % \
724 (size, halfsize, tablestep))
725 for i in range(0, n, size):
726 prefix = "i %d\t" % i
727 k = 0
728 for j in range(i, i + halfsize):
729 (jl, je), (jh, he), (ks, ke) = schedule[idx]
730 print (" %-3d\t%s j=%-2d jh=%-2d k=%-2d -> "
731 "j[jl=%-2d] j[jh=%-2d] ex[k=%d]" % \
732 (idx, prefix, j, j+halfsize, k,
733 jl, jh, ks,
734 ),
735 "end", bin(je)[2:], bin(je)[2:], bin(ke)[2:])
736 k += tablestep
737 idx += 1
738 size *= 2
739
740 # run the demo
741 if __name__ == '__main__':
742 demo()
743 ```
744
745 ### DCT REMAP
746
747 DCT REMAP is RADIX2 only. Convolutions may be applied as usual
748 to create non-RADIX2 DCT. Combined with appropriate Twin-butterfly
749 instructions the algorithm below (written in python3), becomes part
750 of an in-place in-registers Vectorised DCT. The algorithms work
751 by loading data such that as the nested loops progress the result
752 is sorted into correct sequential order.
753
754 ```
755 # DCT "REMAP" scheduler to create an in-place iterative DCT.
756 #
757
758 # bits of the integer 'val' of width 'width' are reversed
759 def reverse_bits(val, width):
760 result = 0
761 for _ in range(width):
762 result = (result << 1) | (val & 1)
763 val >>= 1
764 return result
765
766
767 # iterative version of [recursively-applied] half-reversing
768 # turns out this is Gray-Encoding.
769 def halfrev2(vec, pre_rev=True):
770 res = []
771 for i in range(len(vec)):
772 if pre_rev:
773 res.append(vec[i ^ (i>>1)])
774 else:
775 ri = i
776 bl = i.bit_length()
777 for ji in range(1, bl):
778 ri ^= (i >> ji)
779 res.append(vec[ri])
780 return res
781
782
783 def iterate_dct_inner_halfswap_loadstore(SVSHAPE):
784 # get indices to iterate over, in the required order
785 n = SVSHAPE.lims[0]
786 mode = SVSHAPE.lims[1]
787 stride = SVSHAPE.lims[2]
788
789 # reference list for not needing to do data-swaps, just swap what
790 # *indices* are referenced (two levels of indirection at the moment)
791 # pre-reverse the data-swap list so that it *ends up* in the order 0123..
792 ji = list(range(n))
793
794 levels = n.bit_length() - 1
795 ri = [reverse_bits(i, levels) for i in range(n)]
796
797 if SVSHAPE.mode == 0b01: # FFT, bitrev only
798 ji = [ji[ri[i]] for i in range(n)]
799 elif SVSHAPE.submode2 == 0b001:
800 ji = [ji[ri[i]] for i in range(n)]
801 ji = halfrev2(ji, True)
802 else:
803 ji = halfrev2(ji, False)
804 ji = [ji[ri[i]] for i in range(n)]
805
806 # invert order if requested
807 if SVSHAPE.invxyz[0]:
808 ji.reverse()
809
810 for i, jl in enumerate(ji):
811 y_end = jl == ji[-1]
812 yield jl * stride, (0b111 if y_end else 0b000)
813
814 def iterate_dct_inner_costable_indices(SVSHAPE):
815 # get indices to iterate over, in the required order
816 n = SVSHAPE.lims[0]
817 mode = SVSHAPE.lims[1]
818 stride = SVSHAPE.lims[2]
819 # creating lists of indices to iterate over in each dimension
820 # has to be done dynamically, because it depends on the size
821 # first, the size-based loop (which can be done statically)
822 x_r = []
823 size = 2
824 while size <= n:
825 x_r.append(size)
826 size *= 2
827 # invert order if requested
828 if SVSHAPE.invxyz[0]:
829 x_r.reverse()
830
831 if len(x_r) == 0:
832 return
833
834 # start an infinite (wrapping) loop
835 skip = 0
836 z_end = 1 # doesn't exist in this, only 2 loops
837 k = 0
838 while True:
839 for size in x_r: # loop over 3rd order dimension (size)
840 x_end = size == x_r[-1]
841 # y_r schedule depends on size
842 halfsize = size // 2
843 y_r = []
844 for i in range(0, n, size):
845 y_r.append(i)
846 # invert if requested
847 if SVSHAPE.invxyz[1]: y_r.reverse()
848 # two lists of half-range indices, e.g. j 0123, jr 7654
849 j = list(range(0, halfsize))
850 # invert if requested
851 if SVSHAPE.invxyz[2]: j_r.reverse()
852 # loop over 1st order dimension
853 for ci, jl in enumerate(j):
854 y_end = jl == j[-1]
855 # now depending on MODE return the index. inner butterfly
856 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
857 result = k # offset into COS table
858 elif SVSHAPE.skip == 0b10: #
859 result = ci # coefficient helper
860 elif SVSHAPE.skip == 0b11: #
861 result = size # coefficient helper
862 loopends = (z_end |
863 ((y_end and z_end)<<1) |
864 ((y_end and x_end and z_end)<<2))
865
866 yield (result * stride) + SVSHAPE.offset, loopends
867 k += 1
868
869 def iterate_dct_inner_butterfly_indices(SVSHAPE):
870 # get indices to iterate over, in the required order
871 n = SVSHAPE.lims[0]
872 mode = SVSHAPE.lims[1]
873 stride = SVSHAPE.lims[2]
874 # creating lists of indices to iterate over in each dimension
875 # has to be done dynamically, because it depends on the size
876 # first, the size-based loop (which can be done statically)
877 x_r = []
878 size = 2
879 while size <= n:
880 x_r.append(size)
881 size *= 2
882 # invert order if requested
883 if SVSHAPE.invxyz[0]:
884 x_r.reverse()
885
886 if len(x_r) == 0:
887 return
888
889 # reference (read/write) the in-place data in *reverse-bit-order*
890 ri = list(range(n))
891 if SVSHAPE.submode2 == 0b01:
892 levels = n.bit_length() - 1
893 ri = [ri[reverse_bits(i, levels)] for i in range(n)]
894
895 # reference list for not needing to do data-swaps, just swap what
896 # *indices* are referenced (two levels of indirection at the moment)
897 # pre-reverse the data-swap list so that it *ends up* in the order 0123..
898 ji = list(range(n))
899 inplace_mode = True
900 if inplace_mode and SVSHAPE.submode2 == 0b01:
901 ji = halfrev2(ji, True)
902 if inplace_mode and SVSHAPE.submode2 == 0b11:
903 ji = halfrev2(ji, False)
904
905 # start an infinite (wrapping) loop
906 while True:
907 k = 0
908 k_start = 0
909 for size in x_r: # loop over 3rd order dimension (size)
910 x_end = size == x_r[-1]
911 # y_r schedule depends on size
912 halfsize = size // 2
913 y_r = []
914 for i in range(0, n, size):
915 y_r.append(i)
916 # invert if requested
917 if SVSHAPE.invxyz[1]: y_r.reverse()
918 for i in y_r: # loop over 2nd order dimension
919 y_end = i == y_r[-1]
920 # two lists of half-range indices, e.g. j 0123, jr 7654
921 j = list(range(i, i + halfsize))
922 jr = list(range(i+halfsize, i + size))
923 jr.reverse()
924 # invert if requested
925 if SVSHAPE.invxyz[2]:
926 j.reverse()
927 jr.reverse()
928 hz2 = halfsize // 2 # zero stops reversing 1-item lists
929 # loop over 1st order dimension
930 k = k_start
931 for ci, (jl, jh) in enumerate(zip(j, jr)):
932 z_end = jl == j[-1]
933 # now depending on MODE return the index. inner butterfly
934 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
935 if SVSHAPE.submode2 == 0b11: # iDCT
936 result = ji[ri[jl]] # lower half
937 else:
938 result = ri[ji[jl]] # lower half
939 elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
940 if SVSHAPE.submode2 == 0b11: # iDCT
941 result = ji[ri[jl+halfsize]] # upper half
942 else:
943 result = ri[ji[jh]] # upper half
944 elif mode == 4:
945 # COS table pre-generated mode
946 if SVSHAPE.skip == 0b10: #
947 result = k # cos table offset
948 else: # mode 2
949 # COS table generated on-demand ("Vertical-First") mode
950 if SVSHAPE.skip == 0b10: #
951 result = ci # coefficient helper
952 elif SVSHAPE.skip == 0b11: #
953 result = size # coefficient helper
954 loopends = (z_end |
955 ((y_end and z_end)<<1) |
956 ((y_end and x_end and z_end)<<2))
957
958 yield (result * stride) + SVSHAPE.offset, loopends
959 k += 1
960
961 # now in-place swap
962 if inplace_mode:
963 for ci, (jl, jh) in enumerate(zip(j[:hz2], jr[:hz2])):
964 jlh = jl+halfsize
965 tmp1, tmp2 = ji[jlh], ji[jh]
966 ji[jlh], ji[jh] = tmp2, tmp1
967
968 # new k_start point for cos tables( runs inside x_r loop NOT i loop)
969 k_start += halfsize
970
971
972 # python "yield" can be iterated. use this to make it clear how
973 # the indices are generated by using natural-looking nested loops
974 def iterate_dct_outer_butterfly_indices(SVSHAPE):
975 # get indices to iterate over, in the required order
976 n = SVSHAPE.lims[0]
977 mode = SVSHAPE.lims[1]
978 stride = SVSHAPE.lims[2]
979 # creating lists of indices to iterate over in each dimension
980 # has to be done dynamically, because it depends on the size
981 # first, the size-based loop (which can be done statically)
982 x_r = []
983 size = n // 2
984 while size >= 2:
985 x_r.append(size)
986 size //= 2
987 # invert order if requested
988 if SVSHAPE.invxyz[0]:
989 x_r.reverse()
990
991 if len(x_r) == 0:
992 return
993
994 # I-DCT, reference (read/write) the in-place data in *reverse-bit-order*
995 ri = list(range(n))
996 if SVSHAPE.submode2 in [0b11, 0b01]:
997 levels = n.bit_length() - 1
998 ri = [ri[reverse_bits(i, levels)] for i in range(n)]
999
1000 # reference list for not needing to do data-swaps, just swap what
1001 # *indices* are referenced (two levels of indirection at the moment)
1002 # pre-reverse the data-swap list so that it *ends up* in the order 0123..
1003 ji = list(range(n))
1004 inplace_mode = False # need the space... SVSHAPE.skip in [0b10, 0b11]
1005 if SVSHAPE.submode2 == 0b11:
1006 ji = halfrev2(ji, False)
1007
1008 # start an infinite (wrapping) loop
1009 while True:
1010 k = 0
1011 k_start = 0
1012 for size in x_r: # loop over 3rd order dimension (size)
1013 halfsize = size//2
1014 x_end = size == x_r[-1]
1015 y_r = list(range(0, halfsize))
1016 # invert if requested
1017 if SVSHAPE.invxyz[1]: y_r.reverse()
1018 for i in y_r: # loop over 2nd order dimension
1019 y_end = i == y_r[-1]
1020 # one list to create iterative-sum schedule
1021 jr = list(range(i+halfsize, i+n-halfsize, size))
1022 # invert if requested
1023 if SVSHAPE.invxyz[2]: jr.reverse()
1024 hz2 = halfsize // 2 # zero stops reversing 1-item lists
1025 k = k_start
1026 for ci, jh in enumerate(jr): # loop over 1st order dimension
1027 z_end = jh == jr[-1]
1028 if mode == 4:
1029 # COS table pre-generated mode
1030 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
1031 if SVSHAPE.submode2 == 0b11: # iDCT
1032 result = ji[ri[jh]] # upper half
1033 else:
1034 result = ri[ji[jh]] # lower half
1035 elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
1036 if SVSHAPE.submode2 == 0b11: # iDCT
1037 result = ji[ri[jh+size]] # upper half
1038 else:
1039 result = ri[ji[jh+size]] # upper half
1040 elif SVSHAPE.skip == 0b10: #
1041 result = k # cos table offset
1042 else:
1043 # COS table generated on-demand ("Vertical-First") mode
1044 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
1045 if SVSHAPE.submode2 == 0b11: # iDCT
1046 result = ji[ri[jh]] # lower half
1047 else:
1048 result = ri[ji[jh]] # lower half
1049 elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
1050 if SVSHAPE.submode2 == 0b11: # iDCT
1051 result = ji[ri[jh+size]] # upper half
1052 else:
1053 result = ri[ji[jh+size]] # upper half
1054 elif SVSHAPE.skip == 0b10: #
1055 result = ci # coefficient helper
1056 elif SVSHAPE.skip == 0b11: #
1057 result = size # coefficient helper
1058 loopends = (z_end |
1059 ((y_end and z_end)<<1) |
1060 ((y_end and x_end and z_end)<<2))
1061
1062 yield (result * stride) + SVSHAPE.offset, loopends
1063 k += 1
1064
1065 # new k_start point for cos tables( runs inside x_r loop NOT i loop)
1066 k_start += halfsize
1067
1068 ```
1069
1070 ## REMAP selector
1071
1072 Selecting which REMAP Schedule to use is shown by the pseudocode below.
1073 Each SVSHAPE 0-3 goes through this selection process.
1074
1075 ```
1076 if SVSHAPEn.mode == 0b00: iterate_fn = iterate_indices
1077 if SVSHAPEn.mode == 0b10: iterate_fn = iterate_preduce_indices
1078 if SVSHAPEn.mode in [0b01, 0b11]:
1079 # further sub-selection
1080 if SVSHAPEn.ydimsz == 1: iterate_fn = iterate_butterfly_indices
1081 if SVSHAPEn.ydimsz == 2: iterate_fn = iterate_dct_inner_butterfly_indices
1082 if SVSHAPEn.ydimsz == 3: iterate_fn = iterate_dct_outer_butterfly_indices
1083 if SVSHAPEn.ydimsz in [5, 13]: iterate_fn = iterate_dct_inner_costable_indices
1084 if SVSHAPEn.ydimsz in [6, 14, 15]: iterate_fn = iterate_dct_inner_halfswap_loadstore
1085 ```
1086
1087
1088 [[!tag opf_rfc]]