1 # RFC ls009 Simple-V REMAP Subsystem
5 * <https://libre-soc.org/openpower/sv/rfc/ls009/>
6 * <https://bugs.libre-soc.org/show_bug.cgi?id=1042>
7 * <https://git.openpower.foundation/isa/PowerISA/issues/124>
13 **Date**: 26 Mar 2023. v2 TODO
19 **Books and Section affected**:
22 Book I, new Zero-Overhead-Loop Chapter.
23 Appendix E Power ISA sorted by opcode
24 Appendix F Power ISA sorted by version
25 Appendix G Power ISA sorted by Compliancy Subset
26 Appendix H Power ISA sorted by mnemonic
32 svremap - Re-Mapping of Register Element Offsets
33 svindex - General-purpose setting of SHAPEs to be re-mapped
34 svshape - Hardware-level setting of SHAPEs for element re-mapping
35 svshape2 - Hardware-level setting of SHAPEs for element re-mapping (v2)
38 **Submitter**: Luke Leighton (Libre-SOC)
40 **Requester**: Libre-SOC
42 **Impact on processor**:
45 Addition of four new "Zero-Overhead-Loop-Control" DSP-style Vector-style
46 Management Instructions which provide advanced features such as Matrix
47 FFT DCT Hardware-Assist Schedules and general-purpose Index reordering.
50 **Impact on software**:
53 Requires support for new instructions in assembler, debuggers,
60 Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC),
61 Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
62 Digital Signal Processing (DSP)
67 These REMAP Management instructions provide state-of-the-art advanced capabilities
68 to dramatically decrease instruction count and power reduction whilst retaining
69 unprecedented general-purpose capability and a standard Sequential Execution Model.
71 **Notes and Observations**:
73 1. Although costly the alternatives in SIMD-paradigm software result in
74 huge algorithmic complexity and associated power consumption increases.
75 Loop-unrolling compilers are prevalent as is thousands to tens of
76 thousands of instructions.
77 2. Core inner kernels of Matrix DCT DFT FFT NTT are dramatically reduced
78 to a handful of instructions.
79 3. No REMAP instructions with the exception of Indexed rely on registers
80 for the establishment of REMAP capability.
81 4. Future EXT1xx variants and SVP64/VSX will dramatically expand the power
82 and flexibility of REMAP.
83 5. The Simple-V Compliancy Subsets make REMAP optional except in the
84 Advanced Levels. Like v3.1 MMA it is **not** necessary for **all**
85 hardware to implement REMAP.
89 Add the following entries to:
91 * the Appendices of SV Book
92 * Instructions of SV Book as a new Section
93 * SVI, SVM, SVM2, SVRM Form of Book I Section 1.6.1.6 and 1.6.2
99 # Rationale and background
101 In certain common algorithms there are clear patterns of behaviour which
102 typically require inline loop-unrolled instructions comprising interleaved
103 permute and arithmetic operations. REMAP was invented to split the permuting
104 from the arithmetic, and to allow the Indexing to be done as a hardware Schedule.
110 GPR[RT+i] <= GPR[RA+i] +
114 A Hardware-assisted REMAP Vector Add:
118 GPR[RT+remap1(i)] <= GPR[RA+remap2(i)] +
122 The result is a huge saving on register file accesses (no need to calculate Indices
123 then use Permutation instructions), instruction count (Matrix Multiply up to 127 FMACs
124 is 3 instructions), and programmer sanity.
128 This section summarises the motivation for each REMAP Schedule
129 and briefly goes over their characteristics and limitations.
130 Further details on the Deterministic Precise-Interruptible algorithms
131 used in these Schedules is found in the [[sv/remap/appendix]].
133 ## Matrix (1D/2D/3D shaping)
135 Matrix Multiplication is a huge part of High-Performance Compute,
137 In many PackedSIMD as well as Scalable Vector ISAs, non-power-of-two
138 Matrix sizes are a serious challenge. PackedSIMD ISAs, in order to
139 cope with for example 3x4 Matrices, recommend rolling data-repetition and loop-unrolling.
140 Aside from the cost of the load on the L1 I-Cache, the trick only
141 works if one of the dimensions X or Y are power-two. Prime Numbers
142 (5x7, 3x5) become deeply problematic to unroll.
144 Even traditional Scalable Vector ISAs have issues with Matrices, often
145 having to perform data Transpose by pushing out through Memory and back
147 or computing Transposition Indices (costly) then copying to another
150 Matrix REMAP was thus designed to solve these issues by providing Hardware
152 "Schedules" that can view what would otherwise be limited to a strictly
153 linear Vector as instead being 2D (even 3D) *in-place* reordered.
154 With both Transposition and non-power-two being supported the issues
155 faced by other ISAs are mitigated.
157 Limitations of Matrix REMAP are that the Vector Length (VL) is currently
158 restricted to 127: up to 127 FMAs (or other operation)
159 may be performed in total.
160 Also given that it is in-registers only at present some care has to be
161 taken on regfile resource utilisation. However it is perfectly possible
162 to utilise Matrix REMAP to perform the three inner-most "kernel" loops of
163 the usual 6-level "Tiled" large Matrix Multiply, without the usual
164 difficulties associated with SIMD.
166 Also the `svshape` instruction only provides access to part of the
167 Matrix REMAP capability. Rotation and mirroring need to be done by
168 programming the SVSHAPE SPRs directly, which can take a lot more
169 instructions. Future versions of SVP64 will include EXT1xx prefixed
170 variants (`psvshape`) which provide more comprehensive capacity and
171 mitigate the need to write direct to the SVSHAPE SPRs.
173 ## FFT/DCT Triple Loop
175 DCT and FFT are some of the most astonishingly used algorithms in
176 Computer Science. Radar, Audio, Video, R.F. Baseband and dozens more. At least
177 two DSPs, TMS320 and Hexagon, have VLIW instructions specially tailored
180 An in-depth analysis showed that it is possible to do in-place in-register
181 DCT and FFT as long as twin-result "butterfly" instructions are provided.
182 These can be found in the [[openpower/isa/svfparith]] page if performing
183 IEEE754 FP transforms. *(For fixed-point transforms, equivalent 3-in 2-out
184 integer operations would be required)*. These "butterfly" instructions
185 avoid the need for a temporary register because the two array positions
186 being overwritten will be "in-flight" in any In-Order or Out-of-Order
189 DCT and FFT Schedules are currently limited to RADIX2 sizes and do not
190 accept predicate masks. Given that it is common to perform recursive
191 convolutions combining smaller Power-2 DCT/FFT to create larger DCT/FFTs
192 in practice the RADIX2 limit is not a problem. A Bluestein convolution
193 to compute arbitrary length is demonstrated by
194 [Project Nayuki](https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py)
198 The purpose of Indexing is to provide a generalised version of
199 Vector ISA "Permute" instructions, such as VSX `vperm`. The
200 Indexing is abstracted out and may be applied to much more
201 than an element move/copy, and is not limited for example
202 to the number of bytes that can fit into a VSX register.
203 Indexing may be applied to LD/ST (even on Indexed LD/ST
204 instructions such as `sv.lbzx`), arithmetic operations,
205 extsw: there is no artificial limit.
207 The original motivation for Indexed REMAP was to mitigate the need to add
208 an expensive `mv.x` to the Scalar ISA, which was highly likely to be rejected as
209 a stand-alone instruction
210 (`GPR(RT) <- GPR(GPR(RA))`). Usually a Vector ISA would add a non-conflicting
211 variant (as in VSX `vperm`) but it is common to need to permute by source,
212 with the risk of conflict, that has to be resolved, for example, in AVX-512
215 Indexed REMAP is the "Get out of Jail Free" card which (for a price)
216 allows any general-purpose arbitrary Element Permutation not covered by
217 the other Hardware Schedules.
219 ## Parallel Reduction
221 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base"
222 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
223 *appearance* and *effect* of Reduction. Parallel Reduction is not limited
224 to Power-of-two but is limited as usual by the total number of
225 element operations (127) as well as available register file size.
227 Parallel Reduction is normally done explicitly as a loop-unrolled operation,
228 taking up a significant number of instructions. With REMAP it is just three
229 instructions: two for setup and one Scalar Base.
231 ## Parallel Prefix Sum
233 This is a work-efficient Parallel Schedule that for example produces Trangular
234 or Factorial number sequences. Half of the Prefix Sum Schedule is near-identical
235 to Parallel Reduction. Whilst the Arithmetic mapreduce Mode (`/mr`) may achieve the same
236 end-result, implementations may only implement Mapreduce in serial form (or give
237 the appearance to Programmers of the same). The Parallel Prefix Schedule is
238 *required* to be implemented in such a way that its Deterministic Schedule may be
239 parallelised. Like the Reduction Schedule it is 100% Deterministic and consequently
240 may be used with non-commutative operations.
242 Parallel Reduction combined with Parallel Prefix Sum can be combined to help
243 perform AI non-linear interpolation in around twelve to fifteen instructions.
249 Add the following to SV Book
251 [[!inline pages="openpower/sv/remap" raw=yes ]]
255 Add `SVI, SVM, SVM2, SVRM` to `XO (26:31)` Field in Book I, 1.6.2
257 Add the following to Book I, 1.6.1, SVI-Form
260 |0 |6 |11 |16 |21 |23 |24|25|26 31|
261 | PO | SVG|rmm | SVd |ew |SVyx|mm|sk| XO |
264 Add the following to Book I, 1.6.1, SVM-Form
267 |0 |6 |11 |16 |21 |25 |26 |31 |
268 | PO | SVxd | SVyd | SVzd | SVrm |vf | XO |
271 Add the following to Book I, 1.6.1, SVM2-Form
274 |0 |6 |10 |11 |16 |21 |24|25 |26 |31 |
275 | PO | SVo |SVyx| rmm | SVd |XO |mm|sk | XO |
278 Add the following to Book I, 1.6.1, SVRM-Form
281 |0 |6 |11 |13 |15 |17 |19 |21 |22 |26 |31 |
282 | PO | SVme |mi0 | mi1 | mi2 | mo0 | mo1 |pst |/// | XO |
285 Add the following to Book I, 1.6.2
289 Field used in REMAP to select the SVSHAPE for 1st input register
292 Field used in REMAP to select the SVSHAPE for 2nd input register
295 Field used in REMAP to select the SVSHAPE for 3rd input register
298 Field used to specify the meaning of the rmm field for SVI-Form
302 Field used in REMAP to select the SVSHAPE for 1st output register
305 Field used in REMAP to select the SVSHAPE for 2nd output register
308 Field used in REMAP to indicate "persistence" mode (REMAP
309 continues to apply to multiple instructions)
312 REMAP Mode field for SVI-Form and SVM2-Form
315 Field used to specify dimensional skipping in svindex
318 Immediate field used to specify the size of the REMAP dimension
319 in the svindex and svshape2 instructions
322 Immediate field used to specify a 9-bit signed
323 two's complement integer which is concatenated
324 on the right with 0b00 and sign-extended to 64 bits.
327 Field used to specify a GPR to be used as a
331 Simple-V immediate field for setting VL or MVL
334 Simple-V "REMAP" map-enable bits (0-4)
337 Field used by the svshape2 instruction as an offset
340 Simple-V "REMAP" Mode
343 Simple-V "REMAP" x-dimension size
346 Simple-V "REMAP" y-dimension size
349 Simple-V "REMAP" z-dimension size
352 Extended opcode field. Note that bit 21 must be 1, 22 and 23
353 must be zero, and bits 26-31 must be exactly the same as
360 Appendix E Power ISA sorted by opcode
361 Appendix F Power ISA sorted by version
362 Appendix G Power ISA sorted by Compliancy Subset
363 Appendix H Power ISA sorted by mnemonic
365 | Form | Book | Page | Version | mnemonic | Description |
366 |------|------|------|---------|----------|-------------|
367 | SVRM | I | # | 3.0B | svremap | REMAP enabling instruction |
368 | SVM | I | # | 3.0B | svshape | REMAP shape instruction |
369 | SVM2 | I | # | 3.0B | svshape2 | REMAP shape instruction (2) |
370 | SVI | I | # | 3.0B | svindex | REMAP General-purpose Indexing |
372 [[!inline pages="openpower/sv/remap/appendix" raw=yes ]]
376 Written in python3 the following stand-alone executable source code is the Canonical
377 Specification for each REMAP. Vectors of "loopends" are returned when Rc=1
378 in Vectors of CR Fields on `sv.svstep.`, or in Vertical-First Mode
379 a single CR Field (CR0) on `svstep.`. The `SVSTATE.srcstep` or `SVSTATE.dststep` sequential
380 offset is put through each algorithm to determine the actual Element Offset.
381 Alternative implementations producing different ordering
382 is prohibited as software will be critically relying on these Deterministic Schedules.
384 ### REMAP 2D/3D Matrix
386 The following stand-alone executable source code is the Canonical
387 Specification for Matrix (2D/3D) REMAP.
388 Hardware implementations are achievable with simple cascading counter-and-compares.
391 # python "yield" can be iterated. use this to make it clear how
392 # the indices are generated by using natural-looking nested loops
393 def iterate_indices(SVSHAPE):
394 # get indices to iterate over, in the required order
398 # create lists of indices to iterate over in each dimension
399 x_r = list(range(xd))
400 y_r = list(range(yd))
401 z_r = list(range(zd))
402 # invert the indices if needed
403 if SVSHAPE.invxyz[0]: x_r.reverse()
404 if SVSHAPE.invxyz[1]: y_r.reverse()
405 if SVSHAPE.invxyz[2]: z_r.reverse()
406 # start an infinite (wrapping) loop
407 step = 0 # track src/dst step
409 for z in z_r: # loop over 1st order dimension
411 for y in y_r: # loop over 2nd order dimension
413 for x in x_r: # loop over 3rd order dimension
415 # ok work out which order to construct things in.
416 # start by creating a list of tuples of the dimension
418 vals = [(SVSHAPE.lims[0], x, "x"),
419 (SVSHAPE.lims[1], y, "y"),
420 (SVSHAPE.lims[2], z, "z")
422 # now select those by order. this allows us to
423 # create schedules for [z][x], [x][y], or [y][z]
424 # for matrix multiply.
425 vals = [vals[SVSHAPE.order[0]],
426 vals[SVSHAPE.order[1]],
427 vals[SVSHAPE.order[2]]
429 # ok now we can construct the result, using bits of
430 # "order" to say which ones get stacked on
434 lim, idx, dbg = vals[i]
435 # some of the dimensions can be "skipped". the order
436 # was actually selected above on all 3 dimensions,
437 # e.g. [z][x][y] or [y][z][x]. "skip" allows one of
438 # those to be knocked out
439 if SVSHAPE.skip == i+1: continue
440 idx *= mult # shifts up by previous dimension(s)
441 result += idx # adds on this dimension
442 mult *= lim # for the next dimension
445 ((y_end and x_end)<<1) |
446 ((y_end and x_end and z_end)<<2))
448 yield result + SVSHAPE.offset, loopends
452 # set the dimension sizes here
457 # set total (can repeat, e.g. VL=x*y*z*4)
458 VL = xdim * ydim * zdim
464 SVSHAPE0.lims = [xdim, ydim, zdim]
465 SVSHAPE0.order = [1,0,2] # experiment with different permutations, here
468 SVSHAPE0.offset = 0 # experiment with different offset, here
469 SVSHAPE0.invxyz = [0,0,0] # inversion if desired
471 # enumerate over the iterator function, getting new indices
472 for idx, (new_idx, end) in enumerate(iterate_indices(SVSHAPE0)):
475 print ("%d->%d" % (idx, new_idx), "end", bin(end)[2:])
478 if __name__ == '__main__':
482 ### REMAP Parallel Reduction pseudocode
484 The python3 program below is stand-alone executable and is the Canonical Specification
485 for Parallel Reduction REMAP.
486 The Algorithm below is not limited to RADIX2 sizes, and Predicate
487 sources, unlike in Matrix REMAP, apply to the Element Indices **after** REMAP
488 has been applied, not before. MV operations are not required: the algorithm
489 tracks positions of elements that would normally be moved and when applying
490 an Element Reduction Operation sources the operands from their last-known (tracked)
494 # a "yield" version of the Parallel Reduction REMAP algorithm.
495 # the algorithm is in-place. it does not perform "MV" operations.
496 # instead, where a masked-out value *should* be read from is tracked
498 def iterate_indices(SVSHAPE, pred=None):
499 # get indices to iterate over, in the required order
501 # create lists of indices to iterate over in each dimension
503 # invert the indices if needed
504 if SVSHAPE.invxyz[0]: ix.reverse()
505 # start a loop from the lowest step
511 # invert the indices if needed
512 if SVSHAPE.invxyz[1]: steps.reverse()
514 stepend = (step == steps[-1]) # note end of steps
515 idxs = list(range(0, xd, step))
518 other = i + step // 2
520 oi = ix[other] if other < xd else None
521 other_pred = other < xd and (pred is None or pred[oi])
522 if (pred is None or pred[ci]) and other_pred:
523 if SVSHAPE.skip == 0b00: # submode 00
525 elif SVSHAPE.skip == 0b01: # submode 01
527 results.append([result + SVSHAPE.offset, 0])
531 results[-1][1] = (stepend<<1) | 1 # notify end of loops
535 # set the dimension sizes here
542 SVSHAPE0.lims = [xdim, 0, 0]
543 SVSHAPE0.order = [0,1,2]
546 SVSHAPE0.offset = 0 # experiment with different offset, here
547 SVSHAPE0.invxyz = [0,0,0] # inversion if desired
550 SVSHAPE1.lims = [xdim, 0, 0]
551 SVSHAPE1.order = [0,1,2]
554 SVSHAPE1.offset = 0 # experiment with different offset, here
555 SVSHAPE1.invxyz = [0,0,0] # inversion if desired
557 # enumerate over the iterator function, getting new indices
558 shapes = list(iterate_indices(SVSHAPE0)), \
559 list(iterate_indices(SVSHAPE1))
560 for idx in range(len(shapes[0])):
565 print ("%d->%d:%d" % (idx, l_idx, r_idx),
566 "end", bin(lend)[2:], bin(rend)[2:])
569 if __name__ == '__main__':
573 ### REMAP FFT pseudocode
575 The FFT REMAP is RADIX2 only.
578 # a "yield" version of the REMAP algorithm, for FFT Tukey-Cooley schedules
579 # original code for the FFT Tukey-Cooley schedule:
580 # https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py
582 # Radix-2 decimation-in-time FFT (real, not complex)
586 tablestep = n // size
587 for i in range(0, n, size):
589 for j in range(i, i + halfsize):
592 temp1 = vec[jh] * exptable[k]
594 vec[jh] = temp2 - temp1
595 vec[jl] = temp2 + temp1
600 # python "yield" can be iterated. use this to make it clear how
601 # the indices are generated by using natural-looking nested loops
602 def iterate_butterfly_indices(SVSHAPE):
603 # get indices to iterate over, in the required order
605 stride = SVSHAPE.lims[2] # stride-multiplier on reg access
606 # creating lists of indices to iterate over in each dimension
607 # has to be done dynamically, because it depends on the size
608 # first, the size-based loop (which can be done statically)
614 # invert order if requested
615 if SVSHAPE.invxyz[0]: x_r.reverse()
620 # start an infinite (wrapping) loop
623 for size in x_r: # loop over 3rd order dimension (size)
624 x_end = size == x_r[-1]
625 # y_r schedule depends on size
627 tablestep = n // size
629 for i in range(0, n, size):
631 # invert if requested
632 if SVSHAPE.invxyz[1]: y_r.reverse()
633 for i in y_r: # loop over 2nd order dimension
638 for j in range(i, i+halfsize):
642 # invert if requested
643 if SVSHAPE.invxyz[2]: k_r.reverse()
644 if SVSHAPE.invxyz[2]: j_r.reverse()
645 for j, k in zip(j_r, k_r): # loop over 1st order dimension
647 # now depending on MODE return the index
648 if SVSHAPE.skip == 0b00:
649 result = j # for vec[j]
650 elif SVSHAPE.skip == 0b01:
651 result = j + halfsize # for vec[j+halfsize]
652 elif SVSHAPE.skip == 0b10:
653 result = k # for exptable[k]
656 ((y_end and z_end)<<1) |
657 ((y_end and x_end and z_end)<<2))
659 yield (result * stride) + SVSHAPE.offset, loopends
662 # set the dimension sizes here
664 ydim = 0 # not needed
665 zdim = 1 # stride must be set to 1
667 # set total. err don't know how to calculate how many there are...
668 # do it manually for now
674 tablestep = n // size
675 for i in range(0, n, size):
676 for j in range(i, i + halfsize):
685 SVSHAPE0.lims = [xdim, ydim, zdim]
686 SVSHAPE0.order = [0,1,2] # experiment with different permutations, here
689 SVSHAPE0.offset = 0 # experiment with different offset, here
690 SVSHAPE0.invxyz = [0,0,0] # inversion if desired
691 # j+halfstep schedule
693 SVSHAPE1.lims = [xdim, ydim, zdim]
694 SVSHAPE1.order = [0,1,2] # experiment with different permutations, here
697 SVSHAPE1.offset = 0 # experiment with different offset, here
698 SVSHAPE1.invxyz = [0,0,0] # inversion if desired
701 SVSHAPE2.lims = [xdim, ydim, zdim]
702 SVSHAPE2.order = [0,1,2] # experiment with different permutations, here
705 SVSHAPE2.offset = 0 # experiment with different offset, here
706 SVSHAPE2.invxyz = [0,0,0] # inversion if desired
708 # enumerate over the iterator function, getting new indices
710 for idx, (jl, jh, k) in enumerate(zip(iterate_butterfly_indices(SVSHAPE0),
711 iterate_butterfly_indices(SVSHAPE1),
712 iterate_butterfly_indices(SVSHAPE2))):
715 schedule.append((jl, jh, k))
717 # ok now pretty-print the results, with some debug output
722 tablestep = n // size
723 print ("size %d halfsize %d tablestep %d" % \
724 (size, halfsize, tablestep))
725 for i in range(0, n, size):
726 prefix = "i %d\t" % i
728 for j in range(i, i + halfsize):
729 (jl, je), (jh, he), (ks, ke) = schedule[idx]
730 print (" %-3d\t%s j=%-2d jh=%-2d k=%-2d -> "
731 "j[jl=%-2d] j[jh=%-2d] ex[k=%d]" % \
732 (idx, prefix, j, j+halfsize, k,
735 "end", bin(je)[2:], bin(je)[2:], bin(ke)[2:])
741 if __name__ == '__main__':
747 DCT REMAP is RADIX2 only. Convolutions may be applied as usual
748 to create non-RADIX2 DCT. Combined with appropriate Twin-butterfly
749 instructions the algorithm below (written in python3), becomes part
750 of an in-place in-registers Vectorised DCT. The algorithms work
751 by loading data such that as the nested loops progress the result
752 is sorted into correct sequential order.
755 # DCT "REMAP" scheduler to create an in-place iterative DCT.
758 # bits of the integer 'val' of width 'width' are reversed
759 def reverse_bits(val, width):
761 for _ in range(width):
762 result = (result << 1) | (val & 1)
767 # iterative version of [recursively-applied] half-reversing
768 # turns out this is Gray-Encoding.
769 def halfrev2(vec, pre_rev=True):
771 for i in range(len(vec)):
773 res.append(vec[i ^ (i>>1)])
777 for ji in range(1, bl):
783 def iterate_dct_inner_halfswap_loadstore(SVSHAPE):
784 # get indices to iterate over, in the required order
786 mode = SVSHAPE.lims[1]
787 stride = SVSHAPE.lims[2]
789 # reference list for not needing to do data-swaps, just swap what
790 # *indices* are referenced (two levels of indirection at the moment)
791 # pre-reverse the data-swap list so that it *ends up* in the order 0123..
794 levels = n.bit_length() - 1
795 ri = [reverse_bits(i, levels) for i in range(n)]
797 if SVSHAPE.mode == 0b01: # FFT, bitrev only
798 ji = [ji[ri[i]] for i in range(n)]
799 elif SVSHAPE.submode2 == 0b001:
800 ji = [ji[ri[i]] for i in range(n)]
801 ji = halfrev2(ji, True)
803 ji = halfrev2(ji, False)
804 ji = [ji[ri[i]] for i in range(n)]
806 # invert order if requested
807 if SVSHAPE.invxyz[0]:
810 for i, jl in enumerate(ji):
812 yield jl * stride, (0b111 if y_end else 0b000)
814 def iterate_dct_inner_costable_indices(SVSHAPE):
815 # get indices to iterate over, in the required order
817 mode = SVSHAPE.lims[1]
818 stride = SVSHAPE.lims[2]
819 # creating lists of indices to iterate over in each dimension
820 # has to be done dynamically, because it depends on the size
821 # first, the size-based loop (which can be done statically)
827 # invert order if requested
828 if SVSHAPE.invxyz[0]:
834 # start an infinite (wrapping) loop
836 z_end = 1 # doesn't exist in this, only 2 loops
839 for size in x_r: # loop over 3rd order dimension (size)
840 x_end = size == x_r[-1]
841 # y_r schedule depends on size
844 for i in range(0, n, size):
846 # invert if requested
847 if SVSHAPE.invxyz[1]: y_r.reverse()
848 # two lists of half-range indices, e.g. j 0123, jr 7654
849 j = list(range(0, halfsize))
850 # invert if requested
851 if SVSHAPE.invxyz[2]: j_r.reverse()
852 # loop over 1st order dimension
853 for ci, jl in enumerate(j):
855 # now depending on MODE return the index. inner butterfly
856 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
857 result = k # offset into COS table
858 elif SVSHAPE.skip == 0b10: #
859 result = ci # coefficient helper
860 elif SVSHAPE.skip == 0b11: #
861 result = size # coefficient helper
863 ((y_end and z_end)<<1) |
864 ((y_end and x_end and z_end)<<2))
866 yield (result * stride) + SVSHAPE.offset, loopends
869 def iterate_dct_inner_butterfly_indices(SVSHAPE):
870 # get indices to iterate over, in the required order
872 mode = SVSHAPE.lims[1]
873 stride = SVSHAPE.lims[2]
874 # creating lists of indices to iterate over in each dimension
875 # has to be done dynamically, because it depends on the size
876 # first, the size-based loop (which can be done statically)
882 # invert order if requested
883 if SVSHAPE.invxyz[0]:
889 # reference (read/write) the in-place data in *reverse-bit-order*
891 if SVSHAPE.submode2 == 0b01:
892 levels = n.bit_length() - 1
893 ri = [ri[reverse_bits(i, levels)] for i in range(n)]
895 # reference list for not needing to do data-swaps, just swap what
896 # *indices* are referenced (two levels of indirection at the moment)
897 # pre-reverse the data-swap list so that it *ends up* in the order 0123..
900 if inplace_mode and SVSHAPE.submode2 == 0b01:
901 ji = halfrev2(ji, True)
902 if inplace_mode and SVSHAPE.submode2 == 0b11:
903 ji = halfrev2(ji, False)
905 # start an infinite (wrapping) loop
909 for size in x_r: # loop over 3rd order dimension (size)
910 x_end = size == x_r[-1]
911 # y_r schedule depends on size
914 for i in range(0, n, size):
916 # invert if requested
917 if SVSHAPE.invxyz[1]: y_r.reverse()
918 for i in y_r: # loop over 2nd order dimension
920 # two lists of half-range indices, e.g. j 0123, jr 7654
921 j = list(range(i, i + halfsize))
922 jr = list(range(i+halfsize, i + size))
924 # invert if requested
925 if SVSHAPE.invxyz[2]:
928 hz2 = halfsize // 2 # zero stops reversing 1-item lists
929 # loop over 1st order dimension
931 for ci, (jl, jh) in enumerate(zip(j, jr)):
933 # now depending on MODE return the index. inner butterfly
934 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
935 if SVSHAPE.submode2 == 0b11: # iDCT
936 result = ji[ri[jl]] # lower half
938 result = ri[ji[jl]] # lower half
939 elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
940 if SVSHAPE.submode2 == 0b11: # iDCT
941 result = ji[ri[jl+halfsize]] # upper half
943 result = ri[ji[jh]] # upper half
945 # COS table pre-generated mode
946 if SVSHAPE.skip == 0b10: #
947 result = k # cos table offset
949 # COS table generated on-demand ("Vertical-First") mode
950 if SVSHAPE.skip == 0b10: #
951 result = ci # coefficient helper
952 elif SVSHAPE.skip == 0b11: #
953 result = size # coefficient helper
955 ((y_end and z_end)<<1) |
956 ((y_end and x_end and z_end)<<2))
958 yield (result * stride) + SVSHAPE.offset, loopends
963 for ci, (jl, jh) in enumerate(zip(j[:hz2], jr[:hz2])):
965 tmp1, tmp2 = ji[jlh], ji[jh]
966 ji[jlh], ji[jh] = tmp2, tmp1
968 # new k_start point for cos tables( runs inside x_r loop NOT i loop)
972 # python "yield" can be iterated. use this to make it clear how
973 # the indices are generated by using natural-looking nested loops
974 def iterate_dct_outer_butterfly_indices(SVSHAPE):
975 # get indices to iterate over, in the required order
977 mode = SVSHAPE.lims[1]
978 stride = SVSHAPE.lims[2]
979 # creating lists of indices to iterate over in each dimension
980 # has to be done dynamically, because it depends on the size
981 # first, the size-based loop (which can be done statically)
987 # invert order if requested
988 if SVSHAPE.invxyz[0]:
994 # I-DCT, reference (read/write) the in-place data in *reverse-bit-order*
996 if SVSHAPE.submode2 in [0b11, 0b01]:
997 levels = n.bit_length() - 1
998 ri = [ri[reverse_bits(i, levels)] for i in range(n)]
1000 # reference list for not needing to do data-swaps, just swap what
1001 # *indices* are referenced (two levels of indirection at the moment)
1002 # pre-reverse the data-swap list so that it *ends up* in the order 0123..
1004 inplace_mode = False # need the space... SVSHAPE.skip in [0b10, 0b11]
1005 if SVSHAPE.submode2 == 0b11:
1006 ji = halfrev2(ji, False)
1008 # start an infinite (wrapping) loop
1012 for size in x_r: # loop over 3rd order dimension (size)
1014 x_end = size == x_r[-1]
1015 y_r = list(range(0, halfsize))
1016 # invert if requested
1017 if SVSHAPE.invxyz[1]: y_r.reverse()
1018 for i in y_r: # loop over 2nd order dimension
1019 y_end = i == y_r[-1]
1020 # one list to create iterative-sum schedule
1021 jr = list(range(i+halfsize, i+n-halfsize, size))
1022 # invert if requested
1023 if SVSHAPE.invxyz[2]: jr.reverse()
1024 hz2 = halfsize // 2 # zero stops reversing 1-item lists
1026 for ci, jh in enumerate(jr): # loop over 1st order dimension
1027 z_end = jh == jr[-1]
1029 # COS table pre-generated mode
1030 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
1031 if SVSHAPE.submode2 == 0b11: # iDCT
1032 result = ji[ri[jh]] # upper half
1034 result = ri[ji[jh]] # lower half
1035 elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
1036 if SVSHAPE.submode2 == 0b11: # iDCT
1037 result = ji[ri[jh+size]] # upper half
1039 result = ri[ji[jh+size]] # upper half
1040 elif SVSHAPE.skip == 0b10: #
1041 result = k # cos table offset
1043 # COS table generated on-demand ("Vertical-First") mode
1044 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
1045 if SVSHAPE.submode2 == 0b11: # iDCT
1046 result = ji[ri[jh]] # lower half
1048 result = ri[ji[jh]] # lower half
1049 elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
1050 if SVSHAPE.submode2 == 0b11: # iDCT
1051 result = ji[ri[jh+size]] # upper half
1053 result = ri[ji[jh+size]] # upper half
1054 elif SVSHAPE.skip == 0b10: #
1055 result = ci # coefficient helper
1056 elif SVSHAPE.skip == 0b11: #
1057 result = size # coefficient helper
1059 ((y_end and z_end)<<1) |
1060 ((y_end and x_end and z_end)<<2))
1062 yield (result * stride) + SVSHAPE.offset, loopends
1065 # new k_start point for cos tables( runs inside x_r loop NOT i loop)
1072 Selecting which REMAP Schedule to use is shown by the pseudocode below.
1073 Each SVSHAPE 0-3 goes through this selection process.
1076 if SVSHAPEn.mode == 0b00: iterate_fn = iterate_indices
1077 if SVSHAPEn.mode == 0b10: iterate_fn = iterate_preduce_indices
1078 if SVSHAPEn.mode in [0b01, 0b11]:
1079 # further sub-selection
1080 if SVSHAPEn.ydimsz == 1: iterate_fn = iterate_butterfly_indices
1081 if SVSHAPEn.ydimsz == 2: iterate_fn = iterate_dct_inner_butterfly_indices
1082 if SVSHAPEn.ydimsz == 3: iterate_fn = iterate_dct_outer_butterfly_indices
1083 if SVSHAPEn.ydimsz in [5, 13]: iterate_fn = iterate_dct_inner_costable_indices
1084 if SVSHAPEn.ydimsz in [6, 14, 15]: iterate_fn = iterate_dct_inner_halfswap_loadstore