openpower/sv/rfc/ls009.mdwn

   1 # RFC ls009 Simple-V REMAP Subsystem
   2
   3 **URLs**:
   4
   5 * <https://libre-soc.org/openpower/sv/rfc/ls009/>
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=1042>
   7 * <https://git.openpower.foundation/isa/PowerISA/issues/124>
   8
   9 **Severity**: Major
  10
  11 **Status**: New
  12
  13 **Date**: 26 Mar 2023. v2 TODO
  14
  15 **Target**: v3.2B
  16
  17 **Source**: v3.0B
  18
  19 **Books and Section affected**:
  20
  21 ```
  22     Book I, new Zero-Overhead-Loop Chapter.
  23     Appendix E Power ISA sorted by opcode
  24     Appendix F Power ISA sorted by version
  25     Appendix G Power ISA sorted by Compliancy Subset
  26     Appendix H Power ISA sorted by mnemonic
  27 ```
  28
  29 **Summary**
  30
  31 ```
  32     svremap  - Re-Mapping of Register Element Offsets
  33     svindex  - General-purpose setting of SHAPEs to be re-mapped
  34     svshape  - Hardware-level setting of SHAPEs for element re-mapping
  35     svshape2 - Hardware-level setting of SHAPEs for element re-mapping (v2)
  36 ```
  37
  38 **Submitter**: Luke Leighton (Libre-SOC)
  39
  40 **Requester**: Libre-SOC
  41
  42 **Impact on processor**:
  43
  44 ```
  45     Addition of four new "Zero-Overhead-Loop-Control" DSP-style Vector-style
  46     Management Instructions which provide advanced features such as Matrix
  47     FFT DCT Hardware-Assist Schedules and general-purpose Index reordering.
  48 ```
  49
  50 **Impact on software**:
  51
  52 ```
  53     Requires support for new instructions in assembler, debuggers,
  54     and related tools.
  55 ```
  56
  57 **Keywords**:
  58
  59 ```
  60     Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC),
  61     Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
  62     Digital Signal Processing (DSP)
  63 ```
  64
  65 **Motivation**
  66
  67 These REMAP Management instructions provide state-of-the-art advanced capabilities
  68 to dramatically decrease instruction count and power reduction whilst retaining
  69 unprecedented general-purpose capability and a standard Sequential Execution Model.
  70
  71 **Notes and Observations**:
  72
  73 1. Although costly the alternatives in SIMD-paradigm software result in
  74    huge algorithmic complexity and associated power consumption increases.
  75    Loop-unrolling compilers are prevalent as is thousands to tens of
  76    thousands of instructions.
  77 2. Core inner kernels of Matrix DCT DFT FFT NTT are dramatically reduced
  78    to a handful of instructions.
  79 3. No REMAP instructions with the exception of Indexed rely on registers
  80    for the establishment of REMAP capability.
  81 4. Future EXT1xx variants and SVP64/VSX will dramatically expand the power
  82    and flexibility of REMAP.
  83 5. The Simple-V Compliancy Subsets make REMAP optional except in the
  84    Advanced Levels.  Like v3.1 MMA it is **not** necessary for **all**
  85    hardware to implement REMAP.
  86
  87 **Changes**
  88
  89 Add the following entries to:
  90
  91 * the Appendices of SV Book
  92 * Instructions of SV Book as a new Section
  93 * SVI, SVM, SVM2, SVRM Form of Book I Section 1.6.1.6 and 1.6.2
  94
  95 ----------------
  96
  97 \newpage{}
  98
  99 # Rationale and background
 100
 101 In certain common algorithms there are clear patterns of behaviour which
 102 typically require inline loop-unrolled instructions comprising interleaved
 103 permute and arithmetic operations.  REMAP was invented to split the permuting
 104 from the arithmetic, and to allow the Indexing to be done as a hardware Schedule.
 105
 106 A normal Vector Add:
 107
 108 ```
 109       for i in range(VL):
 110         GPR[RT+i] <= GPR[RA+i] +
 111                      GPR[RB+i];
 112 ```
 113
 114 A Hardware-assisted REMAP Vector Add:
 115
 116 ```
 117       for i in range(VL):
 118         GPR[RT+remap1(i)] <= GPR[RA+remap2(i)] +
 119                              GPR[RB+remap3(i)];
 120 ```
 121
 122 The result is a huge saving on register file accesses (no need to calculate Indices
 123 then use Permutation instructions), instruction count (Matrix Multiply up to 127 FMACs
 124 is 3 instructions), and programmer sanity.
 125
 126 # REMAP types
 127
 128 This section summarises the motivation for each REMAP Schedule
 129 and briefly goes over their characteristics and limitations.
 130 Further details on the Deterministic Precise-Interruptible algorithms
 131 used in these Schedules is found in the [[sv/remap/appendix]].
 132
 133 ## Matrix (1D/2D/3D shaping)
 134
 135 Matrix Multiplication is a huge part of High-Performance Compute,
 136 and 3D.
 137 In many PackedSIMD as well as Scalable Vector ISAs, non-power-of-two
 138 Matrix sizes are a serious challenge. PackedSIMD ISAs, in order to
 139 cope with for example 3x4 Matrices, recommend rolling data-repetition and loop-unrolling.
 140 Aside from the cost of the load on the L1 I-Cache, the trick only
 141 works if one of the dimensions X or Y are power-two. Prime Numbers
 142 (5x7, 3x5) become deeply problematic to unroll.
 143
 144 Even traditional Scalable Vector ISAs have issues with Matrices, often
 145 having to perform data Transpose by pushing out through Memory and back
 146 (costly),
 147 or computing Transposition Indices (costly) then copying to another
 148 Vector (costly).
 149
 150 Matrix REMAP was thus designed to solve these issues by providing Hardware
 151 Assisted
 152 "Schedules" that can view what would otherwise be limited to a strictly
 153 linear Vector as instead being 2D (even 3D) *in-place* reordered.
 154 With both Transposition and non-power-two being supported the issues
 155 faced by other ISAs are mitigated.
 156
 157 Limitations of Matrix REMAP are that the Vector Length (VL) is currently
 158 restricted to 127: up to 127 FMAs (or other operation)
 159 may be performed in total.
 160 Also given that it is in-registers only at present some care has to be
 161 taken on regfile resource utilisation. However it is perfectly possible
 162 to utilise Matrix REMAP to perform the three inner-most "kernel" loops of
 163 the usual 6-level "Tiled" large Matrix Multiply, without the usual
 164 difficulties associated with SIMD.
 165
 166 Also the `svshape` instruction only provides access to part of the
 167 Matrix REMAP capability. Rotation and mirroring need to be done by
 168 programming the SVSHAPE SPRs directly, which can take a lot more
 169 instructions. Future versions of SVP64 will include EXT1xx prefixed
 170 variants (`psvshape`) which provide more comprehensive capacity and
 171 mitigate the need to write direct to the SVSHAPE SPRs.
 172
 173 ## FFT/DCT Triple Loop
 174
 175 DCT and FFT are some of the most astonishingly used algorithms in
 176 Computer Science.  Radar, Audio, Video, R.F. Baseband and dozens more.  At least
 177 two DSPs, TMS320 and Hexagon, have VLIW instructions specially tailored
 178 to FFT.
 179
 180 An in-depth analysis showed that it is possible to do in-place in-register
 181 DCT and FFT as long as twin-result "butterfly" instructions are provided.
 182 These can be found in the [[openpower/isa/svfparith]] page if performing
 183 IEEE754 FP transforms. *(For fixed-point transforms, equivalent 3-in 2-out
 184 integer operations would be required)*. These "butterfly" instructions
 185 avoid the need for a temporary register because the two array positions
 186 being overwritten will be "in-flight" in any In-Order or Out-of-Order
 187 micro-architecture.
 188
 189 DCT and FFT Schedules are currently limited to RADIX2 sizes and do not
 190 accept predicate masks.  Given that it is common to perform recursive
 191 convolutions combining smaller Power-2 DCT/FFT to create larger DCT/FFTs
 192 in practice the RADIX2 limit is not a problem.  A Bluestein convolution
 193 to compute arbitrary length is demonstrated by
 194 [Project Nayuki](https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py)
 195
 196 ## Indexed
 197
 198 The purpose of Indexing is to provide a generalised version of
 199 Vector ISA "Permute" instructions, such as VSX `vperm`.  The
 200 Indexing is abstracted out and may be applied to much more
 201 than an element move/copy, and is not limited for example
 202 to the number of bytes that can fit into a VSX register.
 203 Indexing may be applied to LD/ST (even on Indexed LD/ST
 204 instructions such as `sv.lbzx`), arithmetic operations,
 205 extsw: there is no artificial limit.
 206
 207 The original motivation for Indexed REMAP was to mitigate the need to add
 208 an expensive `mv.x` to the Scalar ISA, which was highly likely to be rejected as
 209 a stand-alone instruction
 210 (`GPR(RT) <- GPR(GPR(RA))`).  Usually a Vector ISA would add a non-conflicting
 211 variant (as in VSX `vperm`) but it is common to need to permute by source,
 212 with the risk of conflict, that has to be resolved, for example, in AVX-512
 213 with `conflictd`.
 214
 215 Indexed REMAP is the "Get out of Jail Free" card which (for a price)
 216 allows any general-purpose arbitrary Element Permutation not covered by
 217 the other Hardware Schedules.
 218
 219 ## Parallel Reduction
 220
 221 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture.  Like Scalar reduction, the "Scalar Base"
 222 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
 223 *appearance* and *effect* of Reduction. Parallel Reduction is not limited
 224 to Power-of-two but is limited as usual by the total number of
 225 element operations (127) as well as available register file size.
 226
 227 Parallel Reduction is normally done explicitly as a loop-unrolled operation,
 228 taking up a significant number of instructions. With REMAP it is just three
 229 instructions: two for setup and one Scalar Base.
 230
 231 ## Parallel Prefix Sum
 232
 233 This is a work-efficient Parallel Schedule that for example produces Trangular
 234 or Factorial number sequences. Half of the Prefix Sum Schedule is near-identical
 235 to Parallel Reduction.  Whilst the Arithmetic mapreduce Mode (`/mr`) may achieve the same
 236 end-result, implementations may only implement Mapreduce in serial form (or give
 237 the appearance to Programmers of the same). The Parallel Prefix Schedule is
 238 *required* to be implemented in such a way that its Deterministic Schedule may be
 239 parallelised. Like the Reduction Schedule it is 100% Deterministic and consequently
 240 may be used with non-commutative operations.
 241
 242 Parallel Reduction combined with Parallel Prefix Sum can be combined to help
 243 perform AI non-linear interpolation in around twelve to fifteen instructions.
 244
 245 ----------------
 246
 247 \newpage{}
 248
 249 Add the following to SV Book
 250
 251 [[!inline pages="openpower/sv/remap" raw=yes ]]
 252
 253 # Forms
 254
 255 Add `SVI, SVM, SVM2, SVRM` to `XO (26:31)` Field in Book I, 1.6.2
 256
 257 Add the following to Book I, 1.6.1, SVI-Form
 258
 259 ```
 260     |0     |6    |11    |16   |21 |23  |24|25|26    31|
 261     | PO   |  SVG|rmm   | SVd |ew |SVyx|mm|sk|   XO   |
 262 ```
 263
 264 Add the following to Book I, 1.6.1, SVM-Form
 265
 266 ```
 267     |0     |6        |11      |16    |21    |25 |26    |31  |
 268     | PO   |  SVxd   |   SVyd | SVzd | SVrm |vf |   XO      |
 269 ```
 270
 271 Add the following to Book I, 1.6.1, SVM2-Form
 272
 273 ```
 274     |0     |6     |10  |11      |16    |21 |24|25 |26    |31  |
 275     | PO   | SVo  |SVyx|   rmm  | SVd  |XO |mm|sk |   XO      |
 276 ```
 277
 278 Add the following to Book I, 1.6.1, SVRM-Form
 279
 280 ```
 281     |0     |6     |11  |13   |15   |17   |19   |21  |22   |26     |31 |
 282     | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 |pst |///  | XO        |
 283 ```
 284
 285 Add the following to Book I, 1.6.2
 286
 287 ```
 288     mi0 (11:12)
 289         Field used in REMAP to select the SVSHAPE for 1st input register
 290         Formats: SVRM
 291     mi1 (13:14)
 292         Field used in REMAP to select the SVSHAPE for 2nd input register
 293         Formats: SVRM
 294     mi2 (15:16)
 295         Field used in REMAP to select the SVSHAPE for 3rd input register
 296         Formats: SVRM
 297     mm (24)
 298         Field used to specify the meaning of the rmm field for SVI-Form
 299         and SVM2-Form
 300         Formats: SVI, SVM2
 301     mo0 (17:18)
 302         Field used in REMAP to select the SVSHAPE for 1st output register
 303         Formats: SVRM
 304     mo1 (19:20)
 305         Field used in REMAP to select the SVSHAPE for 2nd output register
 306         Formats: SVRM
 307     pst (21)
 308         Field used in REMAP to indicate "persistence" mode (REMAP
 309         continues to apply to multiple instructions)
 310         Formats: SVRM
 311     rmm (11:15)
 312         REMAP Mode field for SVI-Form and SVM2-Form
 313         Formats: SVI, SVM2
 314     sk (25)
 315         Field used to specify dimensional skipping in svindex
 316         Formats: SVI, SVM2
 317     SVd (16:20)
 318         Immediate field used to specify the size of the REMAP dimension
 319         in the svindex and svshape2 instructions
 320         Formats: SVI, SVM2
 321     SVDS (16:29)
 322         Immediate field used to specify a 9-bit signed
 323         two's complement integer which is concatenated
 324         on the right with 0b00 and sign-extended to 64 bits.
 325         Formats: SVDS
 326     SVG (6:10)
 327         Field used to specify a GPR to be used as a
 328         source for indexing.
 329         Formats: SVI
 330     SVi (16:22)
 331          Simple-V immediate field for setting VL or MVL
 332          Formats: SVL
 333     SVme (6:10)
 334          Simple-V "REMAP" map-enable bits (0-4)
 335          Formats: SVRM
 336     SVo (6:9)
 337         Field used by the svshape2 instruction as an offset
 338         Formats: SVM2
 339     SVrm (21:24)
 340          Simple-V "REMAP" Mode
 341          Formats: SVM
 342     SVxd (6:10)
 343          Simple-V "REMAP" x-dimension size
 344          Formats: SVM
 345     SVyd (11:15)
 346          Simple-V "REMAP" y-dimension size
 347          Formats: SVM
 348     SVzd (16:20)
 349          Simple-V "REMAP" z-dimension size
 350          Formats: SVM
 351     XO (21:23,26:31)
 352         Extended opcode field.  Note that bit 21 must be 1, 22 and 23
 353         must be zero, and bits 26-31 must be exactly the same as
 354         used for svshape.
 355         Formats: SVM2
 356 ```
 357
 358 # Appendices
 359
 360     Appendix E Power ISA sorted by opcode
 361     Appendix F Power ISA sorted by version
 362     Appendix G Power ISA sorted by Compliancy Subset
 363     Appendix H Power ISA sorted by mnemonic
 364
 365 | Form | Book | Page | Version | mnemonic | Description |
 366 |------|------|------|---------|----------|-------------|
 367 | SVRM | I    | #    | 3.0B    | svremap    | REMAP enabling instruction |
 368 | SVM  | I    | #    | 3.0B    | svshape  | REMAP shape instruction |
 369 | SVM2 | I    | #    | 3.0B    | svshape2 | REMAP shape instruction (2) |
 370 | SVI  | I    | #    | 3.0B    | svindex | REMAP General-purpose Indexing |
 371
 372 [[!inline pages="openpower/sv/remap/appendix" raw=yes ]]
 373
 374 ## REMAP pseudocode
 375
 376 Written in python3 the following stand-alone executable source code is the Canonical
 377 Specification for each REMAP. Vectors of "loopends" are returned when Rc=1
 378 in Vectors of CR Fields on `sv.svstep.`, or in Vertical-First Mode
 379 a single CR Field (CR0) on `svstep.`.  The `SVSTATE.srcstep` or `SVSTATE.dststep` sequential
 380 offset is put through each algorithm to determine the actual Element Offset.
 381 Alternative implementations producing different ordering
 382 is prohibited as software will be critically relying on these Deterministic Schedules.
 383
 384 ### REMAP 2D/3D Matrix
 385
 386 The following stand-alone executable source code is the Canonical
 387 Specification for Matrix (2D/3D) REMAP.
 388 Hardware implementations are achievable with simple cascading counter-and-compares.
 389
 390 ```
 391 # python "yield" can be iterated. use this to make it clear how
 392 # the indices are generated by using natural-looking nested loops
 393 def iterate_indices(SVSHAPE):
 394     # get indices to iterate over, in the required order
 395     xd = SVSHAPE.lims[0]
 396     yd = SVSHAPE.lims[1]
 397     zd = SVSHAPE.lims[2]
 398     # create lists of indices to iterate over in each dimension
 399     x_r = list(range(xd))
 400     y_r = list(range(yd))
 401     z_r = list(range(zd))
 402     # invert the indices if needed
 403     if SVSHAPE.invxyz[0]: x_r.reverse()
 404     if SVSHAPE.invxyz[1]: y_r.reverse()
 405     if SVSHAPE.invxyz[2]: z_r.reverse()
 406     # start an infinite (wrapping) loop
 407     step = 0 # track src/dst step
 408     while True:
 409         for z in z_r:   # loop over 1st order dimension
 410             z_end = z == z_r[-1]
 411             for y in y_r:       # loop over 2nd order dimension
 412                 y_end = y == y_r[-1]
 413                 for x in x_r:           # loop over 3rd order dimension
 414                     x_end = x == x_r[-1]
 415                     # ok work out which order to construct things in.
 416                     # start by creating a list of tuples of the dimension
 417                     # and its limit
 418                     vals = [(SVSHAPE.lims[0], x, "x"),
 419                             (SVSHAPE.lims[1], y, "y"),
 420                             (SVSHAPE.lims[2], z, "z")
 421                            ]
 422                     # now select those by order.  this allows us to
 423                     # create schedules for [z][x], [x][y], or [y][z]
 424                     # for matrix multiply.
 425                     vals = [vals[SVSHAPE.order[0]],
 426                             vals[SVSHAPE.order[1]],
 427                             vals[SVSHAPE.order[2]]
 428                            ]
 429                     # ok now we can construct the result, using bits of
 430                     # "order" to say which ones get stacked on
 431                     result = 0
 432                     mult = 1
 433                     for i in range(3):
 434                         lim, idx, dbg = vals[i]
 435                         # some of the dimensions can be "skipped".  the order
 436                         # was actually selected above on all 3 dimensions,
 437                         # e.g. [z][x][y] or [y][z][x].  "skip" allows one of
 438                         # those to be knocked out
 439                         if SVSHAPE.skip == i+1: continue
 440                         idx *= mult   # shifts up by previous dimension(s)
 441                         result += idx # adds on this dimension
 442                         mult *= lim   # for the next dimension
 443
 444                     loopends = (x_end |
 445                                ((y_end and x_end)<<1) |
 446                                 ((y_end and x_end and z_end)<<2))
 447
 448                     yield result + SVSHAPE.offset, loopends
 449                     step += 1
 450
 451 def demo():
 452     # set the dimension sizes here
 453     xdim = 3
 454     ydim = 2
 455     zdim = 4
 456
 457     # set total (can repeat, e.g. VL=x*y*z*4)
 458     VL = xdim * ydim * zdim
 459
 460     # set up an SVSHAPE
 461     class SVSHAPE:
 462         pass
 463     SVSHAPE0 = SVSHAPE()
 464     SVSHAPE0.lims = [xdim, ydim, zdim]
 465     SVSHAPE0.order = [1,0,2]  # experiment with different permutations, here
 466     SVSHAPE0.mode = 0b00
 467     SVSHAPE0.skip = 0b00
 468     SVSHAPE0.offset = 0       # experiment with different offset, here
 469     SVSHAPE0.invxyz = [0,0,0] # inversion if desired
 470
 471     # enumerate over the iterator function, getting new indices
 472     for idx, (new_idx, end) in enumerate(iterate_indices(SVSHAPE0)):
 473         if idx >= VL:
 474             break
 475         print ("%d->%d" % (idx, new_idx), "end", bin(end)[2:])
 476
 477 # run the demo
 478 if __name__ == '__main__':
 479     demo()
 480 ```
 481
 482 ### REMAP Parallel Reduction pseudocode
 483
 484 The python3 program below is stand-alone executable and is the Canonical Specification
 485 for Parallel Reduction REMAP.
 486 The Algorithm below is not limited to RADIX2 sizes, and Predicate
 487 sources, unlike in Matrix REMAP, apply to the Element Indices **after** REMAP
 488 has been applied, not before.  MV operations are not required: the algorithm
 489 tracks positions of elements that would normally be moved and when applying
 490 an Element Reduction Operation sources the operands from their last-known (tracked)
 491 position.
 492
 493 ```
 494 # a "yield" version of the Parallel Reduction REMAP algorithm.
 495 # the algorithm is in-place. it does not perform "MV" operations.
 496 # instead, where a masked-out value *should* be read from is tracked
 497
 498 def iterate_indices(SVSHAPE, pred=None):
 499     # get indices to iterate over, in the required order
 500     xd = SVSHAPE.lims[0]
 501     # create lists of indices to iterate over in each dimension
 502     ix = list(range(xd))
 503     # invert the indices if needed
 504     if SVSHAPE.invxyz[0]: ix.reverse()
 505     # start a loop from the lowest step
 506     step = 1
 507     steps = []
 508     while step < xd:
 509         step *= 2
 510         steps.append(step)
 511     # invert the indices if needed
 512     if SVSHAPE.invxyz[1]: steps.reverse()
 513     for step in steps:
 514         stepend = (step == steps[-1]) # note end of steps
 515         idxs = list(range(0, xd, step))
 516         results = []
 517         for i in idxs:
 518             other = i + step // 2
 519             ci = ix[i]
 520             oi = ix[other] if other < xd else None
 521             other_pred = other < xd and (pred is None or pred[oi])
 522             if (pred is None or pred[ci]) and other_pred:
 523                 if SVSHAPE.skip == 0b00: # submode 00
 524                     result = ci
 525                 elif SVSHAPE.skip == 0b01: # submode 01
 526                     result = oi
 527                 results.append([result + SVSHAPE.offset, 0])
 528             elif other_pred:
 529                 ix[i] = oi
 530         if results:
 531             results[-1][1] = (stepend<<1) | 1  # notify end of loops
 532         yield from results
 533
 534 def demo():
 535     # set the dimension sizes here
 536     xdim = 9
 537
 538     # set up an SVSHAPE
 539     class SVSHAPE:
 540         pass
 541     SVSHAPE0 = SVSHAPE()
 542     SVSHAPE0.lims = [xdim, 0, 0]
 543     SVSHAPE0.order = [0,1,2]
 544     SVSHAPE0.mode = 0b10
 545     SVSHAPE0.skip = 0b00
 546     SVSHAPE0.offset = 0       # experiment with different offset, here
 547     SVSHAPE0.invxyz = [0,0,0] # inversion if desired
 548
 549     SVSHAPE1 = SVSHAPE()
 550     SVSHAPE1.lims = [xdim, 0, 0]
 551     SVSHAPE1.order = [0,1,2]
 552     SVSHAPE1.mode = 0b10
 553     SVSHAPE1.skip = 0b01
 554     SVSHAPE1.offset = 0       # experiment with different offset, here
 555     SVSHAPE1.invxyz = [0,0,0] # inversion if desired
 556
 557     # enumerate over the iterator function, getting new indices
 558     shapes = list(iterate_indices(SVSHAPE0)), \
 559               list(iterate_indices(SVSHAPE1))
 560     for idx in range(len(shapes[0])):
 561         l = shapes[0][idx]
 562         r = shapes[1][idx]
 563         (l_idx, lend) = l
 564         (r_idx, rend) = r
 565         print ("%d->%d:%d" % (idx, l_idx, r_idx),
 566                "end", bin(lend)[2:], bin(rend)[2:])
 567
 568 # run the demo
 569 if __name__ == '__main__':
 570     demo()
 571 ```
 572
 573 ### REMAP FFT pseudocode
 574
 575 The FFT REMAP is RADIX2 only.
 576
 577 ```
 578 # a "yield" version of the REMAP algorithm, for FFT Tukey-Cooley schedules
 579 # original code for the FFT Tukey-Cooley schedule:
 580 # https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py
 581 """
 582     # Radix-2 decimation-in-time FFT (real, not complex)
 583     size = 2
 584     while size <= n:
 585         halfsize = size // 2
 586         tablestep = n // size
 587         for i in range(0, n, size):
 588             k = 0
 589             for j in range(i, i + halfsize):
 590                 jh = j+halfsize
 591                 jl = j
 592                 temp1 = vec[jh] * exptable[k]
 593                 temp2 = vec[jl]
 594                 vec[jh] = temp2 - temp1
 595                 vec[jl] = temp2 + temp1
 596                 k += tablestep
 597         size *= 2
 598 """
 599
 600 # python "yield" can be iterated. use this to make it clear how
 601 # the indices are generated by using natural-looking nested loops
 602 def iterate_butterfly_indices(SVSHAPE):
 603     # get indices to iterate over, in the required order
 604     n = SVSHAPE.lims[0]
 605     stride = SVSHAPE.lims[2] # stride-multiplier on reg access
 606     # creating lists of indices to iterate over in each dimension
 607     # has to be done dynamically, because it depends on the size
 608     # first, the size-based loop (which can be done statically)
 609     x_r = []
 610     size = 2
 611     while size <= n:
 612         x_r.append(size)
 613         size *= 2
 614     # invert order if requested
 615     if SVSHAPE.invxyz[0]: x_r.reverse()
 616
 617     if len(x_r) == 0:
 618         return
 619
 620     # start an infinite (wrapping) loop
 621     skip = 0
 622     while True:
 623         for size in x_r:           # loop over 3rd order dimension (size)
 624             x_end = size == x_r[-1]
 625             # y_r schedule depends on size
 626             halfsize = size // 2
 627             tablestep = n // size
 628             y_r = []
 629             for i in range(0, n, size):
 630                 y_r.append(i)
 631             # invert if requested
 632             if SVSHAPE.invxyz[1]: y_r.reverse()
 633             for i in y_r:       # loop over 2nd order dimension
 634                 y_end = i == y_r[-1]
 635                 k_r = []
 636                 j_r = []
 637                 k = 0
 638                 for j in range(i, i+halfsize):
 639                     k_r.append(k)
 640                     j_r.append(j)
 641                     k += tablestep
 642                 # invert if requested
 643                 if SVSHAPE.invxyz[2]: k_r.reverse()
 644                 if SVSHAPE.invxyz[2]: j_r.reverse()
 645                 for j, k in zip(j_r, k_r):   # loop over 1st order dimension
 646                     z_end = j == j_r[-1]
 647                     # now depending on MODE return the index
 648                     if SVSHAPE.skip == 0b00:
 649                         result = j              # for vec[j]
 650                     elif SVSHAPE.skip == 0b01:
 651                         result = j + halfsize   # for vec[j+halfsize]
 652                     elif SVSHAPE.skip == 0b10:
 653                         result = k              # for exptable[k]
 654
 655                     loopends = (z_end |
 656                                ((y_end and z_end)<<1) |
 657                                 ((y_end and x_end and z_end)<<2))
 658
 659                     yield (result * stride) + SVSHAPE.offset, loopends
 660
 661 def demo():
 662     # set the dimension sizes here
 663     xdim = 8
 664     ydim = 0 # not needed
 665     zdim = 1 # stride must be set to 1
 666
 667     # set total. err don't know how to calculate how many there are...
 668     # do it manually for now
 669     VL = 0
 670     size = 2
 671     n = xdim
 672     while size <= n:
 673         halfsize = size // 2
 674         tablestep = n // size
 675         for i in range(0, n, size):
 676             for j in range(i, i + halfsize):
 677                 VL += 1
 678         size *= 2
 679
 680     # set up an SVSHAPE
 681     class SVSHAPE:
 682         pass
 683     # j schedule
 684     SVSHAPE0 = SVSHAPE()
 685     SVSHAPE0.lims = [xdim, ydim, zdim]
 686     SVSHAPE0.order = [0,1,2]  # experiment with different permutations, here
 687     SVSHAPE0.mode = 0b01
 688     SVSHAPE0.skip = 0b00
 689     SVSHAPE0.offset = 0       # experiment with different offset, here
 690     SVSHAPE0.invxyz = [0,0,0] # inversion if desired
 691     # j+halfstep schedule
 692     SVSHAPE1 = SVSHAPE()
 693     SVSHAPE1.lims = [xdim, ydim, zdim]
 694     SVSHAPE1.order = [0,1,2]  # experiment with different permutations, here
 695     SVSHAPE0.mode = 0b01
 696     SVSHAPE1.skip = 0b01
 697     SVSHAPE1.offset = 0       # experiment with different offset, here
 698     SVSHAPE1.invxyz = [0,0,0] # inversion if desired
 699     # k schedule
 700     SVSHAPE2 = SVSHAPE()
 701     SVSHAPE2.lims = [xdim, ydim, zdim]
 702     SVSHAPE2.order = [0,1,2]  # experiment with different permutations, here
 703     SVSHAPE0.mode = 0b01
 704     SVSHAPE2.skip = 0b10
 705     SVSHAPE2.offset = 0       # experiment with different offset, here
 706     SVSHAPE2.invxyz = [0,0,0] # inversion if desired
 707
 708     # enumerate over the iterator function, getting new indices
 709     schedule = []
 710     for idx, (jl, jh, k) in enumerate(zip(iterate_butterfly_indices(SVSHAPE0),
 711                                           iterate_butterfly_indices(SVSHAPE1),
 712                                           iterate_butterfly_indices(SVSHAPE2))):
 713         if idx >= VL:
 714             break
 715         schedule.append((jl, jh, k))
 716
 717     # ok now pretty-print the results, with some debug output
 718     size = 2
 719     idx = 0
 720     while size <= n:
 721         halfsize = size // 2
 722         tablestep = n // size
 723         print ("size %d halfsize %d tablestep %d" % \
 724                 (size, halfsize, tablestep))
 725         for i in range(0, n, size):
 726             prefix = "i %d\t" % i
 727             k = 0
 728             for j in range(i, i + halfsize):
 729                 (jl, je), (jh, he), (ks, ke) = schedule[idx]
 730                 print ("  %-3d\t%s j=%-2d jh=%-2d k=%-2d -> "
 731                         "j[jl=%-2d] j[jh=%-2d] ex[k=%d]" % \
 732                                 (idx, prefix, j, j+halfsize, k,
 733                                       jl, jh, ks,
 734                                 ),
 735                                 "end", bin(je)[2:], bin(je)[2:], bin(ke)[2:])
 736                 k += tablestep
 737                 idx += 1
 738         size *= 2
 739
 740 # run the demo
 741 if __name__ == '__main__':
 742     demo()
 743 ```
 744
 745 ### DCT REMAP
 746
 747 DCT REMAP is RADIX2 only.  Convolutions may be applied as usual
 748 to create non-RADIX2 DCT. Combined with appropriate Twin-butterfly
 749 instructions the algorithm below (written in python3), becomes part
 750 of an in-place in-registers Vectorised DCT.  The algorithms work
 751 by loading data such that as the nested loops progress the result
 752 is sorted into correct sequential order.
 753
 754 ```
 755 # DCT "REMAP" scheduler to create an in-place iterative DCT.
 756 #
 757
 758 # bits of the integer 'val' of width 'width' are reversed
 759 def reverse_bits(val, width):
 760     result = 0
 761     for _ in range(width):
 762         result = (result << 1) | (val & 1)
 763         val >>= 1
 764     return result
 765
 766
 767 # iterative version of [recursively-applied] half-reversing
 768 # turns out this is Gray-Encoding.
 769 def halfrev2(vec, pre_rev=True):
 770     res = []
 771     for i in range(len(vec)):
 772         if pre_rev:
 773             res.append(vec[i ^ (i>>1)])
 774         else:
 775             ri = i
 776             bl = i.bit_length()
 777             for ji in range(1, bl):
 778                 ri ^= (i >> ji)
 779             res.append(vec[ri])
 780     return res
 781
 782
 783 def iterate_dct_inner_halfswap_loadstore(SVSHAPE):
 784     # get indices to iterate over, in the required order
 785     n = SVSHAPE.lims[0]
 786     mode = SVSHAPE.lims[1]
 787     stride = SVSHAPE.lims[2]
 788
 789     # reference list for not needing to do data-swaps, just swap what
 790     # *indices* are referenced (two levels of indirection at the moment)
 791     # pre-reverse the data-swap list so that it *ends up* in the order 0123..
 792     ji = list(range(n))
 793
 794     levels = n.bit_length() - 1
 795     ri = [reverse_bits(i, levels) for i in range(n)]
 796
 797     if SVSHAPE.mode == 0b01: # FFT, bitrev only
 798         ji = [ji[ri[i]] for i in range(n)]
 799     elif SVSHAPE.submode2 == 0b001:
 800         ji = [ji[ri[i]] for i in range(n)]
 801         ji = halfrev2(ji, True)
 802     else:
 803         ji = halfrev2(ji, False)
 804         ji = [ji[ri[i]] for i in range(n)]
 805
 806     # invert order if requested
 807     if SVSHAPE.invxyz[0]:
 808         ji.reverse()
 809
 810     for i, jl in enumerate(ji):
 811         y_end = jl == ji[-1]
 812         yield jl * stride, (0b111 if y_end else 0b000)
 813
 814 def iterate_dct_inner_costable_indices(SVSHAPE):
 815     # get indices to iterate over, in the required order
 816     n = SVSHAPE.lims[0]
 817     mode = SVSHAPE.lims[1]
 818     stride = SVSHAPE.lims[2]
 819     # creating lists of indices to iterate over in each dimension
 820     # has to be done dynamically, because it depends on the size
 821     # first, the size-based loop (which can be done statically)
 822     x_r = []
 823     size = 2
 824     while size <= n:
 825         x_r.append(size)
 826         size *= 2
 827     # invert order if requested
 828     if SVSHAPE.invxyz[0]:
 829         x_r.reverse()
 830
 831     if len(x_r) == 0:
 832         return
 833
 834     # start an infinite (wrapping) loop
 835     skip = 0
 836     z_end = 1 # doesn't exist in this, only 2 loops
 837     k = 0
 838     while True:
 839         for size in x_r:           # loop over 3rd order dimension (size)
 840             x_end = size == x_r[-1]
 841             # y_r schedule depends on size
 842             halfsize = size // 2
 843             y_r = []
 844             for i in range(0, n, size):
 845                 y_r.append(i)
 846             # invert if requested
 847             if SVSHAPE.invxyz[1]: y_r.reverse()
 848             # two lists of half-range indices, e.g. j 0123, jr 7654
 849             j = list(range(0, halfsize))
 850             # invert if requested
 851             if SVSHAPE.invxyz[2]: j_r.reverse()
 852             # loop over 1st order dimension
 853             for ci, jl in enumerate(j):
 854                 y_end = jl == j[-1]
 855                 # now depending on MODE return the index.  inner butterfly
 856                 if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
 857                     result = k  # offset into COS table
 858                 elif SVSHAPE.skip == 0b10: #
 859                     result = ci # coefficient helper
 860                 elif SVSHAPE.skip == 0b11: #
 861                     result = size # coefficient helper
 862                 loopends = (z_end |
 863                            ((y_end and z_end)<<1) |
 864                             ((y_end and x_end and z_end)<<2))
 865
 866                 yield (result * stride) + SVSHAPE.offset, loopends
 867                 k += 1
 868
 869 def iterate_dct_inner_butterfly_indices(SVSHAPE):
 870     # get indices to iterate over, in the required order
 871     n = SVSHAPE.lims[0]
 872     mode = SVSHAPE.lims[1]
 873     stride = SVSHAPE.lims[2]
 874     # creating lists of indices to iterate over in each dimension
 875     # has to be done dynamically, because it depends on the size
 876     # first, the size-based loop (which can be done statically)
 877     x_r = []
 878     size = 2
 879     while size <= n:
 880         x_r.append(size)
 881         size *= 2
 882     # invert order if requested
 883     if SVSHAPE.invxyz[0]:
 884         x_r.reverse()
 885
 886     if len(x_r) == 0:
 887         return
 888
 889     # reference (read/write) the in-place data in *reverse-bit-order*
 890     ri = list(range(n))
 891     if SVSHAPE.submode2 == 0b01:
 892         levels = n.bit_length() - 1
 893         ri = [ri[reverse_bits(i, levels)] for i in range(n)]
 894
 895     # reference list for not needing to do data-swaps, just swap what
 896     # *indices* are referenced (two levels of indirection at the moment)
 897     # pre-reverse the data-swap list so that it *ends up* in the order 0123..
 898     ji = list(range(n))
 899     inplace_mode = True
 900     if inplace_mode and SVSHAPE.submode2 == 0b01:
 901         ji = halfrev2(ji, True)
 902     if inplace_mode and SVSHAPE.submode2 == 0b11:
 903         ji = halfrev2(ji, False)
 904
 905     # start an infinite (wrapping) loop
 906     while True:
 907         k = 0
 908         k_start = 0
 909         for size in x_r:           # loop over 3rd order dimension (size)
 910             x_end = size == x_r[-1]
 911             # y_r schedule depends on size
 912             halfsize = size // 2
 913             y_r = []
 914             for i in range(0, n, size):
 915                 y_r.append(i)
 916             # invert if requested
 917             if SVSHAPE.invxyz[1]: y_r.reverse()
 918             for i in y_r:       # loop over 2nd order dimension
 919                 y_end = i == y_r[-1]
 920                 # two lists of half-range indices, e.g. j 0123, jr 7654
 921                 j = list(range(i, i + halfsize))
 922                 jr = list(range(i+halfsize, i + size))
 923                 jr.reverse()
 924                 # invert if requested
 925                 if SVSHAPE.invxyz[2]:
 926                     j.reverse()
 927                     jr.reverse()
 928                 hz2 = halfsize // 2 # zero stops reversing 1-item lists
 929                 # loop over 1st order dimension
 930                 k = k_start
 931                 for ci, (jl, jh) in enumerate(zip(j, jr)):
 932                     z_end = jl == j[-1]
 933                     # now depending on MODE return the index.  inner butterfly
 934                     if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
 935                         if SVSHAPE.submode2 == 0b11: # iDCT
 936                             result = ji[ri[jl]]        # lower half
 937                         else:
 938                             result = ri[ji[jl]]        # lower half
 939                     elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
 940                         if SVSHAPE.submode2 == 0b11: # iDCT
 941                             result = ji[ri[jl+halfsize]] # upper half
 942                         else:
 943                             result = ri[ji[jh]] # upper half
 944                     elif mode == 4:
 945                         # COS table pre-generated mode
 946                         if SVSHAPE.skip == 0b10: #
 947                             result = k # cos table offset
 948                     else: # mode 2
 949                         # COS table generated on-demand ("Vertical-First") mode
 950                         if SVSHAPE.skip == 0b10: #
 951                             result = ci # coefficient helper
 952                         elif SVSHAPE.skip == 0b11: #
 953                             result = size # coefficient helper
 954                     loopends = (z_end |
 955                                ((y_end and z_end)<<1) |
 956                                 ((y_end and x_end and z_end)<<2))
 957
 958                     yield (result * stride) + SVSHAPE.offset, loopends
 959                     k += 1
 960
 961                 # now in-place swap
 962                 if inplace_mode:
 963                     for ci, (jl, jh) in enumerate(zip(j[:hz2], jr[:hz2])):
 964                         jlh = jl+halfsize
 965                         tmp1, tmp2 = ji[jlh], ji[jh]
 966                         ji[jlh], ji[jh] = tmp2, tmp1
 967
 968             # new k_start point for cos tables( runs inside x_r loop NOT i loop)
 969             k_start += halfsize
 970
 971
 972 # python "yield" can be iterated. use this to make it clear how
 973 # the indices are generated by using natural-looking nested loops
 974 def iterate_dct_outer_butterfly_indices(SVSHAPE):
 975     # get indices to iterate over, in the required order
 976     n = SVSHAPE.lims[0]
 977     mode = SVSHAPE.lims[1]
 978     stride = SVSHAPE.lims[2]
 979     # creating lists of indices to iterate over in each dimension
 980     # has to be done dynamically, because it depends on the size
 981     # first, the size-based loop (which can be done statically)
 982     x_r = []
 983     size = n // 2
 984     while size >= 2:
 985         x_r.append(size)
 986         size //= 2
 987     # invert order if requested
 988     if SVSHAPE.invxyz[0]:
 989         x_r.reverse()
 990
 991     if len(x_r) == 0:
 992         return
 993
 994     # I-DCT, reference (read/write) the in-place data in *reverse-bit-order*
 995     ri = list(range(n))
 996     if SVSHAPE.submode2 in [0b11, 0b01]:
 997         levels = n.bit_length() - 1
 998         ri = [ri[reverse_bits(i, levels)] for i in range(n)]
 999
1000     # reference list for not needing to do data-swaps, just swap what
1001     # *indices* are referenced (two levels of indirection at the moment)
1002     # pre-reverse the data-swap list so that it *ends up* in the order 0123..
1003     ji = list(range(n))
1004     inplace_mode = False # need the space... SVSHAPE.skip in [0b10, 0b11]
1005     if SVSHAPE.submode2 == 0b11:
1006         ji = halfrev2(ji, False)
1007
1008     # start an infinite (wrapping) loop
1009     while True:
1010         k = 0
1011         k_start = 0
1012         for size in x_r:           # loop over 3rd order dimension (size)
1013             halfsize = size//2
1014             x_end = size == x_r[-1]
1015             y_r = list(range(0, halfsize))
1016             # invert if requested
1017             if SVSHAPE.invxyz[1]: y_r.reverse()
1018             for i in y_r:       # loop over 2nd order dimension
1019                 y_end = i == y_r[-1]
1020                 # one list to create iterative-sum schedule
1021                 jr = list(range(i+halfsize, i+n-halfsize, size))
1022                 # invert if requested
1023                 if SVSHAPE.invxyz[2]: jr.reverse()
1024                 hz2 = halfsize // 2 # zero stops reversing 1-item lists
1025                 k = k_start
1026                 for ci, jh in enumerate(jr):   # loop over 1st order dimension
1027                     z_end = jh == jr[-1]
1028                     if mode == 4:
1029                         # COS table pre-generated mode
1030                         if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
1031                             if SVSHAPE.submode2 == 0b11: # iDCT
1032                                 result = ji[ri[jh]] # upper half
1033                             else:
1034                                 result = ri[ji[jh]]        # lower half
1035                         elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
1036                             if SVSHAPE.submode2 == 0b11: # iDCT
1037                                 result = ji[ri[jh+size]] # upper half
1038                             else:
1039                                 result = ri[ji[jh+size]] # upper half
1040                         elif SVSHAPE.skip == 0b10: #
1041                             result = k # cos table offset
1042                     else:
1043                         # COS table generated on-demand ("Vertical-First") mode
1044                         if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
1045                             if SVSHAPE.submode2 == 0b11: # iDCT
1046                                 result = ji[ri[jh]]        # lower half
1047                             else:
1048                                 result = ri[ji[jh]]        # lower half
1049                         elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
1050                             if SVSHAPE.submode2 == 0b11: # iDCT
1051                                 result = ji[ri[jh+size]] # upper half
1052                             else:
1053                                 result = ri[ji[jh+size]] # upper half
1054                         elif SVSHAPE.skip == 0b10: #
1055                             result = ci # coefficient helper
1056                         elif SVSHAPE.skip == 0b11: #
1057                             result = size # coefficient helper
1058                     loopends = (z_end |
1059                                ((y_end and z_end)<<1) |
1060                                 ((y_end and x_end and z_end)<<2))
1061
1062                     yield (result * stride) + SVSHAPE.offset, loopends
1063                     k += 1
1064
1065             # new k_start point for cos tables( runs inside x_r loop NOT i loop)
1066             k_start += halfsize
1067
1068 ```
1069
1070 ## REMAP selector
1071
1072 Selecting which REMAP Schedule to use is shown by the pseudocode below.
1073 Each SVSHAPE 0-3 goes through this selection process.
1074
1075 ```
1076   if SVSHAPEn.mode == 0b00:             iterate_fn = iterate_indices
1077   if SVSHAPEn.mode == 0b10:             iterate_fn = iterate_preduce_indices
1078   if SVSHAPEn.mode in [0b01, 0b11]:
1079     # further sub-selection
1080     if SVSHAPEn.ydimsz == 1:            iterate_fn = iterate_butterfly_indices
1081     if SVSHAPEn.ydimsz == 2:            iterate_fn = iterate_dct_inner_butterfly_indices
1082     if SVSHAPEn.ydimsz == 3:            iterate_fn = iterate_dct_outer_butterfly_indices
1083     if SVSHAPEn.ydimsz in [5, 13]:      iterate_fn = iterate_dct_inner_costable_indices
1084     if SVSHAPEn.ydimsz in [6, 14, 15]:  iterate_fn = iterate_dct_inner_halfswap_loadstore
1085 ```
1086
1087
1088 [[!tag opf_rfc]]