# RFC ls009 Simple-V REMAP Subsystem

**URLs**:

* <https://libre-soc.org/openpower/sv/rfc/ls009/>
* <https://bugs.libre-soc.org/show_bug.cgi?id=1042>
* <https://git.openpower.foundation/isa/PowerISA/issues/124>

**Severity**: Major

**Status**: New

**Date**: 26 Mar 2023. v2 TODO

**Target**: v3.2B

**Source**: v3.0B

**Books and Section affected**:

```
    Book I, new Zero-Overhead-Loop Chapter.
    Appendix E Power ISA sorted by opcode
    Appendix F Power ISA sorted by version
    Appendix G Power ISA sorted by Compliancy Subset
    Appendix H Power ISA sorted by mnemonic
```

**Summary**

```
    svremap  - Re-Mapping of Register Element Offsets
    svindex  - General-purpose setting of SHAPEs to be re-mapped
    svshape  - Hardware-level setting of SHAPEs for element re-mapping
    svshape2 - Hardware-level setting of SHAPEs for element re-mapping (v2)
```

**Submitter**: Luke Leighton (Libre-SOC)

**Requester**: Libre-SOC

**Impact on processor**:

```
    Addition of four new "Zero-Overhead-Loop-Control" DSP-style Vector-style
    Management Instructions which provide advanced features such as Matrix
    FFT DCT Hardware-Assist Schedules and general-purpose Index reordering.
```

**Impact on software**:

```
    Requires support for new instructions in assembler, debuggers,
    and related tools.
```

**Keywords**:

```
    Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC),
    Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
    Digital Signal Processing (DSP)
```

**Motivation**

These REMAP Management instructions provide state-of-the-art advanced capabilities
to dramatically decrease instruction count and power reduction whilst retaining
unprecedented general-purpose capability and a standard Sequential Execution Model.

**Notes and Observations**:

1. Although costly the alternatives in SIMD-paradigm software result in
   huge algorithmic complexity and associated power consumption increases.
   Loop-unrolling compilers are prevalent as is thousands to tens of
   thousands of instructions.
2. Core inner kernels of Matrix DCT DFT FFT NTT are dramatically reduced
   to a handful of instructions.
3. No REMAP instructions with the exception of Indexed rely on registers
   for the establishment of REMAP capability.
4. Future EXT1xx variants and SVP64/VSX will dramatically expand the power
   and flexibility of REMAP.
5. The Simple-V Compliancy Subsets make REMAP optional except in the
   Advanced Levels.  Like v3.1 MMA it is **not** necessary for **all**
   hardware to implement REMAP.

**Changes**

Add the following entries to:

* the Appendices of SV Book
* Instructions of SV Book as a new Section
* SVI, SVM, SVM2, SVRM Form of Book I Section 1.6.1.6 and 1.6.2

----------------

\newpage{}

[[!inline pages="openpower/sv/remap" raw=yes ]]

# Forms

Add `SVI, SVM, SVM2, SVRM` to `XO (26:31)` Field in Book I, 1.6.2

Add the following to Book I, 1.6.1, SVI-Form

```
    |0     |6    |11    |16   |21 |23  |24|25|26    31|
    | PO   |  SVG|rmm   | SVd |ew |SVyx|mm|sk|   XO   |
```

Add the following to Book I, 1.6.1, SVM-Form

```
    |0     |6        |11      |16    |21    |25 |26    |31  |
    | PO   |  SVxd   |   SVyd | SVzd | SVrm |vf |   XO      |
```

Add the following to Book I, 1.6.1, SVM2-Form

```
    |0     |6     |10  |11      |16    |21 |24|25 |26    |31  |
    | PO   | SVo  |SVyx|   rmm  | SVd  |XO |mm|sk |   XO      |
```

Add the following to Book I, 1.6.1, SVRM-Form

```
    |0     |6     |11  |13   |15   |17   |19   |21  |22   |26     |31 |
    | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 |pst |///  | XO        |
```

Add the following to Book I, 1.6.2

```
    mi0 (11:12)
        Field used in REMAP to select the SVSHAPE for 1st input register
        Formats: SVRM
    mi1 (13:14)
        Field used in REMAP to select the SVSHAPE for 2nd input register
        Formats: SVRM
    mi2 (15:16)
        Field used in REMAP to select the SVSHAPE for 3rd input register
        Formats: SVRM
    mm (24)
        Field used to specify the meaning of the rmm field for SVI-Form
        and SVM2-Form
        Formats: SVI, SVM2
    mo0 (17:18)
        Field used in REMAP to select the SVSHAPE for 1st output register
        Formats: SVRM
    mo1 (19:20)
        Field used in REMAP to select the SVSHAPE for 2nd output register
        Formats: SVRM
    pst (21)
        Field used in REMAP to indicate "persistence" mode (REMAP
        continues to apply to multiple instructions)
        Formats: SVRM
    rmm (11:15)
        REMAP Mode field for SVI-Form and SVM2-Form
        Formats: SVI, SVM2
    sk (25)
        Field used to specify dimensional skipping in svindex
        Formats: SVI, SVM2
    SVd (16:20)
        Immediate field used to specify the size of the REMAP dimension
        in the svindex and svshape2 instructions
        Formats: SVI, SVM2
    SVDS (16:29)
        Immediate field used to specify a 9-bit signed
        two's complement integer which is concatenated
        on the right with 0b00 and sign-extended to 64 bits.
        Formats: SVDS
    SVG (6:10)
        Field used to specify a GPR to be used as a
        source for indexing.
        Formats: SVI
    SVi (16:22)
         Simple-V immediate field for setting VL or MVL
         Formats: SVL
    SVme (6:10)
         Simple-V "REMAP" map-enable bits (0-4)
         Formats: SVRM
    SVo (6:9)
        Field used by the svshape2 instruction as an offset
        Formats: SVM2
    SVrm (21:24)
         Simple-V "REMAP" Mode
         Formats: SVM
    SVxd (6:10)
         Simple-V "REMAP" x-dimension size
         Formats: SVM
    SVyd (11:15)
         Simple-V "REMAP" y-dimension size
         Formats: SVM
    SVzd (16:20)
         Simple-V "REMAP" z-dimension size
         Formats: SVM
    XO (21:23,26:31)
        Extended opcode field.  Note that bit 21 must be 1, 22 and 23
        must be zero, and bits 26-31 must be exactly the same as
        used for svshape.
        Formats: SVM2
```

# Appendices

    Appendix E Power ISA sorted by opcode
    Appendix F Power ISA sorted by version
    Appendix G Power ISA sorted by Compliancy Subset
    Appendix H Power ISA sorted by mnemonic

| Form | Book | Page | Version | mnemonic | Description |
|------|------|------|---------|----------|-------------|
| SVRM | I    | #    | 3.0B    | svremap    | REMAP enabling instruction |
| SVM  | I    | #    | 3.0B    | svshape  | REMAP shape instruction |
| SVM2 | I    | #    | 3.0B    | svshape2 | REMAP shape instruction (2) |
| SVI  | I    | #    | 3.0B    | svindex | REMAP General-purpose Indexing |

[[!inline pages="openpower/sv/remap/appendix" raw=yes ]]

## REMAP pseudocode

Written in python3 the following stand-alone executable source code is the Canonical
Specification for each REMAP. Vectors of "loopends" are returned when Rc=1
in Vectors of CR Fields on `sv.svstep.`, or in Vertical-First Mode
a single CR Field (CR0) on `svstep.`.  The `SVSTATE.srcstep` or `SVSTATE.dststep` sequential
offset is put through each algorithm to determine the actual Element Offset.
Alternative implementations producing different ordering
is prohibited as software will be critically relying on these Deterministic Schedules.

### REMAP 2D/3D Matrix

The following stand-alone executable source code is the Canonical
Specification for Matrix (2D/3D) REMAP. 
Hardware implementations are achievable with simple cascading counter-and-compares.

```
# python "yield" can be iterated. use this to make it clear how
# the indices are generated by using natural-looking nested loops
def iterate_indices(SVSHAPE):
    # get indices to iterate over, in the required order
    xd = SVSHAPE.lims[0]
    yd = SVSHAPE.lims[1]
    zd = SVSHAPE.lims[2]
    # create lists of indices to iterate over in each dimension
    x_r = list(range(xd))
    y_r = list(range(yd))
    z_r = list(range(zd))
    # invert the indices if needed
    if SVSHAPE.invxyz[0]: x_r.reverse()
    if SVSHAPE.invxyz[1]: y_r.reverse()
    if SVSHAPE.invxyz[2]: z_r.reverse()
    # start an infinite (wrapping) loop
    step = 0 # track src/dst step
    while True:
        for z in z_r:   # loop over 1st order dimension
            z_end = z == z_r[-1]
            for y in y_r:       # loop over 2nd order dimension
                y_end = y == y_r[-1]
                for x in x_r:           # loop over 3rd order dimension
                    x_end = x == x_r[-1]
                    # ok work out which order to construct things in.
                    # start by creating a list of tuples of the dimension
                    # and its limit
                    vals = [(SVSHAPE.lims[0], x, "x"),
                            (SVSHAPE.lims[1], y, "y"),
                            (SVSHAPE.lims[2], z, "z")
                           ]
                    # now select those by order.  this allows us to
                    # create schedules for [z][x], [x][y], or [y][z]
                    # for matrix multiply.
                    vals = [vals[SVSHAPE.order[0]],
                            vals[SVSHAPE.order[1]],
                            vals[SVSHAPE.order[2]]
                           ]
                    # ok now we can construct the result, using bits of
                    # "order" to say which ones get stacked on
                    result = 0
                    mult = 1
                    for i in range(3):
                        lim, idx, dbg = vals[i]
                        # some of the dimensions can be "skipped".  the order
                        # was actually selected above on all 3 dimensions,
                        # e.g. [z][x][y] or [y][z][x].  "skip" allows one of
                        # those to be knocked out
                        if SVSHAPE.skip == i+1: continue
                        idx *= mult   # shifts up by previous dimension(s)
                        result += idx # adds on this dimension
                        mult *= lim   # for the next dimension

                    loopends = (x_end |
                               ((y_end and x_end)<<1) |
                                ((y_end and x_end and z_end)<<2))

                    yield result + SVSHAPE.offset, loopends
                    step += 1

def demo():
    # set the dimension sizes here
    xdim = 3
    ydim = 2
    zdim = 4

    # set total (can repeat, e.g. VL=x*y*z*4)
    VL = xdim * ydim * zdim

    # set up an SVSHAPE
    class SVSHAPE:
        pass
    SVSHAPE0 = SVSHAPE()
    SVSHAPE0.lims = [xdim, ydim, zdim]
    SVSHAPE0.order = [1,0,2]  # experiment with different permutations, here
    SVSHAPE0.mode = 0b00
    SVSHAPE0.skip = 0b00
    SVSHAPE0.offset = 0       # experiment with different offset, here
    SVSHAPE0.invxyz = [0,0,0] # inversion if desired

    # enumerate over the iterator function, getting new indices
    for idx, (new_idx, end) in enumerate(iterate_indices(SVSHAPE0)):
        if idx >= VL:
            break
        print ("%d->%d" % (idx, new_idx), "end", bin(end)[2:])

# run the demo
if __name__ == '__main__':
    demo()
```

### REMAP Parallel Reduction pseudocode

The python3 program below is stand-alone executable and is the Canonical Specification
for Parallel Reduction REMAP.
The Algorithm below is not limited to RADIX2 sizes, and Predicate
sources, unlike in Matrix REMAP, apply to the Element Indices **after** REMAP
has been applied, not before.  MV operations are not required: the algorithm
tracks positions of elements that would normally be moved and when applying
an Element Reduction Operation sources the operands from their last-known (tracked)
position.

```
# a "yield" version of the Parallel Reduction REMAP algorithm. 
# the algorithm is in-place. it does not perform "MV" operations.
# instead, where a masked-out value *should* be read from is tracked

def iterate_indices(SVSHAPE, pred=None):
    # get indices to iterate over, in the required order
    xd = SVSHAPE.lims[0]
    # create lists of indices to iterate over in each dimension
    ix = list(range(xd))
    # invert the indices if needed
    if SVSHAPE.invxyz[0]: ix.reverse()
    # start a loop from the lowest step
    step = 1 
    steps = []
    while step < xd:
        step *= 2
        steps.append(step)
    # invert the indices if needed
    if SVSHAPE.invxyz[1]: steps.reverse()
    for step in steps:
        stepend = (step == steps[-1]) # note end of steps
        idxs = list(range(0, xd, step))
        results = []
        for i in idxs:
            other = i + step // 2
            ci = ix[i]
            oi = ix[other] if other < xd else None
            other_pred = other < xd and (pred is None or pred[oi])
            if (pred is None or pred[ci]) and other_pred:
                if SVSHAPE.skip == 0b00: # submode 00
                    result = ci
                elif SVSHAPE.skip == 0b01: # submode 01
                    result = oi
                results.append([result + SVSHAPE.offset, 0])
            elif other_pred:
                ix[i] = oi
        if results:
            results[-1][1] = (stepend<<1) | 1  # notify end of loops
        yield from results

def demo():
    # set the dimension sizes here
    xdim = 9

    # set up an SVSHAPE
    class SVSHAPE:
        pass
    SVSHAPE0 = SVSHAPE()
    SVSHAPE0.lims = [xdim, 0, 0]
    SVSHAPE0.order = [0,1,2]
    SVSHAPE0.mode = 0b10
    SVSHAPE0.skip = 0b00
    SVSHAPE0.offset = 0       # experiment with different offset, here
    SVSHAPE0.invxyz = [0,0,0] # inversion if desired

    SVSHAPE1 = SVSHAPE()
    SVSHAPE1.lims = [xdim, 0, 0]
    SVSHAPE1.order = [0,1,2]
    SVSHAPE1.mode = 0b10
    SVSHAPE1.skip = 0b01
    SVSHAPE1.offset = 0       # experiment with different offset, here
    SVSHAPE1.invxyz = [0,0,0] # inversion if desired

    # enumerate over the iterator function, getting new indices
    shapes = list(iterate_indices(SVSHAPE0)), \
              list(iterate_indices(SVSHAPE1))
    for idx in range(len(shapes[0])):
        l = shapes[0][idx]
        r = shapes[1][idx]
        (l_idx, lend) = l
        (r_idx, rend) = r
        print ("%d->%d:%d" % (idx, l_idx, r_idx),
               "end", bin(lend)[2:], bin(rend)[2:])

# run the demo
if __name__ == '__main__':
    demo()
```

### REMAP FFT pseudocode

The FFT REMAP is RADIX2 only.

```
# a "yield" version of the REMAP algorithm, for FFT Tukey-Cooley schedules
# original code for the FFT Tukey-Cooley schedule:
# https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py
"""
    # Radix-2 decimation-in-time FFT (real, not complex)
    size = 2
    while size <= n:
        halfsize = size // 2
        tablestep = n // size
        for i in range(0, n, size):
            k = 0
            for j in range(i, i + halfsize):
                jh = j+halfsize
                jl = j
                temp1 = vec[jh] * exptable[k]
                temp2 = vec[jl]
                vec[jh] = temp2 - temp1
                vec[jl] = temp2 + temp1
                k += tablestep
        size *= 2
"""

# python "yield" can be iterated. use this to make it clear how
# the indices are generated by using natural-looking nested loops
def iterate_butterfly_indices(SVSHAPE):
    # get indices to iterate over, in the required order
    n = SVSHAPE.lims[0]
    stride = SVSHAPE.lims[2] # stride-multiplier on reg access
    # creating lists of indices to iterate over in each dimension
    # has to be done dynamically, because it depends on the size
    # first, the size-based loop (which can be done statically)
    x_r = []
    size = 2
    while size <= n:
        x_r.append(size)
        size *= 2
    # invert order if requested
    if SVSHAPE.invxyz[0]: x_r.reverse()

    if len(x_r) == 0:
        return

    # start an infinite (wrapping) loop
    skip = 0
    while True:
        for size in x_r:           # loop over 3rd order dimension (size)
            x_end = size == x_r[-1]
            # y_r schedule depends on size
            halfsize = size // 2
            tablestep = n // size
            y_r = []
            for i in range(0, n, size):
                y_r.append(i)
            # invert if requested
            if SVSHAPE.invxyz[1]: y_r.reverse()
            for i in y_r:       # loop over 2nd order dimension
                y_end = i == y_r[-1]
                k_r = []
                j_r = []
                k = 0
                for j in range(i, i+halfsize):
                    k_r.append(k)
                    j_r.append(j)
                    k += tablestep
                # invert if requested
                if SVSHAPE.invxyz[2]: k_r.reverse()
                if SVSHAPE.invxyz[2]: j_r.reverse()
                for j, k in zip(j_r, k_r):   # loop over 1st order dimension
                    z_end = j == j_r[-1]
                    # now depending on MODE return the index
                    if SVSHAPE.skip == 0b00:
                        result = j              # for vec[j]
                    elif SVSHAPE.skip == 0b01:
                        result = j + halfsize   # for vec[j+halfsize]
                    elif SVSHAPE.skip == 0b10:
                        result = k              # for exptable[k]

                    loopends = (z_end |
                               ((y_end and z_end)<<1) |
                                ((y_end and x_end and z_end)<<2))

                    yield (result * stride) + SVSHAPE.offset, loopends

def demo():
    # set the dimension sizes here
    xdim = 8
    ydim = 0 # not needed
    zdim = 1 # stride must be set to 1

    # set total. err don't know how to calculate how many there are...
    # do it manually for now
    VL = 0
    size = 2
    n = xdim
    while size <= n:
        halfsize = size // 2
        tablestep = n // size
        for i in range(0, n, size):
            for j in range(i, i + halfsize):
                VL += 1
        size *= 2

    # set up an SVSHAPE
    class SVSHAPE:
        pass
    # j schedule
    SVSHAPE0 = SVSHAPE()
    SVSHAPE0.lims = [xdim, ydim, zdim]
    SVSHAPE0.order = [0,1,2]  # experiment with different permutations, here
    SVSHAPE0.mode = 0b01
    SVSHAPE0.skip = 0b00
    SVSHAPE0.offset = 0       # experiment with different offset, here
    SVSHAPE0.invxyz = [0,0,0] # inversion if desired
    # j+halfstep schedule
    SVSHAPE1 = SVSHAPE()
    SVSHAPE1.lims = [xdim, ydim, zdim]
    SVSHAPE1.order = [0,1,2]  # experiment with different permutations, here
    SVSHAPE0.mode = 0b01
    SVSHAPE1.skip = 0b01
    SVSHAPE1.offset = 0       # experiment with different offset, here
    SVSHAPE1.invxyz = [0,0,0] # inversion if desired
    # k schedule
    SVSHAPE2 = SVSHAPE()
    SVSHAPE2.lims = [xdim, ydim, zdim]
    SVSHAPE2.order = [0,1,2]  # experiment with different permutations, here
    SVSHAPE0.mode = 0b01
    SVSHAPE2.skip = 0b10
    SVSHAPE2.offset = 0       # experiment with different offset, here
    SVSHAPE2.invxyz = [0,0,0] # inversion if desired

    # enumerate over the iterator function, getting new indices
    schedule = []
    for idx, (jl, jh, k) in enumerate(zip(iterate_butterfly_indices(SVSHAPE0),
                                          iterate_butterfly_indices(SVSHAPE1),
                                          iterate_butterfly_indices(SVSHAPE2))):
        if idx >= VL:
            break
        schedule.append((jl, jh, k))

    # ok now pretty-print the results, with some debug output
    size = 2
    idx = 0
    while size <= n:
        halfsize = size // 2
        tablestep = n // size
        print ("size %d halfsize %d tablestep %d" % \
                (size, halfsize, tablestep))
        for i in range(0, n, size):
            prefix = "i %d\t" % i
            k = 0
            for j in range(i, i + halfsize):
                (jl, je), (jh, he), (ks, ke) = schedule[idx]
                print ("  %-3d\t%s j=%-2d jh=%-2d k=%-2d -> "
                        "j[jl=%-2d] j[jh=%-2d] ex[k=%d]" % \
                                (idx, prefix, j, j+halfsize, k,
                                      jl, jh, ks,
                                ),
                                "end", bin(je)[2:], bin(je)[2:], bin(ke)[2:])
                k += tablestep
                idx += 1
        size *= 2

# run the demo
if __name__ == '__main__':
    demo()
```

### DCT REMAP

DCT REMAP is RADIX2 only.  Convolutions may be applied as usual
to create non-RADIX2 DCT. Combined with appropriate Twin-butterfly
instructions the algorithm below (written in python3), becomes part
of an in-place in-registers Vectorised DCT.  The algorithms work
by loading data such that as the nested loops progress the result
is sorted into correct sequential order.

```
# DCT "REMAP" scheduler to create an in-place iterative DCT.
#

# bits of the integer 'val' of width 'width' are reversed
def reverse_bits(val, width):
    result = 0
    for _ in range(width):
        result = (result << 1) | (val & 1)
        val >>= 1
    return result


# iterative version of [recursively-applied] half-reversing
# turns out this is Gray-Encoding.
def halfrev2(vec, pre_rev=True):
    res = []
    for i in range(len(vec)):
        if pre_rev:
            res.append(vec[i ^ (i>>1)])
        else:
            ri = i
            bl = i.bit_length()
            for ji in range(1, bl):
                ri ^= (i >> ji)
            res.append(vec[ri])
    return res


def iterate_dct_inner_halfswap_loadstore(SVSHAPE):
    # get indices to iterate over, in the required order
    n = SVSHAPE.lims[0]
    mode = SVSHAPE.lims[1]
    stride = SVSHAPE.lims[2]

    # reference list for not needing to do data-swaps, just swap what
    # *indices* are referenced (two levels of indirection at the moment)
    # pre-reverse the data-swap list so that it *ends up* in the order 0123..
    ji = list(range(n))

    levels = n.bit_length() - 1
    ri = [reverse_bits(i, levels) for i in range(n)]

    if SVSHAPE.mode == 0b01: # FFT, bitrev only
        ji = [ji[ri[i]] for i in range(n)]
    elif SVSHAPE.submode2 == 0b001:
        ji = [ji[ri[i]] for i in range(n)]
        ji = halfrev2(ji, True)
    else:
        ji = halfrev2(ji, False)
        ji = [ji[ri[i]] for i in range(n)]

    # invert order if requested
    if SVSHAPE.invxyz[0]:
        ji.reverse()

    for i, jl in enumerate(ji):
        y_end = jl == ji[-1]
        yield jl * stride, (0b111 if y_end else 0b000)

def iterate_dct_inner_costable_indices(SVSHAPE):
    # get indices to iterate over, in the required order
    n = SVSHAPE.lims[0]
    mode = SVSHAPE.lims[1]
    stride = SVSHAPE.lims[2]
    # creating lists of indices to iterate over in each dimension
    # has to be done dynamically, because it depends on the size
    # first, the size-based loop (which can be done statically)
    x_r = []
    size = 2
    while size <= n:
        x_r.append(size)
        size *= 2
    # invert order if requested
    if SVSHAPE.invxyz[0]:
        x_r.reverse()

    if len(x_r) == 0:
        return

    # start an infinite (wrapping) loop
    skip = 0
    z_end = 1 # doesn't exist in this, only 2 loops
    k = 0
    while True:
        for size in x_r:           # loop over 3rd order dimension (size)
            x_end = size == x_r[-1]
            # y_r schedule depends on size
            halfsize = size // 2
            y_r = []
            for i in range(0, n, size):
                y_r.append(i)
            # invert if requested
            if SVSHAPE.invxyz[1]: y_r.reverse()
            # two lists of half-range indices, e.g. j 0123, jr 7654
            j = list(range(0, halfsize))
            # invert if requested
            if SVSHAPE.invxyz[2]: j_r.reverse()
            # loop over 1st order dimension
            for ci, jl in enumerate(j):
                y_end = jl == j[-1]
                # now depending on MODE return the index.  inner butterfly
                if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
                    result = k  # offset into COS table
                elif SVSHAPE.skip == 0b10: #
                    result = ci # coefficient helper
                elif SVSHAPE.skip == 0b11: #
                    result = size # coefficient helper
                loopends = (z_end |
                           ((y_end and z_end)<<1) |
                            ((y_end and x_end and z_end)<<2))

                yield (result * stride) + SVSHAPE.offset, loopends
                k += 1

def iterate_dct_inner_butterfly_indices(SVSHAPE):
    # get indices to iterate over, in the required order
    n = SVSHAPE.lims[0]
    mode = SVSHAPE.lims[1]
    stride = SVSHAPE.lims[2]
    # creating lists of indices to iterate over in each dimension
    # has to be done dynamically, because it depends on the size
    # first, the size-based loop (which can be done statically)
    x_r = []
    size = 2
    while size <= n:
        x_r.append(size)
        size *= 2
    # invert order if requested
    if SVSHAPE.invxyz[0]:
        x_r.reverse()

    if len(x_r) == 0:
        return

    # reference (read/write) the in-place data in *reverse-bit-order*
    ri = list(range(n))
    if SVSHAPE.submode2 == 0b01:
        levels = n.bit_length() - 1
        ri = [ri[reverse_bits(i, levels)] for i in range(n)]

    # reference list for not needing to do data-swaps, just swap what
    # *indices* are referenced (two levels of indirection at the moment)
    # pre-reverse the data-swap list so that it *ends up* in the order 0123..
    ji = list(range(n))
    inplace_mode = True
    if inplace_mode and SVSHAPE.submode2 == 0b01:
        ji = halfrev2(ji, True)
    if inplace_mode and SVSHAPE.submode2 == 0b11:
        ji = halfrev2(ji, False)

    # start an infinite (wrapping) loop
    while True:
        k = 0
        k_start = 0
        for size in x_r:           # loop over 3rd order dimension (size)
            x_end = size == x_r[-1]
            # y_r schedule depends on size
            halfsize = size // 2
            y_r = []
            for i in range(0, n, size):
                y_r.append(i)
            # invert if requested
            if SVSHAPE.invxyz[1]: y_r.reverse()
            for i in y_r:       # loop over 2nd order dimension
                y_end = i == y_r[-1]
                # two lists of half-range indices, e.g. j 0123, jr 7654
                j = list(range(i, i + halfsize))
                jr = list(range(i+halfsize, i + size))
                jr.reverse()
                # invert if requested
                if SVSHAPE.invxyz[2]:
                    j.reverse()
                    jr.reverse()
                hz2 = halfsize // 2 # zero stops reversing 1-item lists
                # loop over 1st order dimension
                k = k_start
                for ci, (jl, jh) in enumerate(zip(j, jr)):
                    z_end = jl == j[-1]
                    # now depending on MODE return the index.  inner butterfly
                    if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
                        if SVSHAPE.submode2 == 0b11: # iDCT
                            result = ji[ri[jl]]        # lower half
                        else:
                            result = ri[ji[jl]]        # lower half
                    elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
                        if SVSHAPE.submode2 == 0b11: # iDCT
                            result = ji[ri[jl+halfsize]] # upper half
                        else:
                            result = ri[ji[jh]] # upper half
                    elif mode == 4:
                        # COS table pre-generated mode
                        if SVSHAPE.skip == 0b10: #
                            result = k # cos table offset
                    else: # mode 2
                        # COS table generated on-demand ("Vertical-First") mode
                        if SVSHAPE.skip == 0b10: #
                            result = ci # coefficient helper
                        elif SVSHAPE.skip == 0b11: #
                            result = size # coefficient helper
                    loopends = (z_end |
                               ((y_end and z_end)<<1) |
                                ((y_end and x_end and z_end)<<2))

                    yield (result * stride) + SVSHAPE.offset, loopends
                    k += 1

                # now in-place swap
                if inplace_mode:
                    for ci, (jl, jh) in enumerate(zip(j[:hz2], jr[:hz2])):
                        jlh = jl+halfsize
                        tmp1, tmp2 = ji[jlh], ji[jh]
                        ji[jlh], ji[jh] = tmp2, tmp1

            # new k_start point for cos tables( runs inside x_r loop NOT i loop)
            k_start += halfsize


# python "yield" can be iterated. use this to make it clear how
# the indices are generated by using natural-looking nested loops
def iterate_dct_outer_butterfly_indices(SVSHAPE):
    # get indices to iterate over, in the required order
    n = SVSHAPE.lims[0]
    mode = SVSHAPE.lims[1]
    stride = SVSHAPE.lims[2]
    # creating lists of indices to iterate over in each dimension
    # has to be done dynamically, because it depends on the size
    # first, the size-based loop (which can be done statically)
    x_r = []
    size = n // 2
    while size >= 2:
        x_r.append(size)
        size //= 2
    # invert order if requested
    if SVSHAPE.invxyz[0]:
        x_r.reverse()

    if len(x_r) == 0:
        return

    # I-DCT, reference (read/write) the in-place data in *reverse-bit-order*
    ri = list(range(n))
    if SVSHAPE.submode2 in [0b11, 0b01]:
        levels = n.bit_length() - 1
        ri = [ri[reverse_bits(i, levels)] for i in range(n)]

    # reference list for not needing to do data-swaps, just swap what
    # *indices* are referenced (two levels of indirection at the moment)
    # pre-reverse the data-swap list so that it *ends up* in the order 0123..
    ji = list(range(n))
    inplace_mode = False # need the space... SVSHAPE.skip in [0b10, 0b11]
    if SVSHAPE.submode2 == 0b11:
        ji = halfrev2(ji, False)

    # start an infinite (wrapping) loop
    while True:
        k = 0
        k_start = 0
        for size in x_r:           # loop over 3rd order dimension (size)
            halfsize = size//2
            x_end = size == x_r[-1]
            y_r = list(range(0, halfsize))
            # invert if requested
            if SVSHAPE.invxyz[1]: y_r.reverse()
            for i in y_r:       # loop over 2nd order dimension
                y_end = i == y_r[-1]
                # one list to create iterative-sum schedule
                jr = list(range(i+halfsize, i+n-halfsize, size))
                # invert if requested
                if SVSHAPE.invxyz[2]: jr.reverse()
                hz2 = halfsize // 2 # zero stops reversing 1-item lists
                k = k_start
                for ci, jh in enumerate(jr):   # loop over 1st order dimension
                    z_end = jh == jr[-1]
                    if mode == 4:
                        # COS table pre-generated mode
                        if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
                            if SVSHAPE.submode2 == 0b11: # iDCT
                                result = ji[ri[jh]] # upper half
                            else:
                                result = ri[ji[jh]]        # lower half
                        elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
                            if SVSHAPE.submode2 == 0b11: # iDCT
                                result = ji[ri[jh+size]] # upper half
                            else:
                                result = ri[ji[jh+size]] # upper half
                        elif SVSHAPE.skip == 0b10: #
                            result = k # cos table offset
                    else:
                        # COS table generated on-demand ("Vertical-First") mode
                        if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
                            if SVSHAPE.submode2 == 0b11: # iDCT
                                result = ji[ri[jh]]        # lower half
                            else:
                                result = ri[ji[jh]]        # lower half
                        elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
                            if SVSHAPE.submode2 == 0b11: # iDCT
                                result = ji[ri[jh+size]] # upper half
                            else:
                                result = ri[ji[jh+size]] # upper half
                        elif SVSHAPE.skip == 0b10: #
                            result = ci # coefficient helper
                        elif SVSHAPE.skip == 0b11: #
                            result = size # coefficient helper
                    loopends = (z_end |
                               ((y_end and z_end)<<1) |
                                ((y_end and x_end and z_end)<<2))

                    yield (result * stride) + SVSHAPE.offset, loopends
                    k += 1

            # new k_start point for cos tables( runs inside x_r loop NOT i loop)
            k_start += halfsize

```

## REMAP selector

Selecting which REMAP Schedule to use is shown by the pseudocode below.
Each SVSHAPE 0-3 goes through this selection process.

```
  if SVSHAPEn.mode == 0b00:             iterate_fn = iterate_indices
  if SVSHAPEn.mode == 0b10:             iterate_fn = iterate_preduce_indices
  if SVSHAPEn.mode in [0b01, 0b11]:
    # further sub-selection
    if SVSHAPEn.ydimsz == 1:            iterate_fn = iterate_butterfly_indices
    if SVSHAPEn.ydimsz == 2:            iterate_fn = iterate_dct_inner_butterfly_indices
    if SVSHAPEn.ydimsz == 3:            iterate_fn = iterate_dct_outer_butterfly_indices
    if SVSHAPEn.ydimsz in [5, 13]:      iterate_fn = iterate_dct_inner_costable_indices
    if SVSHAPEn.ydimsz in [6, 14, 15]:  iterate_fn = iterate_dct_inner_halfswap_loadstore
```


[[!tag opf_rfc]]