# RFC ls009 Simple-V REMAP Subsystem
**URLs**:
*
*
*
**Severity**: Major
**Status**: New
**Date**: 26 Mar 2023. v2 TODO
**Target**: v3.2B
**Source**: v3.0B
**Books and Section affected**:
```
Book I, new Zero-Overhead-Loop Chapter.
Appendix E Power ISA sorted by opcode
Appendix F Power ISA sorted by version
Appendix G Power ISA sorted by Compliancy Subset
Appendix H Power ISA sorted by mnemonic
```
**Summary**
```
svremap - Re-Mapping of Register Element Offsets
svindex - General-purpose setting of SHAPEs to be re-mapped
svshape - Hardware-level setting of SHAPEs for element re-mapping
svshape2 - Hardware-level setting of SHAPEs for element re-mapping (v2)
```
**Submitter**: Luke Leighton (Libre-SOC)
**Requester**: Libre-SOC
**Impact on processor**:
```
Addition of four new "Zero-Overhead-Loop-Control" DSP-style Vector-style
Management Instructions which provide advanced features such as Matrix
FFT DCT Hardware-Assist Schedules and general-purpose Index reordering.
```
**Impact on software**:
```
Requires support for new instructions in assembler, debuggers,
and related tools.
```
**Keywords**:
```
Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC),
Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
Digital Signal Processing (DSP)
```
**Motivation**
These REMAP Management instructions provide state-of-the-art advanced capabilities
to dramatically decrease instruction count and power reduction whilst retaining
unprecedented general-purpose capability and a standard Sequential Execution Model.
**Notes and Observations**:
1. Although costly the alternatives in SIMD-paradigm software result in
huge algorithmic complexity and associated power consumption increases.
Loop-unrolling compilers are prevalent as is thousands to tens of
thousands of instructions.
2. Core inner kernels of Matrix DCT DFT FFT NTT are dramatically reduced
to a handful of instructions.
3. No REMAP instructions with the exception of Indexed rely on registers
for the establishment of REMAP capability.
4. Future EXT1xx variants and SVP64/VSX will dramatically expand the power
and flexibility of REMAP.
5. The Simple-V Compliancy Subsets make REMAP optional except in the
Advanced Levels. Like v3.1 MMA it is **not** necessary for **all**
hardware to implement REMAP.
**Changes**
Add the following entries to:
* the Appendices of SV Book
* Instructions of SV Book as a new Section
* SVI, SVM, SVM2, SVRM Form of Book I Section 1.6.1.6 and 1.6.2
----------------
\newpage{}
[[!inline pages="openpower/sv/remap" raw=yes ]]
# Forms
Add `SVI, SVM, SVM2, SVRM` to `XO (26:31)` Field in Book I, 1.6.2
Add the following to Book I, 1.6.1, SVI-Form
```
|0 |6 |11 |16 |21 |23 |24|25|26 31|
| PO | SVG|rmm | SVd |ew |SVyx|mm|sk| XO |
```
Add the following to Book I, 1.6.1, SVM-Form
```
|0 |6 |11 |16 |21 |25 |26 |31 |
| PO | SVxd | SVyd | SVzd | SVrm |vf | XO |
```
Add the following to Book I, 1.6.1, SVM2-Form
```
|0 |6 |10 |11 |16 |21 |24|25 |26 |31 |
| PO | SVo |SVyx| rmm | SVd |XO |mm|sk | XO |
```
Add the following to Book I, 1.6.1, SVRM-Form
```
|0 |6 |11 |13 |15 |17 |19 |21 |22 |26 |31 |
| PO | SVme |mi0 | mi1 | mi2 | mo0 | mo1 |pst |/// | XO |
```
Add the following to Book I, 1.6.2
```
mi0 (11:12)
Field used in REMAP to select the SVSHAPE for 1st input register
Formats: SVRM
mi1 (13:14)
Field used in REMAP to select the SVSHAPE for 2nd input register
Formats: SVRM
mi2 (15:16)
Field used in REMAP to select the SVSHAPE for 3rd input register
Formats: SVRM
mm (24)
Field used to specify the meaning of the rmm field for SVI-Form
and SVM2-Form
Formats: SVI, SVM2
mo0 (17:18)
Field used in REMAP to select the SVSHAPE for 1st output register
Formats: SVRM
mo1 (19:20)
Field used in REMAP to select the SVSHAPE for 2nd output register
Formats: SVRM
pst (21)
Field used in REMAP to indicate "persistence" mode (REMAP
continues to apply to multiple instructions)
Formats: SVRM
rmm (11:15)
REMAP Mode field for SVI-Form and SVM2-Form
Formats: SVI, SVM2
sk (25)
Field used to specify dimensional skipping in svindex
Formats: SVI, SVM2
SVd (16:20)
Immediate field used to specify the size of the REMAP dimension
in the svindex and svshape2 instructions
Formats: SVI, SVM2
SVDS (16:29)
Immediate field used to specify a 9-bit signed
two's complement integer which is concatenated
on the right with 0b00 and sign-extended to 64 bits.
Formats: SVDS
SVG (6:10)
Field used to specify a GPR to be used as a
source for indexing.
Formats: SVI
SVi (16:22)
Simple-V immediate field for setting VL or MVL
Formats: SVL
SVme (6:10)
Simple-V "REMAP" map-enable bits (0-4)
Formats: SVRM
SVo (6:9)
Field used by the svshape2 instruction as an offset
Formats: SVM2
SVrm (21:24)
Simple-V "REMAP" Mode
Formats: SVM
SVxd (6:10)
Simple-V "REMAP" x-dimension size
Formats: SVM
SVyd (11:15)
Simple-V "REMAP" y-dimension size
Formats: SVM
SVzd (16:20)
Simple-V "REMAP" z-dimension size
Formats: SVM
XO (21:23,26:31)
Extended opcode field. Note that bit 21 must be 1, 22 and 23
must be zero, and bits 26-31 must be exactly the same as
used for svshape.
Formats: SVM2
```
# Appendices
Appendix E Power ISA sorted by opcode
Appendix F Power ISA sorted by version
Appendix G Power ISA sorted by Compliancy Subset
Appendix H Power ISA sorted by mnemonic
| Form | Book | Page | Version | mnemonic | Description |
|------|------|------|---------|----------|-------------|
| SVRM | I | # | 3.0B | svremap | REMAP enabling instruction |
| SVM | I | # | 3.0B | svshape | REMAP shape instruction |
| SVM2 | I | # | 3.0B | svshape2 | REMAP shape instruction (2) |
| SVI | I | # | 3.0B | svindex | REMAP General-purpose Indexing |
[[!inline pages="openpower/sv/remap/appendix" raw=yes ]]
## REMAP pseudocode
Written in python3 the following stand-alone executable source code is the Canonical
Specification for each REMAP. Vectors of "loopends" are returned when Rc=1
in Vectors of CR Fields on `sv.svstep.`, or in Vertical-First Mode
a single CR Field (CR0) on `svstep.`. The `SVSTATE.srcstep` or `SVSTATE.dststep` sequential
offset is put through each algorithm to determine the actual Element Offset.
Alternative implementations producing different ordering
is prohibited as software will be critically relying on these Deterministic Schedules.
### REMAP 2D/3D Matrix
The following stand-alone executable source code is the Canonical
Specification for Matrix (2D/3D) REMAP.
Hardware implementations are achievable with simple cascading counter-and-compares.
```
# python "yield" can be iterated. use this to make it clear how
# the indices are generated by using natural-looking nested loops
def iterate_indices(SVSHAPE):
# get indices to iterate over, in the required order
xd = SVSHAPE.lims[0]
yd = SVSHAPE.lims[1]
zd = SVSHAPE.lims[2]
# create lists of indices to iterate over in each dimension
x_r = list(range(xd))
y_r = list(range(yd))
z_r = list(range(zd))
# invert the indices if needed
if SVSHAPE.invxyz[0]: x_r.reverse()
if SVSHAPE.invxyz[1]: y_r.reverse()
if SVSHAPE.invxyz[2]: z_r.reverse()
# start an infinite (wrapping) loop
step = 0 # track src/dst step
while True:
for z in z_r: # loop over 1st order dimension
z_end = z == z_r[-1]
for y in y_r: # loop over 2nd order dimension
y_end = y == y_r[-1]
for x in x_r: # loop over 3rd order dimension
x_end = x == x_r[-1]
# ok work out which order to construct things in.
# start by creating a list of tuples of the dimension
# and its limit
vals = [(SVSHAPE.lims[0], x, "x"),
(SVSHAPE.lims[1], y, "y"),
(SVSHAPE.lims[2], z, "z")
]
# now select those by order. this allows us to
# create schedules for [z][x], [x][y], or [y][z]
# for matrix multiply.
vals = [vals[SVSHAPE.order[0]],
vals[SVSHAPE.order[1]],
vals[SVSHAPE.order[2]]
]
# ok now we can construct the result, using bits of
# "order" to say which ones get stacked on
result = 0
mult = 1
for i in range(3):
lim, idx, dbg = vals[i]
# some of the dimensions can be "skipped". the order
# was actually selected above on all 3 dimensions,
# e.g. [z][x][y] or [y][z][x]. "skip" allows one of
# those to be knocked out
if SVSHAPE.skip == i+1: continue
idx *= mult # shifts up by previous dimension(s)
result += idx # adds on this dimension
mult *= lim # for the next dimension
loopends = (x_end |
((y_end and x_end)<<1) |
((y_end and x_end and z_end)<<2))
yield result + SVSHAPE.offset, loopends
step += 1
def demo():
# set the dimension sizes here
xdim = 3
ydim = 2
zdim = 4
# set total (can repeat, e.g. VL=x*y*z*4)
VL = xdim * ydim * zdim
# set up an SVSHAPE
class SVSHAPE:
pass
SVSHAPE0 = SVSHAPE()
SVSHAPE0.lims = [xdim, ydim, zdim]
SVSHAPE0.order = [1,0,2] # experiment with different permutations, here
SVSHAPE0.mode = 0b00
SVSHAPE0.skip = 0b00
SVSHAPE0.offset = 0 # experiment with different offset, here
SVSHAPE0.invxyz = [0,0,0] # inversion if desired
# enumerate over the iterator function, getting new indices
for idx, (new_idx, end) in enumerate(iterate_indices(SVSHAPE0)):
if idx >= VL:
break
print ("%d->%d" % (idx, new_idx), "end", bin(end)[2:])
# run the demo
if __name__ == '__main__':
demo()
```
### REMAP Parallel Reduction pseudocode
The python3 program below is stand-alone executable and is the Canonical Specification
for Parallel Reduction REMAP.
The Algorithm below is not limited to RADIX2 sizes, and Predicate
sources, unlike in Matrix REMAP, apply to the Element Indices **after** REMAP
has been applied, not before. MV operations are not required: the algorithm
tracks positions of elements that would normally be moved and when applying
an Element Reduction Operation sources the operands from their last-known (tracked)
position.
```
# a "yield" version of the Parallel Reduction REMAP algorithm.
# the algorithm is in-place. it does not perform "MV" operations.
# instead, where a masked-out value *should* be read from is tracked
def iterate_indices(SVSHAPE, pred=None):
# get indices to iterate over, in the required order
xd = SVSHAPE.lims[0]
# create lists of indices to iterate over in each dimension
ix = list(range(xd))
# invert the indices if needed
if SVSHAPE.invxyz[0]: ix.reverse()
# start a loop from the lowest step
step = 1
steps = []
while step < xd:
step *= 2
steps.append(step)
# invert the indices if needed
if SVSHAPE.invxyz[1]: steps.reverse()
for step in steps:
stepend = (step == steps[-1]) # note end of steps
idxs = list(range(0, xd, step))
results = []
for i in idxs:
other = i + step // 2
ci = ix[i]
oi = ix[other] if other < xd else None
other_pred = other < xd and (pred is None or pred[oi])
if (pred is None or pred[ci]) and other_pred:
if SVSHAPE.skip == 0b00: # submode 00
result = ci
elif SVSHAPE.skip == 0b01: # submode 01
result = oi
results.append([result + SVSHAPE.offset, 0])
elif other_pred:
ix[i] = oi
if results:
results[-1][1] = (stepend<<1) | 1 # notify end of loops
yield from results
def demo():
# set the dimension sizes here
xdim = 9
# set up an SVSHAPE
class SVSHAPE:
pass
SVSHAPE0 = SVSHAPE()
SVSHAPE0.lims = [xdim, 0, 0]
SVSHAPE0.order = [0,1,2]
SVSHAPE0.mode = 0b10
SVSHAPE0.skip = 0b00
SVSHAPE0.offset = 0 # experiment with different offset, here
SVSHAPE0.invxyz = [0,0,0] # inversion if desired
SVSHAPE1 = SVSHAPE()
SVSHAPE1.lims = [xdim, 0, 0]
SVSHAPE1.order = [0,1,2]
SVSHAPE1.mode = 0b10
SVSHAPE1.skip = 0b01
SVSHAPE1.offset = 0 # experiment with different offset, here
SVSHAPE1.invxyz = [0,0,0] # inversion if desired
# enumerate over the iterator function, getting new indices
shapes = list(iterate_indices(SVSHAPE0)), \
list(iterate_indices(SVSHAPE1))
for idx in range(len(shapes[0])):
l = shapes[0][idx]
r = shapes[1][idx]
(l_idx, lend) = l
(r_idx, rend) = r
print ("%d->%d:%d" % (idx, l_idx, r_idx),
"end", bin(lend)[2:], bin(rend)[2:])
# run the demo
if __name__ == '__main__':
demo()
```
### REMAP FFT pseudocode
The FFT REMAP is RADIX2 only.
```
# a "yield" version of the REMAP algorithm, for FFT Tukey-Cooley schedules
# original code for the FFT Tukey-Cooley schedule:
# https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py
"""
# Radix-2 decimation-in-time FFT (real, not complex)
size = 2
while size <= n:
halfsize = size // 2
tablestep = n // size
for i in range(0, n, size):
k = 0
for j in range(i, i + halfsize):
jh = j+halfsize
jl = j
temp1 = vec[jh] * exptable[k]
temp2 = vec[jl]
vec[jh] = temp2 - temp1
vec[jl] = temp2 + temp1
k += tablestep
size *= 2
"""
# python "yield" can be iterated. use this to make it clear how
# the indices are generated by using natural-looking nested loops
def iterate_butterfly_indices(SVSHAPE):
# get indices to iterate over, in the required order
n = SVSHAPE.lims[0]
stride = SVSHAPE.lims[2] # stride-multiplier on reg access
# creating lists of indices to iterate over in each dimension
# has to be done dynamically, because it depends on the size
# first, the size-based loop (which can be done statically)
x_r = []
size = 2
while size <= n:
x_r.append(size)
size *= 2
# invert order if requested
if SVSHAPE.invxyz[0]: x_r.reverse()
if len(x_r) == 0:
return
# start an infinite (wrapping) loop
skip = 0
while True:
for size in x_r: # loop over 3rd order dimension (size)
x_end = size == x_r[-1]
# y_r schedule depends on size
halfsize = size // 2
tablestep = n // size
y_r = []
for i in range(0, n, size):
y_r.append(i)
# invert if requested
if SVSHAPE.invxyz[1]: y_r.reverse()
for i in y_r: # loop over 2nd order dimension
y_end = i == y_r[-1]
k_r = []
j_r = []
k = 0
for j in range(i, i+halfsize):
k_r.append(k)
j_r.append(j)
k += tablestep
# invert if requested
if SVSHAPE.invxyz[2]: k_r.reverse()
if SVSHAPE.invxyz[2]: j_r.reverse()
for j, k in zip(j_r, k_r): # loop over 1st order dimension
z_end = j == j_r[-1]
# now depending on MODE return the index
if SVSHAPE.skip == 0b00:
result = j # for vec[j]
elif SVSHAPE.skip == 0b01:
result = j + halfsize # for vec[j+halfsize]
elif SVSHAPE.skip == 0b10:
result = k # for exptable[k]
loopends = (z_end |
((y_end and z_end)<<1) |
((y_end and x_end and z_end)<<2))
yield (result * stride) + SVSHAPE.offset, loopends
def demo():
# set the dimension sizes here
xdim = 8
ydim = 0 # not needed
zdim = 1 # stride must be set to 1
# set total. err don't know how to calculate how many there are...
# do it manually for now
VL = 0
size = 2
n = xdim
while size <= n:
halfsize = size // 2
tablestep = n // size
for i in range(0, n, size):
for j in range(i, i + halfsize):
VL += 1
size *= 2
# set up an SVSHAPE
class SVSHAPE:
pass
# j schedule
SVSHAPE0 = SVSHAPE()
SVSHAPE0.lims = [xdim, ydim, zdim]
SVSHAPE0.order = [0,1,2] # experiment with different permutations, here
SVSHAPE0.mode = 0b01
SVSHAPE0.skip = 0b00
SVSHAPE0.offset = 0 # experiment with different offset, here
SVSHAPE0.invxyz = [0,0,0] # inversion if desired
# j+halfstep schedule
SVSHAPE1 = SVSHAPE()
SVSHAPE1.lims = [xdim, ydim, zdim]
SVSHAPE1.order = [0,1,2] # experiment with different permutations, here
SVSHAPE0.mode = 0b01
SVSHAPE1.skip = 0b01
SVSHAPE1.offset = 0 # experiment with different offset, here
SVSHAPE1.invxyz = [0,0,0] # inversion if desired
# k schedule
SVSHAPE2 = SVSHAPE()
SVSHAPE2.lims = [xdim, ydim, zdim]
SVSHAPE2.order = [0,1,2] # experiment with different permutations, here
SVSHAPE0.mode = 0b01
SVSHAPE2.skip = 0b10
SVSHAPE2.offset = 0 # experiment with different offset, here
SVSHAPE2.invxyz = [0,0,0] # inversion if desired
# enumerate over the iterator function, getting new indices
schedule = []
for idx, (jl, jh, k) in enumerate(zip(iterate_butterfly_indices(SVSHAPE0),
iterate_butterfly_indices(SVSHAPE1),
iterate_butterfly_indices(SVSHAPE2))):
if idx >= VL:
break
schedule.append((jl, jh, k))
# ok now pretty-print the results, with some debug output
size = 2
idx = 0
while size <= n:
halfsize = size // 2
tablestep = n // size
print ("size %d halfsize %d tablestep %d" % \
(size, halfsize, tablestep))
for i in range(0, n, size):
prefix = "i %d\t" % i
k = 0
for j in range(i, i + halfsize):
(jl, je), (jh, he), (ks, ke) = schedule[idx]
print (" %-3d\t%s j=%-2d jh=%-2d k=%-2d -> "
"j[jl=%-2d] j[jh=%-2d] ex[k=%d]" % \
(idx, prefix, j, j+halfsize, k,
jl, jh, ks,
),
"end", bin(je)[2:], bin(je)[2:], bin(ke)[2:])
k += tablestep
idx += 1
size *= 2
# run the demo
if __name__ == '__main__':
demo()
```
### DCT REMAP
DCT REMAP is RADIX2 only. Convolutions may be applied as usual
to create non-RADIX2 DCT. Combined with appropriate Twin-butterfly
instructions the algorithm below (written in python3), becomes part
of an in-place in-registers Vectorised DCT. The algorithms work
by loading data such that as the nested loops progress the result
is sorted into correct sequential order.
```
# DCT "REMAP" scheduler to create an in-place iterative DCT.
#
# bits of the integer 'val' of width 'width' are reversed
def reverse_bits(val, width):
result = 0
for _ in range(width):
result = (result << 1) | (val & 1)
val >>= 1
return result
# iterative version of [recursively-applied] half-reversing
# turns out this is Gray-Encoding.
def halfrev2(vec, pre_rev=True):
res = []
for i in range(len(vec)):
if pre_rev:
res.append(vec[i ^ (i>>1)])
else:
ri = i
bl = i.bit_length()
for ji in range(1, bl):
ri ^= (i >> ji)
res.append(vec[ri])
return res
def iterate_dct_inner_halfswap_loadstore(SVSHAPE):
# get indices to iterate over, in the required order
n = SVSHAPE.lims[0]
mode = SVSHAPE.lims[1]
stride = SVSHAPE.lims[2]
# reference list for not needing to do data-swaps, just swap what
# *indices* are referenced (two levels of indirection at the moment)
# pre-reverse the data-swap list so that it *ends up* in the order 0123..
ji = list(range(n))
levels = n.bit_length() - 1
ri = [reverse_bits(i, levels) for i in range(n)]
if SVSHAPE.mode == 0b01: # FFT, bitrev only
ji = [ji[ri[i]] for i in range(n)]
elif SVSHAPE.submode2 == 0b001:
ji = [ji[ri[i]] for i in range(n)]
ji = halfrev2(ji, True)
else:
ji = halfrev2(ji, False)
ji = [ji[ri[i]] for i in range(n)]
# invert order if requested
if SVSHAPE.invxyz[0]:
ji.reverse()
for i, jl in enumerate(ji):
y_end = jl == ji[-1]
yield jl * stride, (0b111 if y_end else 0b000)
def iterate_dct_inner_costable_indices(SVSHAPE):
# get indices to iterate over, in the required order
n = SVSHAPE.lims[0]
mode = SVSHAPE.lims[1]
stride = SVSHAPE.lims[2]
# creating lists of indices to iterate over in each dimension
# has to be done dynamically, because it depends on the size
# first, the size-based loop (which can be done statically)
x_r = []
size = 2
while size <= n:
x_r.append(size)
size *= 2
# invert order if requested
if SVSHAPE.invxyz[0]:
x_r.reverse()
if len(x_r) == 0:
return
# start an infinite (wrapping) loop
skip = 0
z_end = 1 # doesn't exist in this, only 2 loops
k = 0
while True:
for size in x_r: # loop over 3rd order dimension (size)
x_end = size == x_r[-1]
# y_r schedule depends on size
halfsize = size // 2
y_r = []
for i in range(0, n, size):
y_r.append(i)
# invert if requested
if SVSHAPE.invxyz[1]: y_r.reverse()
# two lists of half-range indices, e.g. j 0123, jr 7654
j = list(range(0, halfsize))
# invert if requested
if SVSHAPE.invxyz[2]: j_r.reverse()
# loop over 1st order dimension
for ci, jl in enumerate(j):
y_end = jl == j[-1]
# now depending on MODE return the index. inner butterfly
if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
result = k # offset into COS table
elif SVSHAPE.skip == 0b10: #
result = ci # coefficient helper
elif SVSHAPE.skip == 0b11: #
result = size # coefficient helper
loopends = (z_end |
((y_end and z_end)<<1) |
((y_end and x_end and z_end)<<2))
yield (result * stride) + SVSHAPE.offset, loopends
k += 1
def iterate_dct_inner_butterfly_indices(SVSHAPE):
# get indices to iterate over, in the required order
n = SVSHAPE.lims[0]
mode = SVSHAPE.lims[1]
stride = SVSHAPE.lims[2]
# creating lists of indices to iterate over in each dimension
# has to be done dynamically, because it depends on the size
# first, the size-based loop (which can be done statically)
x_r = []
size = 2
while size <= n:
x_r.append(size)
size *= 2
# invert order if requested
if SVSHAPE.invxyz[0]:
x_r.reverse()
if len(x_r) == 0:
return
# reference (read/write) the in-place data in *reverse-bit-order*
ri = list(range(n))
if SVSHAPE.submode2 == 0b01:
levels = n.bit_length() - 1
ri = [ri[reverse_bits(i, levels)] for i in range(n)]
# reference list for not needing to do data-swaps, just swap what
# *indices* are referenced (two levels of indirection at the moment)
# pre-reverse the data-swap list so that it *ends up* in the order 0123..
ji = list(range(n))
inplace_mode = True
if inplace_mode and SVSHAPE.submode2 == 0b01:
ji = halfrev2(ji, True)
if inplace_mode and SVSHAPE.submode2 == 0b11:
ji = halfrev2(ji, False)
# start an infinite (wrapping) loop
while True:
k = 0
k_start = 0
for size in x_r: # loop over 3rd order dimension (size)
x_end = size == x_r[-1]
# y_r schedule depends on size
halfsize = size // 2
y_r = []
for i in range(0, n, size):
y_r.append(i)
# invert if requested
if SVSHAPE.invxyz[1]: y_r.reverse()
for i in y_r: # loop over 2nd order dimension
y_end = i == y_r[-1]
# two lists of half-range indices, e.g. j 0123, jr 7654
j = list(range(i, i + halfsize))
jr = list(range(i+halfsize, i + size))
jr.reverse()
# invert if requested
if SVSHAPE.invxyz[2]:
j.reverse()
jr.reverse()
hz2 = halfsize // 2 # zero stops reversing 1-item lists
# loop over 1st order dimension
k = k_start
for ci, (jl, jh) in enumerate(zip(j, jr)):
z_end = jl == j[-1]
# now depending on MODE return the index. inner butterfly
if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
if SVSHAPE.submode2 == 0b11: # iDCT
result = ji[ri[jl]] # lower half
else:
result = ri[ji[jl]] # lower half
elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
if SVSHAPE.submode2 == 0b11: # iDCT
result = ji[ri[jl+halfsize]] # upper half
else:
result = ri[ji[jh]] # upper half
elif mode == 4:
# COS table pre-generated mode
if SVSHAPE.skip == 0b10: #
result = k # cos table offset
else: # mode 2
# COS table generated on-demand ("Vertical-First") mode
if SVSHAPE.skip == 0b10: #
result = ci # coefficient helper
elif SVSHAPE.skip == 0b11: #
result = size # coefficient helper
loopends = (z_end |
((y_end and z_end)<<1) |
((y_end and x_end and z_end)<<2))
yield (result * stride) + SVSHAPE.offset, loopends
k += 1
# now in-place swap
if inplace_mode:
for ci, (jl, jh) in enumerate(zip(j[:hz2], jr[:hz2])):
jlh = jl+halfsize
tmp1, tmp2 = ji[jlh], ji[jh]
ji[jlh], ji[jh] = tmp2, tmp1
# new k_start point for cos tables( runs inside x_r loop NOT i loop)
k_start += halfsize
# python "yield" can be iterated. use this to make it clear how
# the indices are generated by using natural-looking nested loops
def iterate_dct_outer_butterfly_indices(SVSHAPE):
# get indices to iterate over, in the required order
n = SVSHAPE.lims[0]
mode = SVSHAPE.lims[1]
stride = SVSHAPE.lims[2]
# creating lists of indices to iterate over in each dimension
# has to be done dynamically, because it depends on the size
# first, the size-based loop (which can be done statically)
x_r = []
size = n // 2
while size >= 2:
x_r.append(size)
size //= 2
# invert order if requested
if SVSHAPE.invxyz[0]:
x_r.reverse()
if len(x_r) == 0:
return
# I-DCT, reference (read/write) the in-place data in *reverse-bit-order*
ri = list(range(n))
if SVSHAPE.submode2 in [0b11, 0b01]:
levels = n.bit_length() - 1
ri = [ri[reverse_bits(i, levels)] for i in range(n)]
# reference list for not needing to do data-swaps, just swap what
# *indices* are referenced (two levels of indirection at the moment)
# pre-reverse the data-swap list so that it *ends up* in the order 0123..
ji = list(range(n))
inplace_mode = False # need the space... SVSHAPE.skip in [0b10, 0b11]
if SVSHAPE.submode2 == 0b11:
ji = halfrev2(ji, False)
# start an infinite (wrapping) loop
while True:
k = 0
k_start = 0
for size in x_r: # loop over 3rd order dimension (size)
halfsize = size//2
x_end = size == x_r[-1]
y_r = list(range(0, halfsize))
# invert if requested
if SVSHAPE.invxyz[1]: y_r.reverse()
for i in y_r: # loop over 2nd order dimension
y_end = i == y_r[-1]
# one list to create iterative-sum schedule
jr = list(range(i+halfsize, i+n-halfsize, size))
# invert if requested
if SVSHAPE.invxyz[2]: jr.reverse()
hz2 = halfsize // 2 # zero stops reversing 1-item lists
k = k_start
for ci, jh in enumerate(jr): # loop over 1st order dimension
z_end = jh == jr[-1]
if mode == 4:
# COS table pre-generated mode
if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
if SVSHAPE.submode2 == 0b11: # iDCT
result = ji[ri[jh]] # upper half
else:
result = ri[ji[jh]] # lower half
elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
if SVSHAPE.submode2 == 0b11: # iDCT
result = ji[ri[jh+size]] # upper half
else:
result = ri[ji[jh+size]] # upper half
elif SVSHAPE.skip == 0b10: #
result = k # cos table offset
else:
# COS table generated on-demand ("Vertical-First") mode
if SVSHAPE.skip == 0b00: # in [0b00, 0b10]:
if SVSHAPE.submode2 == 0b11: # iDCT
result = ji[ri[jh]] # lower half
else:
result = ri[ji[jh]] # lower half
elif SVSHAPE.skip == 0b01: # in [0b01, 0b11]:
if SVSHAPE.submode2 == 0b11: # iDCT
result = ji[ri[jh+size]] # upper half
else:
result = ri[ji[jh+size]] # upper half
elif SVSHAPE.skip == 0b10: #
result = ci # coefficient helper
elif SVSHAPE.skip == 0b11: #
result = size # coefficient helper
loopends = (z_end |
((y_end and z_end)<<1) |
((y_end and x_end and z_end)<<2))
yield (result * stride) + SVSHAPE.offset, loopends
k += 1
# new k_start point for cos tables( runs inside x_r loop NOT i loop)
k_start += halfsize
```
## REMAP selector
Selecting which REMAP Schedule to use is shown by the pseudocode below.
Each SVSHAPE 0-3 goes through this selection process.
```
if SVSHAPEn.mode == 0b00: iterate_fn = iterate_indices
if SVSHAPEn.mode == 0b10: iterate_fn = iterate_preduce_indices
if SVSHAPEn.mode in [0b01, 0b11]:
# further sub-selection
if SVSHAPEn.ydimsz == 1: iterate_fn = iterate_butterfly_indices
if SVSHAPEn.ydimsz == 2: iterate_fn = iterate_dct_inner_butterfly_indices
if SVSHAPEn.ydimsz == 3: iterate_fn = iterate_dct_outer_butterfly_indices
if SVSHAPEn.ydimsz in [5, 13]: iterate_fn = iterate_dct_inner_costable_indices
if SVSHAPEn.ydimsz in [6, 14, 15]: iterate_fn = iterate_dct_inner_halfswap_loadstore
```
[[!tag opf_rfc]]