* <https://bugs.libre-soc.org/show_bug.cgi?id=143>
* <https://bugs.libre-soc.org/show_bug.cgi?id=867> add svindex
* <https://bugs.libre-soc.org/show_bug.cgi?id=885> svindex in simulator
-* see [[sv/propagation]] for a future way to apply
-REMAP.
+* see [[sv/propagation]] for a future way to apply REMAP.
+* see [[sv/remap/appendix]] for examples and usage
REMAP is an advanced form of Vector "Structure Packing" that
provides hardware-level support for commonly-used *nested* loop patterns.
Beyond these mappings it becomes necessary to write directly to
the SVSTATE SPRs manually.
-# REMAP Matrix pseudocode
-
-The algorithm below shows how REMAP works more clearly, and may be
-executed as a python program:
-
-```
-[[!inline pages="openpower/sv/remap.py" quick="yes" raw="yes" ]]
-```
-
-An easier-to-read version (using python iterators) shows the loop nesting:
-
-```
-[[!inline pages="openpower/sv/remapyield.py" quick="yes" raw="yes" ]]
-```
-
-Each element index from the for-loop `0..VL-1`
-is run through the above algorithm to work out the **actual** element
-index, instead. Given that there are four possible SHAPE entries, up to
-four separate registers in any given operation may be simultaneously
-remapped:
-
- function op_add(rd, rs1, rs2) # add not VADD!
- ...
- ...
- for (i = 0; i < VL; i++)
- xSTATE.srcoffs = i # save context
- if (predval & 1<<i) # predication uses intregs
- ireg[rd+remap1(id)] <= ireg[rs1+remap2(irs1)] +
- ireg[rs2+remap3(irs2)];
- if (!int_vec[rd ].isvector) break;
- if (int_vec[rd ].isvector) { id += 1; }
- if (int_vec[rs1].isvector) { irs1 += 1; }
- if (int_vec[rs2].isvector) { irs2 += 1; }
-
-By changing remappings, 2D matrices may be transposed "in-place" for one
-operation, followed by setting a different permutation order without
-having to move the values in the registers to or from memory.
-
-Note that:
-
-* Over-running the register file clearly has to be detected and
- an illegal instruction exception thrown
-* When non-default elwidths are set, the exact same algorithm still
- applies (i.e. it offsets *polymorphic* elements *within* registers rather
- than entire registers).
-* If permute option 000 is utilised, the actual order of the
- reindexing does not change. However, modulo MVL still occurs
- which will result in repeated operations (use with caution).
-* If two or more dimensions are set to zero, the actual order does not change!
-* The above algorithm is pseudo-code **only**. Actual implementations
- will need to take into account the fact that the element for-looping
- must be **re-entrant**, due to the possibility of exceptions occurring.
- See SVSTATE SPR, which records the current element index.
- Continuing after return from an interrupt may introduce latency
- due to re-computation of the remapped offsets.
-* Twin-predicated operations require **two** separate and distinct
- element offsets. The above pseudo-code algorithm will be applied
- separately and independently to each, should each of the two
- operands be remapped. *This even includes unit-strided LD/ST*
- and other operations
- in that category, where in that case it will be the **offset** that is
- remapped.
-* Offset is especially useful, on its own, for accessing elements
- within the middle of a register. Without offsets, it is necessary
- to either use a predicated MV, skipping the first elements, or
- performing a LOAD/STORE cycle to memory.
- With offsets, the data does not have to be moved.
-* Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
- less than MVL is **perfectly legal**, albeit very obscure. It permits
- entries to be regularly presented to operands **more than once**, thus
- allowing the same underlying registers to act as an accumulator of
- multiple vector or matrix operations, for example.
-* Note especially that Program Order **must** still be respected
- even when overlaps occur that read or write the same register
- elements *including polymorphic ones*
-
-Clearly here some considerable care needs to be taken as the remapping
-could hypothetically create arithmetic operations that target the
-exact same underlying registers, resulting in data corruption due to
-pipeline overlaps. Out-of-order / Superscalar micro-architectures with
-register-renaming will have an easier time dealing with this than
-DSP-style SIMD micro-architectures.
-
-# REMAP FFT pseudocode
-
-The algorithm below shows how REMAP works more clearly, and may be
-executed as a python program:
-
-```
-[[!inline pages="openpower/sv/remap_fft_yield.py" quick="yes" raw="yes" ]]
-```
-
-The executable code above is designed to show how a hardware
-implementation may generate Indices which are completely
-independent of the Execution of element-level operations,
-even for something as complex as a Triple-loop Tukey-Cooley
-Schedule. A comprehensive demo and test suite may be found
-[here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD)
-
-# 4x4 Matrix to vec4 Multiply Example
-
-The following settings will allow a 4x4 matrix (starting at f8), expressed
-as a sequence of 16 numbers first by row then by column, to be multiplied
-by a vector of length 4 (starting at f0), using a single FMAC instruction.
-
-* SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
-* SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
-* VL=16, f4=vec, f0=vec, f8=vec
-* FMAC f4, f0, f8, f4
-
-The permutation on SHAPE0 will use f0 as a vec4 source. On the first
-four iterations through the hardware loop, the REMAPed index will not
-increment. On the second four, the index will increase by one. Likewise
-on each subsequent group of four.
-
-The permutation on SHAPE1 will increment f4 continuously cycling through
-f4-f7 every four iterations of the hardware loop.
-
-At the same time, VL will, because there is no SHAPE on f8, increment
-straight sequentially through the 16 values f8-f23 in the Matrix. The
-equivalent sequence thus is issued:
-
- fmac f4, f0, f8, f4
- fmac f5, f0, f9, f5
- fmac f6, f0, f10, f6
- fmac f7, f0, f11, f7
- fmac f4, f1, f12, f4
- fmac f5, f1, f13, f5
- fmac f6, f1, f14, f6
- fmac f7, f1, f15, f7
- fmac f4, f2, f16, f4
- fmac f5, f2, f17, f5
- fmac f6, f2, f18, f6
- fmac f7, f2, f19, f7
- fmac f4, f3, f20, f4
- fmac f5, f3, f21, f5
- fmac f6, f3, f22, f6
- fmac f7, f3, f23, f7
-
-The only other instruction required is to ensure that f4-f7 are
-initialised (usually to zero).
-
-It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively
-the same technique applied to four independent vectors, can be done by
-setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs,
-and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get
-it to apply four times to compute the four columns worth of vectors.
-
-# Warshall transitive closure algorithm
-
-TODO move to [[sv/remap/discussion]] page, copied from here
-http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html
-
-with thanks to Hendrik.
-
-<https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm>
-
-> Just a note: interpreting + as 'or', and * as 'and',
-> operating on Boolean matrices,
-> and having result, X, and Y be the exact same matrix,
-> updated while being used,
-> gives the traditional Warshall transitive-closure
-> algorithm, if the loops are nested exactly in thie order.
-
-this can be done with the ternary instruction which has
-an in-place triple boolean input:
-
- RT = RT | (RA & RB)
-
-and also has a CR Field variant of the same
-
-notes from conversations:
-
-> > for y in y_r:
-> > for x in x_r:
-> > for z in z_r:
-> > result[y][x] +=
-> > a[y][z] *
-> > b[z][x]
-
-> This nesting of loops works for matrix multiply, but not for transitive
-> closure.
-
-> > it can be done:
-> >
-> > for z in z_r:
-> > for y in y_r:
-> > for x in x_r:
-> > result[y][x] +=
-> > a[y][z] *
-> > b[z][x]
->
-> And this ordering of loops *does* work for transitive closure, when a,
-> b, and result are the very same matrix, updated while being used.
->
-> By the way, I believe there is a graph algorithm that does the
-> transitive closure thing, but instead of using boolean, "and", and "or",
-> they use real numbers, addition, and minimum. I think that one computes
-> shortest paths between vertices.
->
-> By the time the z'th iteration of the z loop begins, the algorithm has
-> already peocessed paths that go through vertices numbered < z, and it
-> adds paths that go through vertices numbered z.
->
-> For this to work, the outer loop has to be the one on the subscript that
-> bridges a and b (which in this case are teh same matrix, of course).
-
-# SUBVL Remap
-
-Remapping of SUBVL (vec2/3/4) elements is not permitted: the vec2/3/4
-itself must be considered to be the "element". To perform REMAP
-on the elements of a vec2/3/4, either use Swizzle, or,
-due to the sub-elements themselves being contiguous, treat them as
-such and use Indexing, or add one
-extra dimension to Matrix REMAP, the inner dimension being the size
-of the Subvector (2, 3, or 4).
-
-Note that Swizzle on Sub-vectors may be applied on top of REMAP.
-Where this is appropriate is the Rijndael MixColumns
-stage:
-
-<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />
-
-Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
-a 2D REMAP allows:
-
-* the column bytes (as a vec4) to be iterated over as an inner loop,
- progressing vertically (`a00 a10 a20 a30`)
-* the columns themselves to be iterated as an outer loop
-* a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
-
-This entirely in-place without special 128-bit opcodes. Below is
-the pseudocode for [[!wikipedia Rijndael MixColumns]]
-
-```
-void gmix_column(unsigned char *r) {
- unsigned char a[4];
- unsigned char b[4];
- unsigned char c;
- unsigned char h;
- // no swizzle here but vec4 byte-level
- // elwidth overrides can be done though.
- for (c = 0; c < 4; c++) {
- a[c] = r[c];
- h = (unsigned char)((signed char)r[c] >> 7);
- b[c] = r[c] << 1;
- b[c] ^= 0x1B & h; /* Rijndael's Galois field */
- }
- // These may then each be 4x 8bit Swizzled
- // r0.vec4 = b.vec4
- // r0.vec4 ^= a.vec4.WXYZ
- // r0.vec4 ^= a.vec4.ZWXY
- // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
- r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
- r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
- r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
- r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
-}
-```
-
-The application of the swizzles allows the remapped vec4 a, b and r
-variables to perform four straight linear 32 bit XOR operations where a
-scalar processor would be required to perform 16 byte-level individual
-operations. Given wide enough SIMD backends in hardware these 3 bit
-XORs may be done as single-cycle operations across the entire 128 bit
-Rijndael Matrix.
-
-The other alternative is to simply perform the actual 4x4 GF(256) Matrix
-Multiply using the MDS Matrix.
-
# TODO
* investigate https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429
--- /dev/null
+[[!tag standards]]
+
+# REMAP Matrix pseudocode
+
+The algorithm below shows how REMAP works more clearly, and may be
+executed as a python program:
+
+```
+[[!inline pages="openpower/sv/remap.py" quick="yes" raw="yes" ]]
+```
+
+An easier-to-read version (using python iterators) shows the loop nesting:
+
+```
+[[!inline pages="openpower/sv/remapyield.py" quick="yes" raw="yes" ]]
+```
+
+Each element index from the for-loop `0..VL-1`
+is run through the above algorithm to work out the **actual** element
+index, instead. Given that there are four possible SHAPE entries, up to
+four separate registers in any given operation may be simultaneously
+remapped:
+
+ function op_add(rd, rs1, rs2) # add not VADD!
+ ...
+ ...
+ for (i = 0; i < VL; i++)
+ xSTATE.srcoffs = i # save context
+ if (predval & 1<<i) # predication uses intregs
+ ireg[rd+remap1(id)] <= ireg[rs1+remap2(irs1)] +
+ ireg[rs2+remap3(irs2)];
+ if (!int_vec[rd ].isvector) break;
+ if (int_vec[rd ].isvector) { id += 1; }
+ if (int_vec[rs1].isvector) { irs1 += 1; }
+ if (int_vec[rs2].isvector) { irs2 += 1; }
+
+By changing remappings, 2D matrices may be transposed "in-place" for one
+operation, followed by setting a different permutation order without
+having to move the values in the registers to or from memory.
+
+Note that:
+
+* Over-running the register file clearly has to be detected and
+ an illegal instruction exception thrown
+* When non-default elwidths are set, the exact same algorithm still
+ applies (i.e. it offsets *polymorphic* elements *within* registers rather
+ than entire registers).
+* If permute option 000 is utilised, the actual order of the
+ reindexing does not change. However, modulo MVL still occurs
+ which will result in repeated operations (use with caution).
+* If two or more dimensions are set to zero, the actual order does not change!
+* The above algorithm is pseudo-code **only**. Actual implementations
+ will need to take into account the fact that the element for-looping
+ must be **re-entrant**, due to the possibility of exceptions occurring.
+ See SVSTATE SPR, which records the current element index.
+ Continuing after return from an interrupt may introduce latency
+ due to re-computation of the remapped offsets.
+* Twin-predicated operations require **two** separate and distinct
+ element offsets. The above pseudo-code algorithm will be applied
+ separately and independently to each, should each of the two
+ operands be remapped. *This even includes unit-strided LD/ST*
+ and other operations
+ in that category, where in that case it will be the **offset** that is
+ remapped.
+* Offset is especially useful, on its own, for accessing elements
+ within the middle of a register. Without offsets, it is necessary
+ to either use a predicated MV, skipping the first elements, or
+ performing a LOAD/STORE cycle to memory.
+ With offsets, the data does not have to be moved.
+* Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
+ less than MVL is **perfectly legal**, albeit very obscure. It permits
+ entries to be regularly presented to operands **more than once**, thus
+ allowing the same underlying registers to act as an accumulator of
+ multiple vector or matrix operations, for example.
+* Note especially that Program Order **must** still be respected
+ even when overlaps occur that read or write the same register
+ elements *including polymorphic ones*
+
+Clearly here some considerable care needs to be taken as the remapping
+could hypothetically create arithmetic operations that target the
+exact same underlying registers, resulting in data corruption due to
+pipeline overlaps. Out-of-order / Superscalar micro-architectures with
+register-renaming will have an easier time dealing with this than
+DSP-style SIMD micro-architectures.
+
+# REMAP FFT pseudocode
+
+The algorithm below shows how REMAP works more clearly, and may be
+executed as a python program:
+
+```
+[[!inline pages="openpower/sv/remap_fft_yield.py" quick="yes" raw="yes" ]]
+```
+
+The executable code above is designed to show how a hardware
+implementation may generate Indices which are completely
+independent of the Execution of element-level operations,
+even for something as complex as a Triple-loop Tukey-Cooley
+Schedule. A comprehensive demo and test suite may be found
+[here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD)
+
+# 4x4 Matrix to vec4 Multiply Example
+
+The following settings will allow a 4x4 matrix (starting at f8), expressed
+as a sequence of 16 numbers first by row then by column, to be multiplied
+by a vector of length 4 (starting at f0), using a single FMAC instruction.
+
+* SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
+* SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
+* VL=16, f4=vec, f0=vec, f8=vec
+* FMAC f4, f0, f8, f4
+
+The permutation on SHAPE0 will use f0 as a vec4 source. On the first
+four iterations through the hardware loop, the REMAPed index will not
+increment. On the second four, the index will increase by one. Likewise
+on each subsequent group of four.
+
+The permutation on SHAPE1 will increment f4 continuously cycling through
+f4-f7 every four iterations of the hardware loop.
+
+At the same time, VL will, because there is no SHAPE on f8, increment
+straight sequentially through the 16 values f8-f23 in the Matrix. The
+equivalent sequence thus is issued:
+
+ fmac f4, f0, f8, f4
+ fmac f5, f0, f9, f5
+ fmac f6, f0, f10, f6
+ fmac f7, f0, f11, f7
+ fmac f4, f1, f12, f4
+ fmac f5, f1, f13, f5
+ fmac f6, f1, f14, f6
+ fmac f7, f1, f15, f7
+ fmac f4, f2, f16, f4
+ fmac f5, f2, f17, f5
+ fmac f6, f2, f18, f6
+ fmac f7, f2, f19, f7
+ fmac f4, f3, f20, f4
+ fmac f5, f3, f21, f5
+ fmac f6, f3, f22, f6
+ fmac f7, f3, f23, f7
+
+The only other instruction required is to ensure that f4-f7 are
+initialised (usually to zero).
+
+It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively
+the same technique applied to four independent vectors, can be done by
+setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs,
+and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get
+it to apply four times to compute the four columns worth of vectors.
+
+# Warshall transitive closure algorithm
+
+TODO move to [[sv/remap/discussion]] page, copied from here
+http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html
+
+with thanks to Hendrik.
+
+<https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm>
+
+> Just a note: interpreting + as 'or', and * as 'and',
+> operating on Boolean matrices,
+> and having result, X, and Y be the exact same matrix,
+> updated while being used,
+> gives the traditional Warshall transitive-closure
+> algorithm, if the loops are nested exactly in thie order.
+
+this can be done with the ternary instruction which has
+an in-place triple boolean input:
+
+ RT = RT | (RA & RB)
+
+and also has a CR Field variant of the same
+
+notes from conversations:
+
+> > for y in y_r:
+> > for x in x_r:
+> > for z in z_r:
+> > result[y][x] +=
+> > a[y][z] *
+> > b[z][x]
+
+> This nesting of loops works for matrix multiply, but not for transitive
+> closure.
+
+> > it can be done:
+> >
+> > for z in z_r:
+> > for y in y_r:
+> > for x in x_r:
+> > result[y][x] +=
+> > a[y][z] *
+> > b[z][x]
+>
+> And this ordering of loops *does* work for transitive closure, when a,
+> b, and result are the very same matrix, updated while being used.
+>
+> By the way, I believe there is a graph algorithm that does the
+> transitive closure thing, but instead of using boolean, "and", and "or",
+> they use real numbers, addition, and minimum. I think that one computes
+> shortest paths between vertices.
+>
+> By the time the z'th iteration of the z loop begins, the algorithm has
+> already peocessed paths that go through vertices numbered < z, and it
+> adds paths that go through vertices numbered z.
+>
+> For this to work, the outer loop has to be the one on the subscript that
+> bridges a and b (which in this case are teh same matrix, of course).
+
+# SUBVL Remap
+
+Remapping of SUBVL (vec2/3/4) elements is not permitted: the vec2/3/4
+itself must be considered to be the "element". To perform REMAP
+on the elements of a vec2/3/4, either use Swizzle, or,
+due to the sub-elements themselves being contiguous, treat them as
+such and use Indexing, or add one
+extra dimension to Matrix REMAP, the inner dimension being the size
+of the Subvector (2, 3, or 4).
+
+Note that Swizzle on Sub-vectors may be applied on top of REMAP.
+Where this is appropriate is the Rijndael MixColumns
+stage:
+
+<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />
+
+Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
+a 2D REMAP allows:
+
+* the column bytes (as a vec4) to be iterated over as an inner loop,
+ progressing vertically (`a00 a10 a20 a30`)
+* the columns themselves to be iterated as an outer loop
+* a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
+
+This entirely in-place without special 128-bit opcodes. Below is
+the pseudocode for [[!wikipedia Rijndael MixColumns]]
+
+```
+void gmix_column(unsigned char *r) {
+ unsigned char a[4];
+ unsigned char b[4];
+ unsigned char c;
+ unsigned char h;
+ // no swizzle here but vec4 byte-level
+ // elwidth overrides can be done though.
+ for (c = 0; c < 4; c++) {
+ a[c] = r[c];
+ h = (unsigned char)((signed char)r[c] >> 7);
+ b[c] = r[c] << 1;
+ b[c] ^= 0x1B & h; /* Rijndael's Galois field */
+ }
+ // These may then each be 4x 8bit Swizzled
+ // r0.vec4 = b.vec4
+ // r0.vec4 ^= a.vec4.WXYZ
+ // r0.vec4 ^= a.vec4.ZWXY
+ // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
+ r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
+ r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
+ r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
+ r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
+}
+```
+
+The application of the swizzles allows the remapped vec4 a, b and r
+variables to perform four straight linear 32 bit XOR operations where a
+scalar processor would be required to perform 16 byte-level individual
+operations. Given wide enough SIMD backends in hardware these 3 bit
+XORs may be done as single-cycle operations across the entire 128 bit
+Rijndael Matrix.
+
+The other alternative is to simply perform the actual 4x4 GF(256) Matrix
+Multiply using the MDS Matrix.
+