From 20405cb26e7c2db880b2bb82e64177b066f06454 Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Sun, 26 Mar 2023 22:08:33 +0100
Subject: [PATCH]

---
 openpower/sv/rfc/ls009.mdwn | 156 ++++++++++++++++++++++++++++++------
 1 file changed, 133 insertions(+), 23 deletions(-)

diff --git a/openpower/sv/rfc/ls009.mdwn b/openpower/sv/rfc/ls009.mdwn
index e490f3ba1..490df0642 100644
--- a/openpower/sv/rfc/ls009.mdwn
+++ b/openpower/sv/rfc/ls009.mdwn
@@ -159,7 +159,7 @@ it is possible to service Precise Interrupts without affecting latency
 (a common limitation of Vector ISAs implementing explicit
 Parallel Reduction instructions).
 
-# Basic principle
+## Basic principle
 
 * normal vector element read/write of operands would be sequential
   (0 1 2 3 ....)
@@ -178,7 +178,7 @@ Only the most commonly-used algorithms in computer science have REMAP
 support, due to the high cost in both the ISA and in hardware.  For
 arbitrary remapping the `Indexed` REMAP may be used.
 
-# Example Usage
+## Example Usage
 
 * `svshape` to set the type of reordering to be applied to an
   otherwise usual `0..VL-1` hardware for-loop
@@ -217,14 +217,14 @@ need to perform additional Transpose or register copy instructions.
 The example above may be executed as a unit test and demo,
 [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;h=c15479db9a36055166b6b023c7495f9ca3637333;hb=a17a252e474d5d5bf34026c25a19682e3f2015c3#l94)
      
-# REMAP types
+## REMAP types
 
 This section summarises the motivation for each REMAP Schedule
 and briefly goes over their characteristics and limitations.
 Further details on the Deterministic Precise-Interruptible algorithms
 used in these Schedules is found in the [[sv/remap/appendix]].
 
-## Matrix (1D/2D/3D shaping)
+### Matrix (1D/2D/3D shaping)
 
 Matrix Multiplication is a huge part of High-Performance Compute,
 and 3D.
@@ -261,7 +261,7 @@ Matrix REMAP capability. Rotation and mirroring need to be done by
 programming the SVSHAPE SPRs directly, which can take a lot more
 instructions.
 
-## FFT/DCT Triple Loop
+### FFT/DCT Triple Loop
 
 DCT and FFT are some of the most astonishingly used algorithms in
 Computer Science.  Radar, Audio, Video, R.F. Baseband and dozens more.  At least
@@ -284,7 +284,7 @@ in practice the RADIX2 limit is not a problem.  A Bluestein convolution
 to compute arbitrary length is demonstrated by
 [Project Nayuki](https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py)
 
-## Indexed
+### Indexed
 
 The purpose of Indexing is to provide a generalised version of
 Vector ISA "Permute" instructions, such as VSX `vperm`.  The
@@ -343,7 +343,7 @@ and RB contains the value of VL returned from `setvl`. The resultant
 CR Fields may then be used as Predicate Masks to exclude those operations
 with an Index exceeding VL-1.*
 
-## Parallel Reduction
+### Parallel Reduction
 
 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture.  Like Scalar reduction, the "Scalar Base"
 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
@@ -413,7 +413,7 @@ in the prior Parallel-Reduction instruction.
 In either case the result is in the element with the first bit set in
 the predicate mask.
 
-For *some* implementations
+Programmer's Note: For *some* hardware implementations
 the vector-to-scalar copy may be a slow operation, as may the Predicated
 Parallel Reduction itself.
 It may be better to perform a pre-copy
@@ -427,8 +427,10 @@ vector-to-scalar MV operation is needed.
 The simplest usage is to perform an overwrite, specifying all three
 register operands the same.
 
+```
     svshape parallelreduce, 6
     sv.add *8, *8, *8
+```
 
 The Reduction Schedule will issue the Parallel Tree Reduction spanning
 registers 8 through 13, by adjusting the offsets to RT, RA and RB as
@@ -439,14 +441,16 @@ version, only those destination elements necessary for storing
 intermediary computations will be written to: the remaining elements
 will **not** be overwritten and will **not** be zero'd.
 
+```
     svshape parallelreduce, 6
     sv.add *0, *8, *8
+```
 
 However it is critical to note that if the source and destination are
 not the same then the trick of using a follow-up vector-scalar MV will
 not work.
 
-## Sub-Vector Horizontal Reduction
+### Sub-Vector Horizontal Reduction
 
 Note that when SVM is clear and SUBVL!=1 a Parallel Reduction is performed
 on all first Subvector elements, followed by another separate independent
@@ -482,7 +486,7 @@ and Parallel Reduction is applied per row, then if `SVM` is **clear**
 the Matrix is transposed (like Pack/Unpack)
 before still applying the Parallel Reduction to the **row**.
 
-# Determining Register Hazards
+## Determining Register Hazards
 
 For high-performance (Multi-Issue, Out-of-Order) systems it is critical
 to be able to statically determine the extent of Vectors in order to
@@ -513,7 +517,7 @@ In short, there exists solutions to the problem of Hazard Management,
 with varying degrees of refinement possible at correspondingly
 increasing levels of complexity in hardware.
 
-# REMAP area of SVSTATE
+## REMAP area of SVSTATE
 
 The following bits of the SVSTATE SPR are used for REMAP:
 
@@ -565,6 +569,14 @@ Special Registers Altered:
 
     None
 
+`svremap` determines the relationship between registers and SVSHAPE SPRs.
+The bitmask `SVme` determines which registers have a REMAP applied, and mi0-mo1
+determine which shape is applied to an activated register.  the `pst` bit if
+cleared indicated that the REMAP operation shall only apply to the immediately-following
+instruction.  If set then REMAP remains permanently enabled until such time as it is
+explicitly disabled, either by `setvl` setting a new MAXVL, or with another
+`svremap` instruction.
+
 # SHAPE Remapping SPRs
 
 There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
@@ -741,8 +753,9 @@ for i in 0..VL-1:
 
 Matrix-style reordering still applies to the indices, except limited
 to up to 2 Dimensions (X,Y). Ordering is therefore limited to (X,Y) or
-(Y,X). Only one dimension may optionally be skipped. Inversion of either
-X or Y or both is possible. Pseudocode for Indexed Mode (including elwidth
+(Y,X) for in-place Transposition.
+Only one dimension may optionally be skipped. Inversion of either
+X or Y or both is possible (2D mirroring). Pseudocode for Indexed Mode (including elwidth
 overrides) may be written in terms of Matrix Mode, specifically
 purposed to ensure that the 3rd dimension (Z) has no effect:
 
@@ -772,12 +785,6 @@ may have been costly to set up or costly to duplicate
 
 # svshape instruction  <a name="svshape"> </a>
 
-`svshape` is a convenience instruction that reduces instruction
-count for common usage patterns, particularly Matrix, DCT and FFT. It sets up
-(overwrites) all required SVSHAPE SPRs and also modifies SVSTATE
-including VL and MAXVL. Using `svshape` therefore does not also
-require `setvl`.
-
 Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]])
 
     svshape SVxd,SVyd,SVzd,SVRM,vf
@@ -786,6 +793,109 @@ Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]])
 | -- | --   | ---   | ----- | ------ | -- | ------| -------- |
 |OPCD| SVxd | SVyd  | SVzd  | SVRM   | vf | XO    | svshape  |
 
+        # set up FRB and FRS
+        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
+        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
+        mscale <- (0b0 || SVzd) + 1
+        if (SVrm = 0b1011) then
+            SVSHAPE0[30:31] <- 0b11      # iDCT mode
+            SVSHAPE0[18:20] <- 0b011     # iDCT Outer Butterfly sub-mode
+            SVSHAPE0[21:23] <- 0b101     # "inverse" on outer and inner loop
+        else
+            SVSHAPE0[30:31] <- 0b01      # DCT mode
+            SVSHAPE0[18:20] <- 0b100     # DCT Outer Butterfly sub-mode
+        SVSHAPE0[6:11] <- 0b000010       # DCT Butterfly mode
+        # copy
+        SVSHAPE1[0:31] <- SVSHAPE0[0:31] # j+halfstep schedule
+        SVSHAPE2[0:31] <- SVSHAPE0[0:31] # costable coefficients
+        # for FRA and FRT
+        SVSHAPE1[28:29] <- 0b01           # j+halfstep schedule
+        # reset costable "striding" to 1
+        SVSHAPE2[12:17] <- 0b000000
+    # set schedule up for DCT COS table generation
+    if (SVrm = 0b0101) | (SVrm = 0b1101) then
+        # calculate O(N log2 N)
+        vlen[0:6] <- [0] * 7
+        itercount[0:6] <- (0b00 || SVxd) + 0b0000001
+        itercount[0:6] <- (0b0 || itercount[0:5])
+        n <- [0] * 3
+        do while n < 5
+           if SVxd[4-n] = 0 then
+               leave
+           n <- n + 1
+           vlen[0:6] <- vlen + itercount
+           itercount[0:6] <- (0b0 || itercount[0:5])
+        # set up template in SVSHAPE0, then copy to 1-3
+        # set up FRB and FRS
+        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
+        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
+        mscale <- (0b0 || SVzd) + 1
+        SVSHAPE0[30:31] <- 0b01          # DCT/FFT mode
+        SVSHAPE0[6:11] <- 0b000100       # DCT Inner Butterfly COS-gen mode
+        if (SVrm = 0b0101) then
+            SVSHAPE0[21:23] <- 0b001     # "inverse" on outer loop for DCT
+        # copy
+        SVSHAPE1[0:31] <- SVSHAPE0[0:31]
+        SVSHAPE2[0:31] <- SVSHAPE0[0:31]
+        # for cos coefficient
+        SVSHAPE1[28:29] <- 0b10           # ci schedule
+        SVSHAPE2[28:29] <- 0b11           # size schedule
+    # set schedule up for iDCT / DCT inverse of half-swapped ordering
+    if (SVrm = 0b0110) | (SVrm = 0b1110) | (SVrm = 0b1111) then
+        vlen[0:6] <- (0b00 || SVxd) + 0b0000001
+        # set up template in SVSHAPE0
+        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
+        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
+        mscale <- (0b0 || SVzd) + 1
+        if (SVrm = 0b1110) then
+            SVSHAPE0[18:20] <- 0b001     # DCT opposite half-swap
+        if (SVrm = 0b1111) then
+            SVSHAPE0[30:31] <- 0b01          # FFT mode
+        else
+            SVSHAPE0[30:31] <- 0b11          # DCT mode
+        SVSHAPE0[6:11] <- 0b000101       # DCT "half-swap" mode
+    # set schedule up for parallel reduction
+    if (SVrm = 0b0111) then
+        # calculate the total number of operations (brute-force)
+        vlen[0:6] <- [0] * 7
+        itercount[0:6] <- (0b00 || SVxd) + 0b0000001
+        step[0:6] <- 0b0000001
+        i[0:6] <- 0b0000000
+        do while step <u itercount
+            newstep <- step[1:6] || 0b0
+            j[0:6] <- 0b0000000
+            do while (j+step <u itercount)
+                j <- j + newstep
+                i <- i + 1
+            step <- newstep
+        # VL in Parallel-Reduce is the number of operations
+        vlen[0:6] <- i
+        # set up template in SVSHAPE0, then copy to 1. only 2 needed
+        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
+        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
+        mscale <- (0b0 || SVzd) + 1
+        SVSHAPE0[30:31] <- 0b10          # parallel reduce submode
+        # copy
+        SVSHAPE1[0:31] <- SVSHAPE0[0:31]
+        # set up right operand (left operand 28:29 is zero)
+        SVSHAPE1[28:29] <- 0b01           # right operand
+    # set VL, MVL and Vertical-First
+    m[0:12] <- vlen * mscale
+    maxvl[0:6] <- m[6:12]
+    SVSTATE[0:6] <- maxvl  # MAVXL
+    SVSTATE[7:13] <- vlen  # VL
+    SVSTATE[63] <- vf
+
+Special Registers Altered:
+
+    None
+
+`svshape` is a convenience instruction that reduces instruction
+count for common usage patterns, particularly Matrix, DCT and FFT. It sets up
+(overwrites) all required SVSHAPE SPRs and also modifies SVSTATE
+including VL and MAXVL. Using `svshape` therefore does not also
+require `setvl`.
+
 Fields:
 
 * **SVxd** - SV REMAP "xdim"
@@ -793,11 +903,12 @@ Fields:
 * **SVzd** - SV REMAP "zdim"
 * **SVRM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT etc.)
 * **vf** - sets "Vertical-First" mode
-* **XO** - standard 6-bit XO field
 
 *Note: SVxd, SVyz and SVzd are all stored "off-by-one".  In the assembler
 mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*
 
+There are 14 REMAP Modes (2 bits are RESERVED for `svshape2`)
+
 | SVRM   | Remap Mode description |
 | --     | --              |
 | 0b0000 | Matrix 1/2/3D    |
@@ -818,9 +929,8 @@ mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*
 | 0b1111 | FFT half-swap   |
 
 Examples showing how all of these Modes operate exists in the online
-[SVP64 unit tests](https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/decoder/isa;hb=HEAD)
-and the full pseudocode setting up all SPRs
-is in the [[openpower/isa/simplev]] page.
+[SVP64 unit tests](https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/decoder/isa;hb=HEAD).  Explaining
+these Modes further in detail is beyond the scope of this document.
 
 In Indexed Mode, there are only 5 bits available to specify the GPR
 to use, out of 128 GPRs (7 bit numbering).  Therefore, only the top
-- 
2.30.2