(no commit message)

author lkcl <lkcl@web>

Sun, 26 Mar 2023 21:08:33 +0000 (22:08 +0100)

committer IkiWiki <ikiwiki.info>

Sun, 26 Mar 2023 21:08:33 +0000 (22:08 +0100)
author lkcl <lkcl@web>
Sun, 26 Mar 2023 21:08:33 +0000 (22:08 +0100)
committer IkiWiki <ikiwiki.info>
Sun, 26 Mar 2023 21:08:33 +0000 (22:08 +0100)
diff --git a/openpower/sv/rfc/ls009.mdwn b/openpower/sv/rfc/ls009.mdwn

index e490f3ba1b98a607ad95d8892151fb6da899c008..490df06425e0ceeb04034ea59e79d39b040c1cb2 100644 (file)
--- a/openpower/sv/rfc/ls009.mdwn
+++ b/openpower/sv/rfc/ls009.mdwn
@@ -159,7 +159,7 @@ it is possible to service Precise Interrupts without affecting latency
  (a common limitation of Vector ISAs implementing explicit
  Parallel Reduction instructions).
  
-# Basic principle
+## Basic principle
  
  * normal vector element read/write of operands would be sequential
    (0 1 2 3 ....)
@@ -178,7 +178,7 @@ Only the most commonly-used algorithms in computer science have REMAP
  support, due to the high cost in both the ISA and in hardware.  For
  arbitrary remapping the `Indexed` REMAP may be used.
  
-# Example Usage
+## Example Usage
  
  * `svshape` to set the type of reordering to be applied to an
    otherwise usual `0..VL-1` hardware for-loop
@@ -217,14 +217,14 @@ need to perform additional Transpose or register copy instructions.
  The example above may be executed as a unit test and demo,
  [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;h=c15479db9a36055166b6b023c7495f9ca3637333;hb=a17a252e474d5d5bf34026c25a19682e3f2015c3#l94)
       
-# REMAP types
+## REMAP types
  
  This section summarises the motivation for each REMAP Schedule
  and briefly goes over their characteristics and limitations.
  Further details on the Deterministic Precise-Interruptible algorithms
  used in these Schedules is found in the [[sv/remap/appendix]].
  
-## Matrix (1D/2D/3D shaping)
+### Matrix (1D/2D/3D shaping)
  
  Matrix Multiplication is a huge part of High-Performance Compute,
  and 3D.
@@ -261,7 +261,7 @@ Matrix REMAP capability. Rotation and mirroring need to be done by
  programming the SVSHAPE SPRs directly, which can take a lot more
  instructions.
  
-## FFT/DCT Triple Loop
+### FFT/DCT Triple Loop
  
  DCT and FFT are some of the most astonishingly used algorithms in
  Computer Science.  Radar, Audio, Video, R.F. Baseband and dozens more.  At least
@@ -284,7 +284,7 @@ in practice the RADIX2 limit is not a problem.  A Bluestein convolution
  to compute arbitrary length is demonstrated by
  [Project Nayuki](https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py)
  
-## Indexed
+### Indexed
  
  The purpose of Indexing is to provide a generalised version of
  Vector ISA "Permute" instructions, such as VSX `vperm`.  The
@@ -343,7 +343,7 @@ and RB contains the value of VL returned from `setvl`. The resultant
  CR Fields may then be used as Predicate Masks to exclude those operations
  with an Index exceeding VL-1.*
  
-## Parallel Reduction
+### Parallel Reduction
  
  Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture.  Like Scalar reduction, the "Scalar Base"
  (Power ISA v3.0B) operation is leveraged, unmodified, to give the
@@ -413,7 +413,7 @@ in the prior Parallel-Reduction instruction.
  In either case the result is in the element with the first bit set in
  the predicate mask.
  
-For *some* implementations
+Programmer's Note: For *some* hardware implementations
  the vector-to-scalar copy may be a slow operation, as may the Predicated
  Parallel Reduction itself.
  It may be better to perform a pre-copy
@@ -427,8 +427,10 @@ vector-to-scalar MV operation is needed.
  The simplest usage is to perform an overwrite, specifying all three
  register operands the same.
  
+```
      svshape parallelreduce, 6
      sv.add *8, *8, *8
+```
  
  The Reduction Schedule will issue the Parallel Tree Reduction spanning
  registers 8 through 13, by adjusting the offsets to RT, RA and RB as
@@ -439,14 +441,16 @@ version, only those destination elements necessary for storing
  intermediary computations will be written to: the remaining elements
  will **not** be overwritten and will **not** be zero'd.
  
+```
      svshape parallelreduce, 6
      sv.add *0, *8, *8
+```
  
  However it is critical to note that if the source and destination are
  not the same then the trick of using a follow-up vector-scalar MV will
  not work.
  
-## Sub-Vector Horizontal Reduction
+### Sub-Vector Horizontal Reduction
  
  Note that when SVM is clear and SUBVL!=1 a Parallel Reduction is performed
  on all first Subvector elements, followed by another separate independent
@@ -482,7 +486,7 @@ and Parallel Reduction is applied per row, then if `SVM` is **clear**
  the Matrix is transposed (like Pack/Unpack)
  before still applying the Parallel Reduction to the **row**.
  
-# Determining Register Hazards
+## Determining Register Hazards
  
  For high-performance (Multi-Issue, Out-of-Order) systems it is critical
  to be able to statically determine the extent of Vectors in order to
@@ -513,7 +517,7 @@ In short, there exists solutions to the problem of Hazard Management,
  with varying degrees of refinement possible at correspondingly
  increasing levels of complexity in hardware.
  
-# REMAP area of SVSTATE
+## REMAP area of SVSTATE
  
  The following bits of the SVSTATE SPR are used for REMAP:
  
@@ -565,6 +569,14 @@ Special Registers Altered:
  
      None
  
+`svremap` determines the relationship between registers and SVSHAPE SPRs.
+The bitmask `SVme` determines which registers have a REMAP applied, and mi0-mo1
+determine which shape is applied to an activated register.  the `pst` bit if
+cleared indicated that the REMAP operation shall only apply to the immediately-following
+instruction.  If set then REMAP remains permanently enabled until such time as it is
+explicitly disabled, either by `setvl` setting a new MAXVL, or with another
+`svremap` instruction.
+
  # SHAPE Remapping SPRs
  
  There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
@@ -741,8 +753,9 @@ for i in 0..VL-1:
  
  Matrix-style reordering still applies to the indices, except limited
  to up to 2 Dimensions (X,Y). Ordering is therefore limited to (X,Y) or
-(Y,X). Only one dimension may optionally be skipped. Inversion of either
-X or Y or both is possible. Pseudocode for Indexed Mode (including elwidth
+(Y,X) for in-place Transposition.
+Only one dimension may optionally be skipped. Inversion of either
+X or Y or both is possible (2D mirroring). Pseudocode for Indexed Mode (including elwidth
  overrides) may be written in terms of Matrix Mode, specifically
  purposed to ensure that the 3rd dimension (Z) has no effect:
  
@@ -772,12 +785,6 @@ may have been costly to set up or costly to duplicate
  
  # svshape instruction  <a name="svshape"> </a>
  
-`svshape` is a convenience instruction that reduces instruction
-count for common usage patterns, particularly Matrix, DCT and FFT. It sets up
-(overwrites) all required SVSHAPE SPRs and also modifies SVSTATE
-including VL and MAXVL. Using `svshape` therefore does not also
-require `setvl`.
-
  Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]])
  
      svshape SVxd,SVyd,SVzd,SVRM,vf
@@ -786,6 +793,109 @@ Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]])
  | -- | --   | ---   | ----- | ------ | -- | ------| -------- |
  |OPCD| SVxd | SVyd  | SVzd  | SVRM   | vf | XO    | svshape  |
  
+        # set up FRB and FRS
+        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
+        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
+        mscale <- (0b0 || SVzd) + 1
+        if (SVrm = 0b1011) then
+            SVSHAPE0[30:31] <- 0b11      # iDCT mode
+            SVSHAPE0[18:20] <- 0b011     # iDCT Outer Butterfly sub-mode
+            SVSHAPE0[21:23] <- 0b101     # "inverse" on outer and inner loop
+        else
+            SVSHAPE0[30:31] <- 0b01      # DCT mode
+            SVSHAPE0[18:20] <- 0b100     # DCT Outer Butterfly sub-mode
+        SVSHAPE0[6:11] <- 0b000010       # DCT Butterfly mode
+        # copy
+        SVSHAPE1[0:31] <- SVSHAPE0[0:31] # j+halfstep schedule
+        SVSHAPE2[0:31] <- SVSHAPE0[0:31] # costable coefficients
+        # for FRA and FRT
+        SVSHAPE1[28:29] <- 0b01           # j+halfstep schedule
+        # reset costable "striding" to 1
+        SVSHAPE2[12:17] <- 0b000000
+    # set schedule up for DCT COS table generation
+    if (SVrm = 0b0101) | (SVrm = 0b1101) then
+        # calculate O(N log2 N)
+        vlen[0:6] <- [0] * 7
+        itercount[0:6] <- (0b00 || SVxd) + 0b0000001
+        itercount[0:6] <- (0b0 || itercount[0:5])
+        n <- [0] * 3
+        do while n < 5
+           if SVxd[4-n] = 0 then
+               leave
+           n <- n + 1
+           vlen[0:6] <- vlen + itercount
+           itercount[0:6] <- (0b0 || itercount[0:5])
+        # set up template in SVSHAPE0, then copy to 1-3
+        # set up FRB and FRS
+        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
+        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
+        mscale <- (0b0 || SVzd) + 1
+        SVSHAPE0[30:31] <- 0b01          # DCT/FFT mode
+        SVSHAPE0[6:11] <- 0b000100       # DCT Inner Butterfly COS-gen mode
+        if (SVrm = 0b0101) then
+            SVSHAPE0[21:23] <- 0b001     # "inverse" on outer loop for DCT
+        # copy
+        SVSHAPE1[0:31] <- SVSHAPE0[0:31]
+        SVSHAPE2[0:31] <- SVSHAPE0[0:31]
+        # for cos coefficient
+        SVSHAPE1[28:29] <- 0b10           # ci schedule
+        SVSHAPE2[28:29] <- 0b11           # size schedule
+    # set schedule up for iDCT / DCT inverse of half-swapped ordering
+    if (SVrm = 0b0110) | (SVrm = 0b1110) | (SVrm = 0b1111) then
+        vlen[0:6] <- (0b00 || SVxd) + 0b0000001
+        # set up template in SVSHAPE0
+        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
+        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
+        mscale <- (0b0 || SVzd) + 1
+        if (SVrm = 0b1110) then
+            SVSHAPE0[18:20] <- 0b001     # DCT opposite half-swap
+        if (SVrm = 0b1111) then
+            SVSHAPE0[30:31] <- 0b01          # FFT mode
+        else
+            SVSHAPE0[30:31] <- 0b11          # DCT mode
+        SVSHAPE0[6:11] <- 0b000101       # DCT "half-swap" mode
+    # set schedule up for parallel reduction
+    if (SVrm = 0b0111) then
+        # calculate the total number of operations (brute-force)
+        vlen[0:6] <- [0] * 7
+        itercount[0:6] <- (0b00 || SVxd) + 0b0000001
+        step[0:6] <- 0b0000001
+        i[0:6] <- 0b0000000
+        do while step <u itercount
+            newstep <- step[1:6] || 0b0
+            j[0:6] <- 0b0000000
+            do while (j+step <u itercount)
+                j <- j + newstep
+                i <- i + 1
+            step <- newstep
+        # VL in Parallel-Reduce is the number of operations
+        vlen[0:6] <- i
+        # set up template in SVSHAPE0, then copy to 1. only 2 needed
+        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
+        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
+        mscale <- (0b0 || SVzd) + 1
+        SVSHAPE0[30:31] <- 0b10          # parallel reduce submode
+        # copy
+        SVSHAPE1[0:31] <- SVSHAPE0[0:31]
+        # set up right operand (left operand 28:29 is zero)
+        SVSHAPE1[28:29] <- 0b01           # right operand
+    # set VL, MVL and Vertical-First
+    m[0:12] <- vlen * mscale
+    maxvl[0:6] <- m[6:12]
+    SVSTATE[0:6] <- maxvl  # MAVXL
+    SVSTATE[7:13] <- vlen  # VL
+    SVSTATE[63] <- vf
+
+Special Registers Altered:
+
+    None
+
+`svshape` is a convenience instruction that reduces instruction
+count for common usage patterns, particularly Matrix, DCT and FFT. It sets up
+(overwrites) all required SVSHAPE SPRs and also modifies SVSTATE
+including VL and MAXVL. Using `svshape` therefore does not also
+require `setvl`.
+
  Fields:
  
  * **SVxd** - SV REMAP "xdim"
@@ -793,11 +903,12 @@ Fields:
  * **SVzd** - SV REMAP "zdim"
  * **SVRM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT etc.)
  * **vf** - sets "Vertical-First" mode
-* **XO** - standard 6-bit XO field
  
  *Note: SVxd, SVyz and SVzd are all stored "off-by-one".  In the assembler
  mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*
  
+There are 14 REMAP Modes (2 bits are RESERVED for `svshape2`)
+
  | SVRM   | Remap Mode description |
  | --     | --              |
  | 0b0000 | Matrix 1/2/3D    |
@@ -818,9 +929,8 @@ mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*
  | 0b1111 | FFT half-swap   |
  
  Examples showing how all of these Modes operate exists in the online
-[SVP64 unit tests](https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/decoder/isa;hb=HEAD)
-and the full pseudocode setting up all SPRs
-is in the [[openpower/isa/simplev]] page.
+[SVP64 unit tests](https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/decoder/isa;hb=HEAD).  Explaining
+these Modes further in detail is beyond the scope of this document.
  
  In Indexed Mode, there are only 5 bits available to specify the GPR
  to use, out of 128 GPRs (7 bit numbering).  Therefore, only the top
author	lkcl <lkcl@web>
	Sun, 26 Mar 2023 21:08:33 +0000 (22:08 +0100)
committer	IkiWiki <ikiwiki.info>
	Sun, 26 Mar 2023 21:08:33 +0000 (22:08 +0100)