From 20405cb26e7c2db880b2bb82e64177b066f06454 Mon Sep 17 00:00:00 2001 From: lkcl Date: Sun, 26 Mar 2023 22:08:33 +0100 Subject: [PATCH] --- openpower/sv/rfc/ls009.mdwn | 156 ++++++++++++++++++++++++++++++------ 1 file changed, 133 insertions(+), 23 deletions(-) diff --git a/openpower/sv/rfc/ls009.mdwn b/openpower/sv/rfc/ls009.mdwn index e490f3ba1..490df0642 100644 --- a/openpower/sv/rfc/ls009.mdwn +++ b/openpower/sv/rfc/ls009.mdwn @@ -159,7 +159,7 @@ it is possible to service Precise Interrupts without affecting latency (a common limitation of Vector ISAs implementing explicit Parallel Reduction instructions). -# Basic principle +## Basic principle * normal vector element read/write of operands would be sequential (0 1 2 3 ....) @@ -178,7 +178,7 @@ Only the most commonly-used algorithms in computer science have REMAP support, due to the high cost in both the ISA and in hardware. For arbitrary remapping the `Indexed` REMAP may be used. -# Example Usage +## Example Usage * `svshape` to set the type of reordering to be applied to an otherwise usual `0..VL-1` hardware for-loop @@ -217,14 +217,14 @@ need to perform additional Transpose or register copy instructions. The example above may be executed as a unit test and demo, [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;h=c15479db9a36055166b6b023c7495f9ca3637333;hb=a17a252e474d5d5bf34026c25a19682e3f2015c3#l94) -# REMAP types +## REMAP types This section summarises the motivation for each REMAP Schedule and briefly goes over their characteristics and limitations. Further details on the Deterministic Precise-Interruptible algorithms used in these Schedules is found in the [[sv/remap/appendix]]. -## Matrix (1D/2D/3D shaping) +### Matrix (1D/2D/3D shaping) Matrix Multiplication is a huge part of High-Performance Compute, and 3D. @@ -261,7 +261,7 @@ Matrix REMAP capability. Rotation and mirroring need to be done by programming the SVSHAPE SPRs directly, which can take a lot more instructions. -## FFT/DCT Triple Loop +### FFT/DCT Triple Loop DCT and FFT are some of the most astonishingly used algorithms in Computer Science. Radar, Audio, Video, R.F. Baseband and dozens more. At least @@ -284,7 +284,7 @@ in practice the RADIX2 limit is not a problem. A Bluestein convolution to compute arbitrary length is demonstrated by [Project Nayuki](https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py) -## Indexed +### Indexed The purpose of Indexing is to provide a generalised version of Vector ISA "Permute" instructions, such as VSX `vperm`. The @@ -343,7 +343,7 @@ and RB contains the value of VL returned from `setvl`. The resultant CR Fields may then be used as Predicate Masks to exclude those operations with an Index exceeding VL-1.* -## Parallel Reduction +### Parallel Reduction Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base" (Power ISA v3.0B) operation is leveraged, unmodified, to give the @@ -413,7 +413,7 @@ in the prior Parallel-Reduction instruction. In either case the result is in the element with the first bit set in the predicate mask. -For *some* implementations +Programmer's Note: For *some* hardware implementations the vector-to-scalar copy may be a slow operation, as may the Predicated Parallel Reduction itself. It may be better to perform a pre-copy @@ -427,8 +427,10 @@ vector-to-scalar MV operation is needed. The simplest usage is to perform an overwrite, specifying all three register operands the same. +``` svshape parallelreduce, 6 sv.add *8, *8, *8 +``` The Reduction Schedule will issue the Parallel Tree Reduction spanning registers 8 through 13, by adjusting the offsets to RT, RA and RB as @@ -439,14 +441,16 @@ version, only those destination elements necessary for storing intermediary computations will be written to: the remaining elements will **not** be overwritten and will **not** be zero'd. +``` svshape parallelreduce, 6 sv.add *0, *8, *8 +``` However it is critical to note that if the source and destination are not the same then the trick of using a follow-up vector-scalar MV will not work. -## Sub-Vector Horizontal Reduction +### Sub-Vector Horizontal Reduction Note that when SVM is clear and SUBVL!=1 a Parallel Reduction is performed on all first Subvector elements, followed by another separate independent @@ -482,7 +486,7 @@ and Parallel Reduction is applied per row, then if `SVM` is **clear** the Matrix is transposed (like Pack/Unpack) before still applying the Parallel Reduction to the **row**. -# Determining Register Hazards +## Determining Register Hazards For high-performance (Multi-Issue, Out-of-Order) systems it is critical to be able to statically determine the extent of Vectors in order to @@ -513,7 +517,7 @@ In short, there exists solutions to the problem of Hazard Management, with varying degrees of refinement possible at correspondingly increasing levels of complexity in hardware. -# REMAP area of SVSTATE +## REMAP area of SVSTATE The following bits of the SVSTATE SPR are used for REMAP: @@ -565,6 +569,14 @@ Special Registers Altered: None +`svremap` determines the relationship between registers and SVSHAPE SPRs. +The bitmask `SVme` determines which registers have a REMAP applied, and mi0-mo1 +determine which shape is applied to an activated register. the `pst` bit if +cleared indicated that the REMAP operation shall only apply to the immediately-following +instruction. If set then REMAP remains permanently enabled until such time as it is +explicitly disabled, either by `setvl` setting a new MAXVL, or with another +`svremap` instruction. + # SHAPE Remapping SPRs There are four "shape" SPRs, SHAPE0-3, 32-bits in each, @@ -741,8 +753,9 @@ for i in 0..VL-1: Matrix-style reordering still applies to the indices, except limited to up to 2 Dimensions (X,Y). Ordering is therefore limited to (X,Y) or -(Y,X). Only one dimension may optionally be skipped. Inversion of either -X or Y or both is possible. Pseudocode for Indexed Mode (including elwidth +(Y,X) for in-place Transposition. +Only one dimension may optionally be skipped. Inversion of either +X or Y or both is possible (2D mirroring). Pseudocode for Indexed Mode (including elwidth overrides) may be written in terms of Matrix Mode, specifically purposed to ensure that the 3rd dimension (Z) has no effect: @@ -772,12 +785,6 @@ may have been costly to set up or costly to duplicate # svshape instruction -`svshape` is a convenience instruction that reduces instruction -count for common usage patterns, particularly Matrix, DCT and FFT. It sets up -(overwrites) all required SVSHAPE SPRs and also modifies SVSTATE -including VL and MAXVL. Using `svshape` therefore does not also -require `setvl`. - Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]]) svshape SVxd,SVyd,SVzd,SVRM,vf @@ -786,6 +793,109 @@ Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]]) | -- | -- | --- | ----- | ------ | -- | ------| -------- | |OPCD| SVxd | SVyd | SVzd | SVRM | vf | XO | svshape | + # set up FRB and FRS + SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim + SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT) + mscale <- (0b0 || SVzd) + 1 + if (SVrm = 0b1011) then + SVSHAPE0[30:31] <- 0b11 # iDCT mode + SVSHAPE0[18:20] <- 0b011 # iDCT Outer Butterfly sub-mode + SVSHAPE0[21:23] <- 0b101 # "inverse" on outer and inner loop + else + SVSHAPE0[30:31] <- 0b01 # DCT mode + SVSHAPE0[18:20] <- 0b100 # DCT Outer Butterfly sub-mode + SVSHAPE0[6:11] <- 0b000010 # DCT Butterfly mode + # copy + SVSHAPE1[0:31] <- SVSHAPE0[0:31] # j+halfstep schedule + SVSHAPE2[0:31] <- SVSHAPE0[0:31] # costable coefficients + # for FRA and FRT + SVSHAPE1[28:29] <- 0b01 # j+halfstep schedule + # reset costable "striding" to 1 + SVSHAPE2[12:17] <- 0b000000 + # set schedule up for DCT COS table generation + if (SVrm = 0b0101) | (SVrm = 0b1101) then + # calculate O(N log2 N) + vlen[0:6] <- [0] * 7 + itercount[0:6] <- (0b00 || SVxd) + 0b0000001 + itercount[0:6] <- (0b0 || itercount[0:5]) + n <- [0] * 3 + do while n < 5 + if SVxd[4-n] = 0 then + leave + n <- n + 1 + vlen[0:6] <- vlen + itercount + itercount[0:6] <- (0b0 || itercount[0:5]) + # set up template in SVSHAPE0, then copy to 1-3 + # set up FRB and FRS + SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim + SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT) + mscale <- (0b0 || SVzd) + 1 + SVSHAPE0[30:31] <- 0b01 # DCT/FFT mode + SVSHAPE0[6:11] <- 0b000100 # DCT Inner Butterfly COS-gen mode + if (SVrm = 0b0101) then + SVSHAPE0[21:23] <- 0b001 # "inverse" on outer loop for DCT + # copy + SVSHAPE1[0:31] <- SVSHAPE0[0:31] + SVSHAPE2[0:31] <- SVSHAPE0[0:31] + # for cos coefficient + SVSHAPE1[28:29] <- 0b10 # ci schedule + SVSHAPE2[28:29] <- 0b11 # size schedule + # set schedule up for iDCT / DCT inverse of half-swapped ordering + if (SVrm = 0b0110) | (SVrm = 0b1110) | (SVrm = 0b1111) then + vlen[0:6] <- (0b00 || SVxd) + 0b0000001 + # set up template in SVSHAPE0 + SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim + SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT) + mscale <- (0b0 || SVzd) + 1 + if (SVrm = 0b1110) then + SVSHAPE0[18:20] <- 0b001 # DCT opposite half-swap + if (SVrm = 0b1111) then + SVSHAPE0[30:31] <- 0b01 # FFT mode + else + SVSHAPE0[30:31] <- 0b11 # DCT mode + SVSHAPE0[6:11] <- 0b000101 # DCT "half-swap" mode + # set schedule up for parallel reduction + if (SVrm = 0b0111) then + # calculate the total number of operations (brute-force) + vlen[0:6] <- [0] * 7 + itercount[0:6] <- (0b00 || SVxd) + 0b0000001 + step[0:6] <- 0b0000001 + i[0:6] <- 0b0000000 + do while step