From: lkcl Date: Tue, 6 Sep 2022 07:15:33 +0000 (+0100) Subject: (no commit message) X-Git-Tag: opf_rfc_ls005_v1~662 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=5b0d27734236e39eaa553e710353aa5df31e59b7;p=libreriscv.git --- diff --git a/openpower/sv/remap.mdwn b/openpower/sv/remap.mdwn index 9dab85398..e4ad1c888 100644 --- a/openpower/sv/remap.mdwn +++ b/openpower/sv/remap.mdwn @@ -278,28 +278,29 @@ created. No special action is taken: the result and its CR Field are stored "as usual" exactly as all other SVP64 Rc=1 operations. Note that the Schedule only makes sense on top of certain instructions: -X-Form with a Register Profile of `RT,RA,RB` is fine. Like Scalar +X-Form with a Register Profile of `RT,RA,RB` is fine because two sources +and the destination are all the same type. Like Scalar Reduction, nothing is prohibited: the results of execution on an unsuitable instruction may simply -not make sense. Many 3-input instructions (madd, fmadd) unlike Scalar -Reduction in particular do not make sense, but `ternlogi`, if used -with care, would. +not make sense. With care, even 3-input instructions (madd, fmadd, ternlogi) +may be used. Critical to note regarding use of Parallel-Reduction REMAP is that, exactly as with Matrix Mode, the `svshape` instruction *requests* a certain Vector Length (number of elements to reduce) and then sets VL and MAXVL at the number of **operations** needed to be -carried out. Thus, equally as importantly, the total number of operations +carried out. Thus, equally as importantly, like Matrix REMAP +the total number of operations is restricted to 127. Any Parallel-Reduction requiring more operations will need to be done manually in batches. Also important to note is that the Deterministic Schedule is arranged -so that some implementations *may* parallelise it, as long as doing so -respects Program Order and Register Hazards. Performance (speed) +so that some implementations *may* parallelise it (as long as doing so +respects Program Order and Register Hazards). Performance (speed) of any given implementation is neither strictly defined or guaranteed. As with the Vulkan(tm) Specification, strict compliance is paramount whilst -performance is left to Implementors. +performance is at the discretion of Implementors. **Parallel-Reduction with Predication** @@ -338,7 +339,7 @@ vector-to-scalar MV operation is needed. The simplest usage is to perform an overwrite, specifying all three register operands the same. - setvl VL=6 + svshape parallelreduce, 6 sv.add *8, *8, *8 The Reduction Schedule will issue the Parallel Tree Reduction spanning @@ -350,7 +351,7 @@ version, only those destination elements necessary for storing intermediary computations will be written to: the remaining elements will **not** be overwritten and will **not** be zero'd. - setvl VL=4 + svshape parallelreduce, 6 sv.add *0, *8, *8 However it is critical to note that if the source and destination are @@ -393,6 +394,37 @@ and Parallel Reduction is applied per row, then if `SVM` is **clear** the Matrix is transposed (like Pack/Unpack) before still applying the Parallel Reduction to the **row**. +# Determining Register Hazards + +For high-performance (Multi-Issue, Out-of-Order) systems it is critical +to be able to statically determine the extent of Vectors in order to +allocate pre-emptive Hazard protection. The next task is to eliminate +masked-out elements using predicate bits, freeing up the associated +Hazards. + +For non-REMAP situations `VL` is sufficient to ascertain early +Hazard coverage, and with SVSTATE being a high priority cached +quantity at the same level of MSR and PC this is not a problem. + +The problems come when REMAP is enabled. Indexed REMAP must instead +use `MAXVL` as the earliest (simplest) +batch-level Hazard Reservation indicator, +but Matrix, FFT and Parallel Reduction must all use completely different +schemes. The reason is that VL is used to step through the total +number of *operations*, not the number of registers. The "Saving Grace" +is that all of the REMAP Schedules are Deterministic. + +Advance-notice Parallel computation and subsequent cacheing +of all of these complex Deterministic REMAP Schedules is +*strongly recommended*, thus allowing clear and precise multi-issue +batched Hazard coverage to be deployed, *even for Indexed Mode*. +This is only possible for Indexed due to the strict guidelines +given to Programmers. + +In short, there exists solutions to the problem of Hazard Management, +with varying degrees of refinement possible at correspondingly +increasing levels of complexity in hardware. + # REMAP area of SVSTATE The following bits of the SVSTATE SPR are used for REMAP: