From: lkcl Date: Sat, 8 Apr 2023 11:48:47 +0000 (+0100) Subject: (no commit message) X-Git-Tag: opf_rfc_ls012_v1~76 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=5e6b329d40ad9da07a2e8d26952dffd039559794;p=libreriscv.git --- diff --git a/openpower/sv/rfc/ls012.mdwn b/openpower/sv/rfc/ls012.mdwn index cae948fb4..2f39e8dbb 100644 --- a/openpower/sv/rfc/ls012.mdwn +++ b/openpower/sv/rfc/ls012.mdwn @@ -170,7 +170,7 @@ Power ISA mutiply, divide, and shift operations at this late stage of maturity o the Power ISA is an entire area of research on its own deemed unlikely to be achievable. -## fclass +## fclass and GPR-FPR moves [[sv/fclass]] - just one instruction. With SFFS being locked down to exclude VSX, and there being no desire within the nascent OpenPOWER ecosystem outside of IBM to @@ -178,6 +178,13 @@ implement the VSX PackedSIMD paradigm, it becomes necessary to upgrade SFFS such that it is stand-alone capable. One omission based on the assumption that VSX would always be present is an equivalent to `xvtstdcsp`. +Similar arguments apply to the GPR-INT move operations, with the opportunity taken +to add rounding modes present in other ISAs that Power ISA VSX PackedSIMD does not +have. Javascript rounding, one of the worst offenders of Computer Science, requires +a phenomental 35 instructions with *six branches* to emulate in Power ISA! For +desktop as well as Server HTML/JS back-end execution of javascript this becomes an +obvious priority, recognised already by ARM as just one example. + ## (f)mv.swizzle [[sv/mv.swizzle]] is dicey. It is a 2-in 2-out operation whose value as a Scalar @@ -197,7 +204,38 @@ mv Swizzle operations, which can always be Macro-op fused in exactly the same way that ARM SVE predicated-move extends 3-operand "overwrite" opcodes to full independent 3-in 1-out. - +# BMI (bitmanipulation) group. + +Whilst the [[sv/vector_ops]] instructions are only two in number, in reality the +`bmask` instruction has a Mode field allowing it to cover **24** instructions, +more than have been added to any other CPUs by ARM, Intel or AMD. Analyis of +the BMI sets of these CPUs shows simple patterns that can greatly simplify both +Decode and implementation. These are sufficiently commonly used, saving instruction +count regularly, that they justify going into EXT0xx. + +The other instruction is `cprop` - Carry-Propagation - which takes the P and Q +from carry-propagation algorithms and generates carry look-ahead. Greatly +increases the efficiency of arbitrary-precision integer arithmetic by combining +what would otherwise be half a dozen instructions into one. However it is +still not a huge priority unlike `bmask` so is probably best placed in EXT2xx. + +* Float-Load-Immediate + +Very easily justified. As explained in [[ls002]] these +always saves one LD L1/2/3 D-Cache memory-lookup operation, by virtue of the Immediate +FP value being in the I-Cache side. It is such a high priority that these instuctions +are easily justifiable adding into EXT0xx, despite requiring a 16-bit immediate. +By designing the second-half instruction as a Read-Modify-Write it saves on XO +bitlength (only 5 bits), and can be macro-op fused with its first-half to store a +full IEEE754 FP32 immediate into a register. + +There is little point in putting these instructions into EXT2xx. Their very benefit +and inherent value *is* as 32-bit instructions, not 64-bit ones. Likewise there is +less value in taking up EXT1xx Enoding space because EXT1xx only brings an additional +16 bits (approx) to the table, and that is provided already by the second-half +instuction. + +Thus they qualify as both high priority and also EXT0xx candidates. [[!inline pages="openpower/sv/rfc/ls012/areas.mdwn" raw=yes ]]