The purpose of this RFC is:
* to give a full list of upcoming Scalar opcodes developed by Libre-SOC
- (being cognisant that *all* of them are Vectorisable)
+ (being cognisant that *all* of them are Vectoriseable)
* to give OPF Members and non-Members alike the opportunity to comment and get
involved early in RFC submission
* formally agree a priority order on an iterative basis with new versions
to get a clear advance picture of Opcode Allocation
*prior* to submission
-As this is a Formal ISA RFC the evaluation shall ultimatly define
+As this is a Formal ISA RFC the evaluation shall ultimately define
(in advance of the actual submission of the instructions themselves)
which instructions will be submitted over the next 1-18 months.
outside of Embedded or Tablet/Desktop/Laptop power budgets. Thus bringing
Power Scalar up-to-date to modern standards *and on its own merits*
is a reasonable goal, and the advantages of the reduced focus is that
-SFFS remains RISC-paradigm, and that lessons can be learned from other
+SFFS remains RISC-paradigm, and that lessons can be learned from other
ISAs from the intervening years. Good examples here include `bmask`.
SVP64 Prefixing - also known by the terms "Zero-Overhead-Loop-Prefixing"
**Target areas**
Whilst entirely general-purpose there are some categories that these
-instructions are targetting: Bitmanipulation, Big-integer, cryptography,
+instructions are targetting: Bit-manipulation, Big-integer, cryptography,
Audio/Visual, High-Performance Compute, GPU workloads and DSP.
**Instruction count guide and approximate priority order**
max instruction).
The Management Instructions themselves are all Scalar Operations, so
-PO1-Prefixing is perfecly reasonable. SVP64 Management instructions of
+PO1-Prefixing is perfectly reasonable. SVP64 Management instructions of
which there are only 6 are all 5 or 6 bit XO, meaning that the opcode
space they take up in EXT0xx is not alarmingly high for their intrinsic
strategic value.
ISA up-to-date to IEEE754-2019. Fortunately the number of critical
instructions is quite low, but the caveat is that if those operations
are utilised to synthesise other IEEE754 operations (divide by `pi` for
-example) full bitlevel accuracy (a hard requirement for IEEE754) is lost.
+example) full bit-level accuracy (a hard requirement for IEEE754) is lost.
Also worth noting that the Khronos Group defines minimum acceptable
bit-accuracy levels for 3D Graphics: these are **nowhere near** the full
Found at [[sv/av_opcodes]] these do not require Saturated variants
because Saturation is added via [[sv/svp64]] (Vector Prefixing) and via
[[sv/svp64_single]] Scalar Prefixing. This is important to note for
-Opcode Allocation because placing these operations in the UnVectoriseble
-areas would irrediemably damage their value. Unlike PackedSIMD ISAs
+Opcode Allocation because placing these operations in the UnVectoriseable
+areas would irredeemably damage their value. Unlike PackedSIMD ISAs
the actual number of AV Opcodes is remarkably small once the usual
cascading-option-multipliers (SIMD width, bitwidth, saturation,
HI/LO) are abstracted out to RISC-paradigm Prefixing, leaving just
It should be pretty clear this is high priority.
-With SVP64 [[sv/remap]] providing the Loop Schedules it falls to
+With SVP64 [[sv/remap]] providing the Loop Schedules it falls to
the Scalar side of the ISA to add the prerequisite "Twin Butterfly"
operations, typically performing for example one multiply but in-place
subtracting that product from one operand and adding it to the other.
An additional need is to do popcount on CR Field bit vectors but adding
such instructions to the *Condition Register* side was deemed to be far
too much. Therefore, priority was given instead to transferring several
-CR Field bits into GPRs, whereupon the full set of tandard Scalar GPR
+CR Field bits into GPRs, whereupon the full set of Standard Scalar GPR
Logical Operations may be used. This strategy has the side-effect of
keeping the CRweird group down to only five instructions.
## Big-integer Math
-[[sv/biginteger]] has always been a high priority area for commercial
+[[sv/biginteger]] has always been a high priority area for commercial
applications, privacy, Banking, as well as HPC Numerical Accuracy:
libgmp as well as cryptographic uses in Asymmetric Ciphers. poly1305
and ec25519 are finding their way into everyday use via OpenSSL.
64-bit shifters.
The reduction in instruction count these operations bring, in critical
-hotloops, is remarkably high, to the extent where a Scalar-to-Vector
+hot loops, is remarkably high, to the extent where a Scalar-to-Vector
operation of *arbitrary length* becomes just the one Vector-Prefixed
instruction.
Whilst these are 5-6 bit XO their utility is considered high strategic
value and as such are strongly advocated to be in EXT04. The alternative
is to bring back a 64-bit Carry SPR but how it is retrospectively
-applicable to pre-existing Scalar Power ISA mutiply, divide, and shift
+applicable to pre-existing Scalar Power ISA multiply, divide, and shift
operations at this late stage of maturity of the Power ISA is an entire
area of research on its own deemed unlikely to be achievable.
[[ls006]], with the opportunity taken to add rounding modes present
in other ISAs that Power ISA VSX PackedSIMD does not have. Javascript
rounding, one of the worst offenders of Computer Science, requires a
-phenomental 35 instructions with *six branches* to emulate in Power
+phenomenal 35 instructions with *six branches* to emulate in Power
ISA! For desktop as well as Server HTML/JS back-end execution of
javascript this becomes an obvious priority, recognised already by ARM
as just one example.
taken to work on *all* bits of a CR Field rather than just one bit as
is done with the existing CR operations crand, cror etc.
-The other high strategic value instruction is `grevlut` (and `grevluti`
+The other high strategic value instruction is `grevlut` (and `grevluti`
which can generate a remarkably large number of regular-patterned magic
constants). The grevlut set require of the order of 20,000 gates but
provide an astonishing plethora of innovative bit-permuting instructions
in exactly the same way that ARM SVE predicated-move extends 3-operand
"overwrite" opcodes to full independent 3-in 1-out.
-## BMI (bitmanipulation) group.
+## BMI (bit-manipulation) group.
Whilst the [[sv/vector_ops]] instructions are only two in number, in
reality the `bmask` instruction has a Mode field allowing it to cover
**24** instructions, more than have been added to any other CPUs by
-ARM, Intel or AMD. Analyis of the BMI sets of these CPUs shows simple
+ARM, Intel or AMD. Analysis of the BMI sets of these CPUs shows simple
patterns that can greatly simplify both Decode and implementation. These
are sufficiently commonly used, saving instruction count regularly,
that they justify going into EXT0xx.
Very easily justified. As explained in [[ls002]] these always saves one
LD L1/2/3 D-Cache memory-lookup operation, by virtue of the Immediate
FP value being in the I-Cache side. It is such a high priority that
-these instuctions are easily justifiable adding into EXT0xx, despite
+these instructions are easily justifiable adding into EXT0xx, despite
requiring a 16-bit immediate. By designing the second-half instruction
-as a Read-Modify-Write it saves on XO bitlength (only 5 bits), and can be
+as a Read-Modify-Write it saves on XO bit-length (only 5 bits), and can be
macro-op fused with its first-half to store a full IEEE754 FP32 immediate
into a register.
There is little point in putting these instructions into EXT2xx. Their
very benefit and inherent value *is* as 32-bit instructions, not 64-bit
-ones. Likewise there is less value in taking up EXT1xx Enoding space
+ones. Likewise there is less value in taking up EXT1xx Encoding space
because EXT1xx only brings an additional 16 bits (approx) to the table,
-and that is provided already by the second-half instuction.
+and that is provided already by the second-half instruction.
Thus they qualify as both high priority and also EXT0xx candidates.
Counter-examples are FMAC which had to be added to IEEE754 because the
*internal* product requires more accuracy than can fit into a register.
-Another would be a dotproduct instruction, which again requires an accumulator
+Another would be a dot-product instruction, which again requires an accumulator
of at least double the width of the two vector inputs. And in the AMDGPU
ISA, there are Texture-mapping instructions taking up to an astounding
*twelve* input operands!
Less extreme examples include instructions that take only a few cycles to complete,
but if used in tight loops with Conditional Branches, an Out-of-Order system with
Speculative capability may need significantly more Reservation Stations to hold
-in-flight dats for instructions which take longer than those which do not.
+in-flight data for instructions which take longer than those which do not.
**Can one instruction do the job of many?**
* **regs** - a guide to register usage, to how costly Hazard Management
will be, in hardware:
- 1R: reads one GPR/FPR/SPR/CR.
- - 1W: writes one GPR/FPR/SPR/CR.
+ - 1W: writes one GPR/FPR/SPR/CR.
- 1r: reads one CR *Field* (not necessarily the entire CR)
- 1w: writes one CR *Field* (not necessarily the entire CR)