* <https://bugs.libre-soc.org/show_bug.cgi?id=213>
* <https://youtu.be/ZQ5hw9AwO1U> walkthrough video (19jun2022)
+* <https://ftp.libre-soc.org/simple_v_spec.pdf>
+ PDF version of this DRAFT specification
**SV is in DRAFT STATUS**. SV has not yet been submitted to the OpenPOWER Foundation ISA WG for review.
===
-# DRAFT SV (Simple Scalar Vectorisation) for the Power ISA
+# Scalable Vectors for the Power ISA
SV is designed as a strict RISC-paradigm
Scalable Vector ISA for Hybrid 3D CPU GPU VPU workloads.
* Preserving the underlying scalar execution dependencies as if the
for-loop had been expanded as actual scalar instructions
(termed "preserving Program Order")
+* Specifically designed to be Precise-Interruptible at all times
+ (many Vector ISAs have operations which, due to higher internal
+ accuracy or other complexity, must be effectively atomic only for
+ the full Vector operation's duration, adversely affecting interrupt
+ response latency, or be abandoned and started again)
* Augments ("tags") existing instructions, providing Vectorisation
"context" rather than adding new instructions.
* Strictly does not interfere with or alter the non-Scalable Power ISA
Comparative instruction count:
* ARM NEON SIMD: around 2,000 instructions, prerequisite: ARM Scalar.
-* ARM SVE: around 4,000 instructions, prerequisite: NEON.
-* ARM SVE2: around 1,000 instructions, prerequisite: SVE
+* ARM SVE: around 4,000 instructions, prerequisite: NEON and ARM Scalar
+* ARM SVE2: around 1,000 instructions, prerequisite: SVE, NEON, and
+ ARM Scalar for a grand total of well over 7,000 instructions.
* Intel AVX-512: around 4,000 instructions, prerequisite AVX, AVX2,
AVX-128 and AVX-256 which in turn critically rely on the rest of
x86, for a grand total of well over 10,000 instructions.
* RISV-V RVV: 192 instructions, prerequisite 96 Scalar RV64GC instructions
-* SVP64: **five** instructions, 24-bit prefixing of
+* SVP64: **six** instructions, two of which are in the same space
+ (svshape, svshape2), with 24-bit prefixing of
prerequisite SFS (150) or
- SFFS (214) Compliancy Subsets
+ SFFS (214) Compliancy Subsets.
+ **There are no dedicated Vector instructions, only Scalar-prefixed**.
+
+Comparative Basic Design Principle:
+
+* ARM NEON and VSX: PackedSIMD. No instruction-overloaded meaning
+ (every instruction is unique for a given register bitwidth,
+ guaranteeing binary interoperability)
+* Intel AVX-512 (and below): Hybrid Packed-Predicated SIMD with no
+ instruction-overloading, guaranteeing binary interoperability
+ but at the same time penalising the ISA with runaway
+ opcode proliferation.
+* ARM SVE/SVE2: Hybrid Packed-Predicated SIMD with instruction-overloading
+ that destroys binary interoperability. This is hidden behind the
+ misuse of the word "Scalable" and is **permitted under License**
+ by "Silicon Partners".
+* RISC-V RVV: Cray-style Scalable Vector but with instruction-overloading
+ **permitted by the specification** that destroys binary interoperability.
+* SVP64: Cray-style Scalable Vector with no instruction-overloaded
+ meanings. The regfile numbers and bitwidths shall **not** change
+ in a future revision (for the same instruction encoding):
+ "Silicon Partner" Scaling is prohibited,
+ in order to guarantee binary interoperability. Future revisions
+ of SVP64 may extend VSX instructions to achieve larger regfiles, and
+ non-interoperability on the same will likewise be prohibited.
SV comprises several [[sv/compliancy_levels]] suited to Embedded, Energy
efficient High-Performance Compute, Distributed Computing and Advanced
Pages being developed and examples
+* [[sv/executive_summary]]
* [[sv/overview]] explaining the basics.
* [[sv/compliancy_levels]] for minimum subsets through to Advanced
Supercomputing.
* [[sv/implementation]] implementation planning and coordination
-* [[sv/svp64]] contains the packet-format *only*, the [[sv/svp64/appendix]]
+* [[sv/po9_encoding]] a new DRAFT 64-bit space similar to EXT1xx,
+ introducing new areas EXT232-63 and EXT300-363
+* [[sv/svp64]] contains the packet-format *only*, the [[svp64/appendix]]
contains explanations and further details
+* [[sv/svp64-single]] still under development
* [[sv/svp64_quirks]] things in SVP64 that slightly break the rules
+ or are not immediately apparent despite the RISC paradigm
* [[opcode_regs_deduped]] autogenerated table of SVP64 decoder augmentation
* [[sv/sprs]] SPRs
-* SVP64 "Modes":
- - For condition register operations see [[sv/cr_ops]] - SVP64 Condition
+* [[sv/rfc]] RFCs to the [OPF ISA WG](https://openpower.foundation/isarfc/)
+
+SVP64 "Modes":
+
+* For condition register operations see [[sv/cr_ops]] - SVP64 Condition
Register ops: Guidelines
on Vectorisation of any v3.0B base operations which return
or modify a Condition Register bit or field.
- - For LD/ST Modes, see [[sv/ldst]].
- - For Branch modes, see [[sv/branches]] - SVP64 Conditional Branch
+* For LD/ST Modes, see [[sv/ldst]].
+* For Branch modes, see [[sv/branches]] - SVP64 Conditional Branch
behaviour: All/Some Vector CRs
- - For arithmetic and logical, see [[sv/normal]]
- - [[sv/mv.vec]] pack/unpack move to and from vec2/3/4,
+* For arithmetic and logical, see [[sv/normal]]
+* [[sv/mv.vec]] pack/unpack move to and from vec2/3/4,
actually an RM.EXTRA Mode and a [[sv/remap]] mode
Core SVP64 instructions:
* [[sv/setvl]] the Cray-style "Vector Length" instruction
-* [[sv/remap]] "Remapping" for Matrix Multiply and RGB "Structure Packing"
-* [[sv/svstep]] Key stepping instruction for Vertical-First Mode
-
-*Please note: there are only five instructions in the whole of SV.
+* svremap, svindex and svshape: part of [[sv/remap]] "Remapping" for
+ Matrix Multiply, DCT/FFT and RGB-style "Structure Packing"
+ as well as general-purpose Indexing. Also describes associated SPRs.
+* [[sv/svstep]] Key stepping instruction, primarily for
+ Vertical-First Mode and also providing traditional "Vector Iota"
+ capability.
+
+*Please note: there are only six instructions in the whole of SV.
Beyond this point are additional **Scalar** instructions related to
specific workloads that have nothing to do with the SV Specification*
-# Non-Simple-V Scalar instructiond
+# Stability Guarantees in Simple-V
+
+Providing long-term stability in an ISA is extremely challenging
+but critically important.
+It requires certain guarantees to be provided.
+
+* Firstly: that instructions will never be ambiguously-defined.
+* Secondly, that no instruction shall change meaning to produce
+ different results on different hardware (present or future).
+* Thirdly, that Scalar "defined words" (32 bit instruction
+ encodings) if Vectorised will also always be implemented as
+ identical Scalar instructions (the sole semi-exception being
+ Vectorised Branch-Conditional)
+* Fourthly, that implementors are not permitted to either add
+ arbitrary features nor implement features in an incompatible
+ way. *(Performance may differ, but differing results are
+ not permitted)*.
+* Fifthly, that any part of Simple-V not implemented by
+ a lower Compliancy Level is *required* to raise an illegal
+ instruction trap (allowing soft-emulation), including if
+ Simple-V is not implemented at all.
+* Sixthly, that any `UNDEFINED` behaviour for practical implementation
+ reasons is clearly documented for both programmers and hardware
+ implementors.
+
+In particular, given the strong recent emphasis and interest in
+"Scalable Vector" ISAs, it is most unfortunate that both ARM SVE
+and RISC-V RVV permit the exact same instruction to produce
+different results on different hardware depending on a
+"Silicon Partner" hardware choice. This choice catastrophically
+and irrevocably causes binary non-interoperability *despite being
+a "feature"*. Explained in <https://m.youtube.com/watch?v=HNEm8zmkjBU>
+it is the exact same binary-incompatibility issue faced by Power ISA
+on its 32- to 64-bit transition: 32-bit hardware was **unable** to
+trap-and-emulate 64-bit binaries because the opcodes were (are) the same.
+
+It is therefore *guaranteed* that extensions to the register file
+width and quantity in Simple-V shall only be made in future by
+explicit means, ensuring binary compatibility.
+
+# Optional Scalar instructions
**Additional Instructions for specific purposes (not SVP64)**
They are all entirely designed as Scalar instructions that, as
Scalar instructions, stand on their own merit. Considerable
lengths have been made to provide justifications for each of these
-*Scalar* instructions.
+*Scalar* instructions in a *Scalar* context, completely independently
+of SVP64.
-Some of these Scalar instructions are specifically designed to make
+Some of these Scalar instructions happen also designed to make
Scalable Vector binaries more efficient, such
as the crweird group. Others are to bring the Scalar Power ISA
up-to-date within specific workloads,
-such as a Javascript Rounding instruction. None of them are strictly
-necessary but performance and power consumption may be (or, is already)
-compromised
+such as a JavaScript Rounding instruction
+(which saves 32 scalar instructions including seven branch instructions).
+None of them are strictly necessary but performance and power consumption may
+be (or, is already) compromised
in certain workloads and use-cases without them.
-Vector-related:
+Vector-related but still Scalar:
* [[sv/mv.swizzle]] vec2/3/4 Swizzles (RGBA, XYZW) for 3D and CUDA.
designed as a Scalar instruction.
* [[sv/vector_ops]] scalar operations needed for supporting vectors
+* [[sv/cr_int_predication]] scalar instructions needed for
+ effective predication
-Scalar Instructions:
+Stand-alone Scalar Instructions:
-* [[sv/cr_int_predication]] instructions needed for effective predication
* [[sv/bitmanip]]
* [[sv/fcvt]] FP Conversion (due to OpenPOWER Scalar FP32)
* [[sv/fclass]] detect class of FP numbers
* [[sv/int_fp_mv]] Move and convert GPR <-> FPR, needed for !VSX
* [[sv/av_opcodes]] scalar opcodes for Audio/Video
-* Twin targetted instructions (two registers out, one implicit, just like
- Load-with-Update).
- Explanation of the rules for twin register targets
- (implicit RS, FRS) explained in SVP64 [[sv/svp64/appendix]]
- - [[isa/svfixedarith]]
- - [[isa/svfparith]]
- - [[sv/biginteger]] Operations that help with big arithmetic
+* [[prefix_codes]] Decode/encode prefix-codes, used by JPEG, DEFLATE, etc.
* TODO: OpenPOWER adaptation [[openpower/transcendentals]]
+Twin targetted instructions (two registers out, one implicit, just like
+Load-with-Update).
+
+* [[isa/svfixedarith]]
+* [[isa/svfparith]]
+* [[sv/biginteger]] Operations that help with big arithmetic
+
+Explanation of the rules for twin register targets
+(implicit RS, FRS) explained in SVP64 [[svp64/appendix]]
+
+# Architectural Note
+
+This section is primarily for the ISA Working Group and for IBM
+in their capacity and responsibility for allocating "Architectural
+Resources" (opcodes), but it is also useful for general understanding
+of Simple-V.
+
+Simple-V is effectively a type of "Zero-Overhead Loop Control" to which
+an entire 24 bits are exclusively dedicated in a fully RISC-abstracted
+manner. Within those 24-bits there are no Scalar instructions, and
+no Vector instructions: there is *only* "Loop Control".
+
+This is why there are no actual Vector operations in Simple-V: *all* suitable
+Scalar Operations are Vectorised or not at all. This has some extremely
+important implications when considering adding new instructions, and
+especially when allocating the Opcode Space for them.
+To protect SVP64 from damage, a "Hard Rule" has to be set:
+
+ Scalar Instructions must be simultaneously added in the corresponding
+ SVP64 opcode space with the exact same 32-bit "Defined Word" or they
+ must not be added at all. Likewise, instructions planned for addition
+ in what is considered (wrongly) to be the exclusive "Vector" domain
+ must correspondingly be added in the Scalar space with the exact same
+ 32-bit "Defined Word", or they must not be added at all.
+
+Some explanation of the above is needed. Firstly, "Defined Word" is a term
+used in Section 1.6.3 of the Power ISA v3 1 Book I: it means, in short,
+"a 32 bit instruction", which can then be Prefixed by EXT001 to extend it
+to 64-bit (named EXT100-163).
+Prefixed-Prefixed (96-bit Variable-Length) encodings are
+prohibited in v3.1 and they are just as prohibited in Simple-V: it's too
+complex in hardware. This means that **only** 32-bit "Defined Words"
+may be Vectorised, and in particular it means that no 64-bit instruction
+(EXT100-163) may **ever** be Vectorised.
+
+Secondly, the term "Vectoriseable" was used. This refers to "instructions
+which if SVP64-Prefixed are actually meaningful". `sc` is meaningless
+to Vectorise, for example, as is `sync` and `mtmsr` (there is only ever
+going to be one MSR).
+
+The problem comes if the rationale is applied, "if unused,
+Unvectoriseable opcodes
+can therefore be allocated to alternative instructions mixing inside
+the SVP64
+Opcode space",
+which unfortunately results in huge inadviseable complexity in HDL at the
+Decode Phase, attempting to discern between the two types. Worse than that,
+if the alternate 64-bit instruction is Vectoriseable but the 32-bit Scalar
+"Defined Word" is already allocated, how can there ever be a Scalar version
+of the alternate instruction? It would have to be added as a **completely
+different** 32-bit "Defined Word", and things go rapidly downhill in the
+Decoder as well as the ISA from there.
+
+Therefore to avoid risk and long-term damage to the Power ISA:
+
+* *even Unvectoriseable* "Defined Words" (`mtmsr`) must have the
+ corresponding SVP64 Prefixed Space `RESERVED`, permanently requiring
+ Illegal Instruction to be raised (the 64-bit encoding corresponding
+ to an illegal `sv.mtmsr` if ever incorrectly attempted must be
+ **defined** to raise an Exception)
+* *Even instructions that may not be Scalar* (although for various
+ practical reasons this is extremely rare if not impossible,
+ if not just generally "strongly discouraged")
+ which have no meaning or use as a 32-bit Scalar "Defined Word", **must**
+ still have the Scalar "Defined Word" `RESERVED` in the scalar
+ opcode space, as an Illegal Instruction.
+
+A good example of the former is `mtmsr` because there is only one
+MSR register (`sv.mtmsr` is meaningless, as is `sv.sc`),
+and a good example of the latter is [[sv/mv.x]]
+which is so deeply problematic to add to any Scalar ISA that it was
+rejected outright and an alternative route taken (Indexed REMAP).
+
+Another good example would be Cross Product which has no meaning
+at all in a Scalar ISA (Cross Product as a concept only applies
+to Mathematical Vectors). If any such Vector operation were ever added,
+it would be **critically** important to reserve the exact same *Scalar*
+opcode with the exact same "Defined Word" in the *Scalar* Power ISA
+opcode space, as an Illegal Instruction. There are
+good reasons why Cross Product has not been proposed, but it serves
+to illustrate the point as far as Architectural Resource Allocation is
+concerned.
+
+Bottom line is that whilst this seems wasteful the alternatives are a
+destabilisation of the Power ISA and impractically-complex Hardware
+Decoders. With the Scalar Power ISA (v3.0, v3.1) already being comprehensive
+in the number of instructions, keeping further Decode complexity down is a
+high priority.
+
# Other Scalable Vector ISAs
+These Scalable Vector ISAs are listed to aid in understanding and
+context of what is involved.
+
* Original Cray ISA
<http://www.bitsavers.org/pdf/cray/CRAY_Y-MP/HR-04001-0C_Cray_Y-MP_Computer_Systems_Functional_Description_Jun90.pdf>
* NEC SX Aurora (still in production, inspired by Cray)
A comprehensive list of 3D GPU, Packed SIMD, Predicated-SIMD and true Scalable
Vector ISAs may be found at the [[sv/vector_isa_comparison]] page.
-Note: AVX-512 and SVE2 are *not strict Vector ISAs*, they are Predicated-SIMD.
+Note: AVX-512 and SVE2 are *not Vector ISAs*, they are Predicated-SIMD.
*Public discussions have taken place at Conferences attended by both Intel
and ARM on adding a `setvl` instruction which would easily make both
-AVX-512 and SVE2 truly "Scalable".*
+AVX-512 and SVE2 truly "Scalable".* [[sv/comparison_table]] in tabular
+form.
-# Major opcodes summary
+# Major opcodes summary <a name="major_op_summary"> </a>
-Simple-V itself only requires four instructions with 6-bit Minor XO
+Simple-V itself only requires six instructions with 6-bit Minor XO
(bits 26-31), and the SVP64 Prefix Encoding requires
25% space of the EXT001 Major Opcode.
There are **no** Vector Instructions and consequently **no further
-opcode space is required**.
+opcode space is required**. Even though they are currently
+placed in the EXT022 Sandbox, the "Management" instructions
+(setvl, svstep, svremap, svshape, svindex) are designed to fit
+cleanly into EXT019 (exactly like `addpcis`) or other 5/6-bit Minor
+XO area (bits 25-31) that has space for Rc=1.
That said: for the target workloads for which Scalable Vectors are typically
-used, the Scalar ISA on which SV critically relies is somewhat anaemic.
+used, the Scalar ISA on which those workloads critically rely
+is somewhat anaemic.
The Libre-SOC Team has therefore been addressing that by developing
a number of Scalar instructions in specialist areas (Big Integer,
Cryptography, 3D, Audio/Video, DSP) and it is these which require
is considerable concern that because there is not yet any two-way
day-to-day communication established with the OPF ISA WG, we have
no idea if any of these are conflicting with future plans by any OPF
-Members. **The External ISA WG RFC Process is yet to be ratified
-and Libre-SOC may not join the OPF as an entity because it does
+Members. **The External ISA WG RFC Process has now been ratified
+but Libre-SOC may not join the OPF as an entity because it does
not exist except in name. Even if it existed it would be a conflict
of interest to join the OPF, due to our funding remit from NLnet**.
We therefore proceed on the basis of making public the intention to
in a wholly unsatisfactory manner have to *hope and trust* that
OPF ISA WG Members are reading this and take it into consideration.
+**Scalar Summary**
+
+As in above sections, it is emphasised strongly that Simple-V in no
+way critically depends on the 100 or so *Scalar* instructions also
+being developed by Libre-SOC.
+
**None of these Draft opcodes are intended for private custom
secret proprietary usage. They are all intended for entirely
public, upstream, high-profile mass-volume day-to-day usage at the
same level as add, popcnt and fld**
-* SVP64 requires 25% of EXT01 (bits 6 and 9 set to 1)
* bitmanip requires two major opcodes (due to 16+ bit immediates)
those are currently EXT022 and EXT05.
* brownfield encoding in one of those two major opcodes still
requires multiple VA-Form operations (in greater numbers
than EXT04 has spare)
* space in EXT019 next to addpcis and crops is recommended
- (or any 5-6 bit Minor XO areas)
+ (or any other 5-6 bit Minor XO areas)
* many X-Form opcodes currently in EXT022 have no preference
for a location at all, and may be moved to EXT059, EXT019,
EXT031 or other much more suitable location.
+* even if ratified and even if the majority (mostly X-Form)
+ is moved to other locations, the large immediate sizes of
+ the remaining bitmanip instructions means
+ it would be highly likely these remaining instructions would need two
+ major opcodes. Fortuitously the v3.1 Spec states that
+ both EXT005 and EXT009 are
+ available.
+
+**Additional observations**
Note that there is no Sandbox allocation in the published ISA Spec for
v3.1 EXT01 usage, and because SVP64 is already 64-bit Prefixed,
on the 32-bit Major Opcode space.
Note also that EXT022, the Official Architectural Sandbox area
+available for "Custom non-approved purposes" according to the Power
+ISA Spec,
is under severe design pressure as it is insufficient to hold
the full extent of the instruction additions required to create
-a Hybrid 3D CPU-VPU-GPU.
-
-**Whilst SVP64 is only 4 instructions
+a Hybrid 3D CPU-VPU-GPU. Although the wording of the Power ISA
+Specification leaves open the *possibility* of not needing to
+propose ISA Extensions to the ISA WG, it is clear that EXT022
+is an inappropriate location for a large high-profile Extension
+intended for mass-volume product deployment. Every in-good-faith effort will
+therefore be made to work with the OPF ISA WG to
+submit SVP64 via the External RFC Process.
+
+**Whilst SVP64 is only 6 instructions
the heavy focus on VSX for the past 12 years has left the SFFS Level
-anaemic and out-of-date compared to ARM and x86. Approximately
-100 additional Scalar Instructions are up for proposal**
+anaemic and out-of-date compared to ARM and x86.**
+This is very much
+a blessing, as the Scalar ISA has remained clean, making it
+highly suited to RISC-paradigm Scalable Vector Prefixing. Approximately
+100 additional (optional) Scalar Instructions are up for proposal to bring SFFS
+up-to-date. None of them require or depend on PackedSIMD VSX (or VMX).
# Other
Examples experiments future ideas discussion:
+* [Scalar register access](https://bugs.libre-soc.org/show_bug.cgi?id=905)
+ above r31 and CR7.
* [[sv/propagation]] Context propagation including svp64, swizzle and remap
* [[sv/masked_vector_chaining]]
* [[sv/discussion]]
* <https://www.sigarch.org/simd-instructions-considered-harmful/>
* [[sv/vector_isa_comparison]] - a list of Packed SIMD, GPU,
and other Scalable Vector ISAs
+* [[sv/comparison_table]] - a one-off (experimental) table comparing ISAs
* [[simple_v_extension]] old (deprecated) version
* [[openpower/sv/llvm]]