X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=openpower%2Fsv.mdwn;h=23b9e811b5641af5f37dc558f0e0a4f80e51dd79;hb=83e17db9000ab78bc559eed77c9f06743551bd18;hp=f91319e55b9844cd3cd5bce1bf3a845dadb008c4;hpb=dc3a9448773ac40994efeb35db63a473b794b122;p=libreriscv.git

diff --git a/openpower/sv.mdwn b/openpower/sv.mdwn
index f91319e55..23b9e811b 100644
--- a/openpower/sv.mdwn
+++ b/openpower/sv.mdwn
@@ -4,40 +4,63 @@ Obligatory Dilbert:
 
 <img src="https://assets.amuniversal.com/7fada35026ca01393d3d005056a9545d" width="600" />
 
-===
+Links:
 
-# SV (Simple Vectorisation) for the Power ISA
+* <https://bugs.libre-soc.org/show_bug.cgi?id=213>
+* <https://youtu.be/ZQ5hw9AwO1U> walkthrough video (19jun2022)
+* <https://ftp.libre-soc.org/simple_v_spec.pdf>
+  PDF version of this DRAFT specification
 
 **SV is in DRAFT STATUS**. SV has not yet been submitted to the OpenPOWER Foundation ISA WG for review.
 
-<https://bugs.libre-soc.org/show_bug.cgi?id=213>
+===
+
+# Scalable Vectors for the Power ISA
 
-SV is designed as a Scalable Vector ISA for Hybrid 3D CPU GPU VPU workloads.
+SV is designed as a strict RISC-paradigm
+Scalable Vector ISA for Hybrid 3D CPU GPU VPU workloads.
 As such it brings features normally only found in Cray Supercomputers
 (Cray-1, NEC SX-Aurora)
 and in GPUs, but keeps strictly to a *Simple* RISC principle of leveraging
 a *Scalar* ISA, exclusively using "Prefixing". **Not one single actual
-explicit Vector opcode exists in SV, at all**.
+explicit Vector opcode exists in SV, at all**.  It is suitable for
+low-power Embedded and DSP Workloads as much as it is for power-efficient
+Supercomputing.
 
 Fundamental design principles:
 
-* Simplicity of introduction and implementation on the existing Power ISA
+* Taking the simplicity of the RISC paradigm and applying it strictly and
+  uniformly to create a Scalable Vector ISA.
 * Effectively a hardware for-loop, pausing PC, issuing multiple scalar
   operations
 * Preserving the underlying scalar execution dependencies as if the
   for-loop had been expanded as actual scalar instructions
   (termed "preserving Program Order")
+* Specifically designed to be Precise-Interruptible at all times
+  (many Vector ISAs have operations which, due to higher internal
+  accuracy or other complexity, must be effectively atomic only for
+  the full Vector operation's duration, adversely affecting interrupt
+  response latency, or be abandoned and started again)
 * Augments ("tags") existing instructions, providing Vectorisation
-  "context" rather than adding new ones.
-* Does not modify or deviate from the underlying scalar Power ISA
+  "context" rather than adding new instructions.
+* Strictly does not interfere with or alter the non-Scalable Power ISA
+  in any way
+* In the Prefix space, does not modify or deviate from the underlying
+  scalar Power ISA
   unless it provides significant performance or other advantage to do so
-  in the Vector space (dropping XER.SO for example)
+  in the Vector space (dropping the "sticky" characteristics
+  of XER.SO and CR0.SO for example)
 * Designed for Supercomputing: avoids creating significant sequential
-  dependency hazards, allowing high performance superscalar
-  microarchitectures to be deployed.
+  dependency hazards, allowing standard
+  high performance superscalar multi-issue
+  micro-architectures to be leveraged.
+* Divided into Compliancy Levels to reduce cost of implementation for
+  specific needs.
 
 Advantages of these design principles:
 
+* Simplicity of introduction and implementation on top of
+  the existing Power ISA without disruption.
 * It is therefore easy to create a first (and sometimes only)
   implementation as literally a for-loop in hardware, simulators, and
   compilers.
@@ -58,13 +81,41 @@ Advantages of these design principles:
 Comparative instruction count:
 
 * ARM NEON SIMD: around 2,000 instructions, prerequisite: ARM Scalar.
-* ARM SVE: around 4,000 instructions, prerequisite: NEON. 
-* ARM SVE2: around 1,000 instructions, prerequisite: SVE
-* Intel AVX-512: around 4,000 instructions, prerequisite AVX2 etc.
+* ARM SVE: around 4,000 instructions, prerequisite: NEON and ARM Scalar 
+* ARM SVE2: around 1,000 instructions, prerequisite: SVE, NEON, and
+  ARM Scalar for a grand total of well over 7,000 instructions.
+* Intel AVX-512: around 4,000 instructions, prerequisite AVX, AVX2,
+  AVX-128 and AVX-256 which in turn critically rely on the rest of
+  x86, for a grand total of well over 10,000 instructions.
 * RISV-V RVV: 192 instructions, prerequisite 96 Scalar RV64GC instructions
-* SVP64: **four** instructions, 24-bit prefixing of
+* SVP64: **six** instructions, two of which are in the same space
+  (svshape, svshape2), with 24-bit prefixing of
   prerequisite SFS (150) or
-  SFFS (214) Compliancy Subsets
+  SFFS (214) Compliancy Subsets.
+  **There are no dedicated Vector instructions, only Scalar-prefixed**.
+
+Comparative Basic Design Principle:
+
+* ARM NEON and VSX: PackedSIMD. No instruction-overloaded meaning
+  (every instruction is unique for a given register bitwidth,
+  guaranteeing binary interoperability)
+* Intel AVX-512 (and below): Hybrid Packed-Predicated SIMD with no
+  instruction-overloading, guaranteeing binary interoperability
+  but at the same time penalising the ISA with runaway
+  opcode proliferation.
+* ARM SVE/SVE2: Hybrid Packed-Predicated SIMD with instruction-overloading
+  that destroys binary interoperability. This is hidden behind the
+  misuse of the word "Scalable" and is **permitted under License**
+  by "Silicon Partners".
+* RISC-V RVV: Cray-style Scalable Vector but with instruction-overloading
+  **permitted by the specification** that destroys binary interoperability.
+* SVP64: Cray-style Scalable Vector with no instruction-overloaded
+  meanings.  The regfile numbers and bitwidths shall **not** change
+  in a future revision (for the same instruction encoding):
+  "Silicon Partner" Scaling is prohibited,
+  in order to guarantee binary interoperability.  Future revisions
+  of SVP64 may extend VSX instructions to achieve larger regfiles, and
+  non-interoperability on the same will likewise be prohibited.
 
 SV comprises several [[sv/compliancy_levels]] suited to Embedded, Energy
 efficient High-Performance Compute, Distributed Computing and Advanced
@@ -72,14 +123,281 @@ Computational Supercomputing.  The Compliancy Levels are arranged such
 that even at the bare minimum Level, full Soft-Emulation of all
 optional and future features is possible.
 
-# Major opcodes summary
+# Sub-pages
+
+Pages being developed and examples
 
-Please be advised that even though below is entirely DRAFT status, there
+* [[sv/executive_summary]]
+* [[sv/overview]] explaining the basics.
+* [[sv/compliancy_levels]] for minimum subsets through to Advanced
+  Supercomputing.
+* [[sv/implementation]] implementation planning and coordination
+* [[sv/po9_encoding]] a new DRAFT 64-bit space similar to EXT1xx,
+  introducing new areas EXT232-63 and EXT300-363
+* [[sv/svp64]] contains the packet-format *only*, the [[svp64/appendix]]
+  contains explanations and further details
+* [[sv/svp64-single]] still under development
+* [[sv/svp64_quirks]] things in SVP64  that slightly break the rules
+  or are not immediately apparent despite the RISC paradigm
+* [[opcode_regs_deduped]] autogenerated table of SVP64 decoder augmentation
+* [[sv/sprs]] SPRs
+* [[sv/rfc]] RFCs to the [OPF ISA WG](https://openpower.foundation/isarfc/)
+
+SVP64 "Modes":
+
+* For condition register operations see [[sv/cr_ops]] - SVP64 Condition
+    Register ops: Guidelines
+    on Vectorisation of any v3.0B base operations which return
+    or modify a Condition Register bit or field.
+* For LD/ST Modes, see [[sv/ldst]].
+* For Branch modes, see [[sv/branches]] - SVP64 Conditional Branch
+    behaviour: All/Some Vector CRs
+* For arithmetic and logical, see [[sv/normal]]
+* [[sv/mv.vec]] pack/unpack move to and from vec2/3/4,
+    actually an RM.EXTRA Mode and a [[sv/remap]] mode
+
+Core SVP64 instructions:
+
+* [[sv/setvl]] the Cray-style "Vector Length" instruction
+* svremap, svindex and svshape: part of [[sv/remap]] "Remapping" for
+  Matrix Multiply, DCT/FFT and RGB-style "Structure Packing"
+  as well as general-purpose Indexing. Also describes associated SPRs.
+* [[sv/svstep]] Key stepping instruction, primarily for
+  Vertical-First Mode and also providing traditional "Vector Iota"
+  capability.
+
+*Please note: there are only six instructions in the whole of SV.
+Beyond this point are additional **Scalar** instructions related to
+specific workloads that have nothing to do with the SV Specification*
+
+# Stability Guarantees in Simple-V
+
+Providing long-term stability in an ISA is extremely challenging
+but critically important.
+It requires certain guarantees to be provided.
+
+* Firstly: that instructions will never be ambiguously-defined.
+* Secondly, that no instruction shall change meaning to produce
+  different results on different hardware (present or future).
+* Thirdly, that Scalar "defined words" (32 bit instruction
+  encodings) if Vectorised will also always be implemented as
+  identical Scalar instructions (the sole semi-exception being
+  Vectorised Branch-Conditional)
+* Fourthly, that implementors are not permitted to either add
+  arbitrary features nor implement features in an incompatible
+  way. *(Performance may differ, but differing results are
+  not permitted)*.
+* Fifthly, that any part of Simple-V not implemented by
+  a lower Compliancy Level is *required* to raise an illegal
+  instruction trap (allowing soft-emulation), including if
+  Simple-V is not implemented at all.
+* Sixthly, that any `UNDEFINED` behaviour for practical implementation
+  reasons is clearly documented for both programmers and hardware
+  implementors.
+
+In particular, given the strong recent emphasis and interest in
+"Scalable Vector" ISAs, it is most unfortunate that both ARM SVE
+and RISC-V RVV permit the exact same instruction to produce
+different results on different hardware depending on a
+"Silicon Partner" hardware choice. This choice catastrophically
+and irrevocably causes binary non-interoperability *despite being
+a "feature"*.  Explained in <https://m.youtube.com/watch?v=HNEm8zmkjBU>
+it is the exact same binary-incompatibility issue faced by Power ISA
+on its 32- to 64-bit transition: 32-bit hardware was **unable** to
+trap-and-emulate 64-bit binaries because the opcodes were (are) the same.
+
+It is therefore *guaranteed* that extensions to the register file
+width and quantity in Simple-V shall only be made in future by
+explicit means, ensuring binary compatibility.
+
+# Optional Scalar instructions
+
+**Additional Instructions for specific purposes (not SVP64)**
+
+All of these instructions below have nothing to do with SV.
+They are all entirely designed as Scalar instructions that, as
+Scalar instructions, stand on their own merit. Considerable
+lengths have been made to provide justifications for each of these
+*Scalar* instructions in a *Scalar* context, completely independently
+of SVP64.
+
+Some of these Scalar instructions happen also designed to make
+Scalable Vector binaries more efficient, such
+as the crweird group.  Others are to bring the Scalar Power ISA
+up-to-date within specific workloads,
+such as a JavaScript Rounding instruction
+(which saves 32 scalar instructions including seven branch instructions).
+None of them are strictly necessary but performance and power consumption may
+be (or, is already) compromised
+in certain workloads and use-cases without them.
+
+Vector-related but still Scalar:
+
+* [[sv/mv.swizzle]] vec2/3/4 Swizzles (RGBA, XYZW) for 3D and CUDA.
+  designed as a Scalar instruction.
+* [[sv/vector_ops]] scalar operations needed for supporting vectors
+* [[sv/cr_int_predication]] scalar instructions needed for
+  effective predication
+
+Stand-alone Scalar Instructions:
+
+* [[sv/bitmanip]]
+* [[sv/fcvt]] FP Conversion (due to OpenPOWER Scalar FP32)
+* [[sv/fclass]] detect class of FP numbers
+* [[sv/int_fp_mv]] Move and convert GPR <-> FPR, needed for !VSX
+* [[sv/av_opcodes]] scalar opcodes for Audio/Video
+* [[prefix_codes]] Decode/encode prefix-codes, used by JPEG, DEFLATE, etc.
+* TODO: OpenPOWER adaptation [[openpower/transcendentals]]
+
+Twin targetted instructions (two registers out, one implicit, just like
+Load-with-Update).
+
+* [[isa/svfixedarith]]
+* [[isa/svfparith]]
+* [[sv/biginteger]] Operations that help with big arithmetic
+
+Explanation of the rules for twin register targets
+(implicit RS, FRS) explained in SVP64 [[svp64/appendix]]
+
+# Architectural Note
+
+This section is primarily for the ISA Working Group and for IBM
+in their capacity and responsibility for allocating "Architectural
+Resources" (opcodes), but it is also useful for general understanding
+of Simple-V.
+
+Simple-V is effectively a type of "Zero-Overhead Loop Control" to which
+an entire 24 bits are exclusively dedicated in a fully RISC-abstracted
+manner. Within those 24-bits there are no Scalar instructions, and
+no Vector instructions: there is *only* "Loop Control".
+
+This is why there are no actual Vector operations in Simple-V: *all* suitable
+Scalar Operations are Vectorised or not at all.  This has some extremely
+important implications when considering adding new instructions, and
+especially when allocating the Opcode Space for them.
+To protect SVP64 from damage, a "Hard Rule" has to be set:
+
+     Scalar Instructions must be simultaneously added in the corresponding
+     SVP64 opcode space with the exact same 32-bit "Defined Word" or they
+     must not be added at all.  Likewise, instructions planned for addition
+     in what is considered (wrongly) to be the exclusive "Vector" domain
+     must correspondingly be added in the Scalar space with the exact same
+     32-bit "Defined Word", or they must not be added at all.
+
+Some explanation of the above is needed.  Firstly, "Defined Word" is a term
+used in Section 1.6.3 of the Power ISA v3 1 Book I: it means, in short,
+"a 32 bit instruction", which can then be Prefixed by EXT001 to extend it
+to 64-bit (named EXT100-163).
+Prefixed-Prefixed (96-bit Variable-Length) encodings are
+prohibited in v3.1 and they are just as prohibited in Simple-V: it's too
+complex in hardware.  This means that **only** 32-bit "Defined Words"
+may be Vectorised, and in particular it means that no 64-bit instruction
+(EXT100-163) may **ever** be Vectorised.
+
+Secondly, the term "Vectoriseable" was used.  This refers to "instructions
+which if SVP64-Prefixed are actually meaningful". `sc` is meaningless
+to Vectorise, for example, as is `sync` and `mtmsr` (there is only ever
+going to be one MSR).
+
+The problem comes if the rationale is applied, "if unused,
+Unvectoriseable opcodes
+can therefore be allocated to alternative instructions mixing inside
+the SVP64
+Opcode space",
+which unfortunately results in huge inadviseable complexity in HDL at the
+Decode Phase, attempting to discern between the two types.  Worse than that,
+if the alternate 64-bit instruction is Vectoriseable but the 32-bit Scalar
+"Defined Word" is already allocated, how can there ever be a Scalar version
+of the alternate instruction? It would have to be added as a **completely
+different** 32-bit "Defined Word", and things go rapidly downhill in the
+Decoder as well as the ISA from there.
+
+Therefore to avoid risk and long-term damage to the Power ISA:
+
+* *even Unvectoriseable* "Defined Words" (`mtmsr`) must have the
+  corresponding SVP64 Prefixed Space `RESERVED`, permanently requiring
+  Illegal Instruction to be raised (the 64-bit encoding corresponding
+  to an illegal `sv.mtmsr` if ever incorrectly attempted must be
+  **defined** to raise an Exception)
+* *Even instructions that may not be Scalar* (although for various
+  practical reasons this is extremely rare if not impossible,
+  if not just generally "strongly discouraged")
+  which have no meaning or use as a 32-bit Scalar "Defined Word", **must**
+  still have the Scalar "Defined Word" `RESERVED` in the scalar
+  opcode space, as an Illegal Instruction.
+
+A good example of the former is `mtmsr` because there is only one
+MSR register (`sv.mtmsr` is meaningless, as is `sv.sc`),
+and a good example of the latter is [[sv/mv.x]]
+which is so deeply problematic to add to any Scalar ISA that it was
+rejected outright and an alternative route taken (Indexed REMAP).
+
+Another good example would be Cross Product which has no meaning
+at all in a Scalar ISA (Cross Product as a concept only applies
+to Mathematical Vectors). If any such Vector operation were ever added,
+it would be **critically** important to reserve the exact same *Scalar*
+opcode with the exact same "Defined Word" in the *Scalar* Power ISA
+opcode space, as an Illegal Instruction.  There are
+good reasons why Cross Product has not been proposed, but it serves
+to illustrate the point as far as Architectural Resource Allocation is
+concerned.
+
+Bottom line is that whilst this seems wasteful the alternatives are a
+destabilisation of the Power ISA and impractically-complex Hardware
+Decoders. With the Scalar Power ISA (v3.0, v3.1) already being comprehensive
+in the number of instructions, keeping further Decode complexity down is a 
+high priority.
+
+# Other Scalable Vector ISAs
+
+These Scalable Vector ISAs are listed to aid in understanding and
+context of what is involved.
+
+* Original Cray ISA
+  <http://www.bitsavers.org/pdf/cray/CRAY_Y-MP/HR-04001-0C_Cray_Y-MP_Computer_Systems_Functional_Description_Jun90.pdf>
+* NEC SX Aurora (still in production, inspired by Cray)
+  <https://www.hpc.nec/documents/guide/pdfs/Aurora_ISA_guide.pdf>
+* RISC-V RVV (inspired by Cray)
+  <https://github.com/riscv/riscv-v-spec>
+* MRISC32 ISA Manual (under active development)
+  <https://github.com/mrisc32/mrisc32/tree/master/isa-manual>
+* Mitch Alsup's MyISA 66000 Vector Processor ISA Manual is available from
+  Mitch on request.
+
+A comprehensive list of 3D GPU, Packed SIMD, Predicated-SIMD and true Scalable
+Vector ISAs may be found at the [[sv/vector_isa_comparison]] page.
+Note: AVX-512 and SVE2 are *not Vector ISAs*, they are Predicated-SIMD.
+*Public discussions have taken place at Conferences attended by both Intel
+and ARM on adding a `setvl` instruction which would easily make both
+AVX-512 and SVE2 truly "Scalable".*  [[sv/comparison_table]] in tabular
+form.
+
+# Major opcodes summary <a name="major_op_summary"> </a>
+
+Simple-V itself only requires six instructions with 6-bit Minor XO
+(bits 26-31), and the SVP64 Prefix Encoding requires
+25% space of the EXT001 Major Opcode.
+There are **no** Vector Instructions and consequently **no further
+opcode space is required**.  Even though they are currently
+placed in the EXT022 Sandbox, the "Management" instructions
+(setvl, svstep, svremap, svshape, svindex) are designed to fit
+cleanly into EXT019 (exactly like `addpcis`) or other 5/6-bit Minor
+XO area (bits 25-31) that has space for Rc=1.
+
+That said: for the target workloads for which Scalable Vectors are typically
+used, the Scalar ISA on which those workloads critically rely
+is somewhat anaemic.
+The Libre-SOC Team has therefore been addressing that by developing
+a number of Scalar instructions in specialist areas (Big Integer,
+Cryptography, 3D, Audio/Video, DSP) and it is these which require
+considerable Scalar opcode space.
+
+Please be advised that even though SV is entirely DRAFT status, there
 is considerable concern that because there is not yet any two-way
 day-to-day communication established with the OPF ISA WG, we have
 no idea if any of these are conflicting with future plans by any OPF
-Members.  **The External ISA WG RFC Process is yet to be ratified
-and Libre-SOC may not join the OPF as an entity because it does
+Members.  **The External ISA WG RFC Process has now been ratified
+but Libre-SOC may not join the OPF as an entity because it does
 not exist except in name. Even if it existed it would be a conflict
 of interest to join the OPF, due to our funding remit from NLnet**.
 We therefore proceed on the basis of making public the intention to
@@ -87,21 +405,36 @@ submit RFCs once the External ISA WG RFC Process is in place and,
 in a wholly unsatisfactory manner have to *hope and trust* that
 OPF ISA WG Members are reading this and take it into consideration.
 
+**Scalar Summary**
+
+As in above sections, it is emphasised strongly that Simple-V in no
+way critically depends on the 100 or so *Scalar* instructions also
+being developed by Libre-SOC.
+
 **None of these Draft opcodes are intended for private custom
 secret proprietary usage. They are all intended for entirely
 public, upstream, high-profile mass-volume day-to-day usage at the
 same level as add, popcnt and fld**
 
-* SVP64 requires 25% of EXT01 (bits 6 and 9 set to 1)
 * bitmanip requires two major opcodes (due to 16+ bit immediates)
   those are currently EXT022 and EXT05.
 * brownfield encoding in one of those two major opcodes still
   requires multiple VA-Form operations (in greater numbers
   than EXT04 has spare)
 * space in EXT019 next to addpcis and crops is recommended
+  (or any other 5-6 bit Minor XO areas)
 * many X-Form opcodes currently in EXT022 have no preference
   for a location at all, and may be moved to EXT059, EXT019,
   EXT031 or other much more suitable location.
+* even if ratified and even if the majority (mostly X-Form)
+  is moved to other locations, the large immediate sizes of 
+  the remaining bitmanip instructions means
+  it would be highly likely these remaining instructions would need two
+  major opcodes.  Fortuitously the v3.1 Spec states that
+  both EXT005 and EXT009 are
+  available.
+
+**Additional observations**
 
 Note that there is no Sandbox allocation in the published ISA Spec for
 v3.1 EXT01 usage, and because SVP64 is already 64-bit Prefixed,
@@ -111,70 +444,33 @@ situation is a high priority which in turn by necessity puts pressure
 on the 32-bit Major Opcode space.
 
 Note also that EXT022, the Official Architectural Sandbox area
+available for "Custom non-approved purposes" according to the Power
+ISA Spec,
 is under severe design pressure as it is insufficient to hold
 the full extent of the instruction additions required to create
-a Hybrid 3D CPU-VPU-GPU.
-
-**Whilst SVP64 is only 4 instructions
+a Hybrid 3D CPU-VPU-GPU.  Although the wording of the Power ISA
+Specification leaves open the *possibility* of not needing to
+propose ISA Extensions to the ISA WG, it is clear that EXT022
+is an inappropriate location for a large high-profile Extension
+intended for mass-volume product deployment. Every in-good-faith effort will
+therefore be made to work with the OPF ISA WG to
+submit SVP64 via the External RFC Process.
+
+**Whilst SVP64 is only 6 instructions
 the heavy focus on VSX for the past 12 years has left the SFFS Level
-anaemic and out-of-date compared to ARM and x86. Approximately
-100 additional Scalar Instructions are up for proposal**
-
-# Sub-pages
-
-Pages being developed and examples
+anaemic and out-of-date compared to ARM and x86.**
+This is very much
+a blessing, as the Scalar ISA has remained clean, making it
+highly suited to RISC-paradigm Scalable Vector Prefixing. Approximately
+100 additional (optional) Scalar Instructions are up for proposal to bring SFFS
+up-to-date. None of them require or depend on PackedSIMD VSX (or VMX).
 
-* [[sv/overview]] explaining the basics.
-* [[sv/compliancy_levels]] for minimum subsets through to Advanced
-  Supercomputing.
-* [[sv/implementation]] implementation planning and coordination
-* [[sv/svp64]] contains the packet-format *only*, the [[sv/svp64/appendix]]
-  contains explanations and further details
-* [[sv/svp64_quirks]] things in SVP64  that slightly break the rules
-* [[opcode_regs_deduped]] autogenerated table of SVP64 instructions
-* [[sv/sprs]] SPRs
-* SVP64 "Modes":
-  - For condition register operations see [[sv/cr_ops]] - SVP64 Condition
-    Register ops: Guidelines
-    on Vectorisation of any v3.0B base operations which return
-    or modify a Condition Register bit or field.
-  - For LD/ST Modes, see [[sv/ldst]].
-  - For Branch modes, see [[sv/branches]] - SVP64 Conditional Branch
-    behaviour: All/Some Vector CRs
-  - For arithmetic and logical, see [[sv/normal]]
-
-Core SVP64 instructions:
-
-* [[sv/setvl]] the Cray-style "Vector Length" instruction
-* [[sv/remap]] "Remapping" for Matrix Multiply and RGB "Structure Packing"
-* [[sv/svstep]] Key stepping instruction for Vertical-First Mode
-
-Vector-related:
-
-* [[sv/vector_swizzle]]
-* [[sv/mv.vec]] pack/unpack move to and from vec2/3/4
-* [[sv/mv.swizzle]]
-* [[sv/vector_ops]] scalar operations needed for supporting vectors
-
-Scalar Instructions:
-
-* [[sv/cr_int_predication]] instructions needed for effective predication
-* [[sv/bitmanip]]
-* [[sv/fcvt]] FP Conversion (due to OpenPOWER Scalar FP32)
-* [[sv/fclass]] detect class of FP numbers
-* [[sv/int_fp_mv]] Move and convert GPR <-> FPR, needed for !VSX
-* [[sv/vector_ops]] Vector ops needed to make a "complete" Vector ISA
-* [[sv/av_opcodes]] scalar opcodes for Audio/Video
-* Twin targetted instructions (two registers out, one implicit)
-  Explanation of the rules for twin register targets
-  (implicit RS, FRS) explained in SVP64 [[sv/svp64/appendix]]
-  - [[isa/svfixedarith]]
-  - [[isa/svfparith]]
-  - [[sv/biginteger]] Operations that help with big arithmetic
-* TODO: OpenPOWER adaptation [[openpower/transcendentals]]
+# Other
 
 Examples experiments future ideas discussion:
 
+* [Scalar register access](https://bugs.libre-soc.org/show_bug.cgi?id=905)
+  above r31 and CR7.
 * [[sv/propagation]] Context propagation including svp64, swizzle and remap
 * [[sv/masked_vector_chaining]]
 * [[sv/discussion]]
@@ -190,145 +486,9 @@ Examples experiments future ideas discussion:
 Additional links:
 
 * <https://www.sigarch.org/simd-instructions-considered-harmful/>
+* [[sv/vector_isa_comparison]] - a list of Packed SIMD, GPU,
+  and other Scalable Vector ISAs
+* [[sv/comparison_table]] - a one-off (experimental) table comparing ISAs
 * [[simple_v_extension]] old (deprecated) version
 * [[openpower/sv/llvm]]
-* [[openpower/sv/effect-of-more-decode-stages-on-reg-renaming]]
-
-===
-
-Required Background Reading:
-============================
-
-These are all, deep breath, basically... required reading, *as well as
-and in addition* to a full and comprehensive deep technical understanding
-of the Power ISA, in order to understand the depth and background on
-SVP64 as a 3D GPU and VPU Extension.
-
-I am keenly aware that each of them is 300 to 1,000 pages (just like
-the Power ISA itself).
-
-This is just how it is.
-
-Given the sheer overwhelming size and scope of SVP64 we have gone to
-**considerable lengths** to provide justification and rationalisation for
-adding the various sub-extensions to the Base Scalar Power ISA.
-
-* Scalar bitmanipulation is justifiable for the exact same reasons the
-  extensions are justifiable for other ISAs.  The additional justification
-  for their inclusion where some instructions are already (sort-of) present
-  in VSX is that VSX is not mandatory, and the complexity of implementation
-  of VSX is too high a price to pay at the Embedded SFFS Compliancy Level.
-* Scalar FP-to-INT conversions, likewise.  ARM has a javascript conversion
-  instruction, Power ISA does not (and it costs a ridiculous 45 instructions
-  to implement, including 6 branches!)
-* Scalar Transcendentals (SIN, COS, ATAN2, LOG) are easily justifiable
-  for High-Performance Compute workloads.
-
-It also has to be pointed out that normally this work would be covered by
-multiple separate full-time Workgroups with multiple Members contributing
-their time and resources.
-
-Overall the contributions that we are developing take the Power ISA out of
-the specialist highly-focussed market it is presently best known for, and
-expands it into areas with much wider general adoption and broader uses.
-
-
----
-
-OpenCL specifications are linked here, these are relevant when we get
-to a 3D GPU / High Performance Compute ISA WG RFC:
-[[openpower/transcendentals]]
-
-(Failure to add Transcendentals to a 3D GPU is directly equivalent to
-*willfully* designing a product that is 100% destined for commercial
-rejection, due to the extremely high competitive performance/watt achieved
-by today's mass-volume GPUs.)
-
-I mention these because they will be encountered in every single
-commercial GPU ISA, but they're not part of the "Base" (core design)
-of a Vector Processor. Transcendentals can be added as a sub-RFC.
-
----
-
-SIMD ISAs commonly mistaken for Vector:
----------------------------------------
-
-There is considerable confusion surrounding Vector ISAs
-because of a mis-use of the word "Vector" in most
-well-known Packed SIMD ISAs.
-
-* PackedSIMD VSX. VSX, which has the word "Vector" in its name,
-  is "inspired" by Vector Processing
-  but has no "Scaling" capability, and no Predicate masking.
-  Adding Predicate Masks to the PackedSIMD VSX ISA
-  would effectively double the number of PackedSIMD
-  instructions (750 becomes 1,500)
-* [AVX / AVX2 / AVX128 / AVX256 / AVX512](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)
-  again has the word "Vector" in its name but this in no
-  way makes it a Vector ISA. None of the AVX-\* family
-  are "Scalable" however there is at least Predicate Masking
-  in AVX-512.
-* ARM NEON - accurately described as a Packed SIMD ISA in
-  all literature.
-* ARM SVE / SVE2 - accurately described as a Scalable Vector
-  ISA, but the "Scaling" is, rather unfortunately, a parameter
-  that is chosen by the *Hardware Architect*, rather than
-  the programmer. This has resulted in programmers writing
-  multiple variants of hand-coded assembler in order
-  to target different machines with different hardware widths,
-  going directly against the advice given on ARM's developer
-  documentation.
-
-
-Actual 3D GPU Architectures and ISAs:
--------------------------------------
-
-* Broadcom Videocore
-  <https://github.com/hermanhermitage/videocoreiv>
-* Etnaviv
-  <https://github.com/etnaviv/etna_viv/tree/master/doc>
-* Nyuzi
-  <http://www.cs.binghamton.edu/~millerti/nyuziraster.pdf>
-* MALI
-  <https://github.com/cwabbott0/mali-isa-docs>
-* AMD
-  <https://developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf>  
-  <https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf>
-* MIAOW which is *NOT* a 3D GPU, it is a processor which happens to implement a subset of the AMDGPU ISA (Southern Islands), aka a "GPGPU"
-  <https://miaowgpu.org/>
-
-
-Actual Scalar Vector Processor Architectures and ISAs:
-------------------------------------------------------
-
-* NEC SX Aurora
-  <https://www.hpc.nec/documents/guide/pdfs/Aurora_ISA_guide.pdf>
-* Cray ISA
-  <http://www.bitsavers.org/pdf/cray/CRAY_Y-MP/HR-04001-0C_Cray_Y-MP_Computer_Systems_Functional_Description_Jun90.pdf>
-* RISC-V RVV
-  <https://github.com/riscv/riscv-v-spec>
-* MRISC32 ISA Manual (under active development)
-  <https://github.com/mrisc32/mrisc32/tree/master/isa-manual>
-* Mitch Alsup's MyISA 66000 Vector Processor ISA Manual is available from
-  Mitch on direct contact with him.  It is a different approach from the
-  others, which may be termed "Cray-Style Horizontal-First" Vectorisation.
-  66000 is a *Vertical-First* Vector ISA.
-
-The term Horizontal or Vertical alludes to the Matrix "Row-First" or
-"Column-First" technique, where:
-
-* Horizontal-First processes all elements in a Vector before moving on
-  to the next instruction
-* Vertical-First processes *ONE* element per instruction, and requires
-  loop constructs to explicitly step to the next element.
-
-Vector-type Support by Architecture
-[[!table  data="""
-Architecture | Horizontal | Vertical
-MyISA 66000  |            | X
-Cray         | X          |
-SX Aurora    | X          |
-RVV          | X          |
-SVP64        | X          | X
-"""]]