[[!tag standards]]

Obligatory Dilbert:

<img src="https://assets.amuniversal.com/7fada35026ca01393d3d005056a9545d" width="600" />

===

# SV (Simple Vectorisation) for the Power ISA

**SV is in DRAFT STATUS**. SV has not yet been submitted to the OpenPOWER Foundation ISA WG for review.

<https://bugs.libre-soc.org/show_bug.cgi?id=213>

SV is designed as a Scalable Vector ISA for Hybrid 3D CPU GPU VPU workloads.
As such it brings features normally only found in Cray Supercomputers
(Cray-1, NEC SX-Aurora)
and in GPUs, but keeps strictly to a *Simple* RISC principle of leveraging
a *Scalar* ISA, exclusively using "Prefixing". **Not one single actual
explicit Vector opcode exists in SV, at all**.

Fundamental design principles:

* Simplicity of introduction and implementation on the existing Power ISA
* Effectively a hardware for-loop, pausing PC, issuing multiple scalar
  operations
* Preserving the underlying scalar execution dependencies as if the
  for-loop had been expanded as actual scalar instructions
  (termed "preserving Program Order")
* Augments ("tags") existing instructions, providing Vectorisation
  "context" rather than adding new ones.
* Does not modify or deviate from the underlying scalar Power ISA
  unless it provides significant performance or other advantage to do so
  in the Vector space (dropping XER.SO for example)
* Designed for Supercomputing: avoids creating significant sequential
  dependency hazards, allowing high performance superscalar
  microarchitectures to be deployed.

Advantages of these design principles:

* It is therefore easy to create a first (and sometimes only)
  implementation as literally a for-loop in hardware, simulators, and
  compilers.
* Hardware Architects may understand and implement SV as being an
  extra pipeline stage, inserted between decode and issue, that is
  a simple for-loop issuing element-level sub-instructions.
* More complex HDL can be done by repeating existing scalar ALUs and 
  pipelines as blocks and leveraging existing Multi-Issue Infrastructure
* As (mostly) a high-level "context" that does not (significantly) deviate
  from scalar Power ISA and, in its purest form being "a for loop around
  scalar instructions", it is minimally-disruptive and consequently stands
  a reasonable chance of broad community adoption and acceptance
* Completely wipes not just SIMD opcode proliferation off the
  map (SIMD is O(N^6) opcode proliferation)
  but off of Vectorisation ISAs as well.  No more separate Vector
  instructions.

Comparative instruction count:

* ARM NEON SIMD: around 2,000 instructions, prerequisite: ARM Scalar.
* ARM SVE: around 4,000 instructions, prerequisite: NEON. 
* ARM SVE2: around 1,000 instructions, prerequisite: SVE
* Intel AVX-512: around 4,000 instructions, prerequisite AVX2 etc.
* RISV-V RVV: 192 instructions, prerequisite 96 Scalar RV64GC instructions
* SVP64: **four** instructions, 24-bit prefixing of
  prerequisite SFS (150) or
  SFFS (214) Compliancy Subsets

SV comprises several [[sv/compliancy_levels]] suited to Embedded, Energy
efficient High-Performance Compute, Distributed Computing and Advanced
Computational Supercomputing.  The Compliancy Levels are arranged such
that even at the bare minimum Level, full Soft-Emulation of all
optional and future features is possible.

# Major opcodes summary

Please be advised that even though below is entirely DRAFT status, there
is considerable concern that because there is not yet any two-way
day-to-day communication established with the OPF ISA WG, we have
no idea if any of these are conflicting with future plans by any OPF
Members.  **The External ISA WG RFC Process is yet to be ratified
and Libre-SOC may not join the OPF as an entity because it does
not exist except in name. Even if it existed it would be a conflict
of interest to join the OPF, due to our funding remit from NLnet**.
We therefore proceed on the basis of making public the intention to
submit RFCs once the External ISA WG RFC Process is in place and,
in a wholly unsatisfactory manner have to *hope and trust* that
OPF ISA WG Members are reading this and take it into consideration.

**None of these Draft opcodes are intended for private custom
secret proprietary usage. They are all intended for entirely
public, upstream, high-profile mass-volume day-to-day usage at the
same level as add, popcnt and fld**

* SVP64 requires 25% of EXT01 (bits 6 and 9 set to 1)
* bitmanip requires two major opcodes (due to 16+ bit immediates)
  those are currently EXT022 and EXT05.
* brownfield encoding in one of those two major opcodes still
  requires multiple VA-Form operations (in greater numbers
  than EXT04 has spare)
* space in EXT019 next to addpcis and crops is recommended
* many X-Form opcodes currently in EXT022 have no preference
  for a location at all, and may be moved to EXT059, EXT019,
  EXT031 or other much more suitable location.

Note that there is no Sandbox allocation in the published ISA Spec for
v3.1 EXT01 usage, and because SVP64 is already 64-bit Prefixed,
Prefixed-Prefixed-instructions (SVP64 Prefixed v3.1 Prefixed)
would become a whopping 96-bit long instruction. Avoiding this
situation is a high priority which in turn by necessity puts pressure
on the 32-bit Major Opcode space.

Note also that EXT022, the Official Architectural Sandbox area
is under severe design pressure as it is insufficient to hold
the full extent of the instruction additions required to create
a Hybrid 3D CPU-VPU-GPU.

**Whilst SVP64 is only 4 instructions
the heavy focus on VSX for the past 12 years has left the SFFS Level
anaemic and out-of-date compared to ARM and x86. Approximately
100 additional Scalar Instructions are up for proposal**

# Sub-pages

Pages being developed and examples

* [[sv/overview]] explaining the basics.
* [[sv/compliancy_levels]] for minimum subsets through to Advanced
  Supercomputing.
* [[sv/implementation]] implementation planning and coordination
* [[sv/svp64]] contains the packet-format *only*, the [[sv/svp64/appendix]]
  contains explanations and further details
* [[sv/svp64_quirks]] things in SVP64  that slightly break the rules
* [[opcode_regs_deduped]] autogenerated table of SVP64 instructions
* [[sv/sprs]] SPRs
* SVP64 "Modes":
  - For condition register operations see [[sv/cr_ops]] - SVP64 Condition
    Register ops: Guidelines
    on Vectorisation of any v3.0B base operations which return
    or modify a Condition Register bit or field.
  - For LD/ST Modes, see [[sv/ldst]].
  - For Branch modes, see [[sv/branches]] - SVP64 Conditional Branch
    behaviour: All/Some Vector CRs
  - For arithmetic and logical, see [[sv/normal]]

Core SVP64 instructions:

* [[sv/setvl]] the Cray-style "Vector Length" instruction
* [[sv/remap]] "Remapping" for Matrix Multiply and RGB "Structure Packing"
* [[sv/svstep]] Key stepping instruction for Vertical-First Mode

Vector-related:

* [[sv/vector_swizzle]]
* [[sv/mv.vec]] pack/unpack move to and from vec2/3/4
* [[sv/mv.swizzle]]
* [[sv/vector_ops]] scalar operations needed for supporting vectors

Scalar Instructions:

* [[sv/cr_int_predication]] instructions needed for effective predication
* [[sv/bitmanip]]
* [[sv/fcvt]] FP Conversion (due to OpenPOWER Scalar FP32)
* [[sv/fclass]] detect class of FP numbers
* [[sv/int_fp_mv]] Move and convert GPR <-> FPR, needed for !VSX
* [[sv/vector_ops]] Vector ops needed to make a "complete" Vector ISA
* [[sv/av_opcodes]] scalar opcodes for Audio/Video
* Twin targetted instructions (two registers out, one implicit)
  Explanation of the rules for twin register targets
  (implicit RS, FRS) explained in SVP64 [[sv/svp64/appendix]]
  - [[isa/svfixedarith]]
  - [[isa/svfparith]]
  - [[sv/biginteger]] Operations that help with big arithmetic
* TODO: OpenPOWER adaptation [[openpower/transcendentals]]

Examples experiments future ideas discussion:

* [[sv/propagation]] Context propagation including svp64, swizzle and remap
* [[sv/masked_vector_chaining]]
* [[sv/discussion]]
* [[sv/example_dep_matrices]]
* [[sv/major_opcode_allocation]]
* [[sv/byteswap]]
* [[sv/16_bit_compressed]] experimental
* [[sv/toc_data_pointer]] experimental
* [[sv/predication]] discussion on predication concepts
* [[sv/register_type_tags]]
* [[sv/mv.x]] deprecated in favour of Indexed REMAP

Additional links:

* <https://www.sigarch.org/simd-instructions-considered-harmful/>
* [[simple_v_extension]] old (deprecated) version
* [[openpower/sv/llvm]]
* [[openpower/sv/effect-of-more-decode-stages-on-reg-renaming]]

===

Required Background Reading:
============================

These are all, deep breath, basically... required reading, *as well as
and in addition* to a full and comprehensive deep technical understanding
of the Power ISA, in order to understand the depth and background on
SVP64 as a 3D GPU and VPU Extension.

I am keenly aware that each of them is 300 to 1,000 pages (just like
the Power ISA itself).

This is just how it is.

Given the sheer overwhelming size and scope of SVP64 we have gone to
**considerable lengths** to provide justification and rationalisation for
adding the various sub-extensions to the Base Scalar Power ISA.

* Scalar bitmanipulation is justifiable for the exact same reasons the
  extensions are justifiable for other ISAs.  The additional justification
  for their inclusion where some instructions are already (sort-of) present
  in VSX is that VSX is not mandatory, and the complexity of implementation
  of VSX is too high a price to pay at the Embedded SFFS Compliancy Level.
* Scalar FP-to-INT conversions, likewise.  ARM has a javascript conversion
  instruction, Power ISA does not (and it costs a ridiculous 45 instructions
  to implement, including 6 branches!)
* Scalar Transcendentals (SIN, COS, ATAN2, LOG) are easily justifiable
  for High-Performance Compute workloads.

It also has to be pointed out that normally this work would be covered by
multiple separate full-time Workgroups with multiple Members contributing
their time and resources.

Overall the contributions that we are developing take the Power ISA out of
the specialist highly-focussed market it is presently best known for, and
expands it into areas with much wider general adoption and broader uses.


---

OpenCL specifications are linked here, these are relevant when we get
to a 3D GPU / High Performance Compute ISA WG RFC:
[[openpower/transcendentals]]

(Failure to add Transcendentals to a 3D GPU is directly equivalent to
*willfully* designing a product that is 100% destined for commercial
rejection, due to the extremely high competitive performance/watt achieved
by today's mass-volume GPUs.)

I mention these because they will be encountered in every single
commercial GPU ISA, but they're not part of the "Base" (core design)
of a Vector Processor. Transcendentals can be added as a sub-RFC.

---

SIMD ISAs commonly mistaken for Vector:
---------------------------------------

There is considerable confusion surrounding Vector ISAs
because of a mis-use of the word "Vector" in most
well-known Packed SIMD ISAs.

* PackedSIMD VSX. VSX, which has the word "Vector" in its name,
  is "inspired" by Vector Processing
  but has no "Scaling" capability, and no Predicate masking.
  Adding Predicate Masks to the PackedSIMD VSX ISA
  would effectively double the number of PackedSIMD
  instructions (750 becomes 1,500)
* [AVX / AVX2 / AVX128 / AVX256 / AVX512](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)
  again has the word "Vector" in its name but this in no
  way makes it a Vector ISA. None of the AVX-\* family
  are "Scalable" however there is at least Predicate Masking
  in AVX-512.
* ARM NEON - accurately described as a Packed SIMD ISA in
  all literature.
* ARM SVE / SVE2 - accurately described as a Scalable Vector
  ISA, but the "Scaling" is, rather unfortunately, a parameter
  that is chosen by the *Hardware Architect*, rather than
  the programmer. This has resulted in programmers writing
  multiple variants of hand-coded assembler in order
  to target different machines with different hardware widths,
  going directly against the advice given on ARM's developer
  documentation.


Actual 3D GPU Architectures and ISAs:
-------------------------------------

* Broadcom Videocore
  <https://github.com/hermanhermitage/videocoreiv>
* Etnaviv
  <https://github.com/etnaviv/etna_viv/tree/master/doc>
* Nyuzi
  <http://www.cs.binghamton.edu/~millerti/nyuziraster.pdf>
* MALI
  <https://github.com/cwabbott0/mali-isa-docs>
* AMD
  <https://developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf>  
  <https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf>
* MIAOW which is *NOT* a 3D GPU, it is a processor which happens to implement a subset of the AMDGPU ISA (Southern Islands), aka a "GPGPU"
  <https://miaowgpu.org/>


Actual Scalar Vector Processor Architectures and ISAs:
------------------------------------------------------

* NEC SX Aurora
  <https://www.hpc.nec/documents/guide/pdfs/Aurora_ISA_guide.pdf>
* Cray ISA
  <http://www.bitsavers.org/pdf/cray/CRAY_Y-MP/HR-04001-0C_Cray_Y-MP_Computer_Systems_Functional_Description_Jun90.pdf>
* RISC-V RVV
  <https://github.com/riscv/riscv-v-spec>
* MRISC32 ISA Manual (under active development)
  <https://github.com/mrisc32/mrisc32/tree/master/isa-manual>
* Mitch Alsup's MyISA 66000 Vector Processor ISA Manual is available from
  Mitch on direct contact with him.  It is a different approach from the
  others, which may be termed "Cray-Style Horizontal-First" Vectorisation.
  66000 is a *Vertical-First* Vector ISA.

The term Horizontal or Vertical alludes to the Matrix "Row-First" or
"Column-First" technique, where:

* Horizontal-First processes all elements in a Vector before moving on
  to the next instruction
* Vertical-First processes *ONE* element per instruction, and requires
  loop constructs to explicitly step to the next element.

Vector-type Support by Architecture
[[!table  data="""
Architecture | Horizontal | Vertical
MyISA 66000  |            | X
Cray         | X          |
SX Aurora    | X          |
RVV          | X          |
SVP64        | X          | X
"""]]