X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=openpower%2Fsv.mdwn;h=23b9e811b5641af5f37dc558f0e0a4f80e51dd79;hb=83e17db9000ab78bc559eed77c9f06743551bd18;hp=f91319e55b9844cd3cd5bce1bf3a845dadb008c4;hpb=dc3a9448773ac40994efeb35db63a473b794b122;p=libreriscv.git diff --git a/openpower/sv.mdwn b/openpower/sv.mdwn index f91319e55..23b9e811b 100644 --- a/openpower/sv.mdwn +++ b/openpower/sv.mdwn @@ -4,40 +4,63 @@ Obligatory Dilbert: -=== +Links: -# SV (Simple Vectorisation) for the Power ISA +* +* walkthrough video (19jun2022) +* + PDF version of this DRAFT specification **SV is in DRAFT STATUS**. SV has not yet been submitted to the OpenPOWER Foundation ISA WG for review. - +=== + +# Scalable Vectors for the Power ISA -SV is designed as a Scalable Vector ISA for Hybrid 3D CPU GPU VPU workloads. +SV is designed as a strict RISC-paradigm +Scalable Vector ISA for Hybrid 3D CPU GPU VPU workloads. As such it brings features normally only found in Cray Supercomputers (Cray-1, NEC SX-Aurora) and in GPUs, but keeps strictly to a *Simple* RISC principle of leveraging a *Scalar* ISA, exclusively using "Prefixing". **Not one single actual -explicit Vector opcode exists in SV, at all**. +explicit Vector opcode exists in SV, at all**. It is suitable for +low-power Embedded and DSP Workloads as much as it is for power-efficient +Supercomputing. Fundamental design principles: -* Simplicity of introduction and implementation on the existing Power ISA +* Taking the simplicity of the RISC paradigm and applying it strictly and + uniformly to create a Scalable Vector ISA. * Effectively a hardware for-loop, pausing PC, issuing multiple scalar operations * Preserving the underlying scalar execution dependencies as if the for-loop had been expanded as actual scalar instructions (termed "preserving Program Order") +* Specifically designed to be Precise-Interruptible at all times + (many Vector ISAs have operations which, due to higher internal + accuracy or other complexity, must be effectively atomic only for + the full Vector operation's duration, adversely affecting interrupt + response latency, or be abandoned and started again) * Augments ("tags") existing instructions, providing Vectorisation - "context" rather than adding new ones. -* Does not modify or deviate from the underlying scalar Power ISA + "context" rather than adding new instructions. +* Strictly does not interfere with or alter the non-Scalable Power ISA + in any way +* In the Prefix space, does not modify or deviate from the underlying + scalar Power ISA unless it provides significant performance or other advantage to do so - in the Vector space (dropping XER.SO for example) + in the Vector space (dropping the "sticky" characteristics + of XER.SO and CR0.SO for example) * Designed for Supercomputing: avoids creating significant sequential - dependency hazards, allowing high performance superscalar - microarchitectures to be deployed. + dependency hazards, allowing standard + high performance superscalar multi-issue + micro-architectures to be leveraged. +* Divided into Compliancy Levels to reduce cost of implementation for + specific needs. Advantages of these design principles: +* Simplicity of introduction and implementation on top of + the existing Power ISA without disruption. * It is therefore easy to create a first (and sometimes only) implementation as literally a for-loop in hardware, simulators, and compilers. @@ -58,13 +81,41 @@ Advantages of these design principles: Comparative instruction count: * ARM NEON SIMD: around 2,000 instructions, prerequisite: ARM Scalar. -* ARM SVE: around 4,000 instructions, prerequisite: NEON. -* ARM SVE2: around 1,000 instructions, prerequisite: SVE -* Intel AVX-512: around 4,000 instructions, prerequisite AVX2 etc. +* ARM SVE: around 4,000 instructions, prerequisite: NEON and ARM Scalar +* ARM SVE2: around 1,000 instructions, prerequisite: SVE, NEON, and + ARM Scalar for a grand total of well over 7,000 instructions. +* Intel AVX-512: around 4,000 instructions, prerequisite AVX, AVX2, + AVX-128 and AVX-256 which in turn critically rely on the rest of + x86, for a grand total of well over 10,000 instructions. * RISV-V RVV: 192 instructions, prerequisite 96 Scalar RV64GC instructions -* SVP64: **four** instructions, 24-bit prefixing of +* SVP64: **six** instructions, two of which are in the same space + (svshape, svshape2), with 24-bit prefixing of prerequisite SFS (150) or - SFFS (214) Compliancy Subsets + SFFS (214) Compliancy Subsets. + **There are no dedicated Vector instructions, only Scalar-prefixed**. + +Comparative Basic Design Principle: + +* ARM NEON and VSX: PackedSIMD. No instruction-overloaded meaning + (every instruction is unique for a given register bitwidth, + guaranteeing binary interoperability) +* Intel AVX-512 (and below): Hybrid Packed-Predicated SIMD with no + instruction-overloading, guaranteeing binary interoperability + but at the same time penalising the ISA with runaway + opcode proliferation. +* ARM SVE/SVE2: Hybrid Packed-Predicated SIMD with instruction-overloading + that destroys binary interoperability. This is hidden behind the + misuse of the word "Scalable" and is **permitted under License** + by "Silicon Partners". +* RISC-V RVV: Cray-style Scalable Vector but with instruction-overloading + **permitted by the specification** that destroys binary interoperability. +* SVP64: Cray-style Scalable Vector with no instruction-overloaded + meanings. The regfile numbers and bitwidths shall **not** change + in a future revision (for the same instruction encoding): + "Silicon Partner" Scaling is prohibited, + in order to guarantee binary interoperability. Future revisions + of SVP64 may extend VSX instructions to achieve larger regfiles, and + non-interoperability on the same will likewise be prohibited. SV comprises several [[sv/compliancy_levels]] suited to Embedded, Energy efficient High-Performance Compute, Distributed Computing and Advanced @@ -72,14 +123,281 @@ Computational Supercomputing. The Compliancy Levels are arranged such that even at the bare minimum Level, full Soft-Emulation of all optional and future features is possible. -# Major opcodes summary +# Sub-pages + +Pages being developed and examples -Please be advised that even though below is entirely DRAFT status, there +* [[sv/executive_summary]] +* [[sv/overview]] explaining the basics. +* [[sv/compliancy_levels]] for minimum subsets through to Advanced + Supercomputing. +* [[sv/implementation]] implementation planning and coordination +* [[sv/po9_encoding]] a new DRAFT 64-bit space similar to EXT1xx, + introducing new areas EXT232-63 and EXT300-363 +* [[sv/svp64]] contains the packet-format *only*, the [[svp64/appendix]] + contains explanations and further details +* [[sv/svp64-single]] still under development +* [[sv/svp64_quirks]] things in SVP64 that slightly break the rules + or are not immediately apparent despite the RISC paradigm +* [[opcode_regs_deduped]] autogenerated table of SVP64 decoder augmentation +* [[sv/sprs]] SPRs +* [[sv/rfc]] RFCs to the [OPF ISA WG](https://openpower.foundation/isarfc/) + +SVP64 "Modes": + +* For condition register operations see [[sv/cr_ops]] - SVP64 Condition + Register ops: Guidelines + on Vectorisation of any v3.0B base operations which return + or modify a Condition Register bit or field. +* For LD/ST Modes, see [[sv/ldst]]. +* For Branch modes, see [[sv/branches]] - SVP64 Conditional Branch + behaviour: All/Some Vector CRs +* For arithmetic and logical, see [[sv/normal]] +* [[sv/mv.vec]] pack/unpack move to and from vec2/3/4, + actually an RM.EXTRA Mode and a [[sv/remap]] mode + +Core SVP64 instructions: + +* [[sv/setvl]] the Cray-style "Vector Length" instruction +* svremap, svindex and svshape: part of [[sv/remap]] "Remapping" for + Matrix Multiply, DCT/FFT and RGB-style "Structure Packing" + as well as general-purpose Indexing. Also describes associated SPRs. +* [[sv/svstep]] Key stepping instruction, primarily for + Vertical-First Mode and also providing traditional "Vector Iota" + capability. + +*Please note: there are only six instructions in the whole of SV. +Beyond this point are additional **Scalar** instructions related to +specific workloads that have nothing to do with the SV Specification* + +# Stability Guarantees in Simple-V + +Providing long-term stability in an ISA is extremely challenging +but critically important. +It requires certain guarantees to be provided. + +* Firstly: that instructions will never be ambiguously-defined. +* Secondly, that no instruction shall change meaning to produce + different results on different hardware (present or future). +* Thirdly, that Scalar "defined words" (32 bit instruction + encodings) if Vectorised will also always be implemented as + identical Scalar instructions (the sole semi-exception being + Vectorised Branch-Conditional) +* Fourthly, that implementors are not permitted to either add + arbitrary features nor implement features in an incompatible + way. *(Performance may differ, but differing results are + not permitted)*. +* Fifthly, that any part of Simple-V not implemented by + a lower Compliancy Level is *required* to raise an illegal + instruction trap (allowing soft-emulation), including if + Simple-V is not implemented at all. +* Sixthly, that any `UNDEFINED` behaviour for practical implementation + reasons is clearly documented for both programmers and hardware + implementors. + +In particular, given the strong recent emphasis and interest in +"Scalable Vector" ISAs, it is most unfortunate that both ARM SVE +and RISC-V RVV permit the exact same instruction to produce +different results on different hardware depending on a +"Silicon Partner" hardware choice. This choice catastrophically +and irrevocably causes binary non-interoperability *despite being +a "feature"*. Explained in +it is the exact same binary-incompatibility issue faced by Power ISA +on its 32- to 64-bit transition: 32-bit hardware was **unable** to +trap-and-emulate 64-bit binaries because the opcodes were (are) the same. + +It is therefore *guaranteed* that extensions to the register file +width and quantity in Simple-V shall only be made in future by +explicit means, ensuring binary compatibility. + +# Optional Scalar instructions + +**Additional Instructions for specific purposes (not SVP64)** + +All of these instructions below have nothing to do with SV. +They are all entirely designed as Scalar instructions that, as +Scalar instructions, stand on their own merit. Considerable +lengths have been made to provide justifications for each of these +*Scalar* instructions in a *Scalar* context, completely independently +of SVP64. + +Some of these Scalar instructions happen also designed to make +Scalable Vector binaries more efficient, such +as the crweird group. Others are to bring the Scalar Power ISA +up-to-date within specific workloads, +such as a JavaScript Rounding instruction +(which saves 32 scalar instructions including seven branch instructions). +None of them are strictly necessary but performance and power consumption may +be (or, is already) compromised +in certain workloads and use-cases without them. + +Vector-related but still Scalar: + +* [[sv/mv.swizzle]] vec2/3/4 Swizzles (RGBA, XYZW) for 3D and CUDA. + designed as a Scalar instruction. +* [[sv/vector_ops]] scalar operations needed for supporting vectors +* [[sv/cr_int_predication]] scalar instructions needed for + effective predication + +Stand-alone Scalar Instructions: + +* [[sv/bitmanip]] +* [[sv/fcvt]] FP Conversion (due to OpenPOWER Scalar FP32) +* [[sv/fclass]] detect class of FP numbers +* [[sv/int_fp_mv]] Move and convert GPR <-> FPR, needed for !VSX +* [[sv/av_opcodes]] scalar opcodes for Audio/Video +* [[prefix_codes]] Decode/encode prefix-codes, used by JPEG, DEFLATE, etc. +* TODO: OpenPOWER adaptation [[openpower/transcendentals]] + +Twin targetted instructions (two registers out, one implicit, just like +Load-with-Update). + +* [[isa/svfixedarith]] +* [[isa/svfparith]] +* [[sv/biginteger]] Operations that help with big arithmetic + +Explanation of the rules for twin register targets +(implicit RS, FRS) explained in SVP64 [[svp64/appendix]] + +# Architectural Note + +This section is primarily for the ISA Working Group and for IBM +in their capacity and responsibility for allocating "Architectural +Resources" (opcodes), but it is also useful for general understanding +of Simple-V. + +Simple-V is effectively a type of "Zero-Overhead Loop Control" to which +an entire 24 bits are exclusively dedicated in a fully RISC-abstracted +manner. Within those 24-bits there are no Scalar instructions, and +no Vector instructions: there is *only* "Loop Control". + +This is why there are no actual Vector operations in Simple-V: *all* suitable +Scalar Operations are Vectorised or not at all. This has some extremely +important implications when considering adding new instructions, and +especially when allocating the Opcode Space for them. +To protect SVP64 from damage, a "Hard Rule" has to be set: + + Scalar Instructions must be simultaneously added in the corresponding + SVP64 opcode space with the exact same 32-bit "Defined Word" or they + must not be added at all. Likewise, instructions planned for addition + in what is considered (wrongly) to be the exclusive "Vector" domain + must correspondingly be added in the Scalar space with the exact same + 32-bit "Defined Word", or they must not be added at all. + +Some explanation of the above is needed. Firstly, "Defined Word" is a term +used in Section 1.6.3 of the Power ISA v3 1 Book I: it means, in short, +"a 32 bit instruction", which can then be Prefixed by EXT001 to extend it +to 64-bit (named EXT100-163). +Prefixed-Prefixed (96-bit Variable-Length) encodings are +prohibited in v3.1 and they are just as prohibited in Simple-V: it's too +complex in hardware. This means that **only** 32-bit "Defined Words" +may be Vectorised, and in particular it means that no 64-bit instruction +(EXT100-163) may **ever** be Vectorised. + +Secondly, the term "Vectoriseable" was used. This refers to "instructions +which if SVP64-Prefixed are actually meaningful". `sc` is meaningless +to Vectorise, for example, as is `sync` and `mtmsr` (there is only ever +going to be one MSR). + +The problem comes if the rationale is applied, "if unused, +Unvectoriseable opcodes +can therefore be allocated to alternative instructions mixing inside +the SVP64 +Opcode space", +which unfortunately results in huge inadviseable complexity in HDL at the +Decode Phase, attempting to discern between the two types. Worse than that, +if the alternate 64-bit instruction is Vectoriseable but the 32-bit Scalar +"Defined Word" is already allocated, how can there ever be a Scalar version +of the alternate instruction? It would have to be added as a **completely +different** 32-bit "Defined Word", and things go rapidly downhill in the +Decoder as well as the ISA from there. + +Therefore to avoid risk and long-term damage to the Power ISA: + +* *even Unvectoriseable* "Defined Words" (`mtmsr`) must have the + corresponding SVP64 Prefixed Space `RESERVED`, permanently requiring + Illegal Instruction to be raised (the 64-bit encoding corresponding + to an illegal `sv.mtmsr` if ever incorrectly attempted must be + **defined** to raise an Exception) +* *Even instructions that may not be Scalar* (although for various + practical reasons this is extremely rare if not impossible, + if not just generally "strongly discouraged") + which have no meaning or use as a 32-bit Scalar "Defined Word", **must** + still have the Scalar "Defined Word" `RESERVED` in the scalar + opcode space, as an Illegal Instruction. + +A good example of the former is `mtmsr` because there is only one +MSR register (`sv.mtmsr` is meaningless, as is `sv.sc`), +and a good example of the latter is [[sv/mv.x]] +which is so deeply problematic to add to any Scalar ISA that it was +rejected outright and an alternative route taken (Indexed REMAP). + +Another good example would be Cross Product which has no meaning +at all in a Scalar ISA (Cross Product as a concept only applies +to Mathematical Vectors). If any such Vector operation were ever added, +it would be **critically** important to reserve the exact same *Scalar* +opcode with the exact same "Defined Word" in the *Scalar* Power ISA +opcode space, as an Illegal Instruction. There are +good reasons why Cross Product has not been proposed, but it serves +to illustrate the point as far as Architectural Resource Allocation is +concerned. + +Bottom line is that whilst this seems wasteful the alternatives are a +destabilisation of the Power ISA and impractically-complex Hardware +Decoders. With the Scalar Power ISA (v3.0, v3.1) already being comprehensive +in the number of instructions, keeping further Decode complexity down is a +high priority. + +# Other Scalable Vector ISAs + +These Scalable Vector ISAs are listed to aid in understanding and +context of what is involved. + +* Original Cray ISA + +* NEC SX Aurora (still in production, inspired by Cray) + +* RISC-V RVV (inspired by Cray) + +* MRISC32 ISA Manual (under active development) + +* Mitch Alsup's MyISA 66000 Vector Processor ISA Manual is available from + Mitch on request. + +A comprehensive list of 3D GPU, Packed SIMD, Predicated-SIMD and true Scalable +Vector ISAs may be found at the [[sv/vector_isa_comparison]] page. +Note: AVX-512 and SVE2 are *not Vector ISAs*, they are Predicated-SIMD. +*Public discussions have taken place at Conferences attended by both Intel +and ARM on adding a `setvl` instruction which would easily make both +AVX-512 and SVE2 truly "Scalable".* [[sv/comparison_table]] in tabular +form. + +# Major opcodes summary + +Simple-V itself only requires six instructions with 6-bit Minor XO +(bits 26-31), and the SVP64 Prefix Encoding requires +25% space of the EXT001 Major Opcode. +There are **no** Vector Instructions and consequently **no further +opcode space is required**. Even though they are currently +placed in the EXT022 Sandbox, the "Management" instructions +(setvl, svstep, svremap, svshape, svindex) are designed to fit +cleanly into EXT019 (exactly like `addpcis`) or other 5/6-bit Minor +XO area (bits 25-31) that has space for Rc=1. + +That said: for the target workloads for which Scalable Vectors are typically +used, the Scalar ISA on which those workloads critically rely +is somewhat anaemic. +The Libre-SOC Team has therefore been addressing that by developing +a number of Scalar instructions in specialist areas (Big Integer, +Cryptography, 3D, Audio/Video, DSP) and it is these which require +considerable Scalar opcode space. + +Please be advised that even though SV is entirely DRAFT status, there is considerable concern that because there is not yet any two-way day-to-day communication established with the OPF ISA WG, we have no idea if any of these are conflicting with future plans by any OPF -Members. **The External ISA WG RFC Process is yet to be ratified -and Libre-SOC may not join the OPF as an entity because it does +Members. **The External ISA WG RFC Process has now been ratified +but Libre-SOC may not join the OPF as an entity because it does not exist except in name. Even if it existed it would be a conflict of interest to join the OPF, due to our funding remit from NLnet**. We therefore proceed on the basis of making public the intention to @@ -87,21 +405,36 @@ submit RFCs once the External ISA WG RFC Process is in place and, in a wholly unsatisfactory manner have to *hope and trust* that OPF ISA WG Members are reading this and take it into consideration. +**Scalar Summary** + +As in above sections, it is emphasised strongly that Simple-V in no +way critically depends on the 100 or so *Scalar* instructions also +being developed by Libre-SOC. + **None of these Draft opcodes are intended for private custom secret proprietary usage. They are all intended for entirely public, upstream, high-profile mass-volume day-to-day usage at the same level as add, popcnt and fld** -* SVP64 requires 25% of EXT01 (bits 6 and 9 set to 1) * bitmanip requires two major opcodes (due to 16+ bit immediates) those are currently EXT022 and EXT05. * brownfield encoding in one of those two major opcodes still requires multiple VA-Form operations (in greater numbers than EXT04 has spare) * space in EXT019 next to addpcis and crops is recommended + (or any other 5-6 bit Minor XO areas) * many X-Form opcodes currently in EXT022 have no preference for a location at all, and may be moved to EXT059, EXT019, EXT031 or other much more suitable location. +* even if ratified and even if the majority (mostly X-Form) + is moved to other locations, the large immediate sizes of + the remaining bitmanip instructions means + it would be highly likely these remaining instructions would need two + major opcodes. Fortuitously the v3.1 Spec states that + both EXT005 and EXT009 are + available. + +**Additional observations** Note that there is no Sandbox allocation in the published ISA Spec for v3.1 EXT01 usage, and because SVP64 is already 64-bit Prefixed, @@ -111,70 +444,33 @@ situation is a high priority which in turn by necessity puts pressure on the 32-bit Major Opcode space. Note also that EXT022, the Official Architectural Sandbox area +available for "Custom non-approved purposes" according to the Power +ISA Spec, is under severe design pressure as it is insufficient to hold the full extent of the instruction additions required to create -a Hybrid 3D CPU-VPU-GPU. - -**Whilst SVP64 is only 4 instructions +a Hybrid 3D CPU-VPU-GPU. Although the wording of the Power ISA +Specification leaves open the *possibility* of not needing to +propose ISA Extensions to the ISA WG, it is clear that EXT022 +is an inappropriate location for a large high-profile Extension +intended for mass-volume product deployment. Every in-good-faith effort will +therefore be made to work with the OPF ISA WG to +submit SVP64 via the External RFC Process. + +**Whilst SVP64 is only 6 instructions the heavy focus on VSX for the past 12 years has left the SFFS Level -anaemic and out-of-date compared to ARM and x86. Approximately -100 additional Scalar Instructions are up for proposal** - -# Sub-pages - -Pages being developed and examples +anaemic and out-of-date compared to ARM and x86.** +This is very much +a blessing, as the Scalar ISA has remained clean, making it +highly suited to RISC-paradigm Scalable Vector Prefixing. Approximately +100 additional (optional) Scalar Instructions are up for proposal to bring SFFS +up-to-date. None of them require or depend on PackedSIMD VSX (or VMX). -* [[sv/overview]] explaining the basics. -* [[sv/compliancy_levels]] for minimum subsets through to Advanced - Supercomputing. -* [[sv/implementation]] implementation planning and coordination -* [[sv/svp64]] contains the packet-format *only*, the [[sv/svp64/appendix]] - contains explanations and further details -* [[sv/svp64_quirks]] things in SVP64 that slightly break the rules -* [[opcode_regs_deduped]] autogenerated table of SVP64 instructions -* [[sv/sprs]] SPRs -* SVP64 "Modes": - - For condition register operations see [[sv/cr_ops]] - SVP64 Condition - Register ops: Guidelines - on Vectorisation of any v3.0B base operations which return - or modify a Condition Register bit or field. - - For LD/ST Modes, see [[sv/ldst]]. - - For Branch modes, see [[sv/branches]] - SVP64 Conditional Branch - behaviour: All/Some Vector CRs - - For arithmetic and logical, see [[sv/normal]] - -Core SVP64 instructions: - -* [[sv/setvl]] the Cray-style "Vector Length" instruction -* [[sv/remap]] "Remapping" for Matrix Multiply and RGB "Structure Packing" -* [[sv/svstep]] Key stepping instruction for Vertical-First Mode - -Vector-related: - -* [[sv/vector_swizzle]] -* [[sv/mv.vec]] pack/unpack move to and from vec2/3/4 -* [[sv/mv.swizzle]] -* [[sv/vector_ops]] scalar operations needed for supporting vectors - -Scalar Instructions: - -* [[sv/cr_int_predication]] instructions needed for effective predication -* [[sv/bitmanip]] -* [[sv/fcvt]] FP Conversion (due to OpenPOWER Scalar FP32) -* [[sv/fclass]] detect class of FP numbers -* [[sv/int_fp_mv]] Move and convert GPR <-> FPR, needed for !VSX -* [[sv/vector_ops]] Vector ops needed to make a "complete" Vector ISA -* [[sv/av_opcodes]] scalar opcodes for Audio/Video -* Twin targetted instructions (two registers out, one implicit) - Explanation of the rules for twin register targets - (implicit RS, FRS) explained in SVP64 [[sv/svp64/appendix]] - - [[isa/svfixedarith]] - - [[isa/svfparith]] - - [[sv/biginteger]] Operations that help with big arithmetic -* TODO: OpenPOWER adaptation [[openpower/transcendentals]] +# Other Examples experiments future ideas discussion: +* [Scalar register access](https://bugs.libre-soc.org/show_bug.cgi?id=905) + above r31 and CR7. * [[sv/propagation]] Context propagation including svp64, swizzle and remap * [[sv/masked_vector_chaining]] * [[sv/discussion]] @@ -190,145 +486,9 @@ Examples experiments future ideas discussion: Additional links: * +* [[sv/vector_isa_comparison]] - a list of Packed SIMD, GPU, + and other Scalable Vector ISAs +* [[sv/comparison_table]] - a one-off (experimental) table comparing ISAs * [[simple_v_extension]] old (deprecated) version * [[openpower/sv/llvm]] -* [[openpower/sv/effect-of-more-decode-stages-on-reg-renaming]] - -=== - -Required Background Reading: -============================ - -These are all, deep breath, basically... required reading, *as well as -and in addition* to a full and comprehensive deep technical understanding -of the Power ISA, in order to understand the depth and background on -SVP64 as a 3D GPU and VPU Extension. - -I am keenly aware that each of them is 300 to 1,000 pages (just like -the Power ISA itself). - -This is just how it is. - -Given the sheer overwhelming size and scope of SVP64 we have gone to -**considerable lengths** to provide justification and rationalisation for -adding the various sub-extensions to the Base Scalar Power ISA. - -* Scalar bitmanipulation is justifiable for the exact same reasons the - extensions are justifiable for other ISAs. The additional justification - for their inclusion where some instructions are already (sort-of) present - in VSX is that VSX is not mandatory, and the complexity of implementation - of VSX is too high a price to pay at the Embedded SFFS Compliancy Level. -* Scalar FP-to-INT conversions, likewise. ARM has a javascript conversion - instruction, Power ISA does not (and it costs a ridiculous 45 instructions - to implement, including 6 branches!) -* Scalar Transcendentals (SIN, COS, ATAN2, LOG) are easily justifiable - for High-Performance Compute workloads. - -It also has to be pointed out that normally this work would be covered by -multiple separate full-time Workgroups with multiple Members contributing -their time and resources. - -Overall the contributions that we are developing take the Power ISA out of -the specialist highly-focussed market it is presently best known for, and -expands it into areas with much wider general adoption and broader uses. - - ---- - -OpenCL specifications are linked here, these are relevant when we get -to a 3D GPU / High Performance Compute ISA WG RFC: -[[openpower/transcendentals]] - -(Failure to add Transcendentals to a 3D GPU is directly equivalent to -*willfully* designing a product that is 100% destined for commercial -rejection, due to the extremely high competitive performance/watt achieved -by today's mass-volume GPUs.) - -I mention these because they will be encountered in every single -commercial GPU ISA, but they're not part of the "Base" (core design) -of a Vector Processor. Transcendentals can be added as a sub-RFC. - ---- - -SIMD ISAs commonly mistaken for Vector: ---------------------------------------- - -There is considerable confusion surrounding Vector ISAs -because of a mis-use of the word "Vector" in most -well-known Packed SIMD ISAs. - -* PackedSIMD VSX. VSX, which has the word "Vector" in its name, - is "inspired" by Vector Processing - but has no "Scaling" capability, and no Predicate masking. - Adding Predicate Masks to the PackedSIMD VSX ISA - would effectively double the number of PackedSIMD - instructions (750 becomes 1,500) -* [AVX / AVX2 / AVX128 / AVX256 / AVX512](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) - again has the word "Vector" in its name but this in no - way makes it a Vector ISA. None of the AVX-\* family - are "Scalable" however there is at least Predicate Masking - in AVX-512. -* ARM NEON - accurately described as a Packed SIMD ISA in - all literature. -* ARM SVE / SVE2 - accurately described as a Scalable Vector - ISA, but the "Scaling" is, rather unfortunately, a parameter - that is chosen by the *Hardware Architect*, rather than - the programmer. This has resulted in programmers writing - multiple variants of hand-coded assembler in order - to target different machines with different hardware widths, - going directly against the advice given on ARM's developer - documentation. - - -Actual 3D GPU Architectures and ISAs: -------------------------------------- - -* Broadcom Videocore - -* Etnaviv - -* Nyuzi - -* MALI - -* AMD - - -* MIAOW which is *NOT* a 3D GPU, it is a processor which happens to implement a subset of the AMDGPU ISA (Southern Islands), aka a "GPGPU" - - - -Actual Scalar Vector Processor Architectures and ISAs: ------------------------------------------------------- - -* NEC SX Aurora - -* Cray ISA - -* RISC-V RVV - -* MRISC32 ISA Manual (under active development) - -* Mitch Alsup's MyISA 66000 Vector Processor ISA Manual is available from - Mitch on direct contact with him. It is a different approach from the - others, which may be termed "Cray-Style Horizontal-First" Vectorisation. - 66000 is a *Vertical-First* Vector ISA. - -The term Horizontal or Vertical alludes to the Matrix "Row-First" or -"Column-First" technique, where: - -* Horizontal-First processes all elements in a Vector before moving on - to the next instruction -* Vertical-First processes *ONE* element per instruction, and requires - loop constructs to explicitly step to the next element. - -Vector-type Support by Architecture -[[!table data=""" -Architecture | Horizontal | Vertical -MyISA 66000 | | X -Cray | X | -SX Aurora | X | -RVV | X | -SVP64 | X | X -"""]]