From: Luke Kenneth Casson Leighton Date: Thu, 8 Sep 2022 17:25:45 +0000 (+0100) Subject: whitespace X-Git-Tag: opf_rfc_ls005_v1~593 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=9667ea3fb27a41f963641e99338e7a19d7a913da;p=libreriscv.git whitespace --- diff --git a/openpower/sv/rfc/ls001.mdwn b/openpower/sv/rfc/ls001.mdwn index a26420e47..9cc47403c 100644 --- a/openpower/sv/rfc/ls001.mdwn +++ b/openpower/sv/rfc/ls001.mdwn @@ -17,75 +17,79 @@ This proposal is to extend the Power ISA with an Abstract RISC-Paradigm Vectorisation Concept that may be applied to **all and any** suitable Scalar instructions, present and future, in the Scalar Power ISA. -The Vectorisation System is called "Simple-V" and the Prefix Format -is called "SVP64". **Simple-V is not a Traditional Vector ISA and therefore does not add Vector opcodes**. +The Vectorisation System is called "Simple-V" and the Prefix Format is +called "SVP64". **Simple-V is not a Traditional Vector ISA and therefore +does not add Vector opcodes**. An ISA Concept similar to Simple-V was originally invented in 1994 by -Peter Hsu (Architect of the MIPS R8000) but was dropped as MIPS did -not have an Out-of-Order Microarchitecture on which to best exploit it. +Peter Hsu (Architect of the MIPS R8000) but was dropped as MIPS did not +have an Out-of-Order Microarchitecture on which to best exploit it. Simple-V is designed for Embedded Scenarios right the way through Audio/Visual DSPs to 3D GPUs and Supercomputing. As it does **not** add actual Vector Instructions, relying solely and exclusively on the -**Scalar** ISA, it is **Scalar** instructions that need to be added -to the **Scalar** Power ISA before Simple-V may orthogonally Vectorise -them. - -Therefore because the goal of RED Semiconductor Ltd, an OpenPOWER Stakeholder, -is to bring to market mass-volume general-purpose compute processors that -are competitive in the 3D GPU Audio Visual DSP EDGE IoT desktop -chromebook netbook smartphone laptop markets, Simple-V -has to be accompanied by corresponding **Scalar** instructions that bring -the **Scalar** Power ISA up-to-date. These include IEEE754 Transcendentals -AV cryptographic Biginteger and bitmanipulation operations that ARM Intel -AMD and many other ISAs have been adding over the past 12 years and Power ISA -has not. - -*Thus it becomes necesary to consider the Architectural Resource Allocation of not just -Simple-V but the 80-100 Scalar instructions all at the same time*. - -It is also critical to note that Simple-V **does not modify the Scalar Power ISA -in any way**. The sole semi-exception to that is Vectorised Branch Conditional, -in order to provide the usual Advanced Branching -capability present in every Commercial 3D GPU ISA. Scalar Branch -is **not** modified by the **Vectorised** variant. +**Scalar** ISA, it is **Scalar** instructions that need to be added to +the **Scalar** Power ISA before Simple-V may orthogonally Vectorise them. + +Therefore because the goal of RED Semiconductor Ltd, an OpenPOWER +Stakeholder, is to bring to market mass-volume general-purpose compute +processors that are competitive in the 3D GPU Audio Visual DSP EDGE IoT +desktop chromebook netbook smartphone laptop markets, Simple-V has to +be accompanied by corresponding **Scalar** instructions that bring the +**Scalar** Power ISA up-to-date. These include IEEE754 Transcendentals +AV cryptographic Biginteger and bitmanipulation operations that ARM +Intel AMD and many other ISAs have been adding over the past 12 years +and Power ISA has not. + +*Thus it becomes necesary to consider the Architectural Resource +Allocation of not just Simple-V but the 80-100 Scalar instructions all +at the same time*. + +It is also critical to note that Simple-V **does not modify the Scalar +Power ISA in any way**. The sole semi-exception to that is Vectorised +Branch Conditional, in order to provide the usual Advanced Branching +capability present in every Commercial 3D GPU ISA. Scalar Branch is +**not** modified by the **Vectorised** variant. # Compliancy Levels -Simple-V has been subdivided into levels akin to the Power ISA Compliancy Levels. -For now Let us call them "SV Compliancy Levels" to distinguish the two. The reason for -the SV Compliancy Levels is the same as for the Power ISA Compliancy Levels (SFFS, SFS): -to not overburden implementors with features that they do not need. -*There is no dependence between the two types of Compliancy Levels* -The resources below therefore are not all required for all SV Compliancy Levels but -they are all required to be reserved. +Simple-V has been subdivided into levels akin to the Power ISA Compliancy +Levels. For now Let us call them "SV Compliancy Levels" to distinguish +the two. The reason for the SV Compliancy Levels is the same as for the +Power ISA Compliancy Levels (SFFS, SFS): to not overburden implementors +with features that they do not need. *There is no dependence between +the two types of Compliancy Levels* The resources below therefore are +not all required for all SV Compliancy Levels but they are all required +to be reserved. # Hardware Implementations -The fundamental principle of Simple-V is that it sits between Issue and Decode, -pausing the Program-Counter to service a "Sub-Program-Counter" hardware for-loop. -In practical terms for many first-iteration implementations this is sufficient. - -**Considerable** effort has been expended to ensure that Simple-V is practical -to implement on an extremely wide range of Industry-wide common **Scalar** -micro-architectures. -Finite State Machine (for ultra-low-resource and Mission-Critical), In-order -single-issue, all the way through to Great-Big Out-of-Order Superscalar Multi-Issue, -and the SV Compliancy Levels specifically recognise these differing scenarios.the - -SIMD back-end ALUs particularly those with element-level predicate masks may be -exploited to good effect with very little additional complexity to achieve high -throughput, even on a single-issue in-order microarchitecture. As usually becomes -apparent with in-order, its limitations extend also to when Simple-V is deployed, -which is why Multi-Issue Out-of-Order is the recommended (but not mandatory) +The fundamental principle of Simple-V is that it sits between Issue and +Decode, pausing the Program-Counter to service a "Sub-Program-Counter" +hardware for-loop. In practical terms for many first-iteration +implementations this is sufficient. + +**Considerable** effort has been expended to ensure that Simple-V is +practical to implement on an extremely wide range of Industry-wide +common **Scalar** micro-architectures. Finite State Machine (for +ultra-low-resource and Mission-Critical), In-order single-issue, all the +way through to Great-Big Out-of-Order Superscalar Multi-Issue, and the +SV Compliancy Levels specifically recognise these differing scenarios.the + +SIMD back-end ALUs particularly those with element-level predicate +masks may be exploited to good effect with very little additional +complexity to achieve high throughput, even on a single-issue in-order +microarchitecture. As usually becomes apparent with in-order, its +limitations extend also to when Simple-V is deployed, which is why +Multi-Issue Out-of-Order is the recommended (but not mandatory) Micro-architecture. -The only major concern is in the upper SV Compliancy Levels: -the Hazard Management for increased number of Scalar Registers -to 128 (in current versions) but given that IBM POWER9/10 has VSX register numbering 64, -and modern GPUs have 128, 256 amd even 512 registers this was deemed acceptable. Strategies -do exist in hardware for Hazard Management of such large numbers of registers, -even for Multi-Issue microarchitectures. +The only major concern is in the upper SV Compliancy Levels: the Hazard +Management for increased number of Scalar Registers to 128 (in current +versions) but given that IBM POWER9/10 has VSX register numbering 64, +and modern GPUs have 128, 256 amd even 512 registers this was deemed +acceptable. Strategies do exist in hardware for Hazard Management of +such large numbers of registers, even for Multi-Issue microarchitectures. # Simple-V Architectural Resources @@ -96,13 +100,13 @@ even for Multi-Issue microarchitectures. onto VSX: extension of the number of VSX registers will be discussed at that time) * 24-bits are needed within the main SVP64 Prefix (equivalent to a 2-bit XO) -* Another 24-bit (a second 2-bit XO) is needed for a planned future encoding, currently - named "SVP64-Single" [^likeext001] +* Another 24-bit (a second 2-bit XO) is needed for a planned future encoding, + currently named "SVP64-Single" [^likeext001] * A third 24-bits (third 2-bit XO) is strongly recommended to be **reserved** such that future unforeseen capability is needed. -* To hold all Vector Context, five SPRs are needed for userspace (MSR.PR=1 Problem State). - If Supervisor and Hypervisor mode are to also support Simple-V they will correspondingly - need five SPRs each. +* To hold all Vector Context, five SPRs are needed for userspace + (MSR.PR=1 Problem State). If Supervisor and Hypervisor mode are to + also support Simple-V they will correspondingly need five SPRs each. * Five 6-bit XO (A-Form) "Management" instructions are needed. **Summary of Opcode space** @@ -110,65 +114,71 @@ even for Multi-Issue microarchitectures. * 75% of one Major Opcode (equivalent to the rest of EXT017) * Five 6-bit operations. -No further opcode space *for Simple-V* is envisaged to be required for at least the next decade (including if added on VSX) +No further opcode space *for Simple-V* is envisaged to be required for +at least the next decade (including if added on VSX) **SPRs** * **SVSTATE** - Vectorisation State * **SVSRR0** - identical in purpose to SRR0/1, storing SVSTATE on context-switch -* **SVSHAPE0-3* - these are 32-bit and may be grouped in pairs, they REMAP (shape) - the Vectors -* **SVLR** - again similar to LR for exactly the same purpose, SVSTATE is swapped - with SVLR by SV-Branch-Conditional for exactly the same reason that NIA is swapped - with LR +* **SVSHAPE0-3** - these are 32-bit and may be grouped in pairs, they REMAP + (shape) the Vectors +* **SVLR** - again similar to LR for exactly the same purpose, SVSTATE + is swapped with SVLR by SV-Branch-Conditional for exactly the same + reason that NIA is swapped with LR **Vector Management Instructions** * **setvl** - Cray-style Scalar Vector Length instruction -* **svstep** - used for Vertical-First Mode and for enquiring about internal state +* **svstep** - used for Vertical-First Mode and for enquiring about internal + state * **svremap** - "tags" registers for activating REMAP -* **svshape** - convenience instruction for quickly setting up Matrix, DCT, FFT and - Parallel Reduction REMAP +* **svshape** - convenience instruction for quickly setting up Matrix, DCT, + FFT and Parallel Reduction REMAP * **svshape2** - additional convenience instruction to set up "Offset" REMAP (fits within svshape's XO encoding) * **svindex** - convenience instruction for setting up "Indexed" REMAP. # SVP64 24-bit Prefix -The SVP64 24-bit Prefix provides several options, too numerous to describe in this -document but all fitting within the 24-bit space (and no other). +The SVP64 24-bit Prefix provides several options, too numerous to describe +in this document but all fitting within the 24-bit space (and no other). The primary options are: -* element-width overrides, which dynamically redefine each SFFS or SFS Scalar prefixed - instruction to be 8-bit, 16-bit, 32-bit or 64-bit operands **without requiring new - 8/16/32 instructions** [^pseudorewrite] +* element-width overrides, which dynamically redefine each SFFS or SFS + Scalar prefixed instruction to be 8-bit, 16-bit, 32-bit or 64-bit + operands **without requiring new 8/16/32 instructions** [^pseudorewrite] * predication. this is an absolutely essential feature for a 3D GPU VPU ISA. - CR Fields are available as Predicate Masks hence the reason for their extension to 128. -* Saturation. **all** LD/ST and Arithmetic and Logical operations may be saturated - (without adding explicit scalar saturated opcodes) + CR Fields are available as Predicate Masks hence the reason for their + extension to 128. +* Saturation. **all** LD/ST and Arithmetic and Logical operations may + be saturated (without adding explicit scalar saturated opcodes) * Reduction and Prefix-Sum (Fibonnacci Series) Modes # REMAP subsystem -REMAP is extremely advanced but brings features already present in other DSPs and -Supercomputing ISAs. - -* DCT/FFT REMAP brings more capability than TI's MSP-Series DSPs and Qualcom Hexagon DSPs -* Matrix REMAP brings more capability than any other Matrix Extension (AMD GPUs, - Intel, ARM), not being restricted to Power-2 sizes. Also not limited to the type - of operation, it may perform Warshall Transitive Closure, Integer Matrix, - Bitmanipulation Matrix, Galois Field (carryless mul) Matrix, and with care potentially - Graph Maximum Flow as well. Also suited to Convolutions, Matrix Transpose and rotate. -* General-purpose Indexed REMAP, this option is provided to implement an equivalent - of VSX `vperm` -* Parallel Reduction REMAP, performs an automatic map-reduce using *any suitable - scalar operation*. +REMAP is extremely advanced but brings features already present in other +DSPs and Supercomputing ISAs. + +* DCT/FFT REMAP brings more capability than TI's MSP-Series DSPs and + Qualcom Hexagon DSPs +* Matrix REMAP brings more capability than any other Matrix Extension + (AMD GPUs, Intel, ARM), not being restricted to Power-2 sizes. Also not + limited to the type of operation, it may perform Warshall Transitive + Closure, Integer Matrix, Bitmanipulation Matrix, Galois Field (carryless + mul) Matrix, and with care potentially Graph Maximum Flow as well. Also + suited to Convolutions, Matrix Transpose and rotate. +* General-purpose Indexed REMAP, this option is provided to implement + an equivalent of VSX `vperm` +* Parallel Reduction REMAP, performs an automatic map-reduce using + *any suitable scalar operation*. # Scalar Operations. -The primary reason for mentioning the additional Scalar operations is because -they are so numerous, with Power ISA not having advanced in the *general purpose* -compute area in the past 12 years, that some considerable care is needed. +The primary reason for mentioning the additional Scalar operations +is because they are so numerous, with Power ISA not having advanced +in the *general purpose* compute area in the past 12 years, that some +considerable care is needed. Summary: **to fit everything at least 75% of 3 Major Opcodes is required** @@ -181,18 +191,18 @@ Candidates (for all but the X-Form instructions) include: * EXT005 (100% free) * brownfield space in EXT019 (25% but NOT recommended) -SVP64, SVP64-Single and SVP64-Reserved will require on their own each 25% of one Major -Opcode for a total of 75% of one Major Opcode. The remaining **Scalar** opcodes, -due to there being two separate sets of operations with 16-bit immediates, will require -the other space totalling two 75% Majors. +SVP64, SVP64-Single and SVP64-Reserved will require on their own each 25% +of one Major Opcode for a total of 75% of one Major Opcode. The remaining +**Scalar** opcodes, due to there being two separate sets of operations +with 16-bit immediates, will require the other space totalling two 75% +Majors. Note critically that: * Unlike EXT001, SVP64's 24-bits may **not** hold also any Scalar - operations. - There is no free available space: a 25th bit would be required. - The entire 24-bits is **required** for the abstracted Hardware-Looping Concept - **even when these 24-bits are zero** + operations. There is no free available space: a 25th bit would + be required. The entire 24-bits is **required** for the abstracted + Hardware-Looping Concept **even when these 24-bits are zero** * Any Scalar 64-bit instruction (regardless of how it is encoded) is unsafe to then Vectorise because this creates the situation of Prefixed-Prefixed, resulting in deep complexity in Hardware Decode at a critical juncture, as @@ -200,8 +210,8 @@ Note critically that: * **All** of these Scalar instructions are candidates for Vectorisation. Thus none of them may be 64-bit-Scalar-only. -*Three 75% allocations are thus genuinely needed*, all other options are unsuitable -for consideration. +*Three 75% allocations are thus genuinely needed*, all other options +are unsuitable for consideration. **Minor Opcodes to fit candidates above** @@ -209,18 +219,20 @@ In order of size, for bitmanip and A/V DSP purposes: * QTY 3of 2-bit XO: ternlogi, crternlogi, grevlogi * QTY 7of 3-bit XO: xpermi, binlut, grevlog, swizzle-mv/fmv, bitmask, bmrevi -* QTY 8of 5/6-bit (A-Form): xpermi, bincrflut, bmask, fmvis, fishmv, bmrev, Galois Field +* QTY 8of 5/6-bit (A-Form): xpermi, bincrflut, bmask, fmvis, fishmv, bmrev, + Galois Field * QTY 30of 10-bit (X-Form): cldiv/mul, av-min/max/diff, absdac, xperm -Note: Some of the Galois Field operations will require QTY 1of Polynomial SPR (per userspace supervisor hypervisor). +Note: Some of the Galois Field operations will require QTY 1of Polynomial +SPR (per userspace supervisor hypervisor). **EXT004** -For biginteger math, two instructions in the same space as "madd*" are to be proposed. -They are both 3-in 2-out operations taking or producing a 64-bit "pair" (like RTp), -and perform 128/64 mul and div/mod operations respectively. -These are **not** the same as VSX operations -which are 128/128, and they are **not** the same as existing Scalar mul/div/mod, +For biginteger math, two instructions in the same space as "madd" are to +be proposed. They are both 3-in 2-out operations taking or producing a +64-bit "pair" (like RTp), and perform 128/64 mul and div/mod operations +respectively. These are **not** the same as VSX operations which are +128/128, and they are **not** the same as existing Scalar mul/div/mod, all of which are 64/64 (or 64/32). **EXT059 and EXT063**