From b51d4b382d733df660ddc875085546c5f0b876c9 Mon Sep 17 00:00:00 2001 From: Jacob Lifshay Date: Mon, 10 Apr 2023 21:20:06 -0700 Subject: [PATCH] fix spelling and formatting --- openpower/sv/po9_encoding.mdwn | 6 ++--- openpower/sv/rfc/ls012.mdwn | 48 +++++++++++++++++----------------- 2 files changed, 27 insertions(+), 27 deletions(-) diff --git a/openpower/sv/po9_encoding.mdwn b/openpower/sv/po9_encoding.mdwn index e355e1a0d..2a54e37b5 100644 --- a/openpower/sv/po9_encoding.mdwn +++ b/openpower/sv/po9_encoding.mdwn @@ -4,8 +4,8 @@ **Definition of Simple-V:** -In its simpest form, the Simple-V Loop/Vector concept is a Prefixing -system (sililar to the 8086 `REP` instruction) that both augments its +In its simplest form, the Simple-V Loop/Vector concept is a Prefixing +system (similar to the 8086 `REP` instruction) that both augments its following Defined Word Suffix, and also may repeat that instruction with optional sequential register offsets from those given in the Suffix. Register numbers may also be extended (larger register files). @@ -15,7 +15,7 @@ Vertical-First Mode. **Definition of SVP64 Prefixing:** SVP64 is a well-defined implementation of the Simple-V Loop/Vector concept, -in a 32-bit Prefix format, that exploits the following instruction +in a 32-bit Prefix format, that exploits the following instruction (the Defined Word) using it as a "template". It requires 24 bits, some of which are common to all Suffixes, and some Mode bits are specific to the Defined Word class: Load/Store-Immediate, Load/Store-Indexed, diff --git a/openpower/sv/rfc/ls012.mdwn b/openpower/sv/rfc/ls012.mdwn index 8c5407b29..d24a68aad 100644 --- a/openpower/sv/rfc/ls012.mdwn +++ b/openpower/sv/rfc/ls012.mdwn @@ -10,7 +10,7 @@ The purpose of this RFC is: * to give a full list of upcoming Scalar opcodes developed by Libre-SOC - (being cognisant that *all* of them are Vectorisable) + (being cognisant that *all* of them are Vectoriseable) * to give OPF Members and non-Members alike the opportunity to comment and get involved early in RFC submission * formally agree a priority order on an iterative basis with new versions @@ -23,7 +23,7 @@ The purpose of this RFC is: to get a clear advance picture of Opcode Allocation *prior* to submission -As this is a Formal ISA RFC the evaluation shall ultimatly define +As this is a Formal ISA RFC the evaluation shall ultimately define (in advance of the actual submission of the instructions themselves) which instructions will be submitted over the next 1-18 months. @@ -56,7 +56,7 @@ much for any new team to consider (10 years development effort) and far outside of Embedded or Tablet/Desktop/Laptop power budgets. Thus bringing Power Scalar up-to-date to modern standards *and on its own merits* is a reasonable goal, and the advantages of the reduced focus is that -SFFS remains RISC-paradigm, and that lessons can be learned from other +SFFS remains RISC-paradigm, and that lessons can be learned from other ISAs from the intervening years. Good examples here include `bmask`. SVP64 Prefixing - also known by the terms "Zero-Overhead-Loop-Prefixing" @@ -68,7 +68,7 @@ their value when Vector-Prefixed, *as well as* SVP64Single-Prefixed. **Target areas** Whilst entirely general-purpose there are some categories that these -instructions are targetting: Bitmanipulation, Big-integer, cryptography, +instructions are targetting: Bit-manipulation, Big-integer, cryptography, Audio/Visual, High-Performance Compute, GPU workloads and DSP. **Instruction count guide and approximate priority order** @@ -138,7 +138,7 @@ required for Warshall Transitive Closure (on top of a cumulatively-applied max instruction). The Management Instructions themselves are all Scalar Operations, so -PO1-Prefixing is perfecly reasonable. SVP64 Management instructions of +PO1-Prefixing is perfectly reasonable. SVP64 Management instructions of which there are only 6 are all 5 or 6 bit XO, meaning that the opcode space they take up in EXT0xx is not alarmingly high for their intrinsic strategic value. @@ -156,7 +156,7 @@ There are a **lot** of operations here, and they also bring Power ISA up-to-date to IEEE754-2019. Fortunately the number of critical instructions is quite low, but the caveat is that if those operations are utilised to synthesise other IEEE754 operations (divide by `pi` for -example) full bitlevel accuracy (a hard requirement for IEEE754) is lost. +example) full bit-level accuracy (a hard requirement for IEEE754) is lost. Also worth noting that the Khronos Group defines minimum acceptable bit-accuracy levels for 3D Graphics: these are **nowhere near** the full @@ -171,8 +171,8 @@ when 3D Graphics simply has no need for full accuracy. Found at [[sv/av_opcodes]] these do not require Saturated variants because Saturation is added via [[sv/svp64]] (Vector Prefixing) and via [[sv/svp64_single]] Scalar Prefixing. This is important to note for -Opcode Allocation because placing these operations in the UnVectoriseble -areas would irrediemably damage their value. Unlike PackedSIMD ISAs +Opcode Allocation because placing these operations in the UnVectoriseable +areas would irredeemably damage their value. Unlike PackedSIMD ISAs the actual number of AV Opcodes is remarkably small once the usual cascading-option-multipliers (SIMD width, bitwidth, saturation, HI/LO) are abstracted out to RISC-paradigm Prefixing, leaving just @@ -190,7 +190,7 @@ DSP can do full FFT triple loops in one VLIW group. It should be pretty clear this is high priority. -With SVP64 [[sv/remap]] providing the Loop Schedules it falls to +With SVP64 [[sv/remap]] providing the Loop Schedules it falls to the Scalar side of the ISA to add the prerequisite "Twin Butterfly" operations, typically performing for example one multiply but in-place subtracting that product from one operand and adding it to the other. @@ -209,13 +209,13 @@ hot-loops is considered high priority. An additional need is to do popcount on CR Field bit vectors but adding such instructions to the *Condition Register* side was deemed to be far too much. Therefore, priority was given instead to transferring several -CR Field bits into GPRs, whereupon the full set of tandard Scalar GPR +CR Field bits into GPRs, whereupon the full set of Standard Scalar GPR Logical Operations may be used. This strategy has the side-effect of keeping the CRweird group down to only five instructions. ## Big-integer Math -[[sv/biginteger]] has always been a high priority area for commercial +[[sv/biginteger]] has always been a high priority area for commercial applications, privacy, Banking, as well as HPC Numerical Accuracy: libgmp as well as cryptographic uses in Asymmetric Ciphers. poly1305 and ec25519 are finding their way into everyday use via OpenSSL. @@ -236,14 +236,14 @@ require a 128-bit shifter to replace the existing Scalar Power ISA 64-bit shifters. The reduction in instruction count these operations bring, in critical -hotloops, is remarkably high, to the extent where a Scalar-to-Vector +hot loops, is remarkably high, to the extent where a Scalar-to-Vector operation of *arbitrary length* becomes just the one Vector-Prefixed instruction. Whilst these are 5-6 bit XO their utility is considered high strategic value and as such are strongly advocated to be in EXT04. The alternative is to bring back a 64-bit Carry SPR but how it is retrospectively -applicable to pre-existing Scalar Power ISA mutiply, divide, and shift +applicable to pre-existing Scalar Power ISA multiply, divide, and shift operations at this late stage of maturity of the Power ISA is an entire area of research on its own deemed unlikely to be achievable. @@ -260,7 +260,7 @@ Similar arguments apply to the GPR-INT move operations, proposed in [[ls006]], with the opportunity taken to add rounding modes present in other ISAs that Power ISA VSX PackedSIMD does not have. Javascript rounding, one of the worst offenders of Computer Science, requires a -phenomental 35 instructions with *six branches* to emulate in Power +phenomenal 35 instructions with *six branches* to emulate in Power ISA! For desktop as well as Server HTML/JS back-end execution of javascript this becomes an obvious priority, recognised already by ARM as just one example. @@ -280,7 +280,7 @@ priority, and again just like in the CRweird group the opportunity was taken to work on *all* bits of a CR Field rather than just one bit as is done with the existing CR operations crand, cror etc. -The other high strategic value instruction is `grevlut` (and `grevluti` +The other high strategic value instruction is `grevlut` (and `grevluti` which can generate a remarkably large number of regular-patterned magic constants). The grevlut set require of the order of 20,000 gates but provide an astonishing plethora of innovative bit-permuting instructions @@ -316,12 +316,12 @@ introduce mv Swizzle operations, which can always be Macro-op fused in exactly the same way that ARM SVE predicated-move extends 3-operand "overwrite" opcodes to full independent 3-in 1-out. -## BMI (bitmanipulation) group. +## BMI (bit-manipulation) group. Whilst the [[sv/vector_ops]] instructions are only two in number, in reality the `bmask` instruction has a Mode field allowing it to cover **24** instructions, more than have been added to any other CPUs by -ARM, Intel or AMD. Analyis of the BMI sets of these CPUs shows simple +ARM, Intel or AMD. Analysis of the BMI sets of these CPUs shows simple patterns that can greatly simplify both Decode and implementation. These are sufficiently commonly used, saving instruction count regularly, that they justify going into EXT0xx. @@ -338,17 +338,17 @@ instructions into one. However it is still not a huge priority unlike Very easily justified. As explained in [[ls002]] these always saves one LD L1/2/3 D-Cache memory-lookup operation, by virtue of the Immediate FP value being in the I-Cache side. It is such a high priority that -these instuctions are easily justifiable adding into EXT0xx, despite +these instructions are easily justifiable adding into EXT0xx, despite requiring a 16-bit immediate. By designing the second-half instruction -as a Read-Modify-Write it saves on XO bitlength (only 5 bits), and can be +as a Read-Modify-Write it saves on XO bit-length (only 5 bits), and can be macro-op fused with its first-half to store a full IEEE754 FP32 immediate into a register. There is little point in putting these instructions into EXT2xx. Their very benefit and inherent value *is* as 32-bit instructions, not 64-bit -ones. Likewise there is less value in taking up EXT1xx Enoding space +ones. Likewise there is less value in taking up EXT1xx Encoding space because EXT1xx only brings an additional 16 bits (approx) to the table, -and that is provided already by the second-half instuction. +and that is provided already by the second-half instruction. Thus they qualify as both high priority and also EXT0xx candidates. @@ -411,7 +411,7 @@ also to keep the number of registers used down to a minimum. Counter-examples are FMAC which had to be added to IEEE754 because the *internal* product requires more accuracy than can fit into a register. -Another would be a dotproduct instruction, which again requires an accumulator +Another would be a dot-product instruction, which again requires an accumulator of at least double the width of the two vector inputs. And in the AMDGPU ISA, there are Texture-mapping instructions taking up to an astounding *twelve* input operands! @@ -495,7 +495,7 @@ where in-order systems pretty much just stall straight away. Less extreme examples include instructions that take only a few cycles to complete, but if used in tight loops with Conditional Branches, an Out-of-Order system with Speculative capability may need significantly more Reservation Stations to hold -in-flight dats for instructions which take longer than those which do not. +in-flight data for instructions which take longer than those which do not. **Can one instruction do the job of many?** @@ -558,7 +558,7 @@ The key to headings and sections are as follows: * **regs** - a guide to register usage, to how costly Hazard Management will be, in hardware: - 1R: reads one GPR/FPR/SPR/CR. - - 1W: writes one GPR/FPR/SPR/CR. + - 1W: writes one GPR/FPR/SPR/CR. - 1r: reads one CR *Field* (not necessarily the entire CR) - 1w: writes one CR *Field* (not necessarily the entire CR) -- 2.30.2