openpower/sv/rfc/ls005.mdwn

   1 # OPF ISA WG External RFC ls005 v1: XLEN
   2
   3 * RFC Author: Luke Kenneth Casson Leighton.
   4 * RFC Contributors/Ideas: Jacob Lifshay, Toshaan Bharvani
   5 * Funded by NLnet under the NGI Zero Entrust EU Horizon Europe Grant 101069594
   6
   7 **URLs**:
   8
   9 * <https://libre-soc.org/openpower/sv/rfc/ls005/>
  10 * <https://bugs.libre-soc.org/show_bug.cgi?id=988>
  11 * <https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=openpower/isa;hb=HEAD>
  12 * <https://git.openpower.foundation/isa/PowerISA/issues/104>
  13
  14 **Severity**: Major
  15
  16 **Status**: New
  17
  18 **Date**: 22 Dec 2022 v2 TODO
  19
  20 **Target** v3.2B
  21
  22 **Books and Section affected**:
  23
  24 ```
  25     Everything (in a consistent, regular and systematic fashion)
  26 ```
  27
  28 **Summary**
  29
  30 ```
  31     Exactly as is already done in RISC-V, convert the entire use of 64-bit hard-coding to "XLEN".
  32     Exactly as is in RISC-V, options then include PowerISA-32, PowerISA-64 and PowerISA-128.
  33     Unlike in RISC-V, the concept of PowerISA-16 and PowerISA-8 is also floated, for Embedded,
  34     AI, Edge, Processing-in-Memory, Distributed Computing and other purposes.
  35 ```
  36
  37 **Submitter**: Luke Leighton (Libre-SOC)
  38
  39 **Requester**: Libre-SOC
  40
  41 **Impact on processor**:
  42
  43 ```
  44     Entirely new processors, entirely new markets.
  45 ```
  46
  47 **Impact on software**:
  48
  49 ```
  50     Massive but regular, consistent, and systematic.
  51 ```
  52
  53 **Keywords**:
  54
  55 ```
  56     XLEN
  57 ```
  58
  59 **Motivation**
  60
  61 The Power ISA is far too massive, making it wholly unsuited for Embedded
  62 markets and adversely impacting its reach and potential.  The RISC paradigm
  63 it is based on has gone too far into PackedSIMD (128-bit).  Fixing this is
  64 relatively and conceptually straightforward: allow 32-bit and even 16-bit
  65 and 8-bit implementations, and use the opportunity to allow future Scalar
  66 128-bit implementations in the exact same strategic way that RISC-V has RV128.
  67
  68 Register files are redefined to XLEN width but are permitted to "group"
  69 registers together to create 16-bit, 32-bit and 64-bit addresses.
  70 In this way, the limitations of what would otherwise restrict the usefulness
  71 of a severely-targetted application-specific processor may be overcome in
  72 order to make it still possible to (at reduced performance) still run
  73 general-purpose applications.
  74 AI application-specific processing or other Processing-In-Memory or other
  75 specialist design therefore may for example focus a balance
  76 of raw computing power heavily onto 8-bit or 16-bit computation, but still
  77 gain the benefit of the Power ISA and everything it brings.  Contrast
  78 this with the more "normal" approach of creating heavily-focussed
  79 specialist "AI" Engines incapable of Turing-completeness and the benefits
  80 are clear.
  81
  82 Note 1: SVP64 **requires** this change as a 100% critical dependency.
  83 SIMD back-end ALUs process Vectors of "Elements" at 8, 16 and 32-bit (and
  84 64-bit), read from, processed, and returned to, the standard **Scalar**
  85 Register Files, with byte-level write-enable lines.  The proposal is
  86 therefore made as an opportunity for others interested in Scalar ISA
  87 8/16/32-bit (and future 128-bit variants of Scalar Power ISA) to take
  88 **and complete** that work in an incremental fashion, without having
  89 to be faced with a massive bulk and body of work as a prerequisite.
  90
  91 Examples include that whilst an SVP64 Prefixed '''lbz''' instruction
  92 ('''sv.lbz''') is well-defined and has strict well-defined behaviour,
  93 a pure **Scalar-only** (non-SVP64) over-ridden '''lbz''' instruction
  94 has not been so well-defined, and would require a Stakeholder interested
  95 in 8/16/32-bit (and future 128-bit) to think through the implications
  96 and incrementally submit further OPF ISA RFCs.  With RISC-V **already
  97 having done this type of work** it is not technically difficult: it
  98 just requires another Stakeholder to do it.
  99
 100 Note 2: one alternative to this proposal, as far as SVP64 is concerned,
 101 is to literally duplicate the entirety of Chapters 3 and 4 Book III,
 102 and to create - and then maintain - multiple identical copies of the
 103 instructions including identical copies of the pseudocode except for
 104 substitution of occurrences of "64" with a "32" variant, "16" variant,
 105 "8" variant (and future "128" variant), and so on.  This would add
 106 over 700 additional pages to the Power ISA Specification and it should
 107 be clear that it would become a maintenance nightmare.
 108
 109 Another alternative is to poison and irredemably damage the Power ISA
 110 (as a powerful and lean RISC ISA) by adding several hundred (close to 1,000)
 111 additional specific 8-bit, 16-bit and 32-bit (and in future 128-bit) Scalar
 112 instructions. Given that the 32-bit Opcode Allocation Space is already
 113 under pressure such a move would be extremely unwise for that reason alone.
 114
 115 **Changes**
 116
 117 For all pseudocode right across the board in all Scalar operations, replace
 118 hard-coded "64" with "XLEN".  **This work is already underway as sponsored
 119 by NLnet in the Libre-SOC Power ISA Pseudocode**.  The default is obviously
 120 recommended to be "XLEN=64" in order to create zero disruption.
 121
 122 Definitions of the Register File(s) for GPR and FPR are then changed to be
 123 "XLEN" wide.  However, for Embedded purposes (XLEN=32/16/8), an SPR controls
 124 whether (and how many) sequentially-grouped registers are taken together to
 125 create 16-bit, 32-bit and 64-bit addresses (depending on application need).
 126 GPR is obvious, FPR is quirky.  SVP64 redefines FP ops (those not ending in "s")
 127 to be "full width" and all ops ending in "s" to be "half of
 128 the full width".
 129
 130 * XLEN=64 keeps FPR "full width" exactly as presently defined, and
 131   "half width" exactly as presently defined.
 132 * XLEN=32 overrides FPR "full width" operations to
 133   full BFP32, and "half width" to be "BFP16 stored in an BFP32"
 134 * XLEN=16 redefines FPR "full width" operations to full [IEEE BFP16](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) and leaves
 135   "half width" RESERVED (there is no IEEE version of [FP8](https://web.archive.org/web/20221223085833/https://wccftech.com/nvidia-intel-arm-bet-their-ai-future-on-fp8-whitepaper-for-8-bit-fp-published/)).
 136 * XLEN=8 redefines FPR "full width" operations to [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) and leaves
 137   "half width" RESERVED.
 138
 139 ----------------
 140
 141 # Examples
 142
 143 ## pseudocode examples demonstrating modification.
 144
 145 before for popcntb:
 146
 147 ```
 148 do i = 0 to 7
 149    n <-  0
 150    do j = 0 to 7
 151       if (RS)[(i*8)+j] = 1 then
 152           n <- n+1
 153    RA[(i*8):(i*8)+7] <-  n
 154 ```
 155
 156 after:
 157
 158 ```
 159 do i = 0 to ((XLEN/8)-1)
 160    n <-  0
 161    do j = 0 to 7
 162       if (RS)[(i*8)+j] = 1 then
 163           n <- n+1
 164    RA[(i*8):(i*8)+7] <-  n
 165 ```
 166
 167 Here as the instruction's intent is to count bytes, and RA contains on
 168 a per-byte basis a SIMD-style count of each byte's 1s, it becomes possible
 169 to simply count less bytes.
 170
 171 Should it be more useful to redefine popcntb in terms of always returning
 172 eight results? For example `sv.popcntb/w=16` to return 8 2-bit counts of
 173 the number of bits in each 2-bit group in RS?
 174
 175 ## no modification needed, but function changes
 176
 177 For the `addi` instruction there is no apparent change:
 178
 179 ```
 180 RT <- (RA|0) + EXTS(SI)
 181 ```
 182
 183 However behind the scenes, RA is XLEN bits wide, therefore EXTS performs an
 184 increase in bitlength not to exactly 64 but to XLEN.  Obviousy for XLEN=16
 185 there is no sign-extension, and for XLEN=8 truncation of `SI` will occur.
 186 Illustrates that there are subtle quirks involved, requiring some thought.
 187
 188 The reason for keeping as many bits of the Immediate as possible should be clear.
 189
 190 ## Compare Ranged Byte (cmprb BF,L,RA,RB)
 191
 192 ```
 193     src1    <- EXTZ((RA)[XLEN-8:XLEN-1])
 194     src21hi <- EXTZ((RB)[XLEN-32:XLEN-23])
 195     src21lo <- EXTZ((RB)[XLEN-24:XLEN-17])
 196     src22hi <- EXTZ((RB)[XLEN-16:XLEN-9])
 197     src22lo <- EXTZ((RB)[XLEN-8:XLEN-1])
 198     if L=0 then
 199        in_range <-  (src22lo  <= src1) & (src1 <=  src22hi)
 200     else
 201        in_range <- (((src21lo  <= src1) & (src1 <=  src21hi)) |
 202                     ((src22lo  <= src1) & (src1 <=  src22hi)))
 203     CR[4*BF+32] <- 0b0
 204     CR[4*BF+33] <- in_range
 205     CR[4*BF+34] <- 0b0
 206     CR[4*BF+35] <- 0b0
 207 ```
 208
 209 Compare Ranged Byte takes either one or two ranges from RB as individual bytes,
 210 thus requiring a minimum 16-bit (32-bit when L=1) operand RB.
 211 src1 on the other hand is only
 212 8-bit long: the first byte of RA.
 213
 214 Therefore a little more thought is required. Should this simply be UNDEFINED
 215 behaviour when XLEN=8/16 and L=1? When XLEN=16, L=0 the instruction is still
 216 valid. Would it be costly at the Decoder?
 217
 218 ## Trap Word Immediate
 219
 220 Like FP Single operations there also exist operations at "half of regfile width"
 221 in the Integer realm.  They are discernable with the designation "Word" in their
 222 title, such as "Trap WORD Immediate".
 223
 224 ```
 225     a <- EXTS((RA)[XLEN/2:XLEN-1])
 226     if (a < EXTS(SI)) & TO[0]  then TRAP
 227     if (a > EXTS(SI)) & TO[1]  then TRAP
 228     if (a = EXTS(SI)) & TO[2]  then TRAP
 229     if (a <u EXTS(SI)) & TO[3] then TRAP
 230     if (a >u EXTS(SI)) & TO[4] then TRAP
 231 ```
 232
 233 Here, EXTS receives **half** of the bits of its input register operand, RA.
 234 Note this is **not** "32 bit because a Word is 32-bit". The definition
 235 "Trap Word Immediate" has to be replaced with "Trap Half-register-width Immediate"
 236 but this is very clumsy.
 237
 238 When XLEN=8 "half register width" is clearly 4 bit, thus the LSB nibble is tested,
 239 but still sign-extended for comparison
 240 against the 16-bit signed immediate.
 241
 242 ## Extend Sign byte/half/word
 243
 244 This instruction can be redefined again in terms of:
 245
 246 * "Word" meaning "Half of register width"
 247 * "Half-word" meaning "Quarter of register width"
 248 * "Byte" meaning "One-eighth of register width"
 249
 250 And a table results as follows:
 251
 252 ```
 253     XLEN=8:
 254                 extsb: 1-bit -> 8-bit sign extension
 255                 extsh: 2-bit -> 8-bit sign extension
 256                 extsw: 4-bit -> 8-bit sign extension
 257     XLEN=16:
 258                 extsb: 2-bit -> 16-bit sign extension
 259                 extsh: 4-bit -> 16-bit sign extension
 260                 extsw: 8-bit -> 16-bit sign extension
 261     XLEN=32:
 262                 extsb: 4-bit -> 32-bit sign extension
 263                 extsh: 8-bit -> 32-bit sign extension
 264                 extsw: 16-bit -> 32-bit sign extension
 265     XLEN=64:
 266                 extsb: 8-bit -> 64-bit sign extension
 267                 extsh: 16-bit -> 64-bit sign extension
 268                 extsw: 32-bit -> 64-bit sign extension
 269 ```
 270
 271 If the instructions were kept as presently defined then there
 272 is a loss of functionality and opportunity:
 273
 274 ```
 275     XLEN=8: # completely wasted opportunity
 276                 extsb: 8-bit  -> 8-bit does nothing
 277                 extsh: 16-bit -> 8-bit truncates
 278                 extsw: 32-bit -> 8-bit truncates
 279     XLEN=16: # wasted 2/3 of encoding
 280                 extsb: 8-bit  -> 16-bit sign extension
 281                 extsh: 16-bit -> 16-bit does nothing
 282                 extsw: 32-bit -> 16-bit truncates
 283     XLEN=32: # wasted 1/3 of encoding
 284                 extsb: 8-bit  -> 32-bit sign extension
 285                 extsh: 16-bit -> 32-bit sign extension
 286                 extsw: 32-bit -> 32-bit does nothing
 287     XLEN=64: # unchanged (default) behaviour
 288                 extsb: 8-bit  -> 64-bit sign extension
 289                 extsh: 16-bit -> 64-bit sign extension
 290                 extsw: 32-bit -> 64-bit sign extension
 291 ```
 292
 293 The RTL for `extsb` becomes:
 294
 295 ```
 296     in <- (RA)[XLEN-8:XLEN-1] # extract first byte
 297     if XLEN = 8  then RT <- in[7] * 8             # 1->8
 298     if XLEN = 16 then RT <- in[6] * 15 || in[7]   # 2->16
 299     if XLEN = 32 then RT <- in[4] * 29 || in[5:7] # 4->32
 300     if XLEN = 64 then RT <- in[0] * 57 || in[1:7] # 8->64
 301 ```
 302
 303 And `extsh` and `extsw` follow similar logic. Interestingly there is
 304 no loss of functionality compared to keeping `extsb` always as "byte
 305 sign-extending" and ironically the loss of opportunity *is* to keep
 306 `extsb` the same (extend *byte* regardless of XLEN).
 307
 308 [[!tag opf_rfc]]
 309
 310 \newpage{}
 311