openpower/sv/rfc/ls005.xlen.mdwn

   1 # OPF ISA WG External RFC ls005.xlen v1: XLEN
   2
   3 * RFC Author: Luke Kenneth Casson Leighton.
   4 * RFC Contributors/Ideas: Jacob Lifshay, Toshaan Bharvani
   5 * Funded by NLnet under the NGI Zero Entrust EU Horizon Europe Grant 101069594
   6 * <https://libre-soc.org/openpower/sv/rfc/ls005.xlen/>
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=1062>
   8 * <https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=openpower/isa;hb=HEAD>
   9 * <https://git.openpower.foundation/isa/PowerISA/issues/104>
  10
  11 **Severity**: Major
  12
  13 **Status**: New
  14
  15 **Date**: 22 Dec 2022 v2 TODO
  16
  17 **Target** v3.2B
  18
  19 **Books and Section affected**:
  20
  21 ```
  22     Everything (in a consistent, regular and systematic fashion)
  23 ```
  24
  25 **Summary**
  26
  27 ```
  28     Exactly as is already done in RISC-V, convert the entire use of 64-bit hard-coding to "XLEN".
  29     Exactly as is in RISC-V, options then include PowerISA-32, PowerISA-64 and PowerISA-128.
  30     Unlike in RISC-V, the concept of PowerISA-16 and PowerISA-8 is also floated, for Embedded,
  31     AI, Edge, Processing-in-Memory, Distributed Computing and other purposes.
  32 ```
  33
  34 **Submitter**: Luke Leighton (Libre-SOC)
  35
  36 **Requester**: Libre-SOC
  37
  38 **Impact on processor**:
  39
  40 ```
  41     Entirely new processors, entirely new markets.
  42 ```
  43
  44 **Impact on software**:
  45
  46 ```
  47     Massive but regular, consistent, and systematic.
  48 ```
  49
  50 **Keywords**:
  51
  52 ```
  53     XLEN
  54 ```
  55
  56 **Motivation**
  57
  58 The Power ISA is far too massive, making it wholly unsuited for Embedded
  59 markets and adversely impacting its reach and potential.  The RISC paradigm
  60 it is based on has gone too far into PackedSIMD (128-bit).  Fixing this is
  61 relatively and conceptually straightforward: allow 32-bit and even 16-bit
  62 and 8-bit implementations, and use the opportunity to allow future Scalar
  63 128-bit implementations in the exact same strategic way that RISC-V has RV128.
  64
  65 Register files are redefined to XLEN width but are permitted to "group"
  66 registers together to create 16-bit, 32-bit and 64-bit addresses.
  67 In this way, the limitations of what would otherwise restrict the usefulness
  68 of a severely-targetted application-specific processor may be overcome in
  69 order to make it still possible to (at reduced performance) still run
  70 general-purpose applications.
  71 AI application-specific processing or other Processing-In-Memory or other
  72 specialist design therefore may for example focus a balance
  73 of raw computing power heavily onto 8-bit or 16-bit computation, but still
  74 gain the benefit of the Power ISA and everything it brings.  Contrast
  75 this with the more "normal" approach of creating heavily-focussed
  76 specialist "AI" Engines incapable of Turing-completeness and the benefits
  77 are clear.
  78
  79 Note 1: SVP64 **requires** this change as a 100% critical dependency.
  80 SIMD back-end ALUs process Vectors of "Elements" at 8, 16 and 32-bit (and
  81 64-bit), read from, processed, and returned to, the standard **Scalar**
  82 Register Files, with byte-level write-enable lines.  The proposal is
  83 therefore made as an opportunity for others interested in Scalar ISA
  84 8/16/32-bit (and future 128-bit variants of Scalar Power ISA) to take
  85 **and complete** that work in an incremental fashion, without having
  86 to be faced with a massive bulk and body of work as a prerequisite.
  87
  88 Examples include that whilst an SVP64 Prefixed '''lbz''' instruction
  89 ('''sv.lbz''') is well-defined and has strict well-defined behaviour,
  90 a pure **Scalar-only** (non-SVP64) over-ridden '''lbz''' instruction
  91 has not been so well-defined, and would require a Stakeholder interested
  92 in 8/16/32-bit (and future 128-bit) to think through the implications
  93 and incrementally submit further OPF ISA RFCs.  With RISC-V **already
  94 having done this type of work** it is not technically difficult: it
  95 just requires another Stakeholder to do it.
  96
  97 Note 2: one alternative to this proposal, as far as SVP64 is concerned,
  98 is to literally duplicate the entirety of Chapters 3 and 4 Book III,
  99 and to create - and then maintain - multiple identical copies of the
 100 instructions including identical copies of the pseudocode except for
 101 substitution of occurrences of "64" with a "32" variant, "16" variant,
 102 "8" variant (and future "128" variant), and so on.  This would add
 103 over 700 additional pages to the Power ISA Specification and it should
 104 be clear that it would become a maintenance nightmare.
 105
 106 Another alternative is to poison and irredemably damage the Power ISA
 107 (as a powerful and lean RISC ISA) by adding several hundred (close to 1,000)
 108 additional specific 8-bit, 16-bit and 32-bit (and in future 128-bit) Scalar
 109 instructions. Given that the 32-bit Opcode Allocation Space is already
 110 under pressure such a move would be extremely unwise for that reason alone.
 111
 112 **Changes**
 113
 114 For all pseudocode right across the board in all Scalar operations, replace
 115 hard-coded "64" with "XLEN".  **This work is already underway as sponsored
 116 by NLnet in the Libre-SOC Power ISA Pseudocode**.  The default is obviously
 117 recommended to be "XLEN=64" in order to create zero disruption.
 118
 119 Definitions of the Register File(s) for GPR and FPR are then changed to be
 120 "XLEN" wide.  However, for Embedded purposes (XLEN=32/16/8), an SPR controls
 121 whether (and how many) sequentially-grouped registers are taken together to
 122 create 16-bit, 32-bit and 64-bit addresses (depending on application need).
 123 GPR is obvious, FPR is quirky.  SVP64 redefines FP ops (those not ending in "s")
 124 to be "full width" and all ops ending in "s" to be "half of
 125 the full width".
 126
 127 * XLEN=64 keeps FPR "full width" exactly as presently defined, and
 128   "half width" exactly as presently defined.
 129 * XLEN=32 overrides FPR "full width" operations to
 130   full BFP32, and "half width" to be "BFP16 stored in an BFP32"
 131 * XLEN=16 redefines FPR "full width" operations to full [IEEE BFP16](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) and leaves
 132   "half width" RESERVED (there is no IEEE version of [FP8](https://web.archive.org/web/20221223085833/https://wccftech.com/nvidia-intel-arm-bet-their-ai-future-on-fp8-whitepaper-for-8-bit-fp-published/)).
 133 * XLEN=8 redefines FPR "full width" operations to [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) and leaves
 134   "half width" RESERVED.
 135
 136 ----------------
 137
 138 # Examples
 139
 140 ## pseudocode examples demonstrating modification.
 141
 142 before for popcntb:
 143
 144 ```
 145 do i = 0 to 7
 146    n <-  0
 147    do j = 0 to 7
 148       if (RS)[(i*8)+j] = 1 then
 149           n <- n+1
 150    RA[(i*8):(i*8)+7] <-  n
 151 ```
 152
 153 after:
 154
 155 ```
 156 do i = 0 to ((XLEN/8)-1)
 157    n <-  0
 158    do j = 0 to 7
 159       if (RS)[(i*8)+j] = 1 then
 160           n <- n+1
 161    RA[(i*8):(i*8)+7] <-  n
 162 ```
 163
 164 Here as the instruction's intent is to count bytes, and RA contains on
 165 a per-byte basis a SIMD-style count of each byte's 1s, it becomes possible
 166 to simply count less bytes.
 167
 168 Should it be more useful to redefine popcntb in terms of always returning
 169 eight results? For example `sv.popcntb/w=16` to return 8 2-bit counts of
 170 the number of bits in each 2-bit group in RS?
 171
 172 ## no modification needed, but function changes
 173
 174 For the `addi` instruction there is no apparent change:
 175
 176 ```
 177 RT <- (RA|0) + EXTS(SI)
 178 ```
 179
 180 However behind the scenes, RA is XLEN bits wide, therefore EXTS performs an
 181 increase in bitlength not to exactly 64 but to XLEN.  Obviousy for XLEN=16
 182 there is no sign-extension, and for XLEN=8 truncation of `SI` will occur.
 183 Illustrates that there are subtle quirks involved, requiring some thought.
 184
 185 The reason for keeping as many bits of the Immediate as possible should be clear.
 186
 187 ## Compare Ranged Byte (cmprb BF,L,RA,RB)
 188
 189 ```
 190     src1    <- EXTZ((RA)[XLEN-8:XLEN-1])
 191     src21hi <- EXTZ((RB)[XLEN-32:XLEN-23])
 192     src21lo <- EXTZ((RB)[XLEN-24:XLEN-17])
 193     src22hi <- EXTZ((RB)[XLEN-16:XLEN-9])
 194     src22lo <- EXTZ((RB)[XLEN-8:XLEN-1])
 195     if L=0 then
 196        in_range <-  (src22lo  <= src1) & (src1 <=  src22hi)
 197     else
 198        in_range <- (((src21lo  <= src1) & (src1 <=  src21hi)) |
 199                     ((src22lo  <= src1) & (src1 <=  src22hi)))
 200     CR[4*BF+32] <- 0b0
 201     CR[4*BF+33] <- in_range
 202     CR[4*BF+34] <- 0b0
 203     CR[4*BF+35] <- 0b0
 204 ```
 205
 206 Compare Ranged Byte takes either one or two ranges from RB as individual bytes,
 207 thus requiring a minimum 16-bit (32-bit when L=1) operand RB.
 208 src1 on the other hand is only
 209 8-bit long: the first byte of RA.
 210
 211 Therefore a little more thought is required. Should this simply be UNDEFINED
 212 behaviour when XLEN=8/16 and L=1? When XLEN=16, L=0 the instruction is still
 213 valid. Would it be costly at the Decoder?
 214
 215 ## Trap Word Immediate
 216
 217 Like FP Single operations there also exist operations at "half of regfile width"
 218 in the Integer realm.  They are discernable with the designation "Word" in their
 219 title, such as "Trap WORD Immediate".
 220
 221 ```
 222     a <- EXTS((RA)[XLEN/2:XLEN-1])
 223     if (a < EXTS(SI)) & TO[0]  then TRAP
 224     if (a > EXTS(SI)) & TO[1]  then TRAP
 225     if (a = EXTS(SI)) & TO[2]  then TRAP
 226     if (a <u EXTS(SI)) & TO[3] then TRAP
 227     if (a >u EXTS(SI)) & TO[4] then TRAP
 228 ```
 229
 230 Here, EXTS receives **half** of the bits of its input register operand, RA.
 231 Note this is **not** "32 bit because a Word is 32-bit". The definition
 232 "Trap Word Immediate" has to be replaced with "Trap Half-register-width Immediate"
 233 but this is very clumsy.
 234
 235 When XLEN=8 "half register width" is clearly 4 bit, thus the LSB nibble is tested,
 236 but still sign-extended for comparison
 237 against the 16-bit signed immediate.
 238
 239 ## Extend Sign byte/half/word
 240
 241 This instruction can be redefined again in terms of:
 242
 243 * "Word" meaning "Half of register width"
 244 * "Half-word" meaning "Quarter of register width"
 245 * "Byte" meaning "One-eighth of register width"
 246
 247 And a table results as follows:
 248
 249 ```
 250     XLEN=8:
 251                 extsb: 1-bit -> 8-bit sign extension
 252                 extsh: 2-bit -> 8-bit sign extension
 253                 extsw: 4-bit -> 8-bit sign extension
 254     XLEN=16:
 255                 extsb: 2-bit -> 16-bit sign extension
 256                 extsh: 4-bit -> 16-bit sign extension
 257                 extsw: 8-bit -> 16-bit sign extension
 258     XLEN=32:
 259                 extsb: 4-bit -> 32-bit sign extension
 260                 extsh: 8-bit -> 32-bit sign extension
 261                 extsw: 16-bit -> 32-bit sign extension
 262     XLEN=64:
 263                 extsb: 8-bit -> 64-bit sign extension
 264                 extsh: 16-bit -> 64-bit sign extension
 265                 extsw: 32-bit -> 64-bit sign extension
 266 ```
 267
 268 If the instructions were kept as presently defined then there
 269 is a loss of functionality and opportunity:
 270
 271 ```
 272     XLEN=8: # completely wasted opportunity
 273                 extsb: 8-bit  -> 8-bit does nothing
 274                 extsh: 16-bit -> 8-bit truncates
 275                 extsw: 32-bit -> 8-bit truncates
 276     XLEN=16: # wasted 2/3 of encoding
 277                 extsb: 8-bit  -> 16-bit sign extension
 278                 extsh: 16-bit -> 16-bit does nothing
 279                 extsw: 32-bit -> 16-bit truncates
 280     XLEN=32: # wasted 1/3 of encoding
 281                 extsb: 8-bit  -> 32-bit sign extension
 282                 extsh: 16-bit -> 32-bit sign extension
 283                 extsw: 32-bit -> 32-bit does nothing
 284     XLEN=64: # unchanged (default) behaviour
 285                 extsb: 8-bit  -> 64-bit sign extension
 286                 extsh: 16-bit -> 64-bit sign extension
 287                 extsw: 32-bit -> 64-bit sign extension
 288 ```
 289
 290 The RTL for `extsb` becomes:
 291
 292 ```
 293     in <- (RA)[XLEN-8:XLEN-1] # extract first byte
 294     if XLEN = 8  then RT <- in[7] * 8             # 1->8
 295     if XLEN = 16 then RT <- in[6] * 15 || in[7]   # 2->16
 296     if XLEN = 32 then RT <- in[4] * 29 || in[5:7] # 4->32
 297     if XLEN = 64 then RT <- in[0] * 57 || in[1:7] # 8->64
 298 ```
 299
 300 And `extsh` and `extsw` follow similar logic. Interestingly there is
 301 no loss of functionality compared to keeping `extsb` always as "byte
 302 sign-extending" and ironically the loss of opportunity *is* to keep
 303 `extsb` the same (extend *byte* regardless of XLEN).
 304
 305 [[!tag opf_rfc]]
 306
 307 \newpage{}
 308