openpower/sv/16_bit_compressed.mdwn

   1 # 16 bit Compressed
   2
   3 Similar to VLE (but without immediate-prefixing) this encoding is designed
   4 to fit on top of OpenPOWER ISA v3.0B when a "Modeswitch" bit is set (PCR
   5 is recommended). Note that Compressed is *mutually exclusively incompatible*
   6 with OpenPOWER v3.1B "prefixing" due to using (requiring) both EXT000
   7 and EXT001. Hypothetically it could be made to use anything other than
   8 EXT001, with some inconvenience (extra gates).  The incompatibility is
   9 "fixed" by swapping out of "Compressed" Mode and back into "Normal"
  10 (v3.1B) Mode, at runtime, as needed.
  11
  12 Although initially intended to be augmented by Simple-V Prefixing (to
  13 add Vector context, width overrides, e.g IEEE754 FP16, and predication) yet not put pressure on I-Cache power
  14 or size, this Compressed Encoding is not critically dependent
  15 *on* SV Prefixing, and may be used stand-alone.
  16
  17 See:
  18
  19 * <https://bugs.libre-soc.org/show_bug.cgi?id=238>
  20 * <https://ftp.libre-soc.org/VLE_314-68105.pdf> VLE Encoding
  21 * <http://lists.mailinglist.openpowerfoundation.org/pipermail/openpower-hdl-cores/2020-November/000210.html>
  22
  23 This one is a conundrum.  OpenPOWER ISA was never designed with 16
  24 bit in mind.  VLE was added 10 years ago but only by way of marking
  25 an entire 64k page as "VLE".  With VLE not maintained it is not
  26 fully compatible with current PowerISA.
  27
  28 Here, in order to embed 16 bit into a predominantly 32 bit stream the
  29 overhead of using an entire 16 bits just to switch into Compressed mode
  30 is itself a significant overhead.  The situation is made worse by
  31 OpenPOWER ISA being fundamentally designed with 6 bits uniformly
  32 taking up Major Opcode space, leaving only 10 bits to allocate
  33 to actual instructions.
  34
  35 Contrast this with RVC which takes 3 out of 4 combinations of the first 2
  36 bits for indicating 16-bit (anything with 0b00 to 0b10 in the LSBs), and
  37 uses the 4th (0b11) as a Huffman-style escape-sequence, easily allowing
  38 standard 32 bit and 16 bit to intermingle cleanly.  To achieve the same
  39 thing on OpenPOWER would require a whopping 24 6-bit Major Opcodes which
  40 is clearly impractical: other schemes need to be devised.
  41
  42 In addition we would like to add SV-C32 which is a Vectorised version
  43 of 16 bit Compressed, and ideally have a variant that adds the 27-bit
  44 prefix format from SV-P64, as well.
  45
  46 Potential ways to reduce pressure on the 16 bit space are:
  47
  48 * To use more than one v3.0B Major Opcode, preferably an odd-even
  49   contiguous pair
  50 * To provide "paging".  This involves bank-switching to alternative
  51   optimised encodings for specific workloads
  52 * To enter "16 bit mode" for durations specified at the start
  53 * To reserve one bit of every 16 bit instruction to indicate that the
  54   16 bit mode is to continue to be sustained
  55
  56 This latter would be useful in the Vector context to have an alternative
  57 meaning: as the bit which determines whether the instruction is 11-bit
  58 prefixed or 27-bit prefixed:
  59
  60     0 1 2 3 4 5 6 7 8 9 a b c d e f |
  61     |major op | 11 bit vector prefix|
  62     |16 bit opcode  alt vec. mode ^ |
  63     | extra vector prefix if alt set|
  64
  65 Using a major opcode to enter 16 bit mode, leaves 11 bits to find
  66 something to use them for:
  67
  68     0 1 2 3 4 5 6 7 8 9 a b c d e f |
  69     |major op | what to do here   1 |
  70     |16 bit    stay in 16bit mode 1 |
  71     |16 bit    stay in 16bit mode 1 |
  72     |16 bit       exit 16bit mode 0 |
  73
  74 One possibility is that the 11 bits are used for bank selection,
  75 with some room for additional context such as altering the registers
  76 used for the 16 bit operations (bank selection of which scalar regs).
  77 However the downside is that short sequences of Compressed instructions
  78 become penalised by the fixed overhead.  Even a single 16 bit instruction
  79 requires a 16 bit overhead to "gain access" to 16 bit "mode", making
  80 the exercise pointless.
  81
  82 An alternative is to use the first 11 bits for only the utmost commonly
  83 used instructions.  That being the case then one of those 11 bits could
  84 be dedicated to saying if 16 bit mode is to be continued, at which
  85 point *all* 16 bits can be used for Compressed.  10 bits remain for
  86 actual opcodes, which is ridiculously tight, however the opportunity to
  87 subsequently use all 16 bits is worth it.
  88
  89 The reason for picking 2 contiguous Major v3.0B opcodes is illustrated below:
  90
  91     |0 1 2 3 4 5 6 7 8 9 a b c d e f|
  92     |major op..0| LO Half C space   |
  93     |major op..1| HI Half C space   |
  94     |N N N N N|<--11 bits C space-->|
  95
  96 If NNNNN is the same value (two contiguous Major v3.0B Opcodes) this
  97 saves gates at a critical part of the decode phase.
  98
  99 ## ABI considerations
 100
 101 Unlike RISC-V RVC, the above "context" encodings require state, to be stored
 102 in the PCR, MSR, or a dedicated SPR.  These bits (just like LE/BE 32bit
 103 mode and the IEEE754 FPCSR mode) all require taking that context into
 104 consideration.
 105
 106 In particular it is critically important to recognise that context (in
 107 general) is an implicit part of the ABI implemented for example by glibc6.
 108 Therefore (in specific) Compressed Mode Context **must not** be permitted
 109 to cross into or out of a function call.
 110
 111 Thus it is the mandatory responsibility of the compiler to ensure that
 112 context returns to "v3.0B Standard" prior to entering a function call
 113 (responsibility of caller) and prior to exit from a function call
 114 (responsibility of callee).
 115
 116 Trap Handlers also take responsibility for saving and restoring of
 117 Compressed Mode state, just as they already take responsibility for
 118 other critical state.  This makes traps transparent to functions as
 119 far as Compressed Mode Context is concerned, just as traps are already
 120 transparent to functions.
 121
 122 Note however that there are exceptions in a compiler to the otherwise
 123 hard rule that Compressed Mode context not be permitted to cross function
 124 boundaries: inline functions and static functions.  static functions,
 125 if correctly identified as never to be called externally, may, as an
 126 optimisation, disregard standard ABIs, bearing in mind that this will
 127 be fraught (pointers to functions) and not easy to get right.
 128
 129 # Opcode Allocation Ideas
 130
 131 * one bit from the 16-bit mode is used to indicate that standard
 132   (v3.0B) mode is to be dropped into for only one single instruction
 133   <https://bugs.libre-soc.org/show_bug.cgi?id=238#c2>
 134
 135 ## Opcodes exploration (Attempt 1)
 136
 137 Switching between different encoding modes is controlled by M (alone)
 138 in 10-bit mode, and M and N in 16-bit mode.
 139
 140 * M in 10-bit mode if zero indicates that following instructions are
 141   standard OpenPOWER ISA 32-bit encoded (including, redundantly,
 142   further 10/16-bit instructions)
 143 * M in 10-bit mode if 1 indicates that following instructions are
 144   in 16-bit encoding mode
 145
 146 Once in 16-bit mode:
 147
 148 * 0b01 (M=1, N=0): stay in 16-bit mode
 149 * 0b00: leave 16-bit mode permanently (return to standard OpenPOWER ISA)
 150 * 0b10: leave 16-bit mode for one cycle (return to standard OpenPOWER ISA)
 151 * 0b11: free to be used for something completely different.
 152
 153 The current "top" idea for 0b11 is to use it for a new encoding format
 154 of predominantly "immediates-based" 16-bit instructions (branch-conditional,
 155 addi, mulli etc.)
 156
 157 * The Compressed Major Opcode is in bits 5-7.
 158 * Minor opcode in bit 8.
 159 * In some cases bit 9 is taken as an additional sub-opcode, followed
 160   by bits 0-4 (for CR operations)
 161 * M+N mode-switching is not available for C-Major.minor 0b001.1
 162 * 10 bit mode may be expanded by 16 bit mode, adding capabilities
 163   that do not fit in the extreme limited space.
 164
 165 Mode-switching FSM showing relationship between v3.0B, C 10bit and C 16bit.
 166 16-bit immediate mode remains in 16-bit.
 167
 168     | 0 | 1234 | 567  8 | 9abcde | f | explanation
 169     | - | ---- | ------ | ------ | - | -----------
 170     | EXT000/1 | Cmaj.m | fields | 0 | 10bit then v3.0B
 171     | EXT000/1 | Cmaj.m | fields | 1 | 10bit then 16bit
 172     | 0 | flds | Cmaj.m | fields | 0 | 16bit then v3.0B
 173     | 0 | flds | Cmaj.m | fields | 1 | 16bit then 16bit
 174     | 1 | flds | Cmaj.m | fields | 0 | 16b, 1x v3.0B, 16b
 175     | 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit
 176
 177 Notes:
 178
 179 * Cmaj.m is the C major/minor opcode: 3 bits for major, 1 for minor
 180 * EXT000 and EXT001 are v3.0B Major Opcodes.  The first 5 bits
 181   are zero, therefore the 6th bit is actually part of Cmaj.
 182 * "10bit then 16bit" means "this instruction is encoded C 10bit
 183   and the following one in C 16bit"
 184 * "16b, 1x v3.0B, 16b" means, "this instruction is encoded C 16bit,
 185   the following one is V3.0B Standard, and the one after that is
 186   back to 16bit".
 187
 188 ### C Instruction Encoding types
 189
 190 10-bit Opcode formats (all start with v3.0B EXT000 or EXT001
 191 Major Opcodes)
 192
 193     | 01234    | 567  8 | 9  | a b | c  | d e | f | enc
 194     | E01      | Cmaj.m | fld1     | fld2     | M | 10b
 195     | E01      | Cmaj.m | offset              | M | 10b b
 196     | E01      | 001.1  | S1 | fd1 | S2 | fd2 | M | 10b sub
 197     | E01      | 111.m  | fld1     | fld2     | M | 10b LDST
 198
 199 16-bit Opcode formats (including 10/16/v3.0B Switching)
 200
 201     | 0 | 1234 | 567  8 | 9  | a b | c  | d e | f | enc
 202     | N | immf | Cmaj.m | fld1     | fld2     | M | 16b
 203     | 1 | immf | Cmaj.m | fld1     | imm      | 1 | 16b imm
 204     | fd3      | 001.1  | S1 | fd1 | S2 | fd2 | M | 16b sub
 205     | N | fd4  | 111.m  | fld1     | fld2     | M | 16b LDST
 206
 207 Notes:
 208
 209 * fld1 and fld2 can contain reg numbers, immediates, or opcode
 210   fields (BO, BI, LK)
 211 * S1 and S2 are further sub-selectors of C 001.1
 212
 213 ### Immediate Opcodes
 214
 215 only available in 16-bit mode, only available when M=1 and N=1
 216 and when Cmaj.min is not 0b001.1.
 217
 218 instruction counts from objdump on /bin/bash:
 219
 220       466 extsw r1,r1
 221       649 stw r1,1(r1)
 222       691 lwz r1,1(r1)
 223       705 cmpdi r1,1
 224       791 cmpwi r1,1
 225       794 addis r1,r1,1
 226      1474 std r1,1(r1)
 227      1846 li r1,1
 228      2031 mr r1,r1
 229      2473 addi r1,r1,1
 230      3012 nop
 231      3028 ld r1,1(r1)
 232
 233
 234     | 0 | 1  | 2 | 3 4 | | 567.8 | 9ab  | cde | f |
 235     | 1 | 0  | 0   0 0 | | 001.0 |      | 000 | 1 | TBD
 236     | 1 | 0  |  sh2    | | 001.0 | RA   | sh  | 1 | sradi.
 237     | 1 | 1  | 0   0 0 | | 001.0 |      | 000 | 1 | TBD
 238     | 1 | 1  | 0 | sh2 | | 001.0 | RA   | sh  | 1 | srawi.
 239     | 1 | 1  | 1 |     | | 001.0 | 000  | imm | 1 | TBD
 240     | 1 | 1  | 1 | i2  | | 001.0 | RA!=0| imm | 1 | addis
 241     | 1 |              | | 010.0 | 000  |     | 1 | TBD
 242     | 1 | i2           | | 010.0 | RA!=0| imm | 1 | addi
 243     | 1 | 0  | i2      | | 010.1 | RA   | imm | 1 | cmpdi
 244     | 1 | 1  | i2      | | 010.1 | RA   | imm | 1 | cmpwi
 245     | 1 | 0  | i2      | | 011.0 | RT   | imm | 1 | ldspi
 246     | 1 | 1  | i2      | | 011.0 | RT   | imm | 1 | lwspi
 247     | 1 | 0  | i2      | | 011.1 | RT   | imm | 1 | stwspi
 248     | 1 | 1  | i2      | | 011.1 | RT   | imm | 1 | stdspi
 249     | 1 | i2 | RA      | | 100.0 | RT   | imm | 1 | stwi
 250     | 1 | i2 | RA      | | 100.1 | RT   | imm | 1 | stdi
 251     | 1 | i2 | RT      | | 101.0 | RA   | imm | 1 | ldi
 252     | 1 | i2 | RT      | | 101.1 | RA   | imm | 1 | lwi
 253     | 1 | i2 | RA      | | 110.0 | RT   | imm | 1 | fsti
 254     | 1 | i2 | RA      | | 110.1 | RT   | imm | 1 | fstdi
 255     | 1 | i2 | RT      | | 111.0 | RA   | imm | 1 | flwi
 256     | 1 | i2 | RT      | | 111.1 | RA   | imm | 1 | fldi
 257
 258 Construction of immediate:
 259
 260 * LD/ST r1 (SP) variants should be offset by -256
 261  see <https://bugs.libre-soc.org/show_bug.cgi?id=238#c43>
 262   - SP variants map to e.g ld RT, imm(r1)
 263   - SV Prefixing can be used to map r1 to alternate regs
 264 * [1] not the same as v3.0B addis: the shift amount is smaller and actually
 265   still maps to within the v3.0B addi immediate range.
 266 * addi is EXTS(i2||imm) to give a 4-bit range -8 to +7
 267 * addis is EXTS(i2||imm||000) to give a 11-bit range -1024 to +1023 in
 268   increments of 8
 269 * all others are EXTS(i2||imm) to give a 7-bit range -128 to +127
 270   (further for LD/ST due to word/dword-alignment)
 271
 272 Further Notes:
 273
 274 * bc also has an immediate mode, listed separately below in Branch section
 275 * for LD/ST, offset is aligned.  8-byte: i2||imm||0b000 4-byte: 0b00
 276 * SV Prefix over-rides help provide alternative bitwidths for LD/ST
 277 * RA|0 if RA is zero, addi. becomes "li"
 278   - this only works if RT takes part of opcode
 279   - mv is also possible by specifying an immediate of zero
 280
 281 ### Illegal, nop and attn
 282
 283 Note that illeg is all zeros, including in the 16-bit mode.
 284 Given that C is allocated to OpenPOWER ISA Major opcodes EXT000 and
 285 EXT001 this ensures that in both 10-bit *and* 16-bit mode, a 16-bit
 286 run of all zeros is considered "illegal" whilst 0b0000.0000.1000.0000
 287 is "nop"
 288
 289     | 16-bit mode | | 10-bit mode                 |
 290     | 0 | 1 | 234 | | 567.8  | 9  ab | c   de | f |
 291     | - | - | --- | | -----  | ----- | ------ | - |
 292     | 0 | 0   000 | | 000.0  | 0  00 | 0   00 | 0 | illeg
 293     | 0 | 0   000 | | 000.0  | 0  00 | 0   00 | 1 | nop
 294
 295 16 bit mode only:
 296
 297     | - | - | --- | | -----  | ----- | ------ | - |
 298     | 1 | 0   000 | | 000.0  | 0  00 | 0   00 | 0 | nop
 299     | 1 | 1   000 | | 000.0  | 0  00 | 0   00 | 0 | attn
 300     | 1 | nonzero | | 000.0  | 0  00 | 0   00 | 0 | TBD
 301
 302 Notes:
 303
 304 * All-zeros being an illegal instruction is normal for ISAs.  Ensuring that
 305   this remains true at all times i.e. for both 10 bit and 16 bit mode is
 306   common sense.
 307 * The 10-bit nop (bit 15, M=1) is intended for circumstances
 308   where alignment to 32-bit before returning to v3.0B is required.
 309   M=1 being an indication "return to Standard v3.0B Encoding Mode".
 310 * The 16-bit nop (bit 0, N=1) is intended for circumstances where a
 311   return to Standard v3.0B Encoding is required for one cycle
 312   but one cycle where alignment to a 32-bit boundary is needed.
 313   Examples of this would be to return to "strict" (non-C) mode
 314   where the PC may not be on a non-word-aligned boundary.
 315 * If for any reason multiple 16 bit nops are needed in succession
 316   the M=1 variant can be used, because each one returns to
 317   Standard v3.0B Encoding Mode, each time.
 318
 319 In essence the 2 nops are needed due to there being 2 different C forms:
 320 10 and 16 bit.
 321
 322 ### Branch
 323
 324 TODO: document that branching whilst using mode-switching bits (M/N) is perfectly well permitted but is specifically and wholly the complier/assembler writers responsibility to obey ABI rules and ensure that even with branches and returns that, at no time, is an incorrect mode entered or left that could result in any instruction being misinterpreted.
 325
 326     | 16-bit mode | | 10-bit mode                 |
 327     | 0 | 1 | 234 | | 567.8  | 9  ab | c   de | f |
 328     | - | - | --- | | -----  | ----- | ------ | - |
 329     | N | offs2   | | 000.LK | offs!=0        | M | b, bl
 330     | N |         | | 000.1  | 0  00 | 0   00 | M | TBD
 331     | 1 | offs2   | | 000.LK | BI    | BO1 oo | 1 | bc, bcl
 332     | N | BO3 BI3 | | 001.0  | LK BI | BO     | M | bclr, bclrl
 333
 334 16 bit mode:
 335
 336 * bc only available when N,M=0b11
 337 * offs2 extends offset in MSBs
 338 * BI3 extends BI in MSBs to allow selection of full CR
 339 * BO3 extends BO
 340 * bc offset constructed from oo as LSBs and offs2 as MSBs
 341 * bc BI allows selection of all bits from CR0 or CR1
 342 * bc CR check is always active (as if BO0=1) therefore BO1 inverts
 343
 344 10 bit mode:
 345
 346 * illegal (all zeros) covers part of branch (offs=0,M=0,LK=0)
 347 * nop also covers part of branch (offs=0,M=0,LK=1)
 348 * bc **not available** in 10-bit mode
 349 * BO[0] enables CR check, BO[1] inverts check
 350 * BI refers to CR0 only (4 bits of)
 351 * no Branch Conditional with immediate
 352 * no Absolute Address
 353 * CTR mode allowed with BO[2] for b only.
 354 * offs is to 2 byte (signed) aligned
 355 * all branches to 2 byte aligned
 356
 357 ### LD/ST
 358
 359 Note: for 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
 360
 361     | 16-bit mode    | | 10-bit mode             |
 362     | 0   | 1  | 234 | | 567.8 | 9 a b | c d e | f |
 363     | --- | -- | --- | | ----- | ----- | ----- | - |
 364     | RA2 | SZ |  RB | | 001.1 | 1  RA | 0  RT | M | st
 365     | RA2 | SZ |  RB | | 001.1 | 1  RA | 1  RT | M | fst
 366     | N   | SZ |  RT | | 111.0 |  RA   |  RB   | M | ld
 367     | N   | SZ |  RT | | 111.1 |  RA   |  RB   | M | fld
 368
 369 * elwidth overrides can set different widths
 370
 371 16 bit mode:
 372
 373 * SZ=1 is 64 bit, SZ=0 is 32 bit
 374 * RA2 extends RA to 3 bits (MSB)
 375 * RT2 extends RT to 3 bits (MSB)
 376
 377 10 bit mode:
 378
 379 * RA and RB are only 2 bit (0-3)
 380 * for LD, RT is implicitly RB: "ld RT=RB, RA(RB)"
 381 * for ST, there is no offset: "st RT, RA(0)"
 382
 383 ### Arithmetic
 384
 385 * 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
 386 * 16-bit: note that bit 1==0 (sub-sub-encoding)
 387
 388 10 and 16 bit:
 389
 390     | 16-bit mode | | 10-bit mode             |
 391     | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
 392     | - | - | --- | | ----- | --- | ----- | - |
 393     | N | 0 | RT  | | 010.0 | RB  | RA!=0 | M | add
 394     | N | 0 | RT  | | 010.1 | RB  | RA|0  | M | sub.
 395     | N | 0 | BF  | | 011.0 | RB  | RA|0  | M | cmpl
 396
 397 Notes:
 398
 399 * sub. and cmpl: default CR target is CR0
 400 * for (RA|0) when RA=0 the input is a zero immediate,
 401   meaning that sub. becomes neg. and cmp becomes cmpi against zero
 402 * RT is implicitly RB: "add RT(=RB), RA, RB"
 403 * Opcode 0b010.0 RA=0 is not missing from the above:
 404   it is a system-wide instruction, "cbank" (section below)
 405
 406 16 bit mode only:
 407
 408     | 0 | 1 | 234 | | 567.8 | 9ab | cde   | f |
 409     | - | - | --- | | ----- | --- | ----- | - |
 410     | N | 1 | RA  | | 010.0 | RB  | RS    | M | sld.
 411     | N | 1 | RA  | | 010.1 | RB  | RS!=0 | M | srd.
 412     | N | 1 | RA  | | 010.1 | RB  | 000   | M | srad.
 413     | N | 1 | BF  | | 011.0 | RB  | RA|0  | M | cmpw
 414
 415 Notes:
 416
 417 * for srad, RS=RA: "srad. RA(=RS), RS, RB"
 418
 419 ### Logical
 420
 421 * 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
 422 * 16-bit: note that bit 1==0 (sub-sub-encoding)
 423
 424 10 and 16 bit:
 425
 426     | 16-bit mode | | 10-bit mode             |
 427     | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
 428     | - | - | --- | | ----- | --- | ----- | - |
 429     | N | 0 |  RT | | 100.0 | RB  | RA!=0 | M | and
 430     | N | 0 |  RT | | 100.1 | RB  | RA!=0 | M | nand
 431     | N | 0 |  RT | | 101.0 | RB  | RA!=0 | M | or
 432     | N | 0 |  RT | | 101.1 | RB  | RA!=0 | M | nor/mr
 433     | N | 0 |  RT | | 100.0 | RB  | 0 0 0 | M | extsw
 434     | N | 0 |  RT | | 100.1 | RB  | 0 0 0 | M | cntlz
 435     | N | 0 |  RT | | 101.0 | RB  | 0 0 0 | M | popcnt
 436     | N | 0 |  RT | | 101.1 | RB  | 0 0 0 | M | not
 437
 438 16-bit mode only (note that bit 1 == 1):
 439
 440     | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
 441     | - | - | --- | | ----- | --- | ----- | - |
 442     | N | 1 |  RT | | 100.0 | RB  | RA!=0 | M | TBD
 443     | N | 1 |  RT | | 100.1 | RB  | RA!=0 | M | TBD
 444     | N | 1 |  RT | | 101.0 | RB  | RA!=0 | M | xor
 445     | N | 1 |  RT | | 101.1 | RB  | RA!=0 | M | eqv (xnor)
 446     | N | 1 |  RT | | 100.0 | RB  | 0 0 0 | M | extsb
 447     | N | 1 |  RT | | 100.1 | RB  | 0 0 0 | M | cnttz
 448     | N | 1 |  RT | | 101.0 | RB  | 0 0 0 | M | TBD
 449     | N | 1 |  RT | | 101.1 | RB  | 0 0 0 | M | extsh
 450
 451 10 bit mode:
 452
 453 * idea: for 10bit mode, nor is actually 'mr' because mr is
 454   a more common operation.  in 16bit however, this encoding
 455   (Cmaj.min=0b101.1, N=0) is 'nor'
 456 * for (RA|0) when RA=0 the input is a zero immediate,
 457   meaning that nor becomes not
 458 * cntlz, popcnt, exts **not available** in 10-bit mode
 459 * RT is implicitly RB: "and RT(=RB), RA, RB"
 460
 461 ### Floating Point
 462
 463 Note here that elwidth overrides (SV Prefix) can be used to select FP16/32/64
 464
 465 * 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
 466 * 16-bit: note that bit 1==0 (sub-sub-encoding)
 467
 468 10 and 16 bit:
 469
 470     | 16-bit mode | | 10-bit mode             |
 471     | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
 472     | - | - | --- | | ----- | --- | ----- | - |
 473     | N |   |  RT | | 011.1 | RB  | RA!=0 | M | fsub.
 474     | N | 0 |  RT | | 110.0 | RB  | RA!=0 | M | fadd
 475     | N | 0 |  RT | | 110.1 | RB  | RA!=0 | M | fmul
 476     | N | 0 |  RT | | 011.1 | RB  | 0 0 0 | M | fneg.
 477     | N | 0 |     | | 110.0 |     | 0 0 0 | M | TBD
 478     | N | 0 |     | | 110.1 |     | 0 0 0 | M | TND
 479
 480 16-bit mode only (note that bit 1 == 1):
 481
 482     | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
 483     | - | - | --- | | ----- | --- | ----- | - |
 484     | N | 1 |     | | 011.1 |     | RA!=0 | M | TBD
 485     | N | 1 |     | | 110.0 |     | RA!=0 | M | TBD
 486     | N | 1 |  RT | | 110.1 | RB  | RA!=0 | M | fdiv
 487     | N | 1 |  RT | | 011.1 | RB  | 0 0 0 | M | fabs.
 488     | N | 1 |  RT | | 110.0 | RB  | 0 0 0 | M | fmr.
 489     | N | 1 |     | | 110.1 |     | 0 0 0 | M | TBD
 490
 491 16 bit only, FP to INT convert (using C 0b001.1 subencoding)
 492
 493     | 0123 | 4 | | 567.8 | 9 ab | cde  | f |
 494     | ---- | - | | ----- | ---- | ---- | - |
 495     | 0010 | X | | 001.1 | 0 RA | Y RT | M | fp2int
 496     | 0011 | X | | 001.1 | 0 RA | Y RT | M | int2fp
 497
 498 * X: signed=1, unsigned=0
 499 * Y: FP32=0, FP64=1
 500
 501 10 bit mode:
 502
 503 * fsub. fneg. and fmr. default target is CR1
 504 * fmr. is **not available** in 10-bit mode
 505 * fdiv is **not available** in 10-bit mode
 506
 507 16 bit mode:
 508
 509 * fmr. copies RB to RT (and sets CR1)
 510
 511 ### Condition Register
 512
 513 10-bit or 16 bit:
 514
 515     | 16-bit mode| | 10-bit mode            |
 516     | 0 | 123 | 4   | | 567.8 | 9 ab | cde | f |
 517     | - | --- | --- | | ----- | ---- | --- | - |
 518     | N | 000 | BF2 | | 001.1 | 0 BF | BFA | M | mcrf
 519
 520 16-bit only:
 521
 522     | 0 | 1234 | | 567.8 | 9 ab | cde | f |
 523     | - | ---- | | ----- | ---- | --- | - |
 524     | N | 0010 | | 001.1 | 0 BA | BB  | M | crnor
 525     | N | 0011 | | 001.1 | 0 BA | BB  | M | crandc
 526     | N | 0100 | | 001.1 | 0 BA | BB  | M | crxor
 527     | N | 0101 | | 001.1 | 0 BA | BB  | M | crnand
 528     | N | 0110 | | 001.1 | 0 BA | BB  | M | crand
 529     | N | 0111 | | 001.1 | 0 BA | BB  | M | creqv
 530     | N | 1000 | | 001.1 | 0 BA | BB  | M | crorc
 531     | N | 1001 | | 001.1 | 0 BA | BB  | M | cror
 532
 533 Notes
 534
 535 10 bit mode:
 536
 537 * mcrf BF is only 2 bits which means the destination is only CR0-CR3
 538 * CR operations: **not available** in 10-bit mode (but mcrf is)
 539
 540 16 bit mode:
 541
 542 * mcrf BF2 extends BF (in MSB) to 3 bits
 543 * CR operations: destination register is same as BA.
 544 * CR operations: only possible on CR0 and CR1
 545
 546 SV (Vector Mode):
 547
 548 * CR operations: greatly extended reach/range (useful for predicates)
 549
 550 ### System
 551
 552 cbank: Selection of Compressed-encoding "Bank".  Different "banks"
 553 give different meanings to opcodes.  Example: CBank=0b001 is heavily
 554 optimised to A/Video Encode/Decode.  cbank borrows from add's encoding
 555 space (when RA==0)
 556
 557     | 16-bit mode | | 10-bit mode             |
 558     | 0 | 1 2 3 4 | | 567.8 | 9ab   | cde | f |
 559     | - | ------- | | ----- | ----- | --- | - |
 560     | N | 0 Bank2 | | 010.0 | CBank | 000 | M | cbank
 561
 562 **not available** in 10-bit mode, **only** in 16-bit mode:
 563
 564     | 0 | 1234 | | 567.8 | 9 ab | cde  | f |
 565     | - | ---- | | ----- | ---- | ---- | - |
 566     | N | 1110 | | 001.1 | 0 00 |  RT  | M | mtlr
 567     | N | 1110 | | 001.1 | 0 01 |  RT  | M | mtctr
 568     | N | 1110 | | 001.1 | 0 11 |  RT  | M | mtcr
 569     | N | 1111 | | 001.1 | 0 00 |  RA  | M | mflr
 570     | N | 1111 | | 001.1 | 0 01 |  RA  | M | mfctr
 571     | N | 1111 | | 001.1 | 0 11 |  RA  | M | mfcr
 572
 573 ### Unallocated
 574
 575     | 0 | 123 | 4 | | 567.8 | 9 ab | cde  | f |
 576     | - | --- | - | | ----- | ---- | ---- | - |
 577     | N | 101 |   | | 001.1 | 0    |      | M |
 578     | N | 101 |   | | 001.1 | 0    |      | M |
 579     | N | 110 |   | | 001.1 | 0    |      | M |
 580     | N | 111 |   | | 001.1 | 0 10 |      | M |
 581
 582 ## Other ideas (Attempt 2)
 583
 584 ### 8-bit mode-switching instructions, odd addresses for C mode
 585
 586 Drop the complexity of the 16-bit encoding further reduced to 10-bit,
 587 and use a single byte instead of two to switch between modes.  This
 588 would place compressed (C) mode instructions at odd bytes, so the LSB
 589 of the PC can be used for the processor to tell which mode it is in.
 590
 591 To switch from traditional to compressed mode, the single-byte
 592 instruction would be at the MSByte, that holds the EXT bits.  (When we
 593 break up a 32-bit instruction across words, the most significant half
 594 should go in the word with the lower address.)
 595
 596 To switch from compressed mode to traditional mode, the single-byte
 597 instruction would also be at the opcode/format portion, placed in the
 598 lower-address word if split across words, so that the instruction can
 599 be recognized as the mode-switching one without going for its second
 600 byte.
 601
 602 The C-mode nop should be encoded so that its second byte encodes a
 603 switch to compressed mode, if decoded in traditional mode.  This
 604 enables such a nop to straddle across a label:
 605
 606     8-bit first half of nop
 607     Label:
 608     8-bit second half of nop AKA switch to compressed mode
 609     16-bit insns...
 610
 611 so that if traditional code jumps to the word-aligned label (because
 612 traditional branches drop the 2 LSB), it immediately switches to
 613 compressed mode; if we fall-through, we remain in 16-bit mode; and if
 614 we branch to it from compressed mode, whether we jump to the odd or
 615 the even address, we end up in compressed mode as desired.
 616
 617 Tables explaining encoding:
 618
 619     | byte 0 | byte 1 | byte 2 | byte 3 |
 620     | v3.0B standard 32 bit instruction |
 621     | EXT000 | 16 bit          | 16...  |
 622     | .. bit | 8nop   | v3.0b stand...  |
 623     | .. ard 32 bit   | EXT000 | 16...  |
 624     | .. bit | 16 bit          | 8nop   |
 625     | v3.0B standard 32 bit instruction |
 626
 627
 628 # TODO
 629
 630 * make a preliminary assessment of branch in/out viability
 631 * confirm FSM encoding (is LSB of PC really enough?)
 632 * guestimate opcode and register allocation (without necessarily doing
 633   a full encoding)
 634 * write throwaway python program that estimates compression ratio from
 635   objdump raw parsing
 636 * finally do full opcode allocation
 637 * rerun objdump compression ratio estimates
 638 * check in FSM if "return to v3.0B then 16bit" if it is ok to have the v3.0B be a 10bit Compressed.  should this be ignored and carry on? should a trap occur?
 639
 640 ### Use 2- rather than 3-register opcodes
 641
 642 Successful compact ISAs have used 2- rather than 3-register insns, in
 643 which the same register serves as input and output.  Some 20% of
 644 general-purpose 3-register insns already use either input register as
 645 output, without any effort by the compiler to do so.
 646
 647 Repurposing the 3 bits used to encode one one of the input registers
 648 in arithmetic, logical and floating-pointer registers, and the 2 bits
 649 used to encode the mode of the next two insns, we could make the full
 650 register files available to the opcodes already selected for
 651 compressed mode, with one bit to spare to bring additional opcodes in.
 652
 653 An opcode could be assigned to an instruction that combines and
 654 extends with the subsequent instruction, providing it with a separate
 655 input operand to use rather than the output register, or with
 656 additional range for immediate and offset operands, effectively
 657 forming a 32-bit operation, enabling us to remain in compressed mode
 658 even longer.
 659
 660 # Appendix
 661
 662 ## Analysis techniques and tools
 663
 664     objdump -d --no-show-raw-insn /bin/bash | sed 'y/\t/ /;
 665       s/^[ x0-9A-F]*: *\([a-z.]\+\) *\(.*\)/\1 \2 /p; d' |
 666       sed 's/\([, (]\)r[1-9][0-9]*/\1r1/g;
 667       s/\([ ,]\)-*[0-9]\+\([^0-9]\)/\11\2/g' | sort | uniq --count |
 668       sort -n | less
 669
 670 ## gcc register allocation
 671
 672 FTR, information extracted from gcc's gcc/config/rs6000/rs6000.h about
 673 fixed registers (assigned to special purposes) and register allocation
 674 order:
 675
 676 Special-purpose registers on ppc are:
 677
 678     r0: constant zero/throw-away
 679     r1: stack pointer
 680     r2: thread-local storage pointer in 32-bit mode
 681     r2: non-minimal TOC register
 682     r10: EH return stack adjust register
 683     r11: static chain pointer
 684     r13: thread-local storage pointer in 64-bit mode
 685     r30: minimal-TOC/-fPIC/-fpic base register
 686     r31: frame pointer
 687     lr: return address register
 688
 689 the register allocation order in GCC (i.e., it takes the earliest
 690 available register that fits the constraints) is:
 691
 692     We allocate in the following order:
 693
 694         fp0             (not saved or used for anything)
 695         fp13 - fp2      (not saved; incoming fp arg registers)
 696         fp1             (not saved; return value)
 697         fp31 - fp14     (saved; order given to save least number)
 698         cr7, cr5        (not saved or special)
 699         cr6             (not saved, but used for vector operations)
 700         cr1             (not saved, but used for FP operations)
 701         cr0             (not saved, but used for arithmetic operations)
 702         cr4, cr3, cr2   (saved)
 703         r9              (not saved; best for TImode)
 704         r10, r8-r4      (not saved; highest first for less conflict with params)
 705         r3              (not saved; return value register)
 706         r11             (not saved; later alloc to help shrink-wrap)
 707         r0              (not saved; cannot be base reg)
 708         r31 - r13       (saved; order given to save least number)
 709         r12             (not saved; if used for DImode or DFmode would use r13)
 710         ctr             (not saved; when we have the choice ctr is better)
 711         lr              (saved)
 712         r1, r2, ap, ca  (fixed)
 713         v0 - v1         (not saved or used for anything)
 714         v13 - v3        (not saved; incoming vector arg registers)
 715         v2              (not saved; incoming vector arg reg; return value)
 716         v19 - v14       (not saved or used for anything)
 717         v31 - v20       (saved; order given to save least number)
 718         vrsave, vscr    (fixed)
 719         sfp             (fixed)
 720
 721 ## Comparison to VLE
 722
 723 VLE was a means to reduce executable size through three interleaved methods:
 724
 725 * (1) invention of 16 bit encodings (of exactly 16 bit in length)
 726 * (2) invention of 16+16 bit encodings (a 16 bit instruction format but with
 727   an *additional* 16 bit immediate "tacked on" to the end, actually
 728   making a 32-bit instruction format)
 729 * (3) seamless and transparent embedding and intermingling of the
 730   above in amongst arbitrary v2.06/7 BE 32 bit instruction sequences,
 731   with no additional state,
 732   including when the PC was not aligned on a 4-byte boundary.
 733
 734 Whilst (1) and (3) make perfect sense, (2) makes no sense at all given that, as inspection of "ori" and others show, I-Form 16 bit immediates is the "norm" for v2.06/7 and v3.0B standard instructions.  (2) in effect **is** a 32 bit instruction.  (2) **is not** a 16 bit instruction.
 735
 736 *Why "reinvent" an encoding that is 32 bit, when there already exists a 32 bit encoding that does the exact same job?*
 737
 738 Consequently, we do **not** envisage a scenario where (2) would ever be implemented, nor in the future would this Compressed Encoding be extended beyond 16 bit.  Compressed is Compressed and is **by definition** limited to precisely  - and only - 16 bit.
 739
 740 The additional reason why that is the case is because VLE is exceptionally complex to implement.  In a single-issue, low clock rate "Embedded" environment for which VLE was originally designed, VLE was perfectly well matched.
 741
 742 However this Compressed Encoding is designed for High performance multi-issue systems *as well* as Embedded scenarios, and consequently, the complexity of "deep packet inspection" down into the depths of a 16 bit sequence in order to ascertain if it might not be 16 bit after all, is wholly unacceptable.
 743
 744 By eliminating such 16+16 (actually, 32bit conflation) tricks outlined in (2), Compressed is *specifically* designed to fit into a very small FSM, suitable for multi-issue, that in no way requires "deep-dive" analysis. Yet, despite it never being designed with 16 bit encodings in mind, is still suitable for retro-fitting onto OpenPOWER.
 745
 746 ## Compressed Decoder Phases
 747
 748 Phase 1 (stage 1 of a 2-stage pipelined decoder) is defined as the minimum necessary FSM required to determine instruction length and mode.  This is implemented with the absolute bare minimum of gates and is based on the 6 encodings involving N, M and EXTNNN (see table, below)
 749
 750 Phase 2 (stage 2 of a 2-stage pipelined decoder) is defined as the "full decoder" that includes taking into account the length and mode from Phase 1.  Given a 2-stage pipelined decoder it is categorically **impossible** for Phase 2 to go backwards in time and affect the decisions made in Phase 1.
 751
 752 These two phases are specifically designed to take multi-issue execution into account.  Phase 1 is intended to be part of an O(log N) algorithm that can use a form of carry-lookahead propagation. Phase 2 is intended to be on a 2nd pipelined clock cycle, comprising a separate suite of independent local-state-only parallel pipelines that do not require any inter-communication of any kind.
 753
 754 Table: Reminder of the 6 16-bit encodings:
 755
 756     | 0 | 1234 | 567  8 | 9abcde | f | explanation
 757     | - | ---- | ------ | ------ | - | -----------
 758     | EXT000/1 | Cmaj.m | fields | 0 | 10bit then v3.0B
 759     | EXT000/1 | Cmaj.m | fields | 1 | 10bit then 16bit
 760     | 0 | flds | Cmaj.m | fields | 0 | 16bit then v3.0B
 761     | 0 | flds | Cmaj.m | fields | 1 | 16bit then 16bit
 762     | 1 | flds | Cmaj.m | fields | 0 | 16b, 1x v3.0B, 16b
 763     | 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit
 764
 765 ### Phase 1
 766
 767 The Phase 1 length/mode identification takes into account only 3 pieces of information:
 768
 769 * extc_id: insn[0:4] == EXTNNN (Compressed)
 770 * N: insn[0]
 771 * M: insn[15]
 772
 773 The Phase 1 length/mode produces the following lengths/modes:
 774
 775 * 32 - v3.0B (includes v3.0B followed by 16bit)
 776 * 16 - 10bit
 777 * 16 - 16bit
 778
 779 **NOTE THAT FURTHER SUBIDENTIFICATION OF C MODES IS NOT CARRIED OUT AT PHASE 1**. In particular note specifically that 16 bit "immediate mode" is **not** part of the Phase 1 FSM, but is specifically isolated to Phase 2.
 780
 781 Pseudocode:
 782
 783     # starting point for FSM
 784     previ = v3.0B
 785
 786     if previ.mode == v3.0B:
 787         # previous was v3.0B, look for compressed tag
 788         if extc_id:
 789              # found it.  move to 10bit mode
 790              nexti.length = 16
 791              nexti.mode = 10bit
 792         else:
 793              # nope. stay in v3.0B
 794              nexti.length = 32
 795              nexti.mode = v3.0B
 796
 797     elif previ.mode == 10bit:
 798          # previous was v3.0B, move to v3.0B or 16bit?
 799         if M == 0:
 800              next.length = 32
 801              nexti.mode = v3.0B
 802          else:
 803              # otherwise stay in 16bit mode
 804              nexti.length = 16
 805              nexti.mode = 16bit
 806
 807     elif previ.mode == 16bit:
 808           # previous was 16bit, stay there or move?
 809           if M == 0:
 810              # back to v3.0B
 811              next.length = 32
 812              if N == 1:
 813                   # ... but only for 1 insn
 814                   nexti.mode = v3.0B_then_16bit
 815              else:
 816                   nexti.mode = v3.0B
 817          else:
 818              # otherwise stay in 16bit mode
 819              nexti.length = 16
 820              nexti.mode = 16bit
 821
 822     # rest of FSM involving 3.0B to 16bit
 823     # and back transitions left to implementor
 824     # (or for someone else to add)
 825
 826 ### Phase 2: Compressed mode
 827
 828 At this phase, knowing that the length is 16bit and the mode is either 10b or 16b, further analysis is required to determine if the 16bit.immediate encoding is active, and so on.  This is a fully combinatorial block that **at no time** steps outside of the strict bounds already determined by Phase 1.
 829
 830     op_001_1 = insn[5:8] != 0b001.1
 831     if mode == 10bit:
 832         decode_10bit(insn)
 833     elif mode == 16bit:
 834         if N == 1 & M == 1 & op_001_1
 835             # see immediate opcodes table
 836             decode_16bit_immed_mode(insn)
 837         if op_001_1:
 838             # see CR and System tables
 839             # (16 bit ones at least)
 840             decode_16bit_cr_or_sys(insn)
 841         else:
 842             decode_16bit_nonimmed_mode(insn)
 843
 844 From this point onwards each of the decode_xx functions perform straightforward combinatorial decoding of the 16 bits of "insn".  In sone cases this involves further analysis of bit 1, in some cases (Cmaj.m = 0b010.1) even further deep-dive decoding is required (CR ops).  *All* of it is entirely combinatorial and at **no time** involves changing of, or interaction with, or disruption of, the Phase 1 determination of Length+Mode (that has *already taken place* in an earlier decoding pipeline time-schedule)
 845
 846 ### Phase 2: v3.0B mode
 847
 848 Standard v3.0B decoders are deployed.  Absolutely no interaction occurs with any 16 bit decoders or state.  Absolutely no interaction with the earlier Phase 1 decoding occurs.  Absolutely no interaction occurs whatsoever (assuming an implementation that does not perform macro-op fusion) between other multi-issued v3.0B instructions being decoded in parallel at this time.
 849
 850 ## Demo of encoding that's backward-compatible with PowerISA v3.1 in both LE and BE mode
 851
 852 [[demo]]
 853
 854 ### Efficient Decoding Algorithm
 855
 856 [[decoding]]