openpower/sv/16_bit_compressed.mdwn

   1 [[!tag standards]]
   2
   3 # 16 bit Compressed
   4
   5 Similar to VLE (but without immediate-prefixing) this encoding is designed
   6 to fit on top of OpenPOWER ISA v3.0B when a "Modeswitch" bit is set (PCR
   7 is recommended). Note that Compressed is *mutually exclusively incompatible*
   8 with OpenPOWER v3.1B "prefixing" due to using (requiring) both EXT000
   9 and EXT001. Hypothetically it could be made to use anything other than
  10 EXT001, with some inconvenience (extra gates).  The incompatibility is
  11 "fixed" by swapping out of "Compressed" Mode and back into "Normal"
  12 (v3.1B) Mode, at runtime, as needed.
  13
  14 Although initially intended to be augmented by Simple-V Prefixing (to
  15 add Vector context, width overrides, e.g IEEE754 FP16, and predication) yet not put pressure on I-Cache power
  16 or size, this Compressed Encoding is not critically dependent
  17 *on* SV Prefixing, and may be used stand-alone.
  18
  19 See:
  20
  21 * <https://bugs.libre-soc.org/show_bug.cgi?id=238>
  22 * <https://ftp.libre-soc.org/VLE_314-68105.pdf> VLE Encoding
  23 * <http://lists.mailinglist.openpowerfoundation.org/pipermail/openpower-hdl-cores/2020-November/000210.html>
  24
  25 This one is a conundrum.  OpenPOWER ISA was never designed with 16
  26 bit in mind.  VLE was added 10 years ago but only by way of marking
  27 an entire 64k page as "VLE".  With VLE not maintained it is not
  28 fully compatible with current PowerISA.
  29
  30 Here, in order to embed 16 bit into a predominantly 32 bit stream the
  31 overhead of using an entire 16 bits just to switch into Compressed mode
  32 is itself a significant overhead.  The situation is made worse by
  33 OpenPOWER ISA being fundamentally designed with 6 bits uniformly
  34 taking up Major Opcode space, leaving only 10 bits to allocate
  35 to actual instructions.
  36
  37 Contrast this with RVC which takes 3 out of 4 combinations of the first 2
  38 bits for indicating 16-bit (anything with 0b00 to 0b10 in the LSBs), and
  39 uses the 4th (0b11) as a Huffman-style escape-sequence, easily allowing
  40 standard 32 bit and 16 bit to intermingle cleanly.  To achieve the same
  41 thing on OpenPOWER would require a whopping 24 6-bit Major Opcodes which
  42 is clearly impractical: other schemes need to be devised.
  43
  44 In addition we would like to add SV-C32 which is a Vectorised version
  45 of 16 bit Compressed, and ideally have a variant that adds the 27-bit
  46 prefix format from SV-P64, as well.
  47
  48 Potential ways to reduce pressure on the 16 bit space are:
  49
  50 * To use more than one v3.0B Major Opcode, preferably an odd-even
  51   contiguous pair
  52 * To provide "paging".  This involves bank-switching to alternative
  53   optimised encodings for specific workloads
  54 * To enter "16 bit mode" for durations specified at the start
  55 * To reserve one bit of every 16 bit instruction to indicate that the
  56   16 bit mode is to continue to be sustained
  57
  58 This latter would be useful in the Vector context to have an alternative
  59 meaning: as the bit which determines whether the instruction is 11-bit
  60 prefixed or 27-bit prefixed:
  61
  62     0 1 2 3 4 5 6 7 8 9 a b c d e f |
  63     |major op | 11 bit vector prefix|
  64     |16 bit opcode  alt vec. mode ^ |
  65     | extra vector prefix if alt set|
  66
  67 Using a major opcode to enter 16 bit mode, leaves 11 bits to find
  68 something to use them for:
  69
  70     0 1 2 3 4 5 6 7 8 9 a b c d e f |
  71     |major op | what to do here   1 |
  72     |16 bit    stay in 16bit mode 1 |
  73     |16 bit    stay in 16bit mode 1 |
  74     |16 bit       exit 16bit mode 0 |
  75
  76 One possibility is that the 11 bits are used for bank selection,
  77 with some room for additional context such as altering the registers
  78 used for the 16 bit operations (bank selection of which scalar regs).
  79 However the downside is that short sequences of Compressed instructions
  80 become penalised by the fixed overhead.  Even a single 16 bit instruction
  81 requires a 16 bit overhead to "gain access" to 16 bit "mode", making
  82 the exercise pointless.
  83
  84 An alternative is to use the first 11 bits for only the utmost commonly
  85 used instructions.  That being the case then one of those 11 bits could
  86 be dedicated to saying if 16 bit mode is to be continued, at which
  87 point *all* 16 bits can be used for Compressed.  10 bits remain for
  88 actual opcodes, which is ridiculously tight, however the opportunity to
  89 subsequently use all 16 bits is worth it.
  90
  91 The reason for picking 2 contiguous Major v3.0B opcodes is illustrated below:
  92
  93     |0 1 2 3 4 5 6 7 8 9 a b c d e f|
  94     |major op..0| LO Half C space   |
  95     |major op..1| HI Half C space   |
  96     |N N N N N|<--11 bits C space-->|
  97
  98 If NNNNN is the same value (two contiguous Major v3.0B Opcodes) this
  99 saves gates at a critical part of the decode phase.
 100
 101 ## ABI considerations
 102
 103 Unlike RISC-V RVC, the above "context" encodings require state, to be stored
 104 in the PCR, MSR, or a dedicated SPR.  These bits (just like LE/BE 32bit
 105 mode and the IEEE754 FPCSR mode) all require taking that context into
 106 consideration.
 107
 108 In particular it is critically important to recognise that context (in
 109 general) is an implicit part of the ABI implemented for example by glibc6.
 110 Therefore (in specific) Compressed Mode Context **must not** be permitted
 111 to cross into or out of a function call.
 112
 113 Thus it is the mandatory responsibility of the compiler to ensure that
 114 context returns to "v3.0B Standard" prior to entering a function call
 115 (responsibility of caller) and prior to exit from a function call
 116 (responsibility of callee) by setting appropriate M and N bits.
 117
 118 If however it is known to the compiler that certain static leaf node functions and their immediate callers will never, under any circumstances, be called by externsl ABI compliant code, then of course the compiler may choose to write such static functions as it sees fit.
 119
 120 Trap Handlers also take responsibility for saving and restoring of
 121 Compressed Mode state, just as they already take responsibility for
 122 other critical state.  This makes traps transparent to functions as
 123 far as Compressed Mode Context is concerned, just as traps are already
 124 transparent to functions.
 125
 126 Note however that there are exceptions in a compiler to the otherwise
 127 hard rule that Compressed Mode context not be permitted to cross function
 128 boundaries: inline functions and static functions.  static functions,
 129 if correctly identified as never to be called externally, may, as an
 130 optimisation, disregard standard ABIs, bearing in mind that this will
 131 be fraught (pointers to functions) and not easy to get right.
 132
 133 # Opcode Allocation Ideas
 134
 135 * one bit from the 16-bit mode is used to indicate that standard
 136   (v3.0B) mode is to be dropped into for only one single instruction
 137   <https://bugs.libre-soc.org/show_bug.cgi?id=238#c2>
 138
 139 ## Opcodes exploration (Attempt 1)
 140
 141 Switching between different encoding modes is controlled by M (alone)
 142 in 10-bit mode, and M and N in 16-bit mode.
 143
 144 * M in 10-bit mode if zero indicates that following instructions are
 145   standard OpenPOWER ISA 32-bit encoded (including, redundantly,
 146   further 10/16-bit instructions)
 147 * M in 10-bit mode if 1 indicates that following instructions are
 148   in 16-bit encoding mode
 149
 150 Once in 16-bit mode:
 151
 152 * 0b01 (M=1, N=0): stay in 16-bit mode
 153 * 0b00: leave 16-bit mode permanently (return to standard OpenPOWER ISA)
 154 * 0b10: leave 16-bit mode for one cycle (return to standard OpenPOWER ISA)
 155 * 0b11: free to be used for something completely different.
 156
 157 The current "top" idea for 0b11 is to use it for a new encoding format
 158 of predominantly "immediates-based" 16-bit instructions (branch-conditional,
 159 addi, mulli etc.)
 160
 161 * The Compressed Major Opcode is in bits 5-7.
 162 * Minor opcode in bit 8.
 163 * In some cases bit 9 is taken as an additional sub-opcode, followed
 164   by bits 0-4 (for CR operations)
 165 * M+N mode-switching is not available for C-Major.minor 0b001.1
 166 * 10 bit mode may be expanded by 16 bit mode, adding capabilities
 167   that do not fit in the extreme limited space.
 168
 169 Mode-switching FSM showing relationship between v3.0B, C 10bit and C 16bit.
 170 16-bit immediate mode remains in 16-bit.
 171
 172     | 0 | 1234 | 567  8 | 9abcde | f | explanation
 173     | - | ---- | ------ | ------ | - | -----------
 174     | EXT000/1 | Cmaj.m | fields | 0 | 10bit then v3.0B
 175     | EXT000/1 | Cmaj.m | fields | 1 | 10bit then 16bit
 176     | 0 | flds | Cmaj.m | fields | 0 | 16bit then v3.0B
 177     | 0 | flds | Cmaj.m | fields | 1 | 16bit then 16bit
 178     | 1 | flds | Cmaj.m | fields | 0 | 16b, 1x v3.0B, 16b
 179     | 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit
 180
 181 Notes:
 182
 183 * Cmaj.m is the C major/minor opcode: 3 bits for major, 1 for minor
 184 * EXT000 and EXT001 are v3.0B Major Opcodes.  The first 5 bits
 185   are zero, therefore the 6th bit is actually part of Cmaj.
 186 * "10bit then 16bit" means "this instruction is encoded C 10bit
 187   and the following one in C 16bit"
 188 * "16b, 1x v3.0B, 16b" means, "this instruction is encoded C 16bit,
 189   the following one is V3.0B Standard, and the one after that is
 190   back to 16bit".
 191
 192 ### C Instruction Encoding types
 193
 194 10-bit Opcode formats (all start with v3.0B EXT000 or EXT001
 195 Major Opcodes)
 196
 197     | 01234    | 567  8 | 9  | a b | c  | d e | f | enc
 198     | E01      | Cmaj.m | fld1     | fld2     | M | 10b
 199     | E01      | Cmaj.m | offset              | M | 10b b
 200     | E01      | 001.1  | S1 | fd1 | S2 | fd2 | M | 10b sub
 201     | E01      | 111.m  | fld1     | fld2     | M | 10b LDST
 202
 203 16-bit Opcode formats (including 10/16/v3.0B Switching)
 204
 205     | 0 | 1234 | 567  8 | 9  | a b | c  | d e | f | enc
 206     | N | immf | Cmaj.m | fld1     | fld2     | M | 16b
 207     | 1 | immf | Cmaj.m | fld1     | imm      | 1 | 16b imm
 208     | N | fd3  | 001.1  | S1 | fd1 | S2 | fd2 | M | 16b sub
 209     | N | fd4  | 111.m  | fld1     | fld2     | M | 16b LDST
 210
 211 Notes:
 212
 213 * fld1 and fld2 can contain reg numbers, immediates, or opcode
 214   fields (BO, BI, LK)
 215 * S1 and S2 are further sub-selectors of C 001.1
 216
 217 ### Immediate Opcodes
 218
 219 only available in 16-bit mode, only available when M=1 and N=1
 220 and when Cmaj.min is not 0b001.1.
 221
 222 instruction counts from objdump on /bin/bash:
 223
 224       466 extsw r1,r1
 225       649 stw r1,1(r1)
 226       691 lwz r1,1(r1)
 227       705 cmpdi r1,1
 228       791 cmpwi r1,1
 229       794 addis r1,r1,1
 230      1474 std r1,1(r1)
 231      1846 li r1,1
 232      2031 mr r1,r1
 233      2473 addi r1,r1,1
 234      3012 nop
 235      3028 ld r1,1(r1)
 236
 237
 238     | 0 | 1  | 2 | 3 4 | | 567.8 | 9ab  | cde | f |
 239     | 1 | 0  | 0   0 0 | | 001.0 |      | 000 | 1 | TBD
 240     | 1 | 0  |  sh2    | | 001.0 | RA   | sh  | 1 | sradi.
 241     | 1 | 1  | 0   0 0 | | 001.0 |      | 000 | 1 | TBD
 242     | 1 | 1  | 0 | sh2 | | 001.0 | RA   | sh  | 1 | srawi.
 243     | 1 | 1  | 1 |     | | 001.0 | 000  | imm | 1 | TBD
 244     | 1 | 1  | 1 | i2  | | 001.0 | RA!=0| imm | 1 | addis
 245     | 1 | 0  | i2      | | 010.0 | 000  | imm | 1 | setvli
 246     | 1 | 1  | i2      | | 010.0 | 000  | imm | 1 | setmvli
 247     | 1 | i2           | | 010.0 | RA!=0| imm | 1 | addi
 248     | 1 | 0  | i2      | | 010.1 | RA   | imm | 1 | cmpdi
 249     | 1 | 1  | i2      | | 010.1 | RA   | imm | 1 | cmpwi
 250     | 1 | 0  | i2      | | 011.0 | RT   | imm | 1 | ldspi
 251     | 1 | 1  | i2      | | 011.0 | RT   | imm | 1 | lwspi
 252     | 1 | 0  | i2      | | 011.1 | RT   | imm | 1 | stwspi
 253     | 1 | 1  | i2      | | 011.1 | RT   | imm | 1 | stdspi
 254     | 1 | i2 | RA      | | 100.0 | RT   | imm | 1 | stwi
 255     | 1 | i2 | RA      | | 100.1 | RT   | imm | 1 | stdi
 256     | 1 | i2 | RT      | | 101.0 | RA   | imm | 1 | ldi
 257     | 1 | i2 | RT      | | 101.1 | RA   | imm | 1 | lwi
 258     | 1 | i2 | RA      | | 110.0 | RT   | imm | 1 | fsti
 259     | 1 | i2 | RA      | | 110.1 | RT   | imm | 1 | fstdi
 260     | 1 | i2 | RT      | | 111.0 | RA   | imm | 1 | flwi
 261     | 1 | i2 | RT      | | 111.1 | RA   | imm | 1 | fldi
 262
 263 Construction of immediate:
 264
 265 * LD/ST r1 (SP) variants should be offset by -256
 266  see <https://bugs.libre-soc.org/show_bug.cgi?id=238#c43>
 267   - SP variants map to e.g ld RT, imm(r1)
 268   - SV Prefixing can be used to map r1 to alternate regs
 269 * [1] not the same as v3.0B addis: the shift amount is smaller and actually
 270   still maps to within the v3.0B addi immediate range.
 271 * addi is EXTS(i2||imm) to give a 4-bit range -8 to +7
 272 * addis is EXTS(i2||imm||000) to give a 11-bit range -1024 to +1023 in
 273   increments of 8
 274 * all others are EXTS(i2||imm) to give a 7-bit range -128 to +127
 275   (further for LD/ST due to word/dword-alignment)
 276
 277 Further Notes:
 278
 279 * bc also has an immediate mode, listed separately below in Branch section
 280 * for LD/ST, offset is aligned.  8-byte: i2||imm||0b000 4-byte: 0b00
 281 * SV Prefix over-rides help provide alternative bitwidths for LD/ST
 282 * RA|0 if RA is zero, addi. becomes "li"
 283   - this only works if RT takes part of opcode
 284   - mv is also possible by specifying an immediate of zero
 285
 286 ### Illegal, nop and attn
 287
 288 Note that illeg is all zeros, including in the 16-bit mode.
 289 Given that C is allocated to OpenPOWER ISA Major opcodes EXT000 and
 290 EXT001 this ensures that in both 10-bit *and* 16-bit mode, a 16-bit
 291 run of all zeros is considered "illegal" whilst 0b0000.0000.1000.0000
 292 is "nop"
 293
 294     | 16-bit mode | | 10-bit mode                 |
 295     | 0 | 1 | 234 | | 567.8  | 9  ab | c   de | f |
 296     | - | - | --- | | -----  | ----- | ------ | - |
 297     | 0 | 0   000 | | 000.0  | 0  00 | 0   00 | 0 | illeg
 298     | 0 | 0   000 | | 000.0  | 0  00 | 0   00 | 1 | nop
 299
 300 16 bit mode only:
 301
 302     | 0 | 1 | 234 | | 567.8  | 9  ab | c   de | f |
 303     | - | - | --- | | -----  | ----- | ------ | - |
 304     | 1 | 0   000 | | 000.0  | 0  00 | 0   00 | 0 | nop
 305     | 1 | 0   000 | | 000.0  | 0  00 | 0   00 | 1 | nop
 306     | N | 1   000 | | 000.0  | 0  00 | 0   00 | M | attn
 307
 308 Notes:
 309
 310 * All-zeros being an illegal instruction is normal for ISAs.  Ensuring that
 311   this remains true at all times i.e. for both 10 bit and 16 bit mode is
 312   common sense.
 313 * The 10-bit nop (bit 15, M=1) is intended for circumstances
 314   where alignment to 32-bit before returning to v3.0B is required.
 315   M=1 being an indication "return to Standard v3.0B Encoding Mode".
 316 * The 16-bit nop (bit 0, N=1) is intended for circumstances where a
 317   return to Standard v3.0B Encoding is required for one cycle
 318   but one cycle where alignment to a 32-bit boundary is needed.
 319   Examples of this would be to return to "strict" (non-C) mode
 320   where the PC may not be on a non-word-aligned boundary.
 321 * If for any reason multiple 16 bit nops are needed in succession
 322   the M=1 variant can be used, because each one returns to
 323   Standard v3.0B Encoding Mode, each time.
 324
 325 In essence the 2 nops are needed due to there being 2 different C forms:
 326 10 and 16 bit.
 327
 328 ### Branch
 329
 330 TODO: document that branching whilst using mode-switching bits (M/N) is perfectly well permitted, the caveat being: it is specifically and wholly the complier/assembler writers responsibility to obey ABI rules and ensure that even with branches and returns that, at no time, is an incorrect mode entered or left that could result in any instruction being misinterpreted.
 331
 332     | 16-bit mode | | 10-bit mode                 |
 333     | 0 | 1 | 234 | | 567.8  | 9  ab | c   de | f |
 334     | - | - | --- | | -----  | ----- | ------ | - |
 335     | N | offs2   | | 000.LK | offs!=0        | M | b, bl
 336     | N |         | | 000.1  | 0  00 | 0   00 | M | TBD
 337     | 1 | offs2   | | 000.LK | BI    | BO1 oo | 1 | bc, bcl
 338     | N | BO3 BI3 | | 001.0  | LK BI | BO     | M | bclr, bclrl
 339
 340 16 bit mode:
 341
 342 * bc only available when N,M=0b11
 343 * offs2 extends offset in MSBs
 344 * BI3 extends BI in MSBs to allow selection of full CR
 345 * BO3 extends BO
 346 * bc offset constructed from oo as LSBs and offs2 as MSBs
 347 * bc BI allows selection of all bits from CR0 or CR1
 348 * bc CR check is always active (as if BO0=1) therefore BO1 inverts
 349
 350 10 bit mode:
 351
 352 * illegal (all zeros) covers part of branch (offs=0,M=0,LK=0)
 353 * nop also covers part of branch (offs=0,M=0,LK=1)
 354 * bc **not available** in 10-bit mode
 355 * BO[0] enables CR check, BO[1] inverts check
 356 * BI refers to CR0 only (4 bits of)
 357 * no Branch Conditional with immediate
 358 * no Absolute Address
 359 * CTR mode allowed with BO[2] for b only.
 360 * offs is to 2 byte (signed) aligned
 361 * all branches to 2 byte aligned
 362
 363 ### LD/ST
 364
 365 Note: for 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
 366
 367     | 16-bit mode  | | 10-bit mode               |
 368     | 0 | 1  | 234 | | 567.8 | 9 a b | c d e | f |
 369     | - | -- | --- | | ----- | ----- | ----- | - |
 370     | N | SZ |  RB | | 001.1 | 1  RA | 0  RT | M | st
 371     | N | SZ |  RB | | 001.1 | 1  RA | 1  RT | M | fst
 372     | N | SZ |  RT | | 111.0 |  RA   |  RB   | M | ld
 373     | N | SZ |  RT | | 111.1 |  RA   |  RB   | M | fld
 374
 375 * elwidth overrides can set different widths
 376
 377 16 bit mode:
 378
 379 * SZ=1 is 64 bit, SZ=0 is 32 bit
 380
 381 10 bit mode:
 382
 383 * RA and RB are only 2 bit (0-3)
 384 * for LD, RT is implicitly RB: "ld RT=RB, RA(RB)"
 385 * for ST, there is no offset: "st RT, RA(0)"
 386
 387 ### Arithmetic
 388
 389 * 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
 390 * 16-bit: note that bit 1==0 (sub-sub-encoding)
 391
 392 10 and 16 bit:
 393
 394     | 16-bit mode | | 10-bit mode             |
 395     | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
 396     | - | - | --- | | ----- | --- | ----- | - |
 397     | N | 0 | RT  | | 010.0 | RB  | RA!=0 | M | add
 398     | N | 0 | RT  | | 010.1 | RB  | RA|0  | M | sub.
 399     | N | 0 | BF  | | 011.0 | RB  | RA|0  | M | cmpl
 400
 401 Notes:
 402
 403 * sub. and cmpl: default CR target is CR0
 404 * for (RA|0) when RA=0 the input is a zero immediate,
 405   meaning that sub. becomes neg. and cmp becomes cmpi against zero
 406 * RT is implicitly RB: "add RT(=RB), RA, RB"
 407 * Opcode 0b010.0 RA=0 is not missing from the above:
 408   it is a system-wide instruction, "cbank" (section below)
 409
 410 16 bit mode only:
 411
 412     | 0 | 1 | 234 | | 567.8 | 9ab | cde   | f |
 413     | - | - | --- | | ----- | --- | ----- | - |
 414     | N | 1 | RA  | | 010.0 | RB  | RS    | M | sld.
 415     | N | 1 | RA  | | 010.1 | RB  | RS!=0 | M | srd.
 416     | N | 1 | RA  | | 010.1 | RB  | 000   | M | srad.
 417     | N | 1 | BF  | | 011.0 | RB  | RA|0  | M | cmpw
 418
 419 Notes:
 420
 421 * for srad, RS=RA: "srad. RA(=RS), RS, RB"
 422
 423 ### Logical
 424
 425 * 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
 426 * 16-bit: note that bit 1==0 (sub-sub-encoding)
 427
 428 10 and 16 bit:
 429
 430     | 16-bit mode | | 10-bit mode             |
 431     | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
 432     | - | - | --- | | ----- | --- | ----- | - |
 433     | N | 0 |  RT | | 100.0 | RB  | RA!=0 | M | and
 434     | N | 0 |  RT | | 100.1 | RB  | RA!=0 | M | nand
 435     | N | 0 |  RT | | 101.0 | RB  | RA!=0 | M | or
 436     | N | 0 |  RT | | 101.1 | RB  | RA!=0 | M | nor/mr
 437     | N | 0 |  RT | | 100.0 | RB  | 0 0 0 | M | popcnt
 438     | N | 0 |  RT | | 100.1 | RB  | 0 0 0 | M | cntlz
 439     | N | 0 |  RT | | 101.0 | RB  | 0 0 0 | M | extsw
 440     | N | 0 |  RT | | 101.1 | RB  | 0 0 0 | M | not
 441
 442 16-bit mode only (note that bit 1 == 1):
 443
 444     | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
 445     | - | - | --- | | ----- | --- | ----- | - |
 446     | N | 1 |  RT | | 100.0 | RB  | RA!=0 | M | TBD
 447     | N | 1 |  RT | | 100.1 | RB  | RA!=0 | M | TBD
 448     | N | 1 |  RT | | 101.0 | RB  | RA!=0 | M | xor
 449     | N | 1 |  RT | | 101.1 | RB  | RA!=0 | M | eqv (xnor)
 450     | N | 1 |  RT | | 100.0 | RB  | 0 0 0 | M | setvl.
 451     | N | 1 |  RT | | 100.1 | RB  | 0 0 0 | M | cnttz
 452     | N | 1 |  RT | | 101.0 | RB  | 0 0 0 | M | extsb
 453     | N | 1 |  RT | | 101.1 | RB  | 0 0 0 | M | extsh
 454
 455 10 bit mode:
 456
 457 * idea: for 10bit mode, nor is actually 'mr' because mr is
 458   a more common operation.  in 16bit however, this encoding
 459   (Cmaj.min=0b101.1, N=0) is 'nor'
 460 * for (RA|0) when RA=0 the input is a zero immediate,
 461   meaning that nor becomes not
 462 * cntlz, popcnt, exts **not available** in 10-bit mode
 463 * RT is implicitly RB: "and RT(=RB), RA, RB"
 464
 465 ### Floating Point
 466
 467 Note here that elwidth overrides (SV Prefix) can be used to select FP16/32/64
 468
 469 * 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
 470 * 16-bit: note that bit 1==0 (sub-sub-encoding)
 471
 472 10 and 16 bit:
 473
 474     | 16-bit mode | | 10-bit mode             |
 475     | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
 476     | - | - | --- | | ----- | --- | ----- | - |
 477     | N |   |  RT | | 011.1 | RB  | RA!=0 | M | fsub.
 478     | N | 0 |  RT | | 110.0 | RB  | RA!=0 | M | fadd
 479     | N | 0 |  RT | | 110.1 | RB  | RA!=0 | M | fmul
 480     | N | 0 |  RT | | 011.1 | RB  | 0 0 0 | M | fneg.
 481     | N | 0 |     | | 110.0 |     | 0 0 0 | M | TBD
 482     | N | 0 |     | | 110.1 |     | 0 0 0 | M | TND
 483
 484 16-bit mode only (note that bit 1 == 1):
 485
 486     | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
 487     | - | - | --- | | ----- | --- | ----- | - |
 488     | N | 1 |     | | 011.1 |     | RA!=0 | M | TBD
 489     | N | 1 |     | | 110.0 |     | RA!=0 | M | TBD
 490     | N | 1 |  RT | | 110.1 | RB  | RA!=0 | M | fdiv
 491     | N | 1 |  RT | | 011.1 | RB  | 0 0 0 | M | fabs.
 492     | N | 1 |  RT | | 110.0 | RB  | 0 0 0 | M | fmr.
 493     | N | 1 |     | | 110.1 |     | 0 0 0 | M | TBD
 494
 495 16 bit only, FP to INT convert (using C 0b001.1 subencoding)
 496
 497     | 0 | 123 | 4 | | 567.8 | 9 ab | cde  | f |
 498     | - | --- | - | | ----- | ---- | ---- | - |
 499     | N | 101 | X | | 001.1 | 0 RA | Y RT | M | fp2int
 500     | N | 110 | X | | 001.1 | 0 RA | Y RT | M | int2fp
 501
 502 * X: signed=1, unsigned=0
 503 * Y: FP32=0, FP64=1
 504
 505 10 bit mode:
 506
 507 * fsub. fneg. and fmr. default target is CR1
 508 * fmr. is **not available** in 10-bit mode
 509 * fdiv is **not available** in 10-bit mode
 510
 511 16 bit mode:
 512
 513 * fmr. copies RB to RT (and sets CR1)
 514
 515 ### Condition Register
 516
 517 10-bit or 16 bit:
 518
 519     | 16-bit mode| | 10-bit mode            |
 520     | 0 | 123 | 4   | | 567.8 | 9 ab | cde | f |
 521     | - | --- | --- | | ----- | ---- | --- | - |
 522     | N | 000 | BF2 | | 001.1 | 0 BF | BFA | M | mcrf
 523
 524 16-bit only:
 525
 526     | 0 | 1234 | | 567.8 | 9 ab | cde | f |
 527     | - | ---- | | ----- | ---- | --- | - |
 528     | N | 0010 | | 001.1 | 0 BA | BB  | M | crnor
 529     | N | 0011 | | 001.1 | 0 BA | BB  | M | crandc
 530     | N | 0100 | | 001.1 | 0 BA | BB  | M | crxor
 531     | N | 0101 | | 001.1 | 0 BA | BB  | M | crnand
 532     | N | 0110 | | 001.1 | 0 BA | BB  | M | crand
 533     | N | 0111 | | 001.1 | 0 BA | BB  | M | creqv
 534     | N | 1000 | | 001.1 | 0 BA | BB  | M | crorc
 535     | N | 1001 | | 001.1 | 0 BA | BB  | M | cror
 536
 537 Notes
 538
 539 10 bit mode:
 540
 541 * mcrf BF is only 2 bits which means the destination is only CR0-CR3
 542 * CR operations: **not available** in 10-bit mode (but mcrf is)
 543
 544 16 bit mode:
 545
 546 * mcrf BF2 extends BF (in MSB) to 3 bits
 547 * CR operations: destination register is same as BA.
 548 * CR operations: only possible on CR0 and CR1
 549
 550 SV (Vector Mode):
 551
 552 * CR operations: greatly extended reach/range (useful for predicates)
 553
 554 ### System
 555
 556 cbank: Selection of Compressed-encoding "Bank".  Different "banks"
 557 give different meanings to opcodes.  Example: CBank=0b001 is heavily
 558 optimised to A/Video Encode/Decode.  cbank borrows from add's encoding
 559 space (when RA==0)
 560
 561     | 16-bit mode | | 10-bit mode             |
 562     | 0 | 1 2 3 4 | | 567.8 | 9ab   | cde | f |
 563     | - | ------- | | ----- | ----- | --- | - |
 564     | N | 0 Bank2 | | 010.0 | CBank | 000 | M | cbank
 565
 566 **not available** in 10-bit mode, **only** in 16-bit mode:
 567
 568     | 0 | 1 | 234 | | 567.8 | 9 ab | cde  | f |
 569     | - | ------- | | ----- | ---- | ---- | - |
 570     | N | 1 | 111 | | 001.1 | 0 00 |  RT  | M | mtlr
 571     | N | 1 | 111 | | 001.1 | 0 01 |  RT  | M | mtctr
 572     | N | 1 | 111 | | 001.1 | 0 00 |  RA  | M | mflr
 573     | N | 1 | 111 | | 001.1 | 0 01 |  RA  | M | mfctr
 574     | N | 0 RA!=0 | | 000.0 | 0 00 |  000 | M | mtcr
 575     | N | 1 RT!=0 | | 000.0 | 0 00 |  000 | M | mfcr
 576
 577 ### Unallocated
 578
 579 16-bit only:
 580
 581     | 0 | 1 | 234 | | 567.8 | 9 ab | cde  | f |
 582     | - | - | --- | | ----- | ---- | ---- | - |
 583     | N | 1 | 111 | | 001.1 | 0 10 |      | M |
 584     | N | 1 | 111 | | 001.1 | 0 11 |      | M |
 585
 586 # Other ideas (Attempt 2)
 587
 588 ## 8-bit mode-switching instructions, odd addresses for C mode
 589
 590 Drop the complexity of the 16-bit encoding further reduced to 10-bit,
 591 and use a single byte instead of two to switch between modes.  This
 592 would place compressed (C) mode instructions at odd bytes, so the LSB
 593 of the PC can be used for the processor to tell which mode it is in.
 594
 595 To switch from traditional to compressed mode, the single-byte
 596 instruction would be at the MSByte, that holds the EXT bits.  (When we
 597 break up a 32-bit instruction across words, the most significant half
 598 should go in the word with the lower address.)
 599
 600 To switch from compressed mode to traditional mode, the single-byte
 601 instruction would also be at the opcode/format portion, placed in the
 602 lower-address word if split across words, so that the instruction can
 603 be recognized as the mode-switching one without going for its second
 604 byte.
 605
 606 The C-mode nop should be encoded so that its second byte encodes a
 607 switch to compressed mode, if decoded in traditional mode.  This
 608 enables such a nop to straddle across a label:
 609
 610     8-bit first half of nop
 611     Label:
 612     8-bit second half of nop AKA switch to compressed mode
 613     16-bit insns...
 614
 615 so that if traditional code jumps to the word-aligned label (because
 616 traditional branches drop the 2 LSB), it immediately switches to
 617 compressed mode; if we fall-through, we remain in 16-bit mode; and if
 618 we branch to it from compressed mode, whether we jump to the odd or
 619 the even address, we end up in compressed mode as desired.
 620
 621 Tables explaining encoding:
 622
 623     | byte 0 | byte 1 | byte 2 | byte 3 |
 624     | v3.0B standard 32 bit instruction |
 625     | EXT000 | 16 bit          | 16...  |
 626     | .. bit | 8nop   | v3.0b stand...  |
 627     | .. ard 32 bit   | EXT000 | 16...  |
 628     | .. bit | 16 bit          | 8nop   |
 629     | v3.0B standard 32 bit instruction |
 630
 631 # Other ideas (v3)
 632
 633 FSM state switching and mode switching deemed too complex.  Instead cut back to
 634
 635 1. 10bit only (actually, 11 bit)
 636 2. SV-Prefixed 16bit only (aka SV-C32)
 637
 638 Each will be entirely different which is a huge amount of work.
 639
 640 # TODO
 641
 642 * make a preliminary assessment of branch in/out viability
 643 * confirm FSM encoding (is LSB of PC really enough?)
 644 * guestimate opcode and register allocation (without necessarily doing
 645   a full encoding)
 646 * write throwaway python program that estimates compression ratio from
 647   objdump raw parsing
 648 * finally do full opcode allocation
 649 * rerun objdump compression ratio estimates
 650 * check in FSM if "return to v3.0B then 16bit" if it is ok to have the v3.0B be a 10bit Compressed.  should this be ignored and carry on? should a trap occur?
 651
 652 ### Use 2- rather than 3-register opcodes
 653
 654 Successful compact ISAs have used 2- rather than 3-register insns, in
 655 which the same register serves as input and output.  Some 20% of
 656 general-purpose 3-register insns already use either input register as
 657 output, without any effort by the compiler to do so.
 658
 659 Repurposing the 3 bits used to encode one one of the input registers
 660 in arithmetic, logical and floating-pointer registers, and the 2 bits
 661 used to encode the mode of the next two insns, we could make the full
 662 register files available to the opcodes already selected for
 663 compressed mode, with one bit to spare to bring additional opcodes in.
 664
 665 An opcode could be assigned to an instruction that combines and
 666 extends with the subsequent instruction, providing it with a separate
 667 input operand to use rather than the output register, or with
 668 additional range for immediate and offset operands, effectively
 669 forming a 32-bit operation, enabling us to remain in compressed mode
 670 even longer.
 671
 672 # Appendix
 673
 674 ## Analysis techniques and tools
 675
 676     objdump -d --no-show-raw-insn /bin/bash | sed 'y/\t/ /;
 677       s/^[ x0-9A-F]*: *\([a-z.]\+\) *\(.*\)/\1 \2 /p; d' |
 678       sed 's/\([, (]\)r[1-9][0-9]*/\1r1/g;
 679       s/\([ ,]\)-*[0-9]\+\([^0-9]\)/\11\2/g' | sort | uniq --count |
 680       sort -n | less
 681
 682 ## gcc register allocation
 683
 684 FTR, information extracted from gcc's gcc/config/rs6000/rs6000.h about
 685 fixed registers (assigned to special purposes) and register allocation
 686 order:
 687
 688 Special-purpose registers on ppc are:
 689
 690     r0: constant zero/throw-away
 691     r1: stack pointer
 692     r2: thread-local storage pointer in 32-bit mode
 693     r2: non-minimal TOC register
 694     r10: EH return stack adjust register
 695     r11: static chain pointer
 696     r13: thread-local storage pointer in 64-bit mode
 697     r30: minimal-TOC/-fPIC/-fpic base register
 698     r31: frame pointer
 699     lr: return address register
 700
 701 the register allocation order in GCC (i.e., it takes the earliest
 702 available register that fits the constraints) is:
 703
 704     We allocate in the following order:
 705
 706         fp0             (not saved or used for anything)
 707         fp13 - fp2      (not saved; incoming fp arg registers)
 708         fp1             (not saved; return value)
 709         fp31 - fp14     (saved; order given to save least number)
 710         cr7, cr5        (not saved or special)
 711         cr6             (not saved, but used for vector operations)
 712         cr1             (not saved, but used for FP operations)
 713         cr0             (not saved, but used for arithmetic operations)
 714         cr4, cr3, cr2   (saved)
 715         r9              (not saved; best for TImode)
 716         r10, r8-r4      (not saved; highest first for less conflict with params)
 717         r3              (not saved; return value register)
 718         r11             (not saved; later alloc to help shrink-wrap)
 719         r0              (not saved; cannot be base reg)
 720         r31 - r13       (saved; order given to save least number)
 721         r12             (not saved; if used for DImode or DFmode would use r13)
 722         ctr             (not saved; when we have the choice ctr is better)
 723         lr              (saved)
 724         r1, r2, ap, ca  (fixed)
 725         v0 - v1         (not saved or used for anything)
 726         v13 - v3        (not saved; incoming vector arg registers)
 727         v2              (not saved; incoming vector arg reg; return value)
 728         v19 - v14       (not saved or used for anything)
 729         v31 - v20       (saved; order given to save least number)
 730         vrsave, vscr    (fixed)
 731         sfp             (fixed)
 732
 733 ## Comparison to VLE
 734
 735 VLE was a means to reduce executable size through three interleaved methods:
 736
 737 * (1) invention of 16 bit encodings (of exactly 16 bit in length)
 738 * (2) invention of 16+16 bit encodings (a 16 bit instruction format but with
 739   an *additional* 16 bit immediate "tacked on" to the end, actually
 740   making a 32-bit instruction format)
 741 * (3) seamless and transparent embedding and intermingling of the
 742   above in amongst arbitrary v2.06/7 BE 32 bit instruction sequences,
 743   with no additional state,
 744   including when the PC was not aligned on a 4-byte boundary.
 745
 746 Whilst (1) and (3) make perfect sense, (2) makes no sense at all given that, as inspection of "ori" and others show, I-Form 16 bit immediates is the "norm" for v2.06/7 and v3.0B standard instructions.  (2) in effect **is** a 32 bit instruction.  (2) **is not** a 16 bit instruction.
 747
 748 *Why "reinvent" an encoding that is 32 bit, when there already exists a 32 bit encoding that does the exact same job?*
 749
 750 Consequently, we do **not** envisage a scenario where (2) would ever be implemented, nor in the future would this Compressed Encoding be extended beyond 16 bit.  Compressed is Compressed and is **by definition** limited to precisely  - and only - 16 bit.
 751
 752 The additional reason why that is the case is because VLE is exceptionally complex to implement.  In a single-issue, low clock rate "Embedded" environment for which VLE was originally designed, VLE was perfectly well matched.
 753
 754 However this Compressed Encoding is designed for High performance multi-issue systems *as well* as Embedded scenarios, and consequently, the complexity of "deep packet inspection" down into the depths of a 16 bit sequence in order to ascertain if it might not be 16 bit after all, is wholly unacceptable.
 755
 756 By eliminating such 16+16 (actually, 32bit conflation) tricks outlined in (2), Compressed is *specifically* designed to fit into a very small FSM, suitable for multi-issue, that in no way requires "deep-dive" analysis. Yet, despite it never being designed with 16 bit encodings in mind, is still suitable for retro-fitting onto OpenPOWER.
 757
 758 ## Compressed Decoder Phases
 759
 760 Phase 1 (stage 1 of a 2-stage pipelined decoder) is defined as the minimum necessary FSM required to determine instruction length and mode.  This is implemented with the absolute bare minimum of gates and is based on the 6 encodings involving N, M and EXTNNN (see table, below)
 761
 762 Phase 2 (stage 2 of a 2-stage pipelined decoder) is defined as the "full decoder" that includes taking into account the length and mode from Phase 1.  Given a 2-stage pipelined decoder it is categorically **impossible** for Phase 2 to go backwards in time and affect the decisions made in Phase 1.
 763
 764 These two phases are specifically designed to take multi-issue execution into account.  Phase 1 is intended to be part of an O(log N) algorithm that can use a form of carry-lookahead propagation. Phase 2 is intended to be on a 2nd pipelined clock cycle, comprising a separate suite of independent local-state-only parallel pipelines that do not require any inter-communication of any kind.
 765
 766 Table: Reminder of the 6 16-bit encodings:
 767
 768     | 0 | 1234 | 567  8 | 9abcde | f | explanation
 769     | - | ---- | ------ | ------ | - | -----------
 770     | EXT000/1 | Cmaj.m | fields | 0 | 10bit then v3.0B
 771     | EXT000/1 | Cmaj.m | fields | 1 | 10bit then 16bit
 772     | 0 | flds | Cmaj.m | fields | 0 | 16bit then v3.0B
 773     | 0 | flds | Cmaj.m | fields | 1 | 16bit then 16bit
 774     | 1 | flds | Cmaj.m | fields | 0 | 16b, 1x v3.0B, 16b
 775     | 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit
 776
 777 ### Phase 1
 778
 779 The Phase 1 length/mode identification takes into account only 3 pieces of information:
 780
 781 * extc_id: insn[0:4] == EXTNNN (Compressed)
 782 * N: insn[0]
 783 * M: insn[15]
 784
 785 The Phase 1 length/mode produces the following lengths/modes:
 786
 787 * 32 - v3.0B (includes v3.0B followed by 16bit)
 788 * 16 - 10bit
 789 * 16 - 16bit
 790
 791 **NOTE THAT FURTHER SUBIDENTIFICATION OF C MODES IS NOT CARRIED OUT AT PHASE 1**. In particular note specifically that 16 bit "immediate mode" is **not** part of the Phase 1 FSM, but is specifically isolated to Phase 2.
 792
 793 Pseudocode:
 794
 795     # starting point for FSM
 796     previ = v3.0B
 797
 798     if previ.mode == v3.0B:
 799         # previous was v3.0B, look for compressed tag
 800         if extc_id:
 801              # found it.  move to 10bit mode
 802              nexti.length = 16
 803              nexti.mode = 10bit
 804         else:
 805              # nope. stay in v3.0B
 806              nexti.length = 32
 807              nexti.mode = v3.0B
 808
 809     elif previ.mode == 10bit:
 810          # previous was v3.0B, move to v3.0B or 16bit?
 811         if M == 0:
 812              next.length = 32
 813              nexti.mode = v3.0B
 814          else:
 815              # otherwise stay in 16bit mode
 816              nexti.length = 16
 817              nexti.mode = 16bit
 818
 819     elif previ.mode == 16bit:
 820           # previous was 16bit, stay there or move?
 821           if M == 0:
 822              # back to v3.0B
 823              next.length = 32
 824              if N == 1:
 825                   # ... but only for 1 insn
 826                   nexti.mode = v3.0B_then_16bit
 827              else:
 828                   nexti.mode = v3.0B
 829          else:
 830              # otherwise stay in 16bit mode
 831              nexti.length = 16
 832              nexti.mode = 16bit
 833
 834     # rest of FSM involving 3.0B to 16bit
 835     # and back transitions left to implementor
 836     # (or for someone else to add)
 837
 838 ### Phase 2: Compressed mode
 839
 840 At this phase, knowing that the length is 16bit and the mode is either 10b or 16b, further analysis is required to determine if the 16bit.immediate encoding is active, and so on.  This is a fully combinatorial block that **at no time** steps outside of the strict bounds already determined by Phase 1.
 841
 842     op_001_1 = insn[5:8] != 0b001.1
 843     if mode == 10bit:
 844         decode_10bit(insn)
 845     elif mode == 16bit:
 846         if N == 1 & M == 1 & op_001_1
 847             # see immediate opcodes table
 848             decode_16bit_immed_mode(insn)
 849         if op_001_1:
 850             # see CR and System tables
 851             # (16 bit ones at least)
 852             decode_16bit_cr_or_sys(insn)
 853         else:
 854             decode_16bit_nonimmed_mode(insn)
 855
 856 From this point onwards each of the decode_xx functions perform straightforward combinatorial decoding of the 16 bits of "insn".  In sone cases this involves further analysis of bit 1, in some cases (Cmaj.m = 0b010.1) even further deep-dive decoding is required (CR ops).  *All* of it is entirely combinatorial and at **no time** involves changing of, or interaction with, or disruption of, the Phase 1 determination of Length+Mode (that has *already taken place* in an earlier decoding pipeline time-schedule)
 857
 858 ### Phase 2: v3.0B mode
 859
 860 Standard v3.0B decoders are deployed.  Absolutely no interaction occurs with any 16 bit decoders or state.  Absolutely no interaction with the earlier Phase 1 decoding occurs.  Absolutely no interaction occurs whatsoever (assuming an implementation that does not perform macro-op fusion) between other multi-issued v3.0B instructions being decoded in parallel at this time.
 861
 862 ## Demo of encoding that's backward-compatible with PowerISA v3.1 in both LE and BE mode
 863
 864 [[demo]]
 865
 866 ### Efficient Decoding Algorithm
 867
 868 [[decoding]]