openpower/sv/16_bit_compressed.mdwn

   1 # 16 bit Compressed
   2
   3 Similar to VLE (but without immediate-prefixing) this encoding is designed
   4 to fit on top of OpenPOWER ISA v3.0B when a "Modeswitch" bit is set (PCR
   5 is recommended). Note that Compressed is *mutually exclusively incompatible*
   6 with OpenPOWER v3.1B "prefixing" due to using (requiring) both EXT000
   7 and EXT001. Hypothetically it could be made to use anything other than
   8 EXT001, with some inconvenience (extra gates).  The incompatibility is
   9 "fixed" by swapping out of "Compressed" Mode and back into "Normal"
  10 (v3.1B) Mode, at runtime, as needed.
  11
  12 Although initially intended to be augmented by Simple-V Prefixing (to
  13 add Vector context, width overrides, e.g IEEE754 FP16, and predication) yet not put pressure on I-Cache power
  14 or size, this Compressed Encoding is not critically dependent
  15 *on* SV Prefixing, and may be used stand-alone.
  16
  17 See:
  18
  19 * <https://bugs.libre-soc.org/show_bug.cgi?id=238>
  20 * <https://ftp.libre-soc.org/VLE_314-68105.pdf> VLE Encoding
  21 * <http://lists.mailinglist.openpowerfoundation.org/pipermail/openpower-hdl-cores/2020-November/000210.html>
  22
  23 This one is a conundrum.  OpenPOWER ISA was never designed with 16
  24 bit in mind.  VLE was added 10 years ago but only by way of marking
  25 an entire 64k page as "VLE".  With VLE not maintained it is not
  26 fully compatible with current PowerISA.
  27
  28 Here, in order to embed 16 bit into a predominantly 32 bit stream the
  29 overhead of using an entire 16 bits just to switch into Compressed mode
  30 is itself a significant overhead.  The situation is made worse by
  31 OpenPOWER ISA being fundamentally designed with 6 bits uniformly
  32 taking up Major Opcode space, leaving only 10 bits to allocate
  33 to actual instructions.
  34
  35 Contrast this with RVC which takes 3 out of 4 combinations of the first 2
  36 bits for indicating 16-bit (anything with 0b00 to 0b10 in the LSBs), and
  37 uses the 4th (0b11) as a Huffman-style escape-sequence, easily allowing
  38 standard 32 bit and 16 bit to intermingle cleanly.  To achieve the same
  39 thing on OpenPOWER would require a whopping 24 6-bit Major Opcodes which
  40 is clearly impractical: other schemes need to be devised.
  41
  42 In addition we would like to add SV-C32 which is a Vectorised version
  43 of 16 bit Compressed, and ideally have a variant that adds the 27-bit
  44 prefix format from SV-P64, as well.
  45
  46 Potential ways to reduce pressure on the 16 bit space are:
  47
  48 * To use more than one v3.0B Major Opcode, preferably an odd-even
  49   contiguous pair
  50 * To provide "paging".  This involves bank-switching to alternative
  51   optimised encodings for specific workloads
  52 * To enter "16 bit mode" for durations specified at the start
  53 * To reserve one bit of every 16 bit instruction to indicate that the
  54   16 bit mode is to continue to be sustained
  55
  56 This latter would be useful in the Vector context to have an alternative
  57 meaning: as the bit which determines whether the instruction is 11-bit
  58 prefixed or 27-bit prefixed:
  59
  60     0 1 2 3 4 5 6 7 8 9 a b c d e f |
  61     |major op | 11 bit vector prefix|
  62     |16 bit opcode  alt vec. mode ^ |
  63     | extra vector prefix if alt set|
  64
  65 Using a major opcode to enter 16 bit mode, leaves 11 bits to find
  66 something to use them for:
  67
  68     0 1 2 3 4 5 6 7 8 9 a b c d e f |
  69     |major op | what to do here   1 |
  70     |16 bit    stay in 16bit mode 1 |
  71     |16 bit    stay in 16bit mode 1 |
  72     |16 bit       exit 16bit mode 0 |
  73
  74 One possibility is that the 11 bits are used for bank selection,
  75 with some room for additional context such as altering the registers
  76 used for the 16 bit operations (bank selection of which scalar regs).
  77 However the downside is that short sequences of Compressed instructions
  78 become penalised by the fixed overhead.  Even a single 16 bit instruction
  79 requires a 16 bit overhead to "gain access" to 16 bit "mode", making
  80 the exercise pointless.
  81
  82 An alternative is to use the first 11 bits for only the utmost commonly
  83 used instructions.  That being the case then one of those 11 bits could
  84 be dedicated to saying if 16 bit mode is to be continued, at which
  85 point *all* 16 bits can be used for Compressed.  10 bits remain for
  86 actual opcodes, which is ridiculously tight, however the opportunity to
  87 subsequently use all 16 bits is worth it.
  88
  89 The reason for picking 2 contiguous Major v3.0B opcodes is illustrated below:
  90
  91     |0 1 2 3 4 5 6 7 8 9 a b c d e f|
  92     |major op..0| LO Half C space   |
  93     |major op..1| HI Half C space   |
  94     |N N N N N|<--11 bits C space-->|
  95
  96 If NNNNN is the same value (two contiguous Major v3.0B Opcodes) this
  97 saves gates at a critical part of the decode phase.
  98
  99 ## ABI considerations
 100
 101 Unlike RISC-V RVC, the above "context" encodings require state, to be stored
 102 in the PCR, MSR, or a dedicated SPR.  These bits (just like LE/BE 32bit
 103 mode and the IEEE754 FPCSR mode) all require taking that context into
 104 consideration.
 105
 106 In particular it is critically important to recognise that context (in
 107 general) is an implicit part of the ABI implemented for example by glibc6.
 108 Therefore (in specific) Compressed Mode Context **must not** be permitted
 109 to cross into or out of a function call.
 110
 111 Thus it is the mandatory responsibility of the compiler to ensure that
 112 context returns to "v3.0B Standard" prior to entering a function call
 113 (responsibility of caller) and prior to exit from a function call
 114 (responsibility of callee).
 115
 116 Trap Handlers also take responsibility for saving and restoring of
 117 Compressed Mode state, just as they already take responsibility for
 118 other critical state.  This makes traps transparent to functions as
 119 far as Compressed Mode Context is concerned, just as traps are already
 120 transparent to functions.
 121
 122 Note however that there are exceptions in a compiler to the otherwise
 123 hard rule that Compressed Mode context not be permitted to cross function
 124 boundaries: inline functions and static functions.  static functions,
 125 if correctly identified as never to be called externally, may, as an
 126 optimisation, disregard standard ABIs, bearing in mind that this will
 127 be fraught (pointers to functions) and not easy to get right.
 128
 129 # Opcode Allocation Ideas
 130
 131 * one bit from the 16-bit mode is used to indicate that standard
 132   (v3.0B) mode is to be dropped into for only one single instruction
 133   <https://bugs.libre-soc.org/show_bug.cgi?id=238#c2>
 134
 135 ## Opcodes exploration (Attempt 1)
 136
 137 Switching between different encoding modes is controlled by M (alone)
 138 in 10-bit mode, and M and N in 16-bit mode.
 139
 140 * M in 10-bit mode if zero indicates that following instructions are
 141   standard OpenPOWER ISA 32-bit encoded (including, redundantly,
 142   further 10/16-bit instructions)
 143 * M in 10-bit mode if 1 indicates that following instructions are
 144   in 16-bit encoding mode
 145
 146 Once in 16-bit mode:
 147
 148 * 0b01 (M=1, N=0): stay in 16-bit mode
 149 * 0b00: leave 16-bit mode permanently (return to standard OpenPOWER ISA)
 150 * 0b10: leave 16-bit mode for one cycle (return to standard OpenPOWER ISA)
 151 * 0b11: free to be used for something completely different.
 152
 153 The current "top" idea for 0b11 is to use it for a new encoding format
 154 of predominantly "immediates-based" 16-bit instructions (branch-conditional,
 155 addi, mulli etc.)
 156
 157 * The Compressed Major Opcode is in bits 5-7.
 158 * Minor opcode in bit 8.
 159 * In some cases bit 9 is taken as an additional sub-opcode, followed
 160   by bits 0-4 (for CR operations)
 161 * M+N mode-switching is not available for C-Major.minor 0b001.1
 162 * 10 bit mode may be expanded by 16 bit mode, adding capabilities
 163   that do not fit in the extreme limited space.
 164
 165 Mode-switching FSM showing relationship between v3.0B, C 10bit and C 16bit.
 166 16-bit immediate mode remains in 16-bit.
 167
 168     | 0 | 1234 | 567  8 | 9abcde | f | explanation
 169     | EXT000/1 | Cmaj.m | fields | 0 | 10bit then v3.0B
 170     | EXT000/1 | Cmaj.m | fields | 1 | 10bit then 16bit
 171     | 0 | flds | Cmaj.m | fields | 0 | 16bit then v3.0B
 172     | 0 | flds | Cmaj.m | fields | 1 | 16bit then 16bit
 173     | 1 | flds | Cmaj.m | fields | 0 | 16b then 1x v3.0B
 174     | 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit
 175
 176 Notes:
 177
 178 * Cmaj.m is the C major/minor opcode: 3 bits for major, 1 for minor
 179 * EXT000 and EXT001 are v3.0B Major Opcodes.  The first 5 bits
 180   are zero, therefore the 6th bit is actually part of Cmaj.
 181 * "10bit then 16bit" means "this instruction is encoded C 10bit
 182   and the following one in C 16bit"
 183
 184 ### C Instruction Encoding types
 185
 186 10-bit Opcode formats (all start with v3.0B EXT000 or EXT001
 187 Major Opcodes)
 188
 189     | 01234    | 567  8 | 9  | a b | c  | d e | f | enc
 190     | E01      | Cmaj.m | fld1     | fld2     | M | 10b
 191     | E01      | Cmaj.m | offset              | M | 10b b
 192     | E01      | 001.1  | S1 | fd1 | S2 | fd2 | M | 10b sub
 193     | E01      | 111.m  | fld1     | fld2     | M | 10b LDST
 194
 195 16-bit Opcode formats (including 10/16/v3.0B Switching)
 196
 197     | 0 | 1234 | 567  8 | 9  | a b | c  | d e | f | enc
 198     | N | immf | Cmaj.m | fld1     | fld2     | M | 16b
 199     | 1 | immf | Cmaj.m | fld1     | imm      | 1 | 16b imm
 200     | fd3      | 001.1  | S1 | fd1 | S2 | fd2 | M | 16b sub
 201     | N | fd4  | 111.m  | fld1     | fld2     | M | 16b LDST
 202
 203 Notes:
 204
 205 * fld1 and fld2 can contain reg numbers, immediates, or opcode
 206   fields (BO, BI, LK)
 207 * S1 and S2 are further sub-selectors of C 001.1
 208
 209 ### Immediate Opcodes
 210
 211 only available in 16-bit mode, only available when M=1 and N=1
 212 and when Cmaj.min is not 0b001.1.
 213
 214 instruction counts from objdump on /bin/bash:
 215
 216       466 extsw r1,r1
 217       649 stw r1,1(r1)
 218       691 lwz r1,1(r1)
 219       705 cmpdi r1,1
 220       791 cmpwi r1,1
 221       794 addis r1,r1,1
 222      1474 std r1,1(r1)
 223      1846 li r1,1
 224      2031 mr r1,r1
 225      2473 addi r1,r1,1
 226      3012 nop
 227      3028 ld r1,1(r1)
 228
 229
 230     | 0 | 1  | 2 | 3 4 | | 567.8 | 9ab  | cde | f |
 231     | 1 | 0  | 0   0 0 | | 001.0 |      | 000 | 1 | TBD
 232     | 1 | 0  |  sh2    | | 001.0 | RA   | sh  | 1 | sradi.
 233     | 1 | 1  | 0   0 0 | | 001.0 |      | 000 | 1 | TBD
 234     | 1 | 1  | 0 | sh2 | | 001.0 | RA   | sh  | 1 | srawi.
 235     | 1 | 1  | 1 |     | | 001.0 | 000  | imm | 1 | TBD
 236     | 1 | 1  | 1 | i2  | | 001.0 | RA!=0| imm | 1 | addis
 237     | 1 |              | | 010.0 | 000  |     | 1 | TBD
 238     | 1 | i2           | | 010.0 | RA!=0| imm | 1 | addi
 239     | 1 | 0  | i2      | | 010.1 | RA   | imm | 1 | cmpdi
 240     | 1 | 1  | i2      | | 010.1 | RA   | imm | 1 | cmpwi
 241     | 1 | 0  | i2      | | 011.0 | RT   | imm | 1 | ldspi
 242     | 1 | 1  | i2      | | 011.0 | RT   | imm | 1 | lwspi
 243     | 1 | 0  | i2      | | 011.1 | RT   | imm | 1 | stwspi
 244     | 1 | 1  | i2      | | 011.1 | RT   | imm | 1 | stdspi
 245     | 1 | i2 | RA      | | 100.0 | RT   | imm | 1 | stwi
 246     | 1 | i2 | RA      | | 100.1 | RT   | imm | 1 | stdi
 247     | 1 | i2 | RT      | | 101.0 | RA   | imm | 1 | ldi
 248     | 1 | i2 | RT      | | 101.1 | RA   | imm | 1 | lwi
 249     | 1 | i2 | RA      | | 110.0 | RT   | imm | 1 | fsti
 250     | 1 | i2 | RA      | | 110.1 | RT   | imm | 1 | fstdi
 251     | 1 | i2 | RT      | | 111.0 | RA   | imm | 1 | flwi
 252     | 1 | i2 | RT      | | 111.1 | RA   | imm | 1 | fldi
 253
 254 Construction of immediate:
 255
 256 * LD/ST r1 (SP) variants should be offset by -256
 257  see <https://bugs.libre-soc.org/show_bug.cgi?id=238#c43>
 258   - SP variants map to e.g ld RT, imm(r1)
 259   - SV Prefixing can be used to map r1 to alternate regs
 260 * [1] not the same as v3.0B addis: the shift amount is smaller and actually
 261   still maps to within the v3.0B addi immediate range.
 262 * addi is EXTS(i2||imm) to give a 4-bit range -8 to +7
 263 * addis is EXTS(i2||imm||000) to give a 11-bit range -1024 to +1023 in
 264   increments of 8
 265 * all others are EXTS(i2||imm) to give a 7-bit range -128 to +127
 266   (further for LD/ST due to word/dword-alignment)
 267
 268 Further Notes:
 269
 270 * bc also has an immediate mode, listed separately below in Branch section
 271 * for LD/ST, offset is aligned.  8-byte: i2||imm||0b000 4-byte: 0b00
 272 * SV Prefix over-rides help provide alternative bitwidths for LD/ST
 273 * RA|0 if RA is zero, addi. becomes "li"
 274   - this only works if RT takes part of opcode
 275   - mv is also possible by specifying an immediate of zero
 276
 277 ### Illegal, nop and attn
 278
 279 Note that illeg is all zeros, including in the 16-bit mode.
 280 Given that C is allocated to OpenPOWER ISA Major opcodes EXT000 and
 281 EXT001 this ensures that in both 10-bit *and* 16-bit mode, a 16-bit
 282 run of all zeros is considered "illegal" whilst 0b0000.0000.1000.0000
 283 is "nop"
 284
 285     | 16-bit mode | | 10-bit mode                 |
 286     | 0 | 1 | 234 | | 567.8  | 9  ab | c   de | f |
 287     | - | - | --- | | -----  | ----- | ------ | - |
 288     | 0 | 0   000 | | 000.0  | 0  00 | 0   00 | 0 | illeg
 289     | 0 | 0   000 | | 000.0  | 0  00 | 0   00 | 1 | nop
 290
 291 16 bit mode only:
 292
 293     | - | - | --- | | -----  | ----- | ------ | - |
 294     | 1 | 0   000 | | 000.0  | 0  00 | 0   00 | 0 | nop
 295     | 1 | 1   000 | | 000.0  | 0  00 | 0   00 | 0 | attn
 296     | 1 | nonzero | | 000.0  | 0  00 | 0   00 | 0 | TBD
 297
 298 Notes:
 299
 300 * All-zeros being an illegal instruction is normal for ISAs.  Ensuring that
 301   this remains true at all times i.e. for both 10 bit and 16 bit mode is
 302   common sense.
 303 * The 10-bit nop (bit 15, M=1) is intended for circumstances
 304   where alignment to 32-bit before returning to v3.0B is required.
 305   M=1 being an indication "return to Standard v3.0B Encoding Mode".
 306 * The 16-bit nop (bit 0, N=1) is intended for circumstances where a
 307   return to Standard v3.0B Encoding is required for one cycle
 308   but one cycle where alignment to a 32-bit boundary is needed.
 309   Examples of this would be to return to "strict" (non-C) mode
 310   where the PC may not be on a non-word-aligned boundary.
 311 * If for any reason multiple 16 bit nops are needed in succession
 312   the M=1 variant can be used, because each one returns to
 313   Standard v3.0B Encoding Mode, each time.
 314
 315 In essence the 2 nops are needed due to there being 2 different C forms:
 316 10 and 16 bit.
 317
 318 ### Branch
 319
 320     | 16-bit mode | | 10-bit mode                 |
 321     | 0 | 1 | 234 | | 567.8  | 9  ab | c   de | f |
 322     | - | - | --- | | -----  | ----- | ------ | - |
 323     | N | offs2   | | 000.LK | offs!=0        | M | b, bl
 324     | 1 | offs2   | | 000.LK | BI    | BO1 oo | 1 | bc, bcl
 325     | N | BO3 BI3 | | 001.0  | LK BI | BO     | M | bclr, bclrl
 326
 327 16 bit mode:
 328
 329 * bc only available when N,M=0b11
 330 * offs2 extends offset in MSBs
 331 * BI3 extends BI in MSBs to allow selection of full CR
 332 * BO3 extends BO
 333 * bc offset constructed from oo as LSBs and offs2 as MSBs
 334 * bc BI allows selection of all bits from CR0 or CR1
 335 * bc CR check is always active (as if BO0=1) therefore BO1 inverts
 336
 337 10 bit mode:
 338
 339 * illegal (all zeros) covers part of branch (offs=0,M=0,LK=0)
 340 * nop also covers part of branch (offs=0,M=0,LK=1)
 341 * bc **not available** in 10-bit mode
 342 * BO[0] enables CR check, BO[1] inverts check
 343 * BI refers to CR0 only (4 bits of)
 344 * no Branch Conditional with immediate
 345 * no Absolute Address
 346 * CTR mode allowed with BO[2] for b only.
 347 * offs is to 2 byte (signed) aligned
 348 * all branches to 2 byte aligned
 349
 350 ### LD/ST
 351
 352 Note: for 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
 353
 354     | 16-bit mode    | | 10-bit mode             |
 355     | 0   | 1  | 234 | | 567.8 | 9 a b | c d e | f |
 356     | --- | -- | --- | | ----- | ----- | ----- | - |
 357     | RA2 | SZ |  RB | | 001.1 | 1  RA | 0  RT | M | st
 358     | RA2 | SZ |  RB | | 001.1 | 1  RA | 1  RT | M | fst
 359     | N   | SZ |  RT | | 111.0 |  RA   |  RB   | M | ld
 360     | N   | SZ |  RT | | 111.1 |  RA   |  RB   | M | fld
 361
 362 * elwidth overrides can set different widths
 363
 364 16 bit mode:
 365
 366 * SZ=1 is 64 bit, SZ=0 is 32 bit
 367 * RA2 extends RA to 3 bits (MSB)
 368 * RT2 extends RT to 3 bits (MSB)
 369
 370 10 bit mode:
 371
 372 * RA and RB are only 2 bit (0-3)
 373 * for LD, RT is implicitly RB: "ld RT=RB, RA(RB)"
 374 * for ST, there is no offset: "st RT, RA(0)"
 375
 376 ### Arithmetic
 377
 378 * 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
 379 * 16-bit: note that bit 1==0 (sub-sub-encoding)
 380
 381 10 and 16 bit:
 382
 383     | 16-bit mode | | 10-bit mode             |
 384     | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
 385     | - | - | --- | | ----- | --- | ----- | - |
 386     | N | 0 | RT  | | 010.0 | RB  | RA!=0 | M | add
 387     | N | 0 | RT  | | 010.1 | RB  | RA|0  | M | sub.
 388     | N | 0 | BF  | | 011.0 | RB  | RA|0  | M | cmpl
 389
 390 Notes:
 391
 392 * sub. and cmpl: default CR target is CR0
 393 * for (RA|0) when RA=0 the input is a zero immediate,
 394   meaning that sub. becomes neg. and cmp becomes cmpi against zero
 395 * RT is implicitly RB: "add RT(=RB), RA, RB"
 396 * Opcode 0b010.0 RA=0 is not missing from the above:
 397   it is a system-wide instruction, "cbank" (section below)
 398
 399 16 bit mode only:
 400
 401     | 0 | 1 | 234 | | 567.8 | 9ab | cde   | f |
 402     | - | - | --- | | ----- | --- | ----- | - |
 403     | N | 1 | RA  | | 010.0 | RB  | RS    | 0 | sld.
 404     | N | 1 | RA  | | 010.1 | RB  | RS!=0 | 0 | srd.
 405     | N | 1 | RA  | | 010.1 | RB  | 000   | 0 | srad.
 406     | N | 1 | BF  | | 011.0 | RB  | RA|0  | 0 | cmpw
 407
 408 Notes:
 409
 410 * for srad, RS=RA: "srad. RA(=RS), RS, RB"
 411
 412 ### Logical
 413
 414 * 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
 415 * 16-bit: note that bit 1==0 (sub-sub-encoding)
 416
 417 10 and 16 bit:
 418
 419     | 16-bit mode | | 10-bit mode             |
 420     | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
 421     | - | - | --- | | ----- | --- | ----- | - |
 422     | N | 0 |  RT | | 100.0 | RB  | RA!=0 | M | and
 423     | N | 0 |  RT | | 100.1 | RB  | RA!=0 | M | nand
 424     | N | 0 |  RT | | 101.0 | RB  | RA!=0 | M | or
 425     | N | 0 |  RT | | 101.1 | RB  | RA!=0 | M | nor/mr
 426     | N | 0 |  RT | | 100.0 | RB  | 0 0 0 | M | extsw
 427     | N | 0 |  RT | | 100.1 | RB  | 0 0 0 | M | cntlz
 428     | N | 0 |  RT | | 101.0 | RB  | 0 0 0 | M | popcnt
 429     | N | 0 |  RT | | 101.1 | RB  | 0 0 0 | M | not
 430
 431 16-bit mode only (note that bit 1 == 1):
 432
 433     | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
 434     | - | - | --- | | ----- | --- | ----- | - |
 435     | N | 1 |  RT | | 100.0 | RB  | RA!=0 | 0 | TBD
 436     | N | 1 |  RT | | 100.1 | RB  | RA!=0 | 0 | TBD
 437     | N | 1 |  RT | | 101.0 | RB  | RA!=0 | 0 | xor
 438     | N | 1 |  RT | | 101.1 | RB  | RA!=0 | 0 | eqv (xnor)
 439     | N | 1 |  RT | | 100.0 | RB  | 0 0 0 | 0 | extsb
 440     | N | 1 |  RT | | 100.1 | RB  | 0 0 0 | 0 | cnttz
 441     | N | 1 |  RT | | 101.0 | RB  | 0 0 0 | 0 | TBD
 442     | N | 1 |  RT | | 101.1 | RB  | 0 0 0 | 0 | extsh
 443
 444 10 bit mode:
 445
 446 * idea: for 10bit mode, nor is actually 'mr' because mr is
 447   a more common operation.  in 16bit however, this encoding
 448   (Cmaj.min=0b101.1, N=0) is 'nor'
 449 * for (RA|0) when RA=0 the input is a zero immediate,
 450   meaning that nor becomes not
 451 * cntlz, popcnt, exts **not available** in 10-bit mode
 452 * RT is implicitly RB: "and RT(=RB), RA, RB"
 453
 454 ### Floating Point
 455
 456 Note here that elwidth overrides (SV Prefix) can be used to select FP16/32/64
 457
 458 * 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
 459 * 16-bit: note that bit 1==0 (sub-sub-encoding)
 460
 461 10 and 16 bit:
 462
 463     | 16-bit mode | | 10-bit mode             |
 464     | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
 465     | - | - | --- | | ----- | --- | ----- | - |
 466     | N |   |  RT | | 011.1 | RB  | RA!=0 | M | fsub.
 467     | N | 0 |  RT | | 110.0 | RB  | RA!=0 | M | fadd
 468     | N | 0 |  RT | | 110.1 | RB  | RA!=0 | M | fmul
 469     | N | 0 |  RT | | 011.1 | RB  | 0 0 0 | M | fneg.
 470     | N | 0 |  RT | | 110.0 | RB  | 0 0 0 | M |
 471     | N | 0 |  RT | | 110.1 | RB  | 0 0 0 | M |
 472
 473 16-bit mode only (note that bit 1 == 1):
 474
 475     | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
 476     | - | - | --- | | ----- | --- | ----- | - |
 477     | N | 1 |  RT | | 011.1 | RB  | RA!=0 | 0 |
 478     | N | 1 |  RT | | 110.0 | RB  | RA!=0 | 0 |
 479     | N | 1 |  RT | | 110.1 | RB  | RA!=0 | 0 | fdiv
 480     | N | 1 |  RT | | 011.1 | RB  | 0 0 0 | 0 | fabs.
 481     | N | 1 |  RT | | 110.0 | RB  | 0 0 0 | 0 | fmr.
 482     | N | 1 |  RT | | 110.1 | RB  | 0 0 0 | 0 |
 483
 484 16 bit only, FP to INT convert (using C 0b001.1 subencoding)
 485
 486     | 0123 | 4 | | 567.8 | 9 ab | cde  | f |
 487     | ---- | - | | ----- | ---- | ---- | - |
 488     | 0010 | X | | 001.1 | 0 RA | Y RT | M | fp2int
 489     | 0011 | X | | 001.1 | 0 RA | Y RT | M | int2fp
 490
 491 * X: signed=1, unsigned=0
 492 * Y: FP32=0, FP64=1
 493
 494 10 bit mode:
 495
 496 * fsub. fneg. and fmr. default target is CR1
 497 * fmr. is **not available** in 10-bit mode
 498 * fdiv is **not available** in 10-bit mode
 499
 500 16 bit mode:
 501
 502 * fmr. copies RB to RT (and sets CR1)
 503
 504 ### Condition Register
 505
 506     | 16-bit mode   | | 10-bit mode            |
 507     | 0 1 2 3 | 4   | | 567.8 | 9 ab | cde | f |
 508     | ------- | --- | | ----- | ---- | --- | - |
 509     | 0 0 0 0 | BF2 | | 001.1 | 0 BF | BFA | M | mcrf
 510     | 0 0 0 1 | BA2 | | 001.1 | 0 BA | BB  | M | crnor
 511     | 0 1 0 0 | BA2 | | 001.1 | 0 BA | BB  | M | crandc
 512     | 0 1 1 0 | BA2 | | 001.1 | 0 BA | BB  | M | crxor
 513     | 0 1 1 1 | BA2 | | 001.1 | 0 BA | BB  | M | crnand
 514     | 1 0 0 0 | BA2 | | 001.1 | 0 BA | BB  | M | crand
 515     | 1 0 0 1 | BA2 | | 001.1 | 0 BA | BB  | M | creqv
 516     | 1 1 0 1 | BA2 | | 001.1 | 0 BA | BB  | M | crorc
 517     | 1 1 1 0 | BA2 | | 001.1 | 0 BA | BB  | M | cror
 518
 519 10 bit mode:
 520
 521 * mcrf BF is only 2 bits which means the destination is only CR0-CR3
 522 * CR operations: **not available** in 10-bit mode (but mcrf is)
 523
 524 16 bit mode:
 525
 526 * mcrf BF2 extends BF (in MSB) to 3 bits
 527 * CR operations: destination register is same as BA.
 528 * CR operations: only possible on CR0 and CR1
 529
 530 SV (Vector Mode):
 531
 532 * CR operations: greatly extended reach/range (useful for predicates)
 533
 534 ### System
 535
 536 cbank: Selection of Compressed-encoding "Bank".  Different "banks"
 537 give different meanings to opcodes.  Example: CBank=0b001 is heavily
 538 optimised to A/Video Encode/Decode.  cbank borrows from add's encoding
 539 space (when RA==0)
 540
 541     | 16-bit mode | | 10-bit mode             |
 542     | 0 | 1 2 3 4 | | 567.8 | 9ab   | cde | f |
 543     | - | ------- | | ----- | ----- | --- | - |
 544     | N | 0 Bank2 | | 010.0 | CBank | 000 | M | cbank
 545
 546 **not available** in 10-bit mode, **only** in 16-bit mode:
 547
 548     | 0 1 2 3 | 4 | | 567.8 | 9 ab | cde  | f |
 549     | ------- | - | | ----- | ---- | ---- | - |
 550     | 1 1 1 1 | 0 | | 001.1 | 0 00 |  RT  | M | mtlr
 551     | 1 1 1 1 | 0 | | 001.1 | 0 01 |  RT  | M | mtctr
 552     | 1 1 1 1 | 0 | | 001.1 | 0 11 |  RT  | M | mtcr
 553     | 1 1 1 1 | 1 | | 001.1 | 0 00 |  RA  | M | mflr
 554     | 1 1 1 1 | 1 | | 001.1 | 0 01 |  RA  | M | mfctr
 555     | 1 1 1 1 | 1 | | 001.1 | 0 11 |  RA  | M | mfcr
 556
 557 ### Unallocated
 558
 559     | 0 1 2 3 | 4 | | 567.8 | 9 ab | cde  | f |
 560     | ------- | - | | ----- | ---- | ---- | - |
 561     | 0 1 0 1 |   | | 001.1 | 0    |      | M |
 562     | 1 0 1 0 |   | | 001.1 | 0    |      | M |
 563     | 1 0 1 1 |   | | 001.1 | 0    |      | M |
 564     | 1 1 0 0 |   | | 001.1 | 0    |      | M |
 565     | 1 1 1 1 |   | | 001.1 | 0 10 |      | M |
 566
 567 ## Other ideas (Attempt 2)
 568
 569 ### 8-bit mode-switching instructions, odd addresses for C mode
 570
 571 Drop the complexity of the 16-bit encoding further reduced to 10-bit,
 572 and use a single byte instead of two to switch between modes.  This
 573 would place compressed (C) mode instructions at odd bytes, so the LSB
 574 of the PC can be used for the processor to tell which mode it is in.
 575
 576 To switch from traditional to compressed mode, the single-byte
 577 instruction would be at the MSByte, that holds the EXT bits.  (When we
 578 break up a 32-bit instruction across words, the most significant half
 579 should go in the word with the lower address.)
 580
 581 To switch from compressed mode to traditional mode, the single-byte
 582 instruction would also be at the opcode/format portion, placed in the
 583 lower-address word if split across words, so that the instruction can
 584 be recognized as the mode-switching one without going for its second
 585 byte.
 586
 587 The C-mode nop should be encoded so that its second byte encodes a
 588 switch to compressed mode, if decoded in traditional mode.  This
 589 enables such a nop to straddle across a label:
 590
 591     8-bit first half of nop
 592     Label:
 593     8-bit second half of nop AKA switch to compressed mode
 594     16-bit insns...
 595
 596 so that if traditional code jumps to the word-aligned label (because
 597 traditional branches drop the 2 LSB), it immediately switches to
 598 compressed mode; if we fall-through, we remain in 16-bit mode; and if
 599 we branch to it from compressed mode, whether we jump to the odd or
 600 the even address, we end up in compressed mode as desired.
 601
 602 Tables explaining encoding:
 603
 604     | byte 0 | byte 1 | byte 2 | byte 3 |
 605     | v3.0B standard 32 bit instruction |
 606     | EXT000 | 16 bit          | 16...  |
 607     | .. bit | 8nop   | v3.0b stand...  |
 608     | .. ard 32 bit   | EXT000 | 16...  |
 609     | .. bit | 16 bit          | 8nop   |
 610     | v3.0B standard 32 bit instruction |
 611
 612
 613 # TODO
 614
 615 * make a preliminary assessment of branch in/out viability
 616 * confirm FSM encoding (is LSB of PC really enough?)
 617 * guestimate opcode and register allocation (without necessarily doing
 618   a full encoding)
 619 * write throwaway python program that estimates compression ratio from
 620   objdump raw parsing
 621 * finally do full opcode allocation
 622 * rerun objdump compression ratio estimates
 623
 624 ### Use 2- rather than 3-register opcodes
 625
 626 Successful compact ISAs have used 2- rather than 3-register insns, in
 627 which the same register serves as input and output.  Some 20% of
 628 general-purpose 3-register insns already use either input register as
 629 output, without any effort by the compiler to do so.
 630
 631 Repurposing the 3 bits used to encode one one of the input registers
 632 in arithmetic, logical and floating-pointer registers, and the 2 bits
 633 used to encode the mode of the next two insns, we could make the full
 634 register files available to the opcodes already selected for
 635 compressed mode, with one bit to spare to bring additional opcodes in.
 636
 637 An opcode could be assigned to an instruction that combines and
 638 extends with the subsequent instruction, providing it with a separate
 639 input operand to use rather than the output register, or with
 640 additional range for immediate and offset operands, effectively
 641 forming a 32-bit operation, enabling us to remain in compressed mode
 642 even longer.
 643
 644 # Appendix
 645
 646 ## Analysis techniques and tools
 647
 648     objdump -d --no-show-raw-insn /bin/bash | sed 'y/\t/ /;
 649       s/^[ x0-9A-F]*: *\([a-z.]\+\) *\(.*\)/\1 \2 /p; d' |
 650       sed 's/\([, (]\)r[1-9][0-9]*/\1r1/g;
 651       s/\([ ,]\)-*[0-9]\+\([^0-9]\)/\11\2/g' | sort | uniq --count |
 652       sort -n | less
 653
 654 ## gcc register allocation
 655
 656 FTR, information extracted from gcc's gcc/config/rs6000/rs6000.h about
 657 fixed registers (assigned to special purposes) and register allocation
 658 order:
 659
 660 Special-purpose registers on ppc are:
 661
 662     r0: constant zero/throw-away
 663     r1: stack pointer
 664     r2: thread-local storage pointer in 32-bit mode
 665     r2: non-minimal TOC register
 666     r10: EH return stack adjust register
 667     r11: static chain pointer
 668     r13: thread-local storage pointer in 64-bit mode
 669     r30: minimal-TOC/-fPIC/-fpic base register
 670     r31: frame pointer
 671     lr: return address register
 672
 673 the register allocation order in GCC (i.e., it takes the earliest
 674 available register that fits the constraints) is:
 675
 676     We allocate in the following order:
 677
 678         fp0             (not saved or used for anything)
 679         fp13 - fp2      (not saved; incoming fp arg registers)
 680         fp1             (not saved; return value)
 681         fp31 - fp14     (saved; order given to save least number)
 682         cr7, cr5        (not saved or special)
 683         cr6             (not saved, but used for vector operations)
 684         cr1             (not saved, but used for FP operations)
 685         cr0             (not saved, but used for arithmetic operations)
 686         cr4, cr3, cr2   (saved)
 687         r9              (not saved; best for TImode)
 688         r10, r8-r4      (not saved; highest first for less conflict with params)
 689         r3              (not saved; return value register)
 690         r11             (not saved; later alloc to help shrink-wrap)
 691         r0              (not saved; cannot be base reg)
 692         r31 - r13       (saved; order given to save least number)
 693         r12             (not saved; if used for DImode or DFmode would use r13)
 694         ctr             (not saved; when we have the choice ctr is better)
 695         lr              (saved)
 696         r1, r2, ap, ca  (fixed)
 697         v0 - v1         (not saved or used for anything)
 698         v13 - v3        (not saved; incoming vector arg registers)
 699         v2              (not saved; incoming vector arg reg; return value)
 700         v19 - v14       (not saved or used for anything)
 701         v31 - v20       (saved; order given to save least number)
 702         vrsave, vscr    (fixed)
 703         sfp             (fixed)
 704
 705 ## Comparison to VLE
 706
 707 VLE was a means to reduce executable size through three interleaved methods:
 708
 709 * (1) invention of 16 bit encodings (of exactly 16 bit in length)
 710 * (2) invention of 16+16 bit encodings (a 16 bit instruction format but with
 711   an *additional* 16 bit immediate "tacked on" to the end, actually
 712   making a 32-bit instruction format)
 713 * (3) seamless and transparent embedding and intermingling of the
 714   above in amongst arbitrary v2.06/7 BE 32 bit instruction sequences,
 715   with no additional state,
 716   including when the PC was not aligned on a 4-byte boundary.
 717
 718 Whilst (1) and (3) make perfect sense, (2) makes no sense at all given that, as inspection of "ori" and others show, I-Form 16 bit immediates is the "norm" for v2.06/7 and v3.0B standard instructions.  (2) in effect **is** a 32 bit instruction.  (2) **is not** a 16 bit instruction.
 719
 720 *Why "reinvent" an encoding that is 32 bit, when there already exists a 32 bit encoding that does the exact same job?*
 721
 722 Consequently, we do **not** envisage a scenario where (2) would ever be implemented, nor in the future would this Compressed Encoding be extended beyond 16 bit.  Compressed is Compressed and is **by definition** limited to precisely  - and only - 16 bit.
 723
 724 The additional reason why that is the case is because VLE is exceptionally complex to implement.  In a single-issue, low clock rate "Embedded" environment for which VLE was originally designed, VLE was perfectly well matched.
 725
 726 However this Compressed Encoding is designed for High performance multi-issue systems *as well* as Embedded scenarios, and consequently, the complexity of "deep packet inspection" down into the depths of a 16 bit sequence in order to ascertain if it might not be 16 bit after all, is wholly unacceptable.
 727
 728 By eliminating such 16+16 (actually, 32bit conflation) tricks outlined in (2), Compressed is *specifically* designed to fit into a very small FSM, suitable for multi-issue, that in no way requires "deep-dive" analysis. Yet, despite it never being designed with 16 bit encodings in mind, is still suitable for retro-fitting onto OpenPOWER.
 729
 730 ## Demo of encoding that's backward-compatible with PowerISA v3.1 in both LE and BE mode
 731
 732 [[demo]]
 733
 734 ### Efficient Decoding Algorithm
 735
 736 [[decoding]]