openpower/sv/16_bit_compressed.mdwn

   1 # 16 bit Compressed
   2
   3 Similar to VLE (but without immediate-prefixing) this encoding is designed
   4 to fit on top of OpenPOWER ISA v3.0B when a "Modeswitch" bit is set (PCR
   5 is recommended). Note that Compressed is *mutually exclusively incompatible*
   6 with OpenPOWER v3.1B "prefixing" due to using (requiring) both EXT000
   7 and EXT001. Hypothetically it could be made to use anything other than
   8 EXT001, with some inconvenience (extra gates).  The incompatibility is
   9 "fixed" by swapping out of "Compressed" Mode and back into "Normal"
  10 (v3.1B) Mode, at runtime, as needed.
  11
  12 Although initially intended to be augmented by Simple-V Prefixing (to
  13 add Vector context, width overrides, e.g IEEE754 FP16, and predication) yet not put pressure on I-Cache power
  14 or size, this Compressed Encoding is not critically dependent
  15 *on* SV Prefixing, and may be used stand-alone.
  16
  17 See:
  18
  19 * <https://bugs.libre-soc.org/show_bug.cgi?id=238>
  20 * <https://ftp.libre-soc.org/VLE_314-68105.pdf> VLE Encoding
  21 * <http://lists.mailinglist.openpowerfoundation.org/pipermail/openpower-hdl-cores/2020-November/000210.html>
  22
  23 This one is a conundrum.  OpenPOWER ISA was never designed with 16
  24 bit in mind.  VLE was added 10 years ago but only by way of marking
  25 an entire 64k page as "VLE".  With VLE not maintained it is not
  26 fully compatible with current PowerISA.
  27
  28 Here, in order to embed 16 bit into a predominantly 32 bit stream the
  29 overhead of using an entire 16 bits just to switch into Compressed mode
  30 is itself a significant overhead.  The situation is made worse by
  31 OpenPOWER ISA being fundamentally designed with 6 bits uniformly
  32 taking up Major Opcode space, leaving only 10 bits to allocate
  33 to actual instructions.
  34
  35 Contrast this with RVC which takes 3 out of 4
  36 combinations of the first 2 bits for indicating 16-bit (anything with 0b00 to 0b10 in the LSBs), and uses the 4th (0b11) as a Huffman-style escape-sequence, easily allowing standard 32 bit and 16 bit to intermingle cleanly.  To achieve the same thing on OpenPOWER would require a whopping 24 6-bit Major Opcodes which is clearly impractical: other schemes need to be devised.
  37
  38 In addition we would like to add SV-C32 which is a Vectorised version
  39 of 16 bit Compressed, and ideally have a variant that adds the 27-bit
  40 prefix format from SV-P64, as well.
  41
  42 Potential ways to reduce pressure on the 16 bit space are:
  43
  44 * To use more than one v3.0B Major Opcode, preferably an odd-even
  45   contiguous pair
  46 * To provide "paging".  This involves bank-switching to alternative optimised encodings for specific workloads
  47 * To enter "16 bit mode" for durations specified at the start
  48 * To reserve one bit of every 16 bit instruction to indicate that the 16 bit mode is to continue to be sustained
  49
  50 This latter would be useful in the Vector context to have an alternative
  51 meaning: as the bit which determines whether the instruction is 11-bit
  52 prefixed or 27-bit prefixed:
  53
  54     0 1 2 3 4 5 6 7 8 9 a b c d e f |
  55     |major op | 11 bit vector prefix|
  56     |16 bit opcode  alt vec. mode ^ |
  57     | extra vector prefix if alt set|
  58
  59 Using a major opcode to enter 16 bit mode, leaves 11 bits to find
  60 something to use them for:
  61
  62     0 1 2 3 4 5 6 7 8 9 a b c d e f |
  63     |major op | what to do here   1 |
  64     |16 bit    stay in 16bit mode 1 |
  65     |16 bit    stay in 16bit mode 1 |
  66     |16 bit       exit 16bit mode 0 |
  67
  68 One possibility is that the 11 bits are used for bank selection, with
  69 some room for additional context such as altering the registers used
  70 for the 16 bit operations (bank selection of which scalar regs).
  71 However the downside is that short sequences of Compressed instructions
  72 become penalised by the fixed overhead.  Even a single 16 bit instruction requires a 16 bit overhead to "gain access" to 16 bit "mode", making the exercise pointless.
  73
  74 An alternative is to use the first 11 bits for only the utmost commonly used
  75 instructions.  That being the case then one of those 11 bits could
  76 be dedicated to saying if 16 bit mode is to be continued, at which
  77 point *all* 16 bits can be used for Compressed.
  78 10 bits remain for actual opcodes, which is ridiculously tight,
  79 however the opportunity to subsequently use all 16 bits is worth it.
  80
  81 The reason for picking 2 contiguous Major v3.0B opcodes is illustrated below:
  82
  83     |0 1 2 3 4 5 6 7 8 9 a b c d e f|
  84     |major op..0| LO Half C space   |
  85     |major op..1| HI Half C space   |
  86     |N N N N N|<--11 bits C space-->|
  87
  88 If NNNNN is the same value (two contiguous Major v3.0B Opcodes) this saves gates at a critical part of the decode phase.
  89
  90 ## ABI considerations
  91
  92 Unlike RVC, the above "context" encodings require state, to be stored in the PCR, MSR, or a dedicated SPR.  These bits (just like LE/BE 32bit mode and the IEEE754 FPCSR mode) all require taking that context into consideration.
  93
  94 In particular it is critically important to recognise that context (in general) is an implicit part of the ABI implemented for example by glibc6.  Therefore (in specific) Compressed Mode Context **must not** be permitted to cross into or out of a function call.
  95
  96 Thus it is the mandatory responsibility of the compiler to ensure that context returns to "v3.0B Standard" prior to entering a function call (responsibility of caller) and prior to exit from a function call (responsibility of callee).
  97
  98 Trap Handlers also take responsibility for saving and restoring of Compressed Mode state, just as they already take responsibility for other critical state.  This makes traps transparent to functions as far as Compressed Mode Context is concerned, just as traps are already transparent to functions.
  99
 100 Note however that there are exceptions in a compiler to the otherwise hard rule that Compressed Mode context not be permitted to cross function boundaries: inline functions and static functions.  static functions, if correctly identified as never to be called externally, may, as an optimisation, disregard standard ABIs, bearing in mind that this will be fraught (pointers to functions) and not easy to get right.
 101
 102 # Opcode Allocation Ideas
 103
 104 * one bit from the 16-bit mode is used to indicate that standard
 105   (v3.0B) mode is to be dropped into for only one single instruction
 106   <https://bugs.libre-soc.org/show_bug.cgi?id=238#c2>
 107
 108 ## Opcodes exploration (Attempt 1)
 109
 110 Switching between different encoding modes is controlled by M (alone)
 111 in 10-bit mode, and M and N in 16-bit mode.
 112
 113 * M in 10-bit mode if zero indicates that following instructions are
 114   standard OpenPOWER ISA 32-bit encoded (including, redundantly,
 115   further 10/16-bit instructions)
 116 * M in 10-bit mode if 1 indicates that following instructions are
 117   in 16-bit encoding mode
 118
 119 Once in 16-bit mode:
 120
 121 * 0b01 (M=1, N=0): stay in 16-bit mode
 122 * 0b00: leave 16-bit mode permanently (return to standard OpenPOWER ISA)
 123 * 0b10: leave 16-bit mode for one cycle (return to standard OpenPOWER ISA)
 124 * 0b11: free to be used for something completely different.
 125
 126 The current "top" idea for 0b11 is to use it for a new encoding format
 127 of predominantly "immediates-based" 16-bit instructions (branch-conditional,
 128 addi, mulli etc.)
 129
 130 * The Compressed Major Opcode is in bits 5-7.
 131 * Minor opcode in bit 8.
 132 * In some cases bit 9 is taken as an additional sub-opcode, followed
 133   by bits 0-4 (for CR operations)
 134 * M+N mode-switching is not available for C-Major.minor 0b001.1
 135 * 10 bit mode may be expanded by 16 bit mode, adding capabilities
 136   that do not fit in the extreme limited space.
 137
 138 Mode-switching FSM showing relationship between v3.0B, C 10bit and C 16bit.
 139 16-bit immediate mode remains in 16-bit.
 140
 141     | 0 | 1234 | 567  8 | 9abcde | f | explanation
 142     | EXT000/1 | Cmaj.m | fields | 0 | 10bit then v3.0B
 143     | EXT000/1 | Cmaj.m | fields | 1 | 10bit then 16bit
 144     | 0 | flds | Cmaj.m | fields | 0 | 16bit then v3.0B
 145     | 0 | flds | Cmaj.m | fields | 1 | 16bit then 16bit
 146     | 1 | flds | Cmaj.m | fields | 0 | 16b then 1x v3.0B
 147     | 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit
 148
 149 Notes:
 150
 151 * Cmaj.m is the C major/minor opcode: 3 bits for major, 1 for minor
 152 * EXT000 and EXT001 are v3.0B Major Opcodes.  The first 5 bits
 153   are zero, therefore the 6th bit is actually part of Cmaj.
 154 * "10bit then 16bit" means "this instruction is encoded C 10bit
 155   and the following one in C 16bit"
 156
 157 ### C Instruction Encoding types
 158
 159 10-bit Opcode formats (all start with v3.0B EXT000 or EXT001
 160 Major Opcodes)
 161
 162     | 01234    | 567  8 | 9  | a b | c  | d e | f | enc
 163     | E01      | Cmaj.m | fld1     | fld2     | M | 10b
 164     | E01      | Cmaj.m | offset              | M | 10b b
 165     | E01      | 001.1  | S1 | fd1 | S2 | fd2 | M | 10b sub
 166     | E01      | 111.m  | fld1     | fld2     | M | 10b LDST
 167
 168 16-bit Opcode formats (including 10/16/v3.0B Switching)
 169
 170     | 0 | 1234 | 567  8 | 9  | a b | c  | d e | f | enc
 171     | N | immf | Cmaj.m | fld1     | fld2     | M | 16b
 172     | 1 | immf | Cmaj.m | fld1     | imm      | 1 | 16b imm
 173     | fd3      | 001.1  | S1 | fd1 | S2 | fd2 | M | 16b sub
 174     | N | fd4  | 111.m  | fld1     | fld2     | M | 16b LDST
 175
 176 Notes:
 177
 178 * fld1 and fld2 can contain reg numbers, immediates, or opcode
 179   fields (BO, BI, LK)
 180 * S1 and S2 are further sub-selectors of C 001.1
 181
 182 ### Immediate Opcodes
 183
 184 only available in 16-bit mode, only available when M=1 and N=1
 185 and when Cmaj.min is not 0b001.1.
 186
 187 instruction counts from objdump on /bin/bash:
 188
 189       466 extsw r1,r1
 190       649 stw r1,1(r1)
 191       691 lwz r1,1(r1)
 192       705 cmpdi r1,1
 193       791 cmpwi r1,1
 194       794 addis r1,r1,1
 195      1474 std r1,1(r1)
 196      1846 li r1,1
 197      2031 mr r1,r1
 198      2473 addi r1,r1,1
 199      3012 nop
 200      3028 ld r1,1(r1)
 201
 202
 203     | 0 | 1  | 2 | 3 4 | | 567.8 | 9ab  | cde | f |
 204     | 1 | 0  | 0   0 0 | | 001.0 |      | 000 | 1 | TBD
 205     | 1 | 0  |  sh2    | | 001.0 | RA   | sh  | 1 | sradi.
 206     | 1 | 1  | 0   0 0 | | 001.0 |      | 000 | 1 | TBD
 207     | 1 | 1  | 0 | sh2 | | 001.0 | RA   | sh  | 1 | srawi.
 208     | 1 | 1  | 1 |     | | 001.0 | 000  | imm | 1 | TBD
 209     | 1 | 1  | 1 | i2  | | 001.0 | RA!=0| imm | 1 | addis
 210     | 1 |              | | 010.0 | 000  |     | 1 | TBD
 211     | 1 | i2           | | 010.0 | RA!=0| imm | 1 | addi
 212     | 1 | 0  | i2      | | 010.1 | RA   | imm | 1 | cmpdi
 213     | 1 | 1  | i2      | | 010.1 | RA   | imm | 1 | cmpwi
 214     | 1 | 0  | i2      | | 011.0 | RT   | imm | 1 | ldspi
 215     | 1 | 1  | i2      | | 011.0 | RT   | imm | 1 | lwspi
 216     | 1 | 0  | i2      | | 011.1 | RT   | imm | 1 | stwspi
 217     | 1 | 1  | i2      | | 011.1 | RT   | imm | 1 | stdspi
 218     | 1 | i2 | RA      | | 100.0 | RT   | imm | 1 | stwi
 219     | 1 | i2 | RA      | | 100.1 | RT   | imm | 1 | stdi
 220     | 1 | i2 | RT      | | 101.0 | RA   | imm | 1 | ldi
 221     | 1 | i2 | RT      | | 101.1 | RA   | imm | 1 | lwi
 222     | 1 | i2 | RA      | | 110.0 | RT   | imm | 1 | fsti
 223     | 1 | i2 | RA      | | 110.1 | RT   | imm | 1 | fstdi
 224     | 1 | i2 | RT      | | 111.0 | RA   | imm | 1 | flwi
 225     | 1 | i2 | RT      | | 111.1 | RA   | imm | 1 | fldi
 226
 227 Construction of immediate:
 228
 229 * LD/ST r1 (SP) variants should be offset by -256
 230  see <https://bugs.libre-soc.org/show_bug.cgi?id=238#c43>
 231   - SP variants map to e.g ld RT, imm(r1)
 232   - SV Prefixing can be used to map r1 to alternate regs
 233 * [1] not the same as v3.0B addis: the shift amount is smaller and actually
 234   still maps to within the v3.0B addi immediate range.
 235 * addi is EXTS(i2||imm) to give a 4-bit range -8 to +7
 236 * addis is EXTS(i2||imm||000) to give a 11-bit range -1024 to +1023 in increments of 8
 237 * all others are EXTS(i2||imm) to give a 7-bit range -128 to +127
 238   (further for LD/ST due to word/dword-alignment)
 239
 240 Further Notes:
 241
 242 * bc also has an immediate mode, listed separately below in Branch section
 243 * for LD/ST, offset is aligned.  8-byte: i2||imm||0b000 4-byte: 0b00
 244 * SV Prefix over-rides help provide alternative bitwidths for LD/ST
 245 * RA|0 if RA is zero, addi. becomes "li"
 246   - this only works if RT takes part of opcode
 247   - mv is also possible by specifying an immediate of zero
 248
 249 ### Illegal, nop and attn
 250
 251 Note that illeg is all zeros, including in the 16-bit mode.
 252 Given that C is allocated to OpenPOWER ISA Major opcodes EXT000 and
 253 EXT001 this ensures that in both 10-bit *and* 16-bit mode, a 16-bit
 254 run of all zeros is considered "illegal" whilst 0b0000.0000.1000.0000
 255 is "nop"
 256
 257     | 16-bit mode | | 10-bit mode                 |
 258     | 0 | 1 | 234 | | 567.8  | 9  ab | c   de | f |
 259     | 0 | 0   000 | | 000.0  | 0  00 | 0   00 | 0 | illeg
 260     | 0 | 0   000 | | 000.0  | 0  00 | 0   00 | 1 | nop
 261
 262 16 bit mode only:
 263
 264     | 1 | 0   000 | | 000.0  | 0  00 | 0   00 | 0 | nop
 265     | 1 | 1   000 | | 000.0  | 0  00 | 0   00 | 0 | attn
 266     | 1 | nonzero | | 000.0  | 0  00 | 0   00 | 0 | TBD
 267
 268 Notes:
 269
 270 * All-zeros being an illegal instruction is normal for ISAs.  Ensuring that
 271   this remains true at all times i.e. for both 10 bit and 16 bit mode is
 272   common sense.
 273 * The 10-bit nop (bit 15, M=1) is intended for circumstances
 274   where alignment to 32-bit before returning to v3.0B is required.
 275   M=1 being an indication "return to Standard v3.0B Encoding Mode".
 276 * The 16-bit nop (bit 0, N=1) is intended for circumstances where a
 277   return to Standard v3.0B Encoding is required for one cycle
 278   but one cycle where alignment to a 32-bit boundary is needed.
 279   Examples of this would be to return to "strict" (non-C) mode
 280   where the PC may not be on a non-word-aligned boundary.
 281 * If for any reason multiple 16 bit nops are needed in succession
 282   the M=1 variant can be used, because each one returns to
 283   Standard v3.0B Encoding Mode, each time.
 284
 285 In essence the 2 nops are needed due to there being 2 different C forms: 10 and 16 bit.
 286
 287 ### Branch
 288
 289     | 16-bit mode | | 10-bit mode                 |
 290     | 0 | 1 | 234 | | 567.8  | 9  ab | c   de | f |
 291     | N | offs2   | | 000.LK | offs!=0        | M | b, bl
 292     | 1 | offs2   | | 000.LK | BI    | BO1 oo | 1 | bc, bcl
 293     | N | BO3 BI3 | | 001.0  | LK BI | BO     | M | bclr, bclrl
 294
 295 16 bit mode:
 296
 297 * bc only available when N,M=0b11
 298 * offs2 extends offset in MSBs
 299 * BI3 extends BI in MSBs to allow selection of full CR
 300 * BO3 extends BO
 301 * bc offset constructed from oo as LSBs and offs2 as MSBs
 302 * bc BI allows selection of all bits from CR0 or CR1
 303 * bc CR check is always active (as if BO0=1) therefore BO1 inverts
 304
 305 10 bit mode:
 306
 307 * illegal (all zeros) covers part of branch (offs=0,M=0,LK=0)
 308 * nop also covers part of branch (offs=0,M=0,LK=1)
 309 * bc **not available** in 10-bit mode
 310 * BO[0] enables CR check, BO[1] inverts check
 311 * BI refers to CR0 only (4 bits of)
 312 * no Branch Conditional with immediate
 313 * no Absolute Address
 314 * CTR mode allowed with BO[2] for b only.
 315 * offs is to 2 byte (signed) aligned
 316 * all branches to 2 byte aligned
 317
 318 ### LD/ST
 319
 320     | 16-bit mode      | | 10-bit mode               |
 321     | 0   | 1  | 2 3 4 | | 567.8 | 9 a b | c d e | f |
 322     | RA2 | SZ |  RB   | | 001.1 | 1  RA | 0  RT | M | st
 323     | RA2 | SZ |  RB   | | 001.1 | 1  RA | 1  RT | M | fst
 324     | N   | SZ |  RT   | | 111.0 |  RA   |  RB   | M | ld
 325     | N   | SZ |  RT   | | 111.1 |  RA   |  RB   | M | fld
 326
 327 * elwidth overrides can set different widths
 328
 329 16 bit mode:
 330
 331 * SZ=1 is 64 bit, SZ=0 is 32 bit
 332 * RA2 extends RA to 3 bits (MSB)
 333 * RT2 extends RT to 3 bits (MSB)
 334
 335 10 bit mode:
 336
 337 * RA and RB are only 2 bit (0-3)
 338 * for LD, RT is implicitly RB: "ld RT=RB, RA(RB)"
 339 * for ST, there is no offset: "st RT, RA(0)"
 340
 341 ### Arithmetic
 342
 343     | 16-bit mode | | 10-bit mode             |
 344     | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
 345     | N | 0 | RT  | | 010.0 | RB  | RA!=0 | M | add
 346     | N | 0 | RT  | | 010.1 | RB  | RA|0  | M | sub.
 347     | N | 0 | BF  | | 011.0 | RB  | RA|0  | M | cmpl
 348
 349 Notes:
 350
 351 * sub. and cmpl: default CR target is CR0
 352 * for (RA|0) when RA=0 the input is a zero immediate,
 353   meaning that sub. becomes neg. and cmp becomes cmpi against zero
 354 * RT is implicitly RB: "add RT(=RB), RA, RB"
 355 * Opcode 0b010.0 RA=0 is not missing from the above:
 356   it is a system-wide instruction, "cbank" (section below)
 357
 358 16 bit mode only:
 359
 360     | 0 | 1 | 234 | | 567.8 | 9ab | cde   | f |
 361     | N | 1 | RA  | | 010.0 | RB  | RS    | 0 | sld.
 362     | N | 1 | RA  | | 010.1 | RB  | RS!=0 | 0 | srd.
 363     | N | 1 | RA  | | 010.1 | RB  | 000   | 0 | srad.
 364     | N | 1 | BF  | | 011.0 | RB  | RA|0  | 0 | cmpw
 365
 366 Notes:
 367
 368 * for srad, RS=RA: "srad. RA(=RS), RS, RB"
 369
 370
 371 ### Logical
 372
 373     | 16-bit mode   | | 10-bit mode             |
 374     | 0 | 1 | 2 3 4 | | 567.8 | 9ab | c d e | f |
 375     | N | 0 |  RT   | | 100.0 | RB  | RA!=0 | M | and
 376     | N | 0 |  RT   | | 100.1 | RB  | RA!=0 | M | nand
 377     | N | 0 |  RT   | | 101.0 | RB  | RA!=0 | M | or
 378     | N | 0 |  RT   | | 101.1 | RB  | RA!=0 | M | nor/mr
 379     | N | 0 |  RT   | | 100.0 | RB  | 0 0 0 | M | extsw
 380     | N | 0 |  RT   | | 100.1 | RB  | 0 0 0 | M | cntlz
 381     | N | 0 |  RT   | | 101.0 | RB  | 0 0 0 | M | popcnt
 382     | N | 0 |  RT   | | 101.1 | RB  | 0 0 0 | M | not
 383
 384 16-bit mode only:
 385
 386     | 0 | 1 | 2 3 4 | | 567.8 | 9ab | c d e | f |
 387     | N | 1 |  RT   | | 100.0 | RB  | RA!=0 | 0 | TBD
 388     | N | 1 |  RT   | | 100.1 | RB  | RA!=0 | 0 | TBD
 389     | N | 1 |  RT   | | 101.0 | RB  | RA!=0 | 0 | xor
 390     | N | 1 |  RT   | | 101.1 | RB  | RA!=0 | 0 | eqv (xnor)
 391     | N | 1 |  RT   | | 100.0 | RB  | 0 0 0 | 0 | extsb
 392     | N | 1 |  RT   | | 100.1 | RB  | 0 0 0 | 0 | cnttz
 393     | N | 1 |  RT   | | 101.0 | RB  | 0 0 0 | 0 | TBD
 394     | N | 1 |  RT   | | 101.1 | RB  | 0 0 0 | 0 | extsh
 395
 396 10 bit mode:
 397
 398 * idea: for 10bit mode, nor is actually 'mr' because mr is
 399   a more common operation.  in 16bit however, this encoding
 400   (Cmaj.min=0b101.1, N=0) is 'nor'
 401 * for (RA|0) when RA=0 the input is a zero immediate,
 402   meaning that nor becomes not
 403 * cntlz, popcnt, exts **not available** in 10-bit mode
 404 * RT is implicitly RB: "and RT(=RB), RA, RB"
 405
 406 ### Floating Point
 407
 408 Note here that elwidth overrides (SV Prefix) can be used to select FP16/32/64
 409
 410     | 16-bit mode   | | 10-bit mode             |
 411     | 0 | 1 | 2 3 4 | | 567.8 | 9ab | c d e | f |
 412     | N |   |  RT   | | 011.1 | RB  | RA!=0 | M | fsub.
 413     | N | 0 |  RT   | | 110.0 | RB  | RA!=0 | M | fadd
 414     | N | 0 |  RT   | | 110.1 | RB  | RA!=0 | M | fmul
 415     | N | 0 |  RT   | | 011.1 | RB  | 0 0 0 | M | fneg.
 416     | N | 0 |  RT   | | 110.0 | RB  | 0 0 0 | M |
 417     | N | 0 |  RT   | | 110.1 | RB  | 0 0 0 | M |
 418
 419 16-bit mode only:
 420
 421     | 0 | 1 | 2 3 4 | | 567.8 | 9ab | c d e | f |
 422     | N | 1 |  RT   | | 011.1 | RB  | RA!=0 | 0 |
 423     | N | 1 |  RT   | | 110.0 | RB  | RA!=0 | 0 |
 424     | N | 1 |  RT   | | 110.1 | RB  | RA!=0 | 0 | fdiv
 425     | N | 1 |  RT   | | 011.1 | RB  | 0 0 0 | 0 | fabs.
 426     | N | 1 |  RT   | | 110.0 | RB  | 0 0 0 | 0 | fmr.
 427     | N | 1 |  RT   | | 110.1 | RB  | 0 0 0 | 0 |
 428
 429 16 bit only, FP to INT convert (using C 0b001.1 subencoding)
 430
 431     | 0123 | 4 | | 567.8 | 9 ab | cde  | f |
 432     | 0010 | X | | 001.1 | 0 RA | Y RT | M | fp2int
 433     | 0011 | X | | 001.1 | 0 RA | Y RT | M | int2fp
 434
 435 * X: signed=1, unsigned=0
 436 * Y: FP32=0, FP64=1
 437
 438 10 bit mode:
 439
 440 * fsub. fneg. and fmr. default target is CR1
 441 * fmr. is **not available** in 10-bit mode
 442 * fdiv is **not available** in 10-bit mode
 443
 444 16 bit mode:
 445
 446 * fmr. copies RB to RT (and sets CR1)
 447
 448 ### Condition Register
 449
 450     | 16-bit mode   | | 10-bit mode            |
 451     | 0 1 2 3 | 4   | | 567.8 | 9 ab | cde | f |
 452     | 0 0 0 0 | BF2 | | 001.1 | 0 BF | BFA | M | mcrf
 453     | 0 0 0 1 | BA2 | | 001.1 | 0 BA | BB  | M | crnor
 454     | 0 1 0 0 | BA2 | | 001.1 | 0 BA | BB  | M | crandc
 455     | 0 1 1 0 | BA2 | | 001.1 | 0 BA | BB  | M | crxor
 456     | 0 1 1 1 | BA2 | | 001.1 | 0 BA | BB  | M | crnand
 457     | 1 0 0 0 | BA2 | | 001.1 | 0 BA | BB  | M | crand
 458     | 1 0 0 1 | BA2 | | 001.1 | 0 BA | BB  | M | creqv
 459     | 1 1 0 1 | BA2 | | 001.1 | 0 BA | BB  | M | crorc
 460     | 1 1 1 0 | BA2 | | 001.1 | 0 BA | BB  | M | cror
 461
 462 10 bit mode:
 463
 464 * mcrf BF is only 2 bits which means the destination is only CR0-CR3
 465 * CR operations: **not available** in 10-bit mode (but mcrf is)
 466
 467 16 bit mode:
 468
 469 * mcrf BF2 extends BF (in MSB) to 3 bits
 470 * CR operations: destination register is same as BA.
 471 * CR operations: only possible on CR0 and CR1
 472
 473 SV (Vector Mode):
 474
 475 * CR operations: greatly extended reach/range (useful for predicates)
 476
 477 ### System
 478
 479 cbank: Selection of Compressed-encoding "Bank".  Different "banks"
 480 give different meanings to opcodes.  Example: CBank=0b001 is heavily
 481 optimised to A/Video Encode/Decode.  cbank borrows from add's encoding
 482 space (when RA==0)
 483
 484     | 16-bit mode | | 10-bit mode             |
 485     | 0 | 1 2 3 4 | | 567.8 | 9ab   | cde | f |
 486     | N | 0 Bank2 | | 010.0 | CBank | 000 | M | cbank
 487
 488 **not available** in 10-bit mode:
 489
 490     | 0 1 2 3 | 4  | | 567.8 | 9 ab | cde  | f |
 491     | 1 1 1 1 | 0  | | 001.1 | 0 00 |  RT  | M | mtlr
 492     | 1 1 1 1 | 0  | | 001.1 | 0 01 |  RT  | M | mtctr
 493     | 1 1 1 1 | 0  | | 001.1 | 0 11 |  RT  | M | mtcr
 494     | 1 1 1 1 | 1  | | 001.1 | 0 00 |  RA  | M | mflr
 495     | 1 1 1 1 | 1  | | 001.1 | 0 01 |  RA  | M | mfctr
 496     | 1 1 1 1 | 1  | | 001.1 | 0 11 |  RA  | M | mfcr
 497
 498 ### Unallocated
 499
 500     | 0 1 2 3 | 4  | | 567.8 | 9 ab | cde  | f |
 501     | 0 1 0 1 |    | | 001.1 | 0    |      | M |
 502     | 1 0 1 0 |    | | 001.1 | 0    |      | M |
 503     | 1 0 1 1 |    | | 001.1 | 0    |      | M |
 504     | 1 1 0 0 |    | | 001.1 | 0    |      | M |
 505     | 1 1 1 1 |    | | 001.1 | 0 10 |      | M |
 506
 507 ## Other ideas (Attempt 2)
 508
 509 ### 8-bit mode-switching instructions, odd addresses for C mode
 510
 511 Drop the complexity of the 16-bit encoding further reduced to 10-bit,
 512 and use a single byte instead of two to switch between modes.  This
 513 would place compressed (C) mode instructions at odd bytes, so the LSB
 514 of the PC can be used for the processor to tell which mode it is in.
 515
 516 To switch from traditional to compressed mode, the single-byte
 517 instruction would be at the MSByte, that holds the EXT bits.  (When we
 518 break up a 32-bit instruction across words, the most significant half
 519 should go in the word with the lower address.)
 520
 521 To switch from compressed mode to traditional mode, the single-byte
 522 instruction would also be at the opcode/format portion, placed in the
 523 lower-address word if split across words, so that the instruction can
 524 be recognized as the mode-switching one without going for its second
 525 byte.
 526
 527 The C-mode nop should be encoded so that its second byte encodes a
 528 switch to compressed mode, if decoded in traditional mode.  This
 529 enables such a nop to straddle across a label:
 530
 531     8-bit first half of nop
 532     Label:
 533     8-bit second half of nop AKA switch to compressed mode
 534     16-bit insns...
 535
 536 so that if traditional code jumps to the word-aligned label (because
 537 traditional branches drop the 2 LSB), it immediately switches to
 538 compressed mode; if we fall-through, we remain in 16-bit mode; and if
 539 we branch to it from compressed mode, whether we jump to the odd or
 540 the even address, we end up in compressed mode as desired.
 541
 542 Tables explaining encoding:
 543
 544     | byte 0 | byte 1 | byte 2 | byte 3 |
 545     | v3.0B standard 32 bit instruction |
 546     | EXT000 | 16 bit          | 16...  |
 547     | .. bit | 8nop   | v3.0b stand...  |
 548     | .. ard 32 bit   | EXT000 | 16...  |
 549     | .. bit | 16 bit          | 8nop   |
 550     | v3.0B standard 32 bit instruction |
 551
 552
 553 ### TODO
 554
 555 * make a preliminary assessment of branch in/out viability
 556 * confirm FSM encoding (is LSB of PC really enough?)
 557 * guestimate opcode and register allocation (without necessarily doing a full encoding)
 558 * write throwaway python program that estimates compression ratio from objdump raw parsing
 559 * finally do full opcode allocation
 560 * rerun objdump compression ratio estimates
 561
 562 ### Use 2- rather than 3-register opcodes
 563
 564 Successful compact ISAs have used 2- rather than 3-register insns, in
 565 which the same register serves as input and output.  Some 20% of
 566 general-purpose 3-register insns already use either input register as
 567 output, without any effort by the compiler to do so.
 568
 569 Repurposing the 3 bits used to encode one one of the input registers
 570 in arithmetic, logical and floating-pointer registers, and the 2 bits
 571 used to encode the mode of the next two insns, we could make the full
 572 register files available to the opcodes already selected for
 573 compressed mode, with one bit to spare to bring additional opcodes in.
 574
 575 An opcode could be assigned to an instruction that combines and
 576 extends with the subsequent instruction, providing it with a separate
 577 input operand to use rather than the output register, or with
 578 additional range for immediate and offset operands, effectively
 579 forming a 32-bit operation, enabling us to remain in compressed mode
 580 even longer.
 581
 582 # Analysis techniques and tools
 583
 584     objdump -d --no-show-raw-insn /bin/bash | sed 'y/\t/ /;
 585       s/^[ x0-9A-F]*: *\([a-z.]\+\) *\(.*\)/\1 \2 /p; d' |
 586       sed 's/\([, (]\)r[1-9][0-9]*/\1r1/g;
 587       s/\([ ,]\)-*[0-9]\+\([^0-9]\)/\11\2/g' | sort | uniq --count |
 588       sort -n | less
 589
 590 # gcc register allocation
 591
 592 FTR, information extracted from gcc's gcc/config/rs6000/rs6000.h about fixed registers (assigned to special purposes) and register allocation order:
 593
 594 Special-purpose registers on ppc are:
 595
 596     r0: constant zero/throw-away
 597     r1: stack pointer
 598     r2: thread-local storage pointer in 32-bit mode
 599     r2: non-minimal TOC register
 600     r10: EH return stack adjust register
 601     r11: static chain pointer
 602     r13: thread-local storage pointer in 64-bit mode
 603     r30: minimal-TOC/-fPIC/-fpic base register
 604     r31: frame pointer
 605     lr: return address register
 606
 607 the register allocation order in GCC (i.e., it takes the earliest available register that fits the constraints) is:
 608
 609      We allocate in the following order:
 610         fp0             (not saved or used for anything)
 611         fp13 - fp2      (not saved; incoming fp arg registers)
 612         fp1             (not saved; return value)
 613         fp31 - fp14     (saved; order given to save least number)
 614         cr7, cr5        (not saved or special)
 615         cr6             (not saved, but used for vector operations)
 616         cr1             (not saved, but used for FP operations)
 617         cr0             (not saved, but used for arithmetic operations)
 618         cr4, cr3, cr2   (saved)
 619         r9              (not saved; best for TImode)
 620         r10, r8-r4      (not saved; highest first for less conflict with params)
 621         r3              (not saved; return value register)
 622         r11             (not saved; later alloc to help shrink-wrap)
 623         r0              (not saved; cannot be base reg)
 624         r31 - r13       (saved; order given to save least number)
 625         r12             (not saved; if used for DImode or DFmode would use r13)
 626         ctr             (not saved; when we have the choice ctr is better)
 627         lr              (saved)
 628         r1, r2, ap, ca  (fixed)
 629         v0 - v1         (not saved or used for anything)
 630         v13 - v3        (not saved; incoming vector arg registers)
 631         v2              (not saved; incoming vector arg reg; return value)
 632         v19 - v14       (not saved or used for anything)
 633         v31 - v20       (saved; order given to save least number)
 634         vrsave, vscr    (fixed)
 635         sfp             (fixed)
 636
 637 # Demo of encoding that's backward-compatible with PowerISA v3.1 in both LE and BE mode
 638
 639 [[demo]]
 640
 641 # Efficient Decoding Algorithm
 642
 643 [[decoding]]