[[!tag standards]] # 16 bit Compressed Similar to VLE (but without immediate-prefixing) this encoding is designed to fit on top of OpenPOWER ISA v3.0B when a "Modeswitch" bit is set (PCR is recommended). Note that Compressed is *mutually exclusively incompatible* with OpenPOWER v3.1B "prefixing" due to using (requiring) both EXT000 and EXT001. Hypothetically it could be made to use anything other than EXT001, with some inconvenience (extra gates). The incompatibility is "fixed" by swapping out of "Compressed" Mode and back into "Normal" (v3.1B) Mode, at runtime, as needed. Although initially intended to be augmented by Simple-V Prefixing (to add Vector context, width overrides, e.g IEEE754 FP16, and predication) yet not put pressure on I-Cache power or size, this Compressed Encoding is not critically dependent *on* SV Prefixing, and may be used stand-alone. See: * * VLE Encoding * This one is a conundrum. OpenPOWER ISA was never designed with 16 bit in mind. VLE was added 10 years ago but only by way of marking an entire 64k page as "VLE". With VLE not maintained it is not fully compatible with current PowerISA. Here, in order to embed 16 bit into a predominantly 32 bit stream the overhead of using an entire 16 bits just to switch into Compressed mode is itself a significant overhead. The situation is made worse by OpenPOWER ISA being fundamentally designed with 6 bits uniformly taking up Major Opcode space, leaving only 10 bits to allocate to actual instructions. Contrast this with RVC which takes 3 out of 4 combinations of the first 2 bits for indicating 16-bit (anything with 0b00 to 0b10 in the LSBs), and uses the 4th (0b11) as a Huffman-style escape-sequence, easily allowing standard 32 bit and 16 bit to intermingle cleanly. To achieve the same thing on OpenPOWER would require a whopping 24 6-bit Major Opcodes which is clearly impractical: other schemes need to be devised. In addition we would like to add SV-C32 which is a Vectorized version of 16 bit Compressed, and ideally have a variant that adds the 27-bit prefix format from SV-P64, as well. Potential ways to reduce pressure on the 16 bit space are: * To use more than one v3.0B Major Opcode, preferably an odd-even contiguous pair * To provide "paging". This involves bank-switching to alternative optimised encodings for specific workloads * To enter "16 bit mode" for durations specified at the start * To reserve one bit of every 16 bit instruction to indicate that the 16 bit mode is to continue to be sustained This latter would be useful in the Vector context to have an alternative meaning: as the bit which determines whether the instruction is 11-bit prefixed or 27-bit prefixed: 0 1 2 3 4 5 6 7 8 9 a b c d e f | |major op | 11 bit vector prefix| |16 bit opcode alt vec. mode ^ | | extra vector prefix if alt set| Using a major opcode to enter 16 bit mode, leaves 11 bits to find something to use them for: 0 1 2 3 4 5 6 7 8 9 a b c d e f | |major op | what to do here 1 | |16 bit stay in 16bit mode 1 | |16 bit stay in 16bit mode 1 | |16 bit exit 16bit mode 0 | One possibility is that the 11 bits are used for bank selection, with some room for additional context such as altering the registers used for the 16 bit operations (bank selection of which scalar regs). However the downside is that short sequences of Compressed instructions become penalised by the fixed overhead. Even a single 16 bit instruction requires a 16 bit overhead to "gain access" to 16 bit "mode", making the exercise pointless. An alternative is to use the first 11 bits for only the utmost commonly used instructions. That being the case then one of those 11 bits could be dedicated to saying if 16 bit mode is to be continued, at which point *all* 16 bits can be used for Compressed. 10 bits remain for actual opcodes, which is ridiculously tight, however the opportunity to subsequently use all 16 bits is worth it. The reason for picking 2 contiguous Major v3.0B opcodes is illustrated below: |0 1 2 3 4 5 6 7 8 9 a b c d e f| |major op..0| LO Half C space | |major op..1| HI Half C space | |N N N N N|<--11 bits C space-->| If NNNNN is the same value (two contiguous Major v3.0B Opcodes) this saves gates at a critical part of the decode phase. ## ABI considerations Unlike RISC-V RVC, the above "context" encodings require state, to be stored in the PCR, MSR, or a dedicated SPR. These bits (just like LE/BE 32bit mode and the IEEE754 FPCSR mode) all require taking that context into consideration. In particular it is critically important to recognise that context (in general) is an implicit part of the ABI implemented for example by glibc6. Therefore (in specific) Compressed Mode Context **must not** be permitted to cross into or out of a function call. Thus it is the mandatory responsibility of the compiler to ensure that context returns to "v3.0B Standard" prior to entering a function call (responsibility of caller) and prior to exit from a function call (responsibility of callee) by setting appropriate M and N bits. If however it is known to the compiler that certain static leaf node functions and their immediate callers will never, under any circumstances, be called by externsl ABI compliant code, then of course the compiler may choose to write such static functions as it sees fit. Trap Handlers also take responsibility for saving and restoring of Compressed Mode state, just as they already take responsibility for other critical state. This makes traps transparent to functions as far as Compressed Mode Context is concerned, just as traps are already transparent to functions. Note however that there are exceptions in a compiler to the otherwise hard rule that Compressed Mode context not be permitted to cross function boundaries: inline functions and static functions. static functions, if correctly identified as never to be called externally, may, as an optimisation, disregard standard ABIs, bearing in mind that this will be fraught (pointers to functions) and not easy to get right. # Opcode Allocation Ideas * one bit from the 16-bit mode is used to indicate that standard (v3.0B) mode is to be dropped into for only one single instruction ## Opcodes exploration (Attempt 1) Switching between different encoding modes is controlled by M (alone) in 10-bit mode, and M and N in 16-bit mode. * M in 10-bit mode if zero indicates that following instructions are standard OpenPOWER ISA 32-bit encoded (including, redundantly, further 10/16-bit instructions) * M in 10-bit mode if 1 indicates that following instructions are in 16-bit encoding mode Once in 16-bit mode: * 0b01 (M=1, N=0): stay in 16-bit mode * 0b00: leave 16-bit mode permanently (return to standard OpenPOWER ISA) * 0b10: leave 16-bit mode for one cycle (return to standard OpenPOWER ISA) * 0b11: free to be used for something completely different. The current "top" idea for 0b11 is to use it for a new encoding format of predominantly "immediates-based" 16-bit instructions (branch-conditional, addi, mulli etc.) * The Compressed Major Opcode is in bits 5-7. * Minor opcode in bit 8. * In some cases bit 9 is taken as an additional sub-opcode, followed by bits 0-4 (for CR operations) * M+N mode-switching is not available for C-Major.minor 0b001.1 * 10 bit mode may be expanded by 16 bit mode, adding capabilities that do not fit in the extreme limited space. Mode-switching FSM showing relationship between v3.0B, C 10bit and C 16bit. 16-bit immediate mode remains in 16-bit. | 0 | 1234 | 567 8 | 9abcde | f | explanation | - | ---- | ------ | ------ | - | ----------- | EXT000/1 | Cmaj.m | fields | 0 | 10bit then v3.0B | EXT000/1 | Cmaj.m | fields | 1 | 10bit then 16bit | 0 | flds | Cmaj.m | fields | 0 | 16bit then v3.0B | 0 | flds | Cmaj.m | fields | 1 | 16bit then 16bit | 1 | flds | Cmaj.m | fields | 0 | 16b, 1x v3.0B, 16b | 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit Notes: * Cmaj.m is the C major/minor opcode: 3 bits for major, 1 for minor * EXT000 and EXT001 are v3.0B Major Opcodes. The first 5 bits are zero, therefore the 6th bit is actually part of Cmaj. * "10bit then 16bit" means "this instruction is encoded C 10bit and the following one in C 16bit" * "16b, 1x v3.0B, 16b" means, "this instruction is encoded C 16bit, the following one is V3.0B Standard, and the one after that is back to 16bit". ### C Instruction Encoding types 10-bit Opcode formats (all start with v3.0B EXT000 or EXT001 Major Opcodes) | 01234 | 567 8 | 9 | a b | c | d e | f | enc | E01 | Cmaj.m | fld1 | fld2 | M | 10b | E01 | Cmaj.m | offset | M | 10b b | E01 | 001.1 | S1 | fd1 | S2 | fd2 | M | 10b sub | E01 | 111.m | fld1 | fld2 | M | 10b LDST 16-bit Opcode formats (including 10/16/v3.0B Switching) | 0 | 1234 | 567 8 | 9 | a b | c | d e | f | enc | N | immf | Cmaj.m | fld1 | fld2 | M | 16b | 1 | immf | Cmaj.m | fld1 | imm | 1 | 16b imm | N | fd3 | 001.1 | S1 | fd1 | S2 | fd2 | M | 16b sub | N | fd4 | 111.m | fld1 | fld2 | M | 16b LDST Notes: * fld1 and fld2 can contain reg numbers, immediates, or opcode fields (BO, BI, LK) * S1 and S2 are further sub-selectors of C 001.1 ### Immediate Opcodes only available in 16-bit mode, only available when M=1 and N=1 and when Cmaj.min is not 0b001.1. instruction counts from objdump on /bin/bash: 466 extsw r1,r1 649 stw r1,1(r1) 691 lwz r1,1(r1) 705 cmpdi r1,1 791 cmpwi r1,1 794 addis r1,r1,1 1474 std r1,1(r1) 1846 li r1,1 2031 mr r1,r1 2473 addi r1,r1,1 3012 nop 3028 ld r1,1(r1) | 0 | 1 | 2 | 3 4 | | 567.8 | 9ab | cde | f | | 1 | 0 | 0 0 0 | | 001.0 | | 000 | 1 | TBD | 1 | 0 | sh2 | | 001.0 | RA | sh | 1 | sradi. | 1 | 1 | 0 0 0 | | 001.0 | | 000 | 1 | TBD | 1 | 1 | 0 | sh2 | | 001.0 | RA | sh | 1 | srawi. | 1 | 1 | 1 | | | 001.0 | 000 | imm | 1 | TBD | 1 | 1 | 1 | i2 | | 001.0 | RA!=0| imm | 1 | addis | 1 | 0 | i2 | | 010.0 | 000 | imm | 1 | setvli | 1 | 1 | i2 | | 010.0 | 000 | imm | 1 | setmvli | 1 | i2 | | 010.0 | RA!=0| imm | 1 | addi | 1 | 0 | i2 | | 010.1 | RA | imm | 1 | cmpdi | 1 | 1 | i2 | | 010.1 | RA | imm | 1 | cmpwi | 1 | 0 | i2 | | 011.0 | RT | imm | 1 | ldspi | 1 | 1 | i2 | | 011.0 | RT | imm | 1 | lwspi | 1 | 0 | i2 | | 011.1 | RT | imm | 1 | stwspi | 1 | 1 | i2 | | 011.1 | RT | imm | 1 | stdspi | 1 | i2 | RA | | 100.0 | RT | imm | 1 | stwi | 1 | i2 | RA | | 100.1 | RT | imm | 1 | stdi | 1 | i2 | RT | | 101.0 | RA | imm | 1 | ldi | 1 | i2 | RT | | 101.1 | RA | imm | 1 | lwi | 1 | i2 | RA | | 110.0 | RT | imm | 1 | fsti | 1 | i2 | RA | | 110.1 | RT | imm | 1 | fstdi | 1 | i2 | RT | | 111.0 | RA | imm | 1 | flwi | 1 | i2 | RT | | 111.1 | RA | imm | 1 | fldi Construction of immediate: * LD/ST r1 (SP) variants should be offset by -256 see - SP variants map to e.g ld RT, imm(r1) - SV Prefixing can be used to map r1 to alternate regs * [1] not the same as v3.0B addis: the shift amount is smaller and actually still maps to within the v3.0B addi immediate range. * addi is EXTS(i2||imm) to give a 4-bit range -8 to +7 * addis is EXTS(i2||imm||000) to give a 11-bit range -1024 to +1023 in increments of 8 * all others are EXTS(i2||imm) to give a 7-bit range -128 to +127 (further for LD/ST due to word/dword-alignment) Further Notes: * bc also has an immediate mode, listed separately below in Branch section * for LD/ST, offset is aligned. 8-byte: i2||imm||0b000 4-byte: 0b00 * SV Prefix over-rides help provide alternative bitwidths for LD/ST * RA|0 if RA is zero, addi. becomes "li" - this only works if RT takes part of opcode - mv is also possible by specifying an immediate of zero ### Illegal, nop and attn Note that illeg is all zeros, including in the 16-bit mode. Given that C is allocated to OpenPOWER ISA Major opcodes EXT000 and EXT001 this ensures that in both 10-bit *and* 16-bit mode, a 16-bit run of all zeros is considered "illegal" whilst 0b0000.0000.1000.0000 is "nop" | 16-bit mode | | 10-bit mode | | 0 | 1 | 234 | | 567.8 | 9 ab | c de | f | | - | - | --- | | ----- | ----- | ------ | - | | 0 | 0 000 | | 000.0 | 0 00 | 0 00 | 0 | illeg | 0 | 0 000 | | 000.0 | 0 00 | 0 00 | 1 | nop 16 bit mode only: | 0 | 1 | 234 | | 567.8 | 9 ab | c de | f | | - | - | --- | | ----- | ----- | ------ | - | | 1 | 0 000 | | 000.0 | 0 00 | 0 00 | 0 | nop | 1 | 0 000 | | 000.0 | 0 00 | 0 00 | 1 | nop | N | 1 000 | | 000.0 | 0 00 | 0 00 | M | attn Notes: * All-zeros being an illegal instruction is normal for ISAs. Ensuring that this remains true at all times i.e. for both 10 bit and 16 bit mode is common sense. * The 10-bit nop (bit 15, M=1) is intended for circumstances where alignment to 32-bit before returning to v3.0B is required. M=1 being an indication "return to Standard v3.0B Encoding Mode". * The 16-bit nop (bit 0, N=1) is intended for circumstances where a return to Standard v3.0B Encoding is required for one cycle but one cycle where alignment to a 32-bit boundary is needed. Examples of this would be to return to "strict" (non-C) mode where the PC may not be on a non-word-aligned boundary. * If for any reason multiple 16 bit nops are needed in succession the M=1 variant can be used, because each one returns to Standard v3.0B Encoding Mode, each time. In essence the 2 nops are needed due to there being 2 different C forms: 10 and 16 bit. ### Branch TODO: document that branching whilst using mode-switching bits (M/N) is perfectly well permitted, the caveat being: it is specifically and wholly the complier/assembler writers responsibility to obey ABI rules and ensure that even with branches and returns that, at no time, is an incorrect mode entered or left that could result in any instruction being misinterpreted. | 16-bit mode | | 10-bit mode | | 0 | 1 | 234 | | 567.8 | 9 ab | c de | f | | - | - | --- | | ----- | ----- | ------ | - | | N | offs2 | | 000.LK | offs!=0 | M | b, bl | N | | | 000.1 | 0 00 | 0 00 | M | TBD | 1 | offs2 | | 000.LK | BI | BO1 oo | 1 | bc, bcl | N | BO3 BI3 | | 001.0 | LK BI | BO | M | bclr, bclrl 16 bit mode: * bc only available when N,M=0b11 * offs2 extends offset in MSBs * BI3 extends BI in MSBs to allow selection of full CR * BO3 extends BO * bc offset constructed from oo as LSBs and offs2 as MSBs * bc BI allows selection of all bits from CR0 or CR1 * bc CR check is always active (as if BO0=1) therefore BO1 inverts 10 bit mode: * illegal (all zeros) covers part of branch (offs=0,M=0,LK=0) * nop also covers part of branch (offs=0,M=0,LK=1) * bc **not available** in 10-bit mode * BO[0] enables CR check, BO[1] inverts check * BI refers to CR0 only (4 bits of) * no Branch Conditional with immediate * no Absolute Address * CTR mode allowed with BO[2] for b only. * offs is to 2 byte (signed) aligned * all branches to 2 byte aligned ### LD/ST Note: for 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed) | 16-bit mode | | 10-bit mode | | 0 | 1 | 234 | | 567.8 | 9 a b | c d e | f | | - | -- | --- | | ----- | ----- | ----- | - | | N | SZ | RB | | 001.1 | 1 RA | 0 RT | M | st | N | SZ | RB | | 001.1 | 1 RA | 1 RT | M | fst | N | SZ | RT | | 111.0 | RA | RB | M | ld | N | SZ | RT | | 111.1 | RA | RB | M | fld * elwidth overrides can set different widths 16 bit mode: * SZ=1 is 64 bit, SZ=0 is 32 bit 10 bit mode: * RA and RB are only 2 bit (0-3) * for LD, RT is implicitly RB: "ld RT=RB, RA(RB)" * for ST, there is no offset: "st RT, RA(0)" ### Arithmetic * 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed) * 16-bit: note that bit 1==0 (sub-sub-encoding) 10 and 16 bit: | 16-bit mode | | 10-bit mode | | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f | | - | - | --- | | ----- | --- | ----- | - | | N | 0 | RT | | 010.0 | RB | RA!=0 | M | add | N | 0 | RT | | 010.1 | RB | RA|0 | M | sub. | N | 0 | BF | | 011.0 | RB | RA|0 | M | cmpl Notes: * sub. and cmpl: default CR target is CR0 * for (RA|0) when RA=0 the input is a zero immediate, meaning that sub. becomes neg. and cmp becomes cmpi against zero * RT is implicitly RB: "add RT(=RB), RA, RB" * Opcode 0b010.0 RA=0 is not missing from the above: it is a system-wide instruction, "cbank" (section below) 16 bit mode only: | 0 | 1 | 234 | | 567.8 | 9ab | cde | f | | - | - | --- | | ----- | --- | ----- | - | | N | 1 | RA | | 010.0 | RB | RS | M | sld. | N | 1 | RA | | 010.1 | RB | RS!=0 | M | srd. | N | 1 | RA | | 010.1 | RB | 000 | M | srad. | N | 1 | BF | | 011.0 | RB | RA|0 | M | cmpw Notes: * for srad, RS=RA: "srad. RA(=RS), RS, RB" ### Logical * 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed) * 16-bit: note that bit 1==0 (sub-sub-encoding) 10 and 16 bit: | 16-bit mode | | 10-bit mode | | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f | | - | - | --- | | ----- | --- | ----- | - | | N | 0 | RT | | 100.0 | RB | RA!=0 | M | and | N | 0 | RT | | 100.1 | RB | RA!=0 | M | nand | N | 0 | RT | | 101.0 | RB | RA!=0 | M | or | N | 0 | RT | | 101.1 | RB | RA!=0 | M | nor/mr | N | 0 | RT | | 100.0 | RB | 0 0 0 | M | popcnt | N | 0 | RT | | 100.1 | RB | 0 0 0 | M | cntlz | N | 0 | RT | | 101.0 | RB | 0 0 0 | M | extsw | N | 0 | RT | | 101.1 | RB | 0 0 0 | M | not 16-bit mode only (note that bit 1 == 1): | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f | | - | - | --- | | ----- | --- | ----- | - | | N | 1 | RT | | 100.0 | RB | RA!=0 | M | TBD | N | 1 | RT | | 100.1 | RB | RA!=0 | M | TBD | N | 1 | RT | | 101.0 | RB | RA!=0 | M | xor | N | 1 | RT | | 101.1 | RB | RA!=0 | M | eqv (xnor) | N | 1 | RT | | 100.0 | RB | 0 0 0 | M | setvl. | N | 1 | RT | | 100.1 | RB | 0 0 0 | M | cnttz | N | 1 | RT | | 101.0 | RB | 0 0 0 | M | extsb | N | 1 | RT | | 101.1 | RB | 0 0 0 | M | extsh 10 bit mode: * idea: for 10bit mode, nor is actually 'mr' because mr is a more common operation. in 16bit however, this encoding (Cmaj.min=0b101.1, N=0) is 'nor' * for (RA|0) when RA=0 the input is a zero immediate, meaning that nor becomes not * cntlz, popcnt, exts **not available** in 10-bit mode * RT is implicitly RB: "and RT(=RB), RA, RB" ### Floating Point Note here that elwidth overrides (SV Prefix) can be used to select FP16/32/64 * 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed) * 16-bit: note that bit 1==0 (sub-sub-encoding) 10 and 16 bit: | 16-bit mode | | 10-bit mode | | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f | | - | - | --- | | ----- | --- | ----- | - | | N | | RT | | 011.1 | RB | RA!=0 | M | fsub. | N | 0 | RT | | 110.0 | RB | RA!=0 | M | fadd | N | 0 | RT | | 110.1 | RB | RA!=0 | M | fmul | N | 0 | RT | | 011.1 | RB | 0 0 0 | M | fneg. | N | 0 | | | 110.0 | | 0 0 0 | M | TBD | N | 0 | | | 110.1 | | 0 0 0 | M | TND 16-bit mode only (note that bit 1 == 1): | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f | | - | - | --- | | ----- | --- | ----- | - | | N | 1 | | | 011.1 | | RA!=0 | M | TBD | N | 1 | | | 110.0 | | RA!=0 | M | TBD | N | 1 | RT | | 110.1 | RB | RA!=0 | M | fdiv | N | 1 | RT | | 011.1 | RB | 0 0 0 | M | fabs. | N | 1 | RT | | 110.0 | RB | 0 0 0 | M | fmr. | N | 1 | | | 110.1 | | 0 0 0 | M | TBD 16 bit only, FP to INT convert (using C 0b001.1 subencoding) | 0 | 123 | 4 | | 567.8 | 9 ab | cde | f | | - | --- | - | | ----- | ---- | ---- | - | | N | 101 | X | | 001.1 | 0 RA | Y RT | M | fp2int | N | 110 | X | | 001.1 | 0 RA | Y RT | M | int2fp * X: signed=1, unsigned=0 * Y: FP32=0, FP64=1 10 bit mode: * fsub. fneg. and fmr. default target is CR1 * fmr. is **not available** in 10-bit mode * fdiv is **not available** in 10-bit mode 16 bit mode: * fmr. copies RB to RT (and sets CR1) ### Condition Register 10-bit or 16 bit: | 16-bit mode| | 10-bit mode | | 0 | 123 | 4 | | 567.8 | 9 ab | cde | f | | - | --- | --- | | ----- | ---- | --- | - | | N | 000 | BF2 | | 001.1 | 0 BF | BFA | M | mcrf 16-bit only: | 0 | 1234 | | 567.8 | 9 ab | cde | f | | - | ---- | | ----- | ---- | --- | - | | N | 0010 | | 001.1 | 0 BA | BB | M | crnor | N | 0011 | | 001.1 | 0 BA | BB | M | crandc | N | 0100 | | 001.1 | 0 BA | BB | M | crxor | N | 0101 | | 001.1 | 0 BA | BB | M | crnand | N | 0110 | | 001.1 | 0 BA | BB | M | crand | N | 0111 | | 001.1 | 0 BA | BB | M | creqv | N | 1000 | | 001.1 | 0 BA | BB | M | crorc | N | 1001 | | 001.1 | 0 BA | BB | M | cror Notes 10 bit mode: * mcrf BF is only 2 bits which means the destination is only CR0-CR3 * CR operations: **not available** in 10-bit mode (but mcrf is) 16 bit mode: * mcrf BF2 extends BF (in MSB) to 3 bits * CR operations: destination register is same as BA. * CR operations: only possible on CR0 and CR1 SV (Vector Mode): * CR operations: greatly extended reach/range (useful for predicates) ### System cbank: Selection of Compressed-encoding "Bank". Different "banks" give different meanings to opcodes. Example: CBank=0b001 is heavily optimised to A/Video Encode/Decode. cbank borrows from add's encoding space (when RA==0) | 16-bit mode | | 10-bit mode | | 0 | 1 2 3 4 | | 567.8 | 9ab | cde | f | | - | ------- | | ----- | ----- | --- | - | | N | 0 Bank2 | | 010.0 | CBank | 000 | M | cbank **not available** in 10-bit mode, **only** in 16-bit mode: | 0 | 1 | 234 | | 567.8 | 9 ab | cde | f | | - | ------- | | ----- | ---- | ---- | - | | N | 1 | 111 | | 001.1 | 0 00 | RT | M | mtlr | N | 1 | 111 | | 001.1 | 0 01 | RT | M | mtctr | N | 1 | 111 | | 001.1 | 0 00 | RA | M | mflr | N | 1 | 111 | | 001.1 | 0 01 | RA | M | mfctr | N | 0 RA!=0 | | 000.0 | 0 00 | 000 | M | mtcr | N | 1 RT!=0 | | 000.0 | 0 00 | 000 | M | mfcr ### Unallocated 16-bit only: | 0 | 1 | 234 | | 567.8 | 9 ab | cde | f | | - | - | --- | | ----- | ---- | ---- | - | | N | 1 | 111 | | 001.1 | 0 10 | | M | | N | 1 | 111 | | 001.1 | 0 11 | | M | # Other ideas (Attempt 2) ## 8-bit mode-switching instructions, odd addresses for C mode Drop the complexity of the 16-bit encoding further reduced to 10-bit, and use a single byte instead of two to switch between modes. This would place compressed (C) mode instructions at odd bytes, so the LSB of the PC can be used for the processor to tell which mode it is in. To switch from traditional to compressed mode, the single-byte instruction would be at the MSByte, that holds the EXT bits. (When we break up a 32-bit instruction across words, the most significant half should go in the word with the lower address.) To switch from compressed mode to traditional mode, the single-byte instruction would also be at the opcode/format portion, placed in the lower-address word if split across words, so that the instruction can be recognized as the mode-switching one without going for its second byte. The C-mode nop should be encoded so that its second byte encodes a switch to compressed mode, if decoded in traditional mode. This enables such a nop to straddle across a label: 8-bit first half of nop Label: 8-bit second half of nop AKA switch to compressed mode 16-bit insns... so that if traditional code jumps to the word-aligned label (because traditional branches drop the 2 LSB), it immediately switches to compressed mode; if we fall-through, we remain in 16-bit mode; and if we branch to it from compressed mode, whether we jump to the odd or the even address, we end up in compressed mode as desired. Tables explaining encoding: | byte 0 | byte 1 | byte 2 | byte 3 | | v3.0B standard 32 bit instruction | | EXT000 | 16 bit | 16... | | .. bit | 8nop | v3.0b stand... | | .. ard 32 bit | EXT000 | 16... | | .. bit | 16 bit | 8nop | | v3.0B standard 32 bit instruction | # Other ideas (v3) FSM state switching and mode switching deemed too complex. Instead cut back to 1. 10bit only (actually, 11 bit) 2. SV-Prefixed 16bit only (aka SV-C32) Each will be entirely different which is a huge amount of work. # TODO * make a preliminary assessment of branch in/out viability * confirm FSM encoding (is LSB of PC really enough?) * guestimate opcode and register allocation (without necessarily doing a full encoding) * write throwaway python program that estimates compression ratio from objdump raw parsing * finally do full opcode allocation * rerun objdump compression ratio estimates * check in FSM if "return to v3.0B then 16bit" if it is ok to have the v3.0B be a 10bit Compressed. should this be ignored and carry on? should a trap occur? ### Use 2- rather than 3-register opcodes Successful compact ISAs have used 2- rather than 3-register insns, in which the same register serves as input and output. Some 20% of general-purpose 3-register insns already use either input register as output, without any effort by the compiler to do so. Repurposing the 3 bits used to encode one one of the input registers in arithmetic, logical and floating-pointer registers, and the 2 bits used to encode the mode of the next two insns, we could make the full register files available to the opcodes already selected for compressed mode, with one bit to spare to bring additional opcodes in. An opcode could be assigned to an instruction that combines and extends with the subsequent instruction, providing it with a separate input operand to use rather than the output register, or with additional range for immediate and offset operands, effectively forming a 32-bit operation, enabling us to remain in compressed mode even longer. # Appendix ## Analysis techniques and tools objdump -d --no-show-raw-insn /bin/bash | sed 'y/\t/ /; s/^[ x0-9A-F]*: *\([a-z.]\+\) *\(.*\)/\1 \2 /p; d' | sed 's/\([, (]\)r[1-9][0-9]*/\1r1/g; s/\([ ,]\)-*[0-9]\+\([^0-9]\)/\11\2/g' | sort | uniq --count | sort -n | less ## gcc register allocation FTR, information extracted from gcc's gcc/config/rs6000/rs6000.h about fixed registers (assigned to special purposes) and register allocation order: Special-purpose registers on ppc are: r0: constant zero/throw-away r1: stack pointer r2: thread-local storage pointer in 32-bit mode r2: non-minimal TOC register r10: EH return stack adjust register r11: static chain pointer r13: thread-local storage pointer in 64-bit mode r30: minimal-TOC/-fPIC/-fpic base register r31: frame pointer lr: return address register the register allocation order in GCC (i.e., it takes the earliest available register that fits the constraints) is: We allocate in the following order: fp0 (not saved or used for anything) fp13 - fp2 (not saved; incoming fp arg registers) fp1 (not saved; return value) fp31 - fp14 (saved; order given to save least number) cr7, cr5 (not saved or special) cr6 (not saved, but used for vector operations) cr1 (not saved, but used for FP operations) cr0 (not saved, but used for arithmetic operations) cr4, cr3, cr2 (saved) r9 (not saved; best for TImode) r10, r8-r4 (not saved; highest first for less conflict with params) r3 (not saved; return value register) r11 (not saved; later alloc to help shrink-wrap) r0 (not saved; cannot be base reg) r31 - r13 (saved; order given to save least number) r12 (not saved; if used for DImode or DFmode would use r13) ctr (not saved; when we have the choice ctr is better) lr (saved) r1, r2, ap, ca (fixed) v0 - v1 (not saved or used for anything) v13 - v3 (not saved; incoming vector arg registers) v2 (not saved; incoming vector arg reg; return value) v19 - v14 (not saved or used for anything) v31 - v20 (saved; order given to save least number) vrsave, vscr (fixed) sfp (fixed) ## Comparison to VLE VLE was a means to reduce executable size through three interleaved methods: * (1) invention of 16 bit encodings (of exactly 16 bit in length) * (2) invention of 16+16 bit encodings (a 16 bit instruction format but with an *additional* 16 bit immediate "tacked on" to the end, actually making a 32-bit instruction format) * (3) seamless and transparent embedding and intermingling of the above in amongst arbitrary v2.06/7 BE 32 bit instruction sequences, with no additional state, including when the PC was not aligned on a 4-byte boundary. Whilst (1) and (3) make perfect sense, (2) makes no sense at all given that, as inspection of "ori" and others show, I-Form 16 bit immediates is the "norm" for v2.06/7 and v3.0B standard instructions. (2) in effect **is** a 32 bit instruction. (2) **is not** a 16 bit instruction. *Why "reinvent" an encoding that is 32 bit, when there already exists a 32 bit encoding that does the exact same job?* Consequently, we do **not** envisage a scenario where (2) would ever be implemented, nor in the future would this Compressed Encoding be extended beyond 16 bit. Compressed is Compressed and is **by definition** limited to precisely - and only - 16 bit. The additional reason why that is the case is because VLE is exceptionally complex to implement. In a single-issue, low clock rate "Embedded" environment for which VLE was originally designed, VLE was perfectly well matched. However this Compressed Encoding is designed for High performance multi-issue systems *as well* as Embedded scenarios, and consequently, the complexity of "deep packet inspection" down into the depths of a 16 bit sequence in order to ascertain if it might not be 16 bit after all, is wholly unacceptable. By eliminating such 16+16 (actually, 32bit conflation) tricks outlined in (2), Compressed is *specifically* designed to fit into a very small FSM, suitable for multi-issue, that in no way requires "deep-dive" analysis. Yet, despite it never being designed with 16 bit encodings in mind, is still suitable for retro-fitting onto OpenPOWER. ## Compressed Decoder Phases Phase 1 (stage 1 of a 2-stage pipelined decoder) is defined as the minimum necessary FSM required to determine instruction length and mode. This is implemented with the absolute bare minimum of gates and is based on the 6 encodings involving N, M and EXTNNN (see table, below) Phase 2 (stage 2 of a 2-stage pipelined decoder) is defined as the "full decoder" that includes taking into account the length and mode from Phase 1. Given a 2-stage pipelined decoder it is categorically **impossible** for Phase 2 to go backwards in time and affect the decisions made in Phase 1. These two phases are specifically designed to take multi-issue execution into account. Phase 1 is intended to be part of an O(log N) algorithm that can use a form of carry-lookahead propagation. Phase 2 is intended to be on a 2nd pipelined clock cycle, comprising a separate suite of independent local-state-only parallel pipelines that do not require any inter-communication of any kind. Table: Reminder of the 6 16-bit encodings: | 0 | 1234 | 567 8 | 9abcde | f | explanation | - | ---- | ------ | ------ | - | ----------- | EXT000/1 | Cmaj.m | fields | 0 | 10bit then v3.0B | EXT000/1 | Cmaj.m | fields | 1 | 10bit then 16bit | 0 | flds | Cmaj.m | fields | 0 | 16bit then v3.0B | 0 | flds | Cmaj.m | fields | 1 | 16bit then 16bit | 1 | flds | Cmaj.m | fields | 0 | 16b, 1x v3.0B, 16b | 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit ### Phase 1 The Phase 1 length/mode identification takes into account only 3 pieces of information: * extc_id: insn[0:4] == EXTNNN (Compressed) * N: insn[0] * M: insn[15] The Phase 1 length/mode produces the following lengths/modes: * 32 - v3.0B (includes v3.0B followed by 16bit) * 16 - 10bit * 16 - 16bit **NOTE THAT FURTHER SUBIDENTIFICATION OF C MODES IS NOT CARRIED OUT AT PHASE 1**. In particular note specifically that 16 bit "immediate mode" is **not** part of the Phase 1 FSM, but is specifically isolated to Phase 2. Pseudocode: # starting point for FSM previ = v3.0B if previ.mode == v3.0B: # previous was v3.0B, look for compressed tag if extc_id: # found it. move to 10bit mode nexti.length = 16 nexti.mode = 10bit else: # nope. stay in v3.0B nexti.length = 32 nexti.mode = v3.0B elif previ.mode == 10bit: # previous was v3.0B, move to v3.0B or 16bit? if M == 0: next.length = 32 nexti.mode = v3.0B else: # otherwise stay in 16bit mode nexti.length = 16 nexti.mode = 16bit elif previ.mode == 16bit: # previous was 16bit, stay there or move? if M == 0: # back to v3.0B next.length = 32 if N == 1: # ... but only for 1 insn nexti.mode = v3.0B_then_16bit else: nexti.mode = v3.0B else: # otherwise stay in 16bit mode nexti.length = 16 nexti.mode = 16bit # rest of FSM involving 3.0B to 16bit # and back transitions left to implementor # (or for someone else to add) ### Phase 2: Compressed mode At this phase, knowing that the length is 16bit and the mode is either 10b or 16b, further analysis is required to determine if the 16bit.immediate encoding is active, and so on. This is a fully combinatorial block that **at no time** steps outside of the strict bounds already determined by Phase 1. op_001_1 = insn[5:8] != 0b001.1 if mode == 10bit: decode_10bit(insn) elif mode == 16bit: if N == 1 & M == 1 & op_001_1 # see immediate opcodes table decode_16bit_immed_mode(insn) if op_001_1: # see CR and System tables # (16 bit ones at least) decode_16bit_cr_or_sys(insn) else: decode_16bit_nonimmed_mode(insn) From this point onwards each of the decode_xx functions perform straightforward combinatorial decoding of the 16 bits of "insn". In sone cases this involves further analysis of bit 1, in some cases (Cmaj.m = 0b010.1) even further deep-dive decoding is required (CR ops). *All* of it is entirely combinatorial and at **no time** involves changing of, or interaction with, or disruption of, the Phase 1 determination of Length+Mode (that has *already taken place* in an earlier decoding pipeline time-schedule) ### Phase 2: v3.0B mode Standard v3.0B decoders are deployed. Absolutely no interaction occurs with any 16 bit decoders or state. Absolutely no interaction with the earlier Phase 1 decoding occurs. Absolutely no interaction occurs whatsoever (assuming an implementation that does not perform macro-op fusion) between other multi-issued v3.0B instructions being decoded in parallel at this time. ## Demo of encoding that's backward-compatible with PowerISA v3.1 in both LE and BE mode [[demo]] ### Efficient Decoding Algorithm [[decoding]]