[[!tag standards]]

# 16 bit Compressed

Similar to VLE (but without immediate-prefixing) this encoding is designed
to fit on top of OpenPOWER ISA v3.0B when a "Modeswitch" bit is set (PCR
is recommended). Note that Compressed is *mutually exclusively incompatible*
with OpenPOWER v3.1B "prefixing" due to using (requiring) both EXT000
and EXT001. Hypothetically it could be made to use anything other than
EXT001, with some inconvenience (extra gates).  The incompatibility is
"fixed" by swapping out of "Compressed" Mode and back into "Normal"
(v3.1B) Mode, at runtime, as needed.

Although initially intended to be augmented by Simple-V Prefixing (to
add Vector context, width overrides, e.g IEEE754 FP16, and predication) yet not put pressure on I-Cache power
or size, this Compressed Encoding is not critically dependent
*on* SV Prefixing, and may be used stand-alone.

See:

* <https://bugs.libre-soc.org/show_bug.cgi?id=238>
* <https://ftp.libre-soc.org/VLE_314-68105.pdf> VLE Encoding
* <http://lists.mailinglist.openpowerfoundation.org/pipermail/openpower-hdl-cores/2020-November/000210.html>

This one is a conundrum.  OpenPOWER ISA was never designed with 16
bit in mind.  VLE was added 10 years ago but only by way of marking
an entire 64k page as "VLE".  With VLE not maintained it is not
fully compatible with current PowerISA.

Here, in order to embed 16 bit into a predominantly 32 bit stream the
overhead of using an entire 16 bits just to switch into Compressed mode
is itself a significant overhead.  The situation is made worse by 
OpenPOWER ISA being fundamentally designed with 6 bits uniformly
taking up Major Opcode space, leaving only 10 bits to allocate
to actual instructions.

Contrast this with RVC which takes 3 out of 4 combinations of the first 2
bits for indicating 16-bit (anything with 0b00 to 0b10 in the LSBs), and
uses the 4th (0b11) as a Huffman-style escape-sequence, easily allowing
standard 32 bit and 16 bit to intermingle cleanly.  To achieve the same
thing on OpenPOWER would require a whopping 24 6-bit Major Opcodes which
is clearly impractical: other schemes need to be devised.

In addition we would like to add SV-C32 which is a Vectorized version
of 16 bit Compressed, and ideally have a variant that adds the 27-bit
prefix format from SV-P64, as well.

Potential ways to reduce pressure on the 16 bit space are:

* To use more than one v3.0B Major Opcode, preferably an odd-even
  contiguous pair
* To provide "paging".  This involves bank-switching to alternative
  optimised encodings for specific workloads
* To enter "16 bit mode" for durations specified at the start
* To reserve one bit of every 16 bit instruction to indicate that the
  16 bit mode is to continue to be sustained

This latter would be useful in the Vector context to have an alternative
meaning: as the bit which determines whether the instruction is 11-bit
prefixed or 27-bit prefixed:

    0 1 2 3 4 5 6 7 8 9 a b c d e f |
    |major op | 11 bit vector prefix|
    |16 bit opcode  alt vec. mode ^ |
    | extra vector prefix if alt set|

Using a major opcode to enter 16 bit mode, leaves 11 bits to find
something to use them for:

    0 1 2 3 4 5 6 7 8 9 a b c d e f |
    |major op | what to do here   1 |
    |16 bit    stay in 16bit mode 1 |
    |16 bit    stay in 16bit mode 1 |
    |16 bit       exit 16bit mode 0 |

One possibility is that the 11 bits are used for bank selection,
with some room for additional context such as altering the registers
used for the 16 bit operations (bank selection of which scalar regs).
However the downside is that short sequences of Compressed instructions
become penalised by the fixed overhead.  Even a single 16 bit instruction
requires a 16 bit overhead to "gain access" to 16 bit "mode", making
the exercise pointless.

An alternative is to use the first 11 bits for only the utmost commonly
used instructions.  That being the case then one of those 11 bits could
be dedicated to saying if 16 bit mode is to be continued, at which
point *all* 16 bits can be used for Compressed.  10 bits remain for
actual opcodes, which is ridiculously tight, however the opportunity to
subsequently use all 16 bits is worth it.

The reason for picking 2 contiguous Major v3.0B opcodes is illustrated below:

    |0 1 2 3 4 5 6 7 8 9 a b c d e f|
    |major op..0| LO Half C space   |
    |major op..1| HI Half C space   |
    |N N N N N|<--11 bits C space-->|

If NNNNN is the same value (two contiguous Major v3.0B Opcodes) this
saves gates at a critical part of the decode phase.

## ABI considerations

Unlike RISC-V RVC, the above "context" encodings require state, to be stored
in the PCR, MSR, or a dedicated SPR.  These bits (just like LE/BE 32bit
mode and the IEEE754 FPCSR mode) all require taking that context into
consideration.

In particular it is critically important to recognise that context (in
general) is an implicit part of the ABI implemented for example by glibc6.
Therefore (in specific) Compressed Mode Context **must not** be permitted
to cross into or out of a function call.

Thus it is the mandatory responsibility of the compiler to ensure that
context returns to "v3.0B Standard" prior to entering a function call
(responsibility of caller) and prior to exit from a function call
(responsibility of callee) by setting appropriate M and N bits.

If however it is known to the compiler that certain static leaf node functions and their immediate callers will never, under any circumstances, be called by externsl ABI compliant code, then of course the compiler may choose to write such static functions as it sees fit.

Trap Handlers also take responsibility for saving and restoring of
Compressed Mode state, just as they already take responsibility for
other critical state.  This makes traps transparent to functions as
far as Compressed Mode Context is concerned, just as traps are already
transparent to functions.

Note however that there are exceptions in a compiler to the otherwise
hard rule that Compressed Mode context not be permitted to cross function
boundaries: inline functions and static functions.  static functions,
if correctly identified as never to be called externally, may, as an
optimisation, disregard standard ABIs, bearing in mind that this will
be fraught (pointers to functions) and not easy to get right.

# Opcode Allocation Ideas

* one bit from the 16-bit mode is used to indicate that standard
  (v3.0B) mode is to be dropped into for only one single instruction
  <https://bugs.libre-soc.org/show_bug.cgi?id=238#c2>

## Opcodes exploration (Attempt 1)

Switching between different encoding modes is controlled by M (alone)
in 10-bit mode, and M and N in 16-bit mode.

* M in 10-bit mode if zero indicates that following instructions are
  standard OpenPOWER ISA 32-bit encoded (including, redundantly,
  further 10/16-bit instructions)
* M in 10-bit mode if 1 indicates that following instructions are
  in 16-bit encoding mode

Once in 16-bit mode:

* 0b01 (M=1, N=0): stay in 16-bit mode
* 0b00: leave 16-bit mode permanently (return to standard OpenPOWER ISA)
* 0b10: leave 16-bit mode for one cycle (return to standard OpenPOWER ISA)
* 0b11: free to be used for something completely different.

The current "top" idea for 0b11 is to use it for a new encoding format
of predominantly "immediates-based" 16-bit instructions (branch-conditional,
addi, mulli etc.)

* The Compressed Major Opcode is in bits 5-7.
* Minor opcode in bit 8.
* In some cases bit 9 is taken as an additional sub-opcode, followed
  by bits 0-4 (for CR operations)
* M+N mode-switching is not available for C-Major.minor 0b001.1
* 10 bit mode may be expanded by 16 bit mode, adding capabilities
  that do not fit in the extreme limited space.

Mode-switching FSM showing relationship between v3.0B, C 10bit and C 16bit.
16-bit immediate mode remains in 16-bit.

    | 0 | 1234 | 567  8 | 9abcde | f | explanation
    | - | ---- | ------ | ------ | - | -----------
    | EXT000/1 | Cmaj.m | fields | 0 | 10bit then v3.0B
    | EXT000/1 | Cmaj.m | fields | 1 | 10bit then 16bit
    | 0 | flds | Cmaj.m | fields | 0 | 16bit then v3.0B
    | 0 | flds | Cmaj.m | fields | 1 | 16bit then 16bit
    | 1 | flds | Cmaj.m | fields | 0 | 16b, 1x v3.0B, 16b
    | 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit

Notes:

* Cmaj.m is the C major/minor opcode: 3 bits for major, 1 for minor
* EXT000 and EXT001 are v3.0B Major Opcodes.  The first 5 bits
  are zero, therefore the 6th bit is actually part of Cmaj.
* "10bit then 16bit" means "this instruction is encoded C 10bit
  and the following one in C 16bit"
* "16b, 1x v3.0B, 16b" means, "this instruction is encoded C 16bit,
  the following one is V3.0B Standard, and the one after that is
  back to 16bit".

### C Instruction Encoding types

10-bit Opcode formats (all start with v3.0B EXT000 or EXT001
Major Opcodes)

    | 01234    | 567  8 | 9  | a b | c  | d e | f | enc
    | E01      | Cmaj.m | fld1     | fld2     | M | 10b
    | E01      | Cmaj.m | offset              | M | 10b b
    | E01      | 001.1  | S1 | fd1 | S2 | fd2 | M | 10b sub
    | E01      | 111.m  | fld1     | fld2     | M | 10b LDST

16-bit Opcode formats (including 10/16/v3.0B Switching)

    | 0 | 1234 | 567  8 | 9  | a b | c  | d e | f | enc
    | N | immf | Cmaj.m | fld1     | fld2     | M | 16b
    | 1 | immf | Cmaj.m | fld1     | imm      | 1 | 16b imm
    | N | fd3  | 001.1  | S1 | fd1 | S2 | fd2 | M | 16b sub
    | N | fd4  | 111.m  | fld1     | fld2     | M | 16b LDST

Notes:

* fld1 and fld2 can contain reg numbers, immediates, or opcode
  fields (BO, BI, LK)
* S1 and S2 are further sub-selectors of C 001.1

### Immediate Opcodes

only available in 16-bit mode, only available when M=1 and N=1
and when Cmaj.min is not 0b001.1.

instruction counts from objdump on /bin/bash:

      466 extsw r1,r1
      649 stw r1,1(r1)
      691 lwz r1,1(r1)
      705 cmpdi r1,1
      791 cmpwi r1,1
      794 addis r1,r1,1
     1474 std r1,1(r1)
     1846 li r1,1
     2031 mr r1,r1
     2473 addi r1,r1,1
     3012 nop
     3028 ld r1,1(r1)


    | 0 | 1  | 2 | 3 4 | | 567.8 | 9ab  | cde | f |
    | 1 | 0  | 0   0 0 | | 001.0 |      | 000 | 1 | TBD
    | 1 | 0  |  sh2    | | 001.0 | RA   | sh  | 1 | sradi.
    | 1 | 1  | 0   0 0 | | 001.0 |      | 000 | 1 | TBD
    | 1 | 1  | 0 | sh2 | | 001.0 | RA   | sh  | 1 | srawi.
    | 1 | 1  | 1 |     | | 001.0 | 000  | imm | 1 | TBD
    | 1 | 1  | 1 | i2  | | 001.0 | RA!=0| imm | 1 | addis
    | 1 | 0  | i2      | | 010.0 | 000  | imm | 1 | setvli
    | 1 | 1  | i2      | | 010.0 | 000  | imm | 1 | setmvli
    | 1 | i2           | | 010.0 | RA!=0| imm | 1 | addi
    | 1 | 0  | i2      | | 010.1 | RA   | imm | 1 | cmpdi
    | 1 | 1  | i2      | | 010.1 | RA   | imm | 1 | cmpwi
    | 1 | 0  | i2      | | 011.0 | RT   | imm | 1 | ldspi
    | 1 | 1  | i2      | | 011.0 | RT   | imm | 1 | lwspi
    | 1 | 0  | i2      | | 011.1 | RT   | imm | 1 | stwspi
    | 1 | 1  | i2      | | 011.1 | RT   | imm | 1 | stdspi
    | 1 | i2 | RA      | | 100.0 | RT   | imm | 1 | stwi
    | 1 | i2 | RA      | | 100.1 | RT   | imm | 1 | stdi
    | 1 | i2 | RT      | | 101.0 | RA   | imm | 1 | ldi
    | 1 | i2 | RT      | | 101.1 | RA   | imm | 1 | lwi
    | 1 | i2 | RA      | | 110.0 | RT   | imm | 1 | fsti
    | 1 | i2 | RA      | | 110.1 | RT   | imm | 1 | fstdi
    | 1 | i2 | RT      | | 111.0 | RA   | imm | 1 | flwi
    | 1 | i2 | RT      | | 111.1 | RA   | imm | 1 | fldi

Construction of immediate:

* LD/ST r1 (SP) variants should be offset by -256
 see <https://bugs.libre-soc.org/show_bug.cgi?id=238#c43>
  - SP variants map to e.g ld RT, imm(r1)
  - SV Prefixing can be used to map r1 to alternate regs
* [1] not the same as v3.0B addis: the shift amount is smaller and actually
  still maps to within the v3.0B addi immediate range.
* addi is EXTS(i2||imm) to give a 4-bit range -8 to +7
* addis is EXTS(i2||imm||000) to give a 11-bit range -1024 to +1023 in
  increments of 8
* all others are EXTS(i2||imm) to give a 7-bit range -128 to +127
  (further for LD/ST due to word/dword-alignment)

Further Notes:

* bc also has an immediate mode, listed separately below in Branch section
* for LD/ST, offset is aligned.  8-byte: i2||imm||0b000 4-byte: 0b00
* SV Prefix over-rides help provide alternative bitwidths for LD/ST
* RA|0 if RA is zero, addi. becomes "li"
  - this only works if RT takes part of opcode
  - mv is also possible by specifying an immediate of zero

### Illegal, nop and attn

Note that illeg is all zeros, including in the 16-bit mode.
Given that C is allocated to OpenPOWER ISA Major opcodes EXT000 and
EXT001 this ensures that in both 10-bit *and* 16-bit mode, a 16-bit
run of all zeros is considered "illegal" whilst 0b0000.0000.1000.0000
is "nop"

    | 16-bit mode | | 10-bit mode                 |
    | 0 | 1 | 234 | | 567.8  | 9  ab | c   de | f |
    | - | - | --- | | -----  | ----- | ------ | - |
    | 0 | 0   000 | | 000.0  | 0  00 | 0   00 | 0 | illeg
    | 0 | 0   000 | | 000.0  | 0  00 | 0   00 | 1 | nop

16 bit mode only:

    | 0 | 1 | 234 | | 567.8  | 9  ab | c   de | f |
    | - | - | --- | | -----  | ----- | ------ | - |
    | 1 | 0   000 | | 000.0  | 0  00 | 0   00 | 0 | nop
    | 1 | 0   000 | | 000.0  | 0  00 | 0   00 | 1 | nop
    | N | 1   000 | | 000.0  | 0  00 | 0   00 | M | attn

Notes:

* All-zeros being an illegal instruction is normal for ISAs.  Ensuring that 
  this remains true at all times i.e. for both 10 bit and 16 bit mode is
  common sense.
* The 10-bit nop (bit 15, M=1) is intended for circumstances
  where alignment to 32-bit before returning to v3.0B is required.
  M=1 being an indication "return to Standard v3.0B Encoding Mode". 
* The 16-bit nop (bit 0, N=1) is intended for circumstances where a
  return to Standard v3.0B Encoding is required for one cycle
  but one cycle where alignment to a 32-bit boundary is needed.
  Examples of this would be to return to "strict" (non-C) mode
  where the PC may not be on a non-word-aligned boundary.
* If for any reason multiple 16 bit nops are needed in succession
  the M=1 variant can be used, because each one returns to
  Standard v3.0B Encoding Mode, each time.

In essence the 2 nops are needed due to there being 2 different C forms:
10 and 16 bit.

### Branch

TODO: document that branching whilst using mode-switching bits (M/N) is perfectly well permitted, the caveat being: it is specifically and wholly the complier/assembler writers responsibility to obey ABI rules and ensure that even with branches and returns that, at no time, is an incorrect mode entered or left that could result in any instruction being misinterpreted.

    | 16-bit mode | | 10-bit mode                 |
    | 0 | 1 | 234 | | 567.8  | 9  ab | c   de | f |
    | - | - | --- | | -----  | ----- | ------ | - |
    | N | offs2   | | 000.LK | offs!=0        | M | b, bl
    | N |         | | 000.1  | 0  00 | 0   00 | M | TBD
    | 1 | offs2   | | 000.LK | BI    | BO1 oo | 1 | bc, bcl
    | N | BO3 BI3 | | 001.0  | LK BI | BO     | M | bclr, bclrl

16 bit mode:

* bc only available when N,M=0b11
* offs2 extends offset in MSBs
* BI3 extends BI in MSBs to allow selection of full CR
* BO3 extends BO
* bc offset constructed from oo as LSBs and offs2 as MSBs
* bc BI allows selection of all bits from CR0 or CR1
* bc CR check is always active (as if BO0=1) therefore BO1 inverts

10 bit mode:

* illegal (all zeros) covers part of branch (offs=0,M=0,LK=0)
* nop also covers part of branch (offs=0,M=0,LK=1)
* bc **not available** in 10-bit mode
* BO[0] enables CR check, BO[1] inverts check
* BI refers to CR0 only (4 bits of)
* no Branch Conditional with immediate
* no Absolute Address
* CTR mode allowed with BO[2] for b only.
* offs is to 2 byte (signed) aligned
* all branches to 2 byte aligned

### LD/ST

Note: for 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)

    | 16-bit mode  | | 10-bit mode               |
    | 0 | 1  | 234 | | 567.8 | 9 a b | c d e | f |
    | - | -- | --- | | ----- | ----- | ----- | - |
    | N | SZ |  RB | | 001.1 | 1  RA | 0  RT | M | st
    | N | SZ |  RB | | 001.1 | 1  RA | 1  RT | M | fst
    | N | SZ |  RT | | 111.0 |  RA   |  RB   | M | ld
    | N | SZ |  RT | | 111.1 |  RA   |  RB   | M | fld

* elwidth overrides can set different widths

16 bit mode:

* SZ=1 is 64 bit, SZ=0 is 32 bit

10 bit mode:

* RA and RB are only 2 bit (0-3)
* for LD, RT is implicitly RB: "ld RT=RB, RA(RB)"
* for ST, there is no offset: "st RT, RA(0)"

### Arithmetic

* 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
* 16-bit: note that bit 1==0 (sub-sub-encoding)

10 and 16 bit:

    | 16-bit mode | | 10-bit mode             |
    | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
    | - | - | --- | | ----- | --- | ----- | - |
    | N | 0 | RT  | | 010.0 | RB  | RA!=0 | M | add
    | N | 0 | RT  | | 010.1 | RB  | RA|0  | M | sub.
    | N | 0 | BF  | | 011.0 | RB  | RA|0  | M | cmpl

Notes:

* sub. and cmpl: default CR target is CR0
* for (RA|0) when RA=0 the input is a zero immediate,
  meaning that sub. becomes neg. and cmp becomes cmpi against zero
* RT is implicitly RB: "add RT(=RB), RA, RB"
* Opcode 0b010.0 RA=0 is not missing from the above:
  it is a system-wide instruction, "cbank" (section below)

16 bit mode only:

    | 0 | 1 | 234 | | 567.8 | 9ab | cde   | f |
    | - | - | --- | | ----- | --- | ----- | - |
    | N | 1 | RA  | | 010.0 | RB  | RS    | M | sld.
    | N | 1 | RA  | | 010.1 | RB  | RS!=0 | M | srd.
    | N | 1 | RA  | | 010.1 | RB  | 000   | M | srad.
    | N | 1 | BF  | | 011.0 | RB  | RA|0  | M | cmpw

Notes:

* for srad, RS=RA: "srad. RA(=RS), RS, RB"

### Logical

* 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
* 16-bit: note that bit 1==0 (sub-sub-encoding)

10 and 16 bit:

    | 16-bit mode | | 10-bit mode             |
    | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
    | - | - | --- | | ----- | --- | ----- | - |
    | N | 0 |  RT | | 100.0 | RB  | RA!=0 | M | and
    | N | 0 |  RT | | 100.1 | RB  | RA!=0 | M | nand
    | N | 0 |  RT | | 101.0 | RB  | RA!=0 | M | or
    | N | 0 |  RT | | 101.1 | RB  | RA!=0 | M | nor/mr
    | N | 0 |  RT | | 100.0 | RB  | 0 0 0 | M | popcnt
    | N | 0 |  RT | | 100.1 | RB  | 0 0 0 | M | cntlz
    | N | 0 |  RT | | 101.0 | RB  | 0 0 0 | M | extsw
    | N | 0 |  RT | | 101.1 | RB  | 0 0 0 | M | not

16-bit mode only (note that bit 1 == 1):

    | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
    | - | - | --- | | ----- | --- | ----- | - |
    | N | 1 |  RT | | 100.0 | RB  | RA!=0 | M | TBD
    | N | 1 |  RT | | 100.1 | RB  | RA!=0 | M | TBD
    | N | 1 |  RT | | 101.0 | RB  | RA!=0 | M | xor
    | N | 1 |  RT | | 101.1 | RB  | RA!=0 | M | eqv (xnor)
    | N | 1 |  RT | | 100.0 | RB  | 0 0 0 | M | setvl.
    | N | 1 |  RT | | 100.1 | RB  | 0 0 0 | M | cnttz
    | N | 1 |  RT | | 101.0 | RB  | 0 0 0 | M | extsb
    | N | 1 |  RT | | 101.1 | RB  | 0 0 0 | M | extsh

10 bit mode:

* idea: for 10bit mode, nor is actually 'mr' because mr is
  a more common operation.  in 16bit however, this encoding
  (Cmaj.min=0b101.1, N=0) is 'nor'
* for (RA|0) when RA=0 the input is a zero immediate,
  meaning that nor becomes not
* cntlz, popcnt, exts **not available** in 10-bit mode
* RT is implicitly RB: "and RT(=RB), RA, RB"

### Floating Point

Note here that elwidth overrides (SV Prefix) can be used to select FP16/32/64

* 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
* 16-bit: note that bit 1==0 (sub-sub-encoding)

10 and 16 bit:

    | 16-bit mode | | 10-bit mode             |
    | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
    | - | - | --- | | ----- | --- | ----- | - |
    | N |   |  RT | | 011.1 | RB  | RA!=0 | M | fsub.
    | N | 0 |  RT | | 110.0 | RB  | RA!=0 | M | fadd
    | N | 0 |  RT | | 110.1 | RB  | RA!=0 | M | fmul
    | N | 0 |  RT | | 011.1 | RB  | 0 0 0 | M | fneg.
    | N | 0 |     | | 110.0 |     | 0 0 0 | M | TBD
    | N | 0 |     | | 110.1 |     | 0 0 0 | M | TND

16-bit mode only (note that bit 1 == 1):

    | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
    | - | - | --- | | ----- | --- | ----- | - |
    | N | 1 |     | | 011.1 |     | RA!=0 | M | TBD
    | N | 1 |     | | 110.0 |     | RA!=0 | M | TBD
    | N | 1 |  RT | | 110.1 | RB  | RA!=0 | M | fdiv
    | N | 1 |  RT | | 011.1 | RB  | 0 0 0 | M | fabs.
    | N | 1 |  RT | | 110.0 | RB  | 0 0 0 | M | fmr.
    | N | 1 |     | | 110.1 |     | 0 0 0 | M | TBD

16 bit only, FP to INT convert (using C 0b001.1 subencoding)

    | 0 | 123 | 4 | | 567.8 | 9 ab | cde  | f |
    | - | --- | - | | ----- | ---- | ---- | - |
    | N | 101 | X | | 001.1 | 0 RA | Y RT | M | fp2int
    | N | 110 | X | | 001.1 | 0 RA | Y RT | M | int2fp

* X: signed=1, unsigned=0
* Y: FP32=0, FP64=1

10 bit mode:

* fsub. fneg. and fmr. default target is CR1
* fmr. is **not available** in 10-bit mode
* fdiv is **not available** in 10-bit mode

16 bit mode:

* fmr. copies RB to RT (and sets CR1)

### Condition Register

10-bit or 16 bit:

    | 16-bit mode| | 10-bit mode            |
    | 0 | 123 | 4   | | 567.8 | 9 ab | cde | f |
    | - | --- | --- | | ----- | ---- | --- | - |
    | N | 000 | BF2 | | 001.1 | 0 BF | BFA | M | mcrf

16-bit only:

    | 0 | 1234 | | 567.8 | 9 ab | cde | f |
    | - | ---- | | ----- | ---- | --- | - |
    | N | 0010 | | 001.1 | 0 BA | BB  | M | crnor
    | N | 0011 | | 001.1 | 0 BA | BB  | M | crandc
    | N | 0100 | | 001.1 | 0 BA | BB  | M | crxor
    | N | 0101 | | 001.1 | 0 BA | BB  | M | crnand
    | N | 0110 | | 001.1 | 0 BA | BB  | M | crand
    | N | 0111 | | 001.1 | 0 BA | BB  | M | creqv
    | N | 1000 | | 001.1 | 0 BA | BB  | M | crorc
    | N | 1001 | | 001.1 | 0 BA | BB  | M | cror

Notes

10 bit mode:

* mcrf BF is only 2 bits which means the destination is only CR0-CR3
* CR operations: **not available** in 10-bit mode (but mcrf is)

16 bit mode:

* mcrf BF2 extends BF (in MSB) to 3 bits
* CR operations: destination register is same as BA.
* CR operations: only possible on CR0 and CR1

SV (Vector Mode):

* CR operations: greatly extended reach/range (useful for predicates)

### System

cbank: Selection of Compressed-encoding "Bank".  Different "banks"
give different meanings to opcodes.  Example: CBank=0b001 is heavily
optimised to A/Video Encode/Decode.  cbank borrows from add's encoding
space (when RA==0)

    | 16-bit mode | | 10-bit mode             |
    | 0 | 1 2 3 4 | | 567.8 | 9ab   | cde | f |
    | - | ------- | | ----- | ----- | --- | - |
    | N | 0 Bank2 | | 010.0 | CBank | 000 | M | cbank

**not available** in 10-bit mode, **only** in 16-bit mode:

    | 0 | 1 | 234 | | 567.8 | 9 ab | cde  | f |
    | - | ------- | | ----- | ---- | ---- | - |
    | N | 1 | 111 | | 001.1 | 0 00 |  RT  | M | mtlr
    | N | 1 | 111 | | 001.1 | 0 01 |  RT  | M | mtctr
    | N | 1 | 111 | | 001.1 | 0 00 |  RA  | M | mflr
    | N | 1 | 111 | | 001.1 | 0 01 |  RA  | M | mfctr
    | N | 0 RA!=0 | | 000.0 | 0 00 |  000 | M | mtcr
    | N | 1 RT!=0 | | 000.0 | 0 00 |  000 | M | mfcr

### Unallocated

16-bit only:

    | 0 | 1 | 234 | | 567.8 | 9 ab | cde  | f |
    | - | - | --- | | ----- | ---- | ---- | - |
    | N | 1 | 111 | | 001.1 | 0 10 |      | M |
    | N | 1 | 111 | | 001.1 | 0 11 |      | M |

# Other ideas (Attempt 2)

## 8-bit mode-switching instructions, odd addresses for C mode

Drop the complexity of the 16-bit encoding further reduced to 10-bit,
and use a single byte instead of two to switch between modes.  This
would place compressed (C) mode instructions at odd bytes, so the LSB
of the PC can be used for the processor to tell which mode it is in.

To switch from traditional to compressed mode, the single-byte
instruction would be at the MSByte, that holds the EXT bits.  (When we
break up a 32-bit instruction across words, the most significant half
should go in the word with the lower address.)

To switch from compressed mode to traditional mode, the single-byte
instruction would also be at the opcode/format portion, placed in the
lower-address word if split across words, so that the instruction can
be recognized as the mode-switching one without going for its second
byte.

The C-mode nop should be encoded so that its second byte encodes a
switch to compressed mode, if decoded in traditional mode.  This
enables such a nop to straddle across a label:

    8-bit first half of nop
    Label:
    8-bit second half of nop AKA switch to compressed mode
    16-bit insns...

so that if traditional code jumps to the word-aligned label (because
traditional branches drop the 2 LSB), it immediately switches to
compressed mode; if we fall-through, we remain in 16-bit mode; and if
we branch to it from compressed mode, whether we jump to the odd or
the even address, we end up in compressed mode as desired.

Tables explaining encoding:

    | byte 0 | byte 1 | byte 2 | byte 3 |
    | v3.0B standard 32 bit instruction |
    | EXT000 | 16 bit          | 16...  |
    | .. bit | 8nop   | v3.0b stand...  |
    | .. ard 32 bit   | EXT000 | 16...  |
    | .. bit | 16 bit          | 8nop   |
    | v3.0B standard 32 bit instruction |

# Other ideas (v3)

FSM state switching and mode switching deemed too complex.  Instead cut back to

1. 10bit only (actually, 11 bit)
2. SV-Prefixed 16bit only (aka SV-C32)

Each will be entirely different which is a huge amount of work.

# TODO 

* make a preliminary assessment of branch in/out viability
* confirm FSM encoding (is LSB of PC really enough?)
* guestimate opcode and register allocation (without necessarily doing
  a full encoding)
* write throwaway python program that estimates compression ratio from
  objdump raw parsing
* finally do full opcode allocation
* rerun objdump compression ratio estimates
* check in FSM if "return to v3.0B then 16bit" if it is ok to have the v3.0B be a 10bit Compressed.  should this be ignored and carry on? should a trap occur?

### Use 2- rather than 3-register opcodes

Successful compact ISAs have used 2- rather than 3-register insns, in
which the same register serves as input and output.  Some 20% of
general-purpose 3-register insns already use either input register as
output, without any effort by the compiler to do so.

Repurposing the 3 bits used to encode one one of the input registers
in arithmetic, logical and floating-pointer registers, and the 2 bits
used to encode the mode of the next two insns, we could make the full
register files available to the opcodes already selected for
compressed mode, with one bit to spare to bring additional opcodes in.

An opcode could be assigned to an instruction that combines and
extends with the subsequent instruction, providing it with a separate
input operand to use rather than the output register, or with
additional range for immediate and offset operands, effectively
forming a 32-bit operation, enabling us to remain in compressed mode
even longer.

# Appendix

## Analysis techniques and tools

    objdump -d --no-show-raw-insn /bin/bash | sed 'y/\t/ /;
      s/^[ x0-9A-F]*: *\([a-z.]\+\) *\(.*\)/\1 \2 /p; d' |
      sed 's/\([, (]\)r[1-9][0-9]*/\1r1/g;
      s/\([ ,]\)-*[0-9]\+\([^0-9]\)/\11\2/g' | sort | uniq --count |
      sort -n | less

## gcc register allocation

FTR, information extracted from gcc's gcc/config/rs6000/rs6000.h about
fixed registers (assigned to special purposes) and register allocation
order:

Special-purpose registers on ppc are:

    r0: constant zero/throw-away
    r1: stack pointer
    r2: thread-local storage pointer in 32-bit mode
    r2: non-minimal TOC register
    r10: EH return stack adjust register
    r11: static chain pointer
    r13: thread-local storage pointer in 64-bit mode
    r30: minimal-TOC/-fPIC/-fpic base register
    r31: frame pointer
    lr: return address register

the register allocation order in GCC (i.e., it takes the earliest
available register that fits the constraints) is:

    We allocate in the following order:

	fp0		(not saved or used for anything)
	fp13 - fp2	(not saved; incoming fp arg registers)
	fp1		(not saved; return value)
	fp31 - fp14	(saved; order given to save least number)
	cr7, cr5	(not saved or special)
	cr6		(not saved, but used for vector operations)
	cr1		(not saved, but used for FP operations)
	cr0		(not saved, but used for arithmetic operations)
	cr4, cr3, cr2	(saved)
	r9		(not saved; best for TImode)
	r10, r8-r4	(not saved; highest first for less conflict with params)
	r3		(not saved; return value register)
	r11		(not saved; later alloc to help shrink-wrap)
	r0		(not saved; cannot be base reg)
	r31 - r13	(saved; order given to save least number)
	r12		(not saved; if used for DImode or DFmode would use r13)
	ctr		(not saved; when we have the choice ctr is better)
	lr		(saved)
	r1, r2, ap, ca	(fixed)
	v0 - v1		(not saved or used for anything)
	v13 - v3	(not saved; incoming vector arg registers)
	v2		(not saved; incoming vector arg reg; return value)
	v19 - v14	(not saved or used for anything)
	v31 - v20	(saved; order given to save least number)
	vrsave, vscr	(fixed)
	sfp		(fixed)

## Comparison to VLE

VLE was a means to reduce executable size through three interleaved methods:

* (1) invention of 16 bit encodings (of exactly 16 bit in length)
* (2) invention of 16+16 bit encodings (a 16 bit instruction format but with
  an *additional* 16 bit immediate "tacked on" to the end, actually
  making a 32-bit instruction format)
* (3) seamless and transparent embedding and intermingling of the
  above in amongst arbitrary v2.06/7 BE 32 bit instruction sequences,
  with no additional state,
  including when the PC was not aligned on a 4-byte boundary.

Whilst (1) and (3) make perfect sense, (2) makes no sense at all given that, as inspection of "ori" and others show, I-Form 16 bit immediates is the "norm" for v2.06/7 and v3.0B standard instructions.  (2) in effect **is** a 32 bit instruction.  (2) **is not** a 16 bit instruction.

*Why "reinvent" an encoding that is 32 bit, when there already exists a 32 bit encoding that does the exact same job?*

Consequently, we do **not** envisage a scenario where (2) would ever be implemented, nor in the future would this Compressed Encoding be extended beyond 16 bit.  Compressed is Compressed and is **by definition** limited to precisely  - and only - 16 bit.

The additional reason why that is the case is because VLE is exceptionally complex to implement.  In a single-issue, low clock rate "Embedded" environment for which VLE was originally designed, VLE was perfectly well matched.

However this Compressed Encoding is designed for High performance multi-issue systems *as well* as Embedded scenarios, and consequently, the complexity of "deep packet inspection" down into the depths of a 16 bit sequence in order to ascertain if it might not be 16 bit after all, is wholly unacceptable.

By eliminating such 16+16 (actually, 32bit conflation) tricks outlined in (2), Compressed is *specifically* designed to fit into a very small FSM, suitable for multi-issue, that in no way requires "deep-dive" analysis. Yet, despite it never being designed with 16 bit encodings in mind, is still suitable for retro-fitting onto OpenPOWER.

## Compressed Decoder Phases

Phase 1 (stage 1 of a 2-stage pipelined decoder) is defined as the minimum necessary FSM required to determine instruction length and mode.  This is implemented with the absolute bare minimum of gates and is based on the 6 encodings involving N, M and EXTNNN (see table, below)

Phase 2 (stage 2 of a 2-stage pipelined decoder) is defined as the "full decoder" that includes taking into account the length and mode from Phase 1.  Given a 2-stage pipelined decoder it is categorically **impossible** for Phase 2 to go backwards in time and affect the decisions made in Phase 1.

These two phases are specifically designed to take multi-issue execution into account.  Phase 1 is intended to be part of an O(log N) algorithm that can use a form of carry-lookahead propagation. Phase 2 is intended to be on a 2nd pipelined clock cycle, comprising a separate suite of independent local-state-only parallel pipelines that do not require any inter-communication of any kind.

Table: Reminder of the 6 16-bit encodings:

    | 0 | 1234 | 567  8 | 9abcde | f | explanation
    | - | ---- | ------ | ------ | - | -----------
    | EXT000/1 | Cmaj.m | fields | 0 | 10bit then v3.0B
    | EXT000/1 | Cmaj.m | fields | 1 | 10bit then 16bit
    | 0 | flds | Cmaj.m | fields | 0 | 16bit then v3.0B
    | 0 | flds | Cmaj.m | fields | 1 | 16bit then 16bit
    | 1 | flds | Cmaj.m | fields | 0 | 16b, 1x v3.0B, 16b
    | 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit

### Phase 1

The Phase 1 length/mode identification takes into account only 3 pieces of information:

* extc_id: insn[0:4] == EXTNNN (Compressed)
* N: insn[0]
* M: insn[15]

The Phase 1 length/mode produces the following lengths/modes:

* 32 - v3.0B (includes v3.0B followed by 16bit)
* 16 - 10bit
* 16 - 16bit

**NOTE THAT FURTHER SUBIDENTIFICATION OF C MODES IS NOT CARRIED OUT AT PHASE 1**. In particular note specifically that 16 bit "immediate mode" is **not** part of the Phase 1 FSM, but is specifically isolated to Phase 2.

Pseudocode:

    # starting point for FSM
    previ = v3.0B

    if previ.mode == v3.0B:
        # previous was v3.0B, look for compressed tag
        if extc_id:
             # found it.  move to 10bit mode
             nexti.length = 16
             nexti.mode = 10bit
        else:
             # nope. stay in v3.0B
             nexti.length = 32
             nexti.mode = v3.0B

    elif previ.mode == 10bit:
         # previous was v3.0B, move to v3.0B or 16bit?
        if M == 0:
             next.length = 32
             nexti.mode = v3.0B
         else:
             # otherwise stay in 16bit mode
             nexti.length = 16
             nexti.mode = 16bit

    elif previ.mode == 16bit:
          # previous was 16bit, stay there or move?
          if M == 0:
             # back to v3.0B
             next.length = 32
             if N == 1:
                  # ... but only for 1 insn
                  nexti.mode = v3.0B_then_16bit
             else:
                  nexti.mode = v3.0B
         else:
             # otherwise stay in 16bit mode
             nexti.length = 16
             nexti.mode = 16bit

    # rest of FSM involving 3.0B to 16bit
    # and back transitions left to implementor
    # (or for someone else to add)

### Phase 2: Compressed mode

At this phase, knowing that the length is 16bit and the mode is either 10b or 16b, further analysis is required to determine if the 16bit.immediate encoding is active, and so on.  This is a fully combinatorial block that **at no time** steps outside of the strict bounds already determined by Phase 1.

    op_001_1 = insn[5:8] != 0b001.1
    if mode == 10bit:
        decode_10bit(insn)
    elif mode == 16bit:
        if N == 1 & M == 1 & op_001_1
            # see immediate opcodes table
            decode_16bit_immed_mode(insn)
        if op_001_1:
            # see CR and System tables
            # (16 bit ones at least)
            decode_16bit_cr_or_sys(insn)
        else:
            decode_16bit_nonimmed_mode(insn)

From this point onwards each of the decode_xx functions perform straightforward combinatorial decoding of the 16 bits of "insn".  In sone cases this involves further analysis of bit 1, in some cases (Cmaj.m = 0b010.1) even further deep-dive decoding is required (CR ops).  *All* of it is entirely combinatorial and at **no time** involves changing of, or interaction with, or disruption of, the Phase 1 determination of Length+Mode (that has *already taken place* in an earlier decoding pipeline time-schedule)

### Phase 2: v3.0B mode

Standard v3.0B decoders are deployed.  Absolutely no interaction occurs with any 16 bit decoders or state.  Absolutely no interaction with the earlier Phase 1 decoding occurs.  Absolutely no interaction occurs whatsoever (assuming an implementation that does not perform macro-op fusion) between other multi-issued v3.0B instructions being decoded in parallel at this time.
 
## Demo of encoding that's backward-compatible with PowerISA v3.1 in both LE and BE mode

[[demo]]

### Efficient Decoding Algorithm

[[decoding]]