# 16 bit Compressed

See:

* <https://bugs.libre-soc.org/show_bug.cgi?id=238>
* <https://ftp.libre-soc.org/VLE_314-68105.pdf> VLE Encoding

This one is a conundrum.  OpenPOWER ISA was never designed with 16
bit in mind.  VLE was added 10 years ago but only by way of marking
an entire 64k page as "VLE".  With VLE not maintained it is not
fully compatible with current PowerISA.

Here, in order to embed 16 bit into a predominantly 32 bit stream the
overhead of using an entire 16 bits just to switch into Compressed mode
is itself a significant overhead.  The situation is made worse by 5 bits
being taken up by Major Opcode space, leaving only 11 bits to allocate
to actual instructions.

In addition we would like to add SV-C32 which is a Vectorised version
of 16 bit Compressed, and ideally have a variant that adds the 27-bit
prefix format from SV-P64, as well.

Potential ways to reduce pressure on the 16 bit space are:

* To provide "paging".  This involves bank-switching to alternative optimised encodings for specific workloads
* To enter "16 bit mode" for durations specified at the start
* To reserve one bit of every 16 bit instruction to indicate that the 16 bit mode is to continue to be sustained

This latter would be useful in the Vector context to have an alternative
meaning: as the bit which determines whether the instruction is 11-bit
prefixed or 27-bit prefixed:

    0 1 2 3 4 5 6 7 8 9 a b c d e f |
    |major op | 11 bit vector prefix|
    |16 bit opcode  alt vec. mode ^ |
    | extra vector prefix if alt set|

Using a major opcode to enter 16 bit mode, leaves 11 bits to find
something to use them for:

    0 1 2 3 4 5 6 7 8 9 a b c d e f |
    |major op | what to do here   1 |
    |16 bit    stay in 16bit mode 1 |
    |16 bit    stay in 16bit mode 1 |
    |16 bit       exit 16bit mode 0 |

One possibility is that the 11 bits are used for bank selection, with
some room for additional context such as altering the registers used
for the 16 bit operations (bank selection of which scalar regs)

Another is to use the 11 bits for only the utmost commonly used
instructions.  That being the case then even one of those 11 bits would
also need to be dedicated to saying if 16 bit mode is to be continued.
10 bits remain for actual opcodes!

# Opcode Allocation Ideas

* one bit from the 16-bit mode is used to indicate that 32-bit mode
  is to be dropped into for only one single instruction
  <https://bugs.libre-soc.org/show_bug.cgi?id=238#c2>

## Opcodes exploration (Attempt 1)

Switching between different encoding modes is controlled by M (alone)
in 10-bit mode, and M and N in 16-bit mode.

* M in 10-bit mode if zero indicates that following instructions are
  standard OpenPOWER ISA 32-bit encoded (including, redundantly,
  further 10/16-bit instructions)
* M in 10-bit mode if 1 indicates that following instructions are
  in 16-bit encoding mode

Once in 16-bit mode:

* 0b01 (M=1, N=0): stay in 16-bit mode
* 0b00: leave 16-bit mode permanently (return to standard OpenPOWER ISA)
* 0b10: leave 16-bit mode for one cycle (return to standard OpenPOWER ISA)
* 0b11: free to be used for something completely different.

The current "top" idea for 0b11 is to use it for a new encoding format
of predominantly "immediates-based" 16-bit instructions (branch-conditional,
addi, mulli etc.)

The Compressed Major Opcode is in bits 5-7.

* M+N mode-switching is not available for C-Major 0b000 or 0b111

### Immediate Opcodes

only available in 16-bit mode, and only available when M=1 and N=1

    | 0 | 1  | 2 3 4 | | 567 | 8 9 a | b c d | e | f |
    | 1 | offs2      | | 001 | o  BI | o BO | LK | 1 | bc
    | 1 | o2 |  RT   | | 010 | RB    | offs      | 1 | addis
    | 1 | o2 |  RT   | | 011 | RB    | offs      | 1 | mulis
    | 1 | o2 |       | | 100 |       | offs      | 1 | 
    | 1 | o2 |  RT   | | 101 | RA    | offs      | 1 | ldi
    | 1 | o2 |  RT   | | 110 | RA    | offs      | 1 | sti

### Branch

10 bit mode may be expanded by 16 bit mode later, adding capabilities
that do not fit in the extreme limited space.

    | 16-bit mode | | 10-bit mode              |
    | 0 | 1 | 2 3 4 | | 567 | 8 9 a | b c d | e  | f |
    | BO2   | BI3   | | 000 | 0  BI | 0  BO | LK | M | bclr
    | BO2   | BI3   | | 000 | 0  BI | 1  BO | LK | M | bctar
    | N | offs2     | | 001 |    offs       | LK | M | b

16 bit mode:

* offs2 extends offset in MSBs
* BI3 extends BI in MSBs to allow selection of full CR
* BO2 extends BO

10 bit mode:

* BO[0] enables CR check, BO[1] inverts check
* BI refers to CR0 only (4 bits of)
* no Branch Conditional with immediate
* no Absolute Address
* no CTR mode (and no bctr)
* offs is to 2 byte (signed) aligned
* all branches to 2 byte aligned

### LD/ST

    | 16-bit mode       | | 10-bit mode             |
    | 0   | 1   | 2 3 4 | | 567 | 8 9 a | b c d | e | f |
    | RB2 | RA2 |  RT   | | 001 | 1  RA | 1  RB | 0 | M | fld
    | RA2 | RT2 |  RB   | | 001 | 1  RA | 1  RT | 1 | M | fst
    |     |     |  RT   | | 111 |  RA   |  RB   | 0 | M | ld
    |     |     |  RB   | | 111 |  RA   |  RT   | 1 | M | st

* elwidth overrides can set different widths

16 bit mode:

* F=1 is FLD, FST
* RA2 extends RA to 3 bits (MSB)
* RT2 extends RT to 3 bits (MSB)

10 bit mode:

* RA and RB are only 2 bit (0-3)
* for LD, RT is implicitly RB: "ld RT=RB, RA(RB)"
* for ST, there is no offset: "st RT, RA(0)"

### Arithmetic

    | 16-bit mode   | | 10-bit mode             |
    | 0 | 1 | 2 3 4 | | 567 | 8 9 a | b c d | e | f |
    | N |   |  RT   | | 010 | RB    | RA!=0 | 0 | M | add
    | N |   |  RT   | | 011 | RB    | RA!=0 | 0 | M | sub.
    | N |   |  RT   | | 010 | RB    | RA    | 1 | M | mul
    | N |   |  RT   | | 011 | RB    | 0 0 0 | 0 | M | neg.

10 bit mode:

* sub. default CR target is CR0
* for (RA|0) when RA=0 the input is a zero immediate,
  meaning that sub. becomes neg.
* RT is implicitly RB: "add RT(=RB), RA, RB"

### Logical

    | 16-bit mode   | | 10-bit mode             |
    | 0 | 1 | 2 3 4 | | 567 | 8 9 a | b c d | e | f |
    | N |   |  RT   | | 100 | RB    | RA!=0 | 0 | M | and
    | N |   |  RT   | | 100 | RB    | RA!=0 | 1 | M | nand
    | N |   |  RT   | | 101 | RB    | RA!=0 | 0 | M | or
    | N |   |  RT   | | 101 | RB    | RA!=0 | 1 | M | nor
    | N |   |  RT   | | 100 | RB    | 0 0 0 | 0 | M | exts
    | N |   |  RT   | | 100 | RB    | 0 0 0 | 1 | M | cntlz
    | N |   |  RT   | | 101 | RB    | 0 0 0 | 0 | M | popcnt
    | N |   |  RT   | | 101 | RB    | 0 0 0 | 1 | M | not

10 bit mode:

* for (RA|0) when RA=0 the input is a zero immediate,
  meaning that nor becomes not
* cntlz, popcnt, exts **not available** in 10-bit mode
* RT is implicitly RB: "and RT(=RB), RA, RB"

### Floating Point

    | 16-bit mode   | | 10-bit mode             |
    | 0 | 1 | 2 3 4 | | 567 | 8 9 a | b c d | e | f |
    | N |   |  RT   | | 011 | RB    | RA!=0 | 1 | M | fsub.
    | N |   |  RT   | | 110 | RB    | RA!=0 | 0 | M | fadd
    | N | 0 |  RT   | | 110 | RB    | RA!=0 | 1 | M | fmul
    | N | 1 |  RT   | | 110 | RB    | RA!=0 | 1 | M | fdiv
    | N |   |  RT   | | 011 | RB    | 0 0 0 | 1 | M | fneg.
    | N |   |  RT   | | 110 | RB    | 0 0 0 | 0 | M | fabs
    | N |   |  RT   | | 110 | RB    | 0 0 0 | 1 | M | fmr.

10 bit mode:

* fsub. fneg. and fmr. default target is CR1
* fmr. is **not available** in 10-bit mode
* fdiv is **not available** in 10-bit mode

16 bit mode:

* fmr. copies RB to RT (and sets CR1)

### Condition Register

    | 16-bit mode   | | 10-bit mode           |
    | 0 1 2 3 | 4   | | 567 | 8 9 a | b c d e | f |
    | 0 0 0 0 | BF2 | | 000 | 1  BF | 0  BFA  | M | mcrf
    | 0 0 0 1 | BA2 | | 000 | 1  BA | 0  BB   | M | crnor
    | 0 1 0 0 | BA2 | | 000 | 1  BA | 0  BB   | M | crandc
    | 0 1 1 0 | BA2 | | 000 | 1  BA | 0  BB   | M | crxor
    | 0 1 1 1 | BA2 | | 000 | 1  BA | 0  BB   | M | crnand
    | 1 0 0 0 | BA2 | | 000 | 1  BA | 0  BB   | M | crand
    | 1 0 0 1 | BA2 | | 000 | 1  BA | 0  BB   | M | creqv
    | 1 1 0 1 | BA2 | | 000 | 1  BA | 0  BB   | M | crorc
    | 1 1 1 0 | BA2 | | 000 | 1  BA | 0  BB   | M | cror

10 bit mode:

* mcrf BF is only 2 bits which means the destination is only CR0-CR3
* CR operations: **not available** in 10-bit mode

16 bit mode:

* mcrf BF2 extends BF (in MSB) to 3 bits
* CR operations: destination register is same as BA.
* CR operations: only possible on CR0 and CR1

SV (Vector Mode):

* CR operations: greatly extended reach/range (useful for predicates)

### System

Selection of Compressed-encoding "Bank".  Different "banks" give different
meanings to opcodes.  Example: CBank=0b001 is heavily optimised to A/Video
Encode/Decode.

    | 16-bit mode | | 10-bit mode             |
    | 0 1 | 2 3 4 | | 567 | 8 9 a | b c d | e | f |
    |       Bank2 | | 010 | CBank | 0 0 0 | 0 | M | cbank

**not available** in 10-bit mode:

    | 0 1 2 3 | 4  | | 567 | 8 9 a | b c d e  | f |
    | 1 1 1 1 | 0  | | 000 | 1  00 | 0  RT    | M | mtlr
    | 1 1 1 1 | 0  | | 000 | 1  01 | 0  RT    | M | mtctr
    | 1 1 1 1 | 0  | | 000 | 1  10 | 0  RT    | M | mttar
    | 1 1 1 1 | 0  | | 000 | 1  11 | 0  RT    | M | mtcr
    | 1 1 1 1 | 1  | | 000 | 1  00 | 0  RA    | M | mflr
    | 1 1 1 1 | 1  | | 000 | 1  01 | 0  RA    | M | mfctr
    | 1 1 1 1 | 1  | | 000 | 1  10 | 0  RA    | M | mftar
    | 1 1 1 1 | 1  | | 000 | 1  11 | 0  RA    | M | mfcr

### Unallocated

    | 0 1 2 3 | 4  | | 567 | 8 9 a | b c d e  | f |
    | 0 0 1 0 |    | | 000 | 1     | 0        | M |
    | 0 0 1 1 |    | | 000 | 1     | 0        | M |
    | 0 1 0 1 |    | | 000 | 1     | 0        | M |
    | 1 0 1 0 |    | | 000 | 1     | 0        | M |
    | 1 0 1 1 |    | | 000 | 1     | 0        | M |
    | 1 1 0 0 |    | | 000 | 1     | 0        | M |