openpower/sv/major_opcode_allocation.mdwn

   1 # Major Opcode Allocation
   2
   3 SimpleV Prefix, 16-bit Compressed, and SV VBLOCK all require considerable
   4 opcode space.  Similar to OpenPOWER v3.1 "prefixes" the key driving
   5 difference here is to reduce overall instruction size and thus greatly
   6 reduce I-Cache size and thus in turn power consumption.
   7
   8 Consequently rather than settle for a v3.1 32 bit prefix, 8 major opcodes
   9 are taken up and given new meanings.  Two options here involve either:
  10
  11 * Taking 8 arbitrary unused major opcodes as-is
  12 * Moving anything in the range 0-7 elsewhere
  13
  14 This **only** in "LibreSOC Mode".  Candidates for moving elsewhere
  15 include mulli, twi and tdi.
  16
  17 * 2 opcodes for 16-bit Compressed instructions with 11 bits available
  18 * 2 opcodes are required in order to give SV-P48 (and SV-C32) the 11 bits needed for prefixing
  19 * 2 opcodes are likewise required for SV-P64 (and SV-C48) to have 27 bits available
  20 * 2 opcodes for SV VBLOCK
  21
  22 With only 11 bits for 16-bit Compressed, it may be better to use the
  23 opportunity to switch into "16 bit mode".  Interestingly SV-P32 could
  24 likewise switch into the same.
  25
  26 # LE/BE complications.
  27
  28 See <https://bugs.libre-soc.org/show_bug.cgi?id=529> for discussion
  29
  30 With the Major Opcode being at the opposite end of the sequential byte
  31 order when read from memory in LE mode, a solution which allows 16 and
  32 48 bit instructions to co-exist with 32 bit ones is to look at bytes 2
  33 and 3 *before* looking at 0 and 1.
  34
  35 Option 1:
  36
  37 A 16 bit instruction would therefore be in bytes 2 and 3, removed from
  38 the instruction stream *ahead* of bytes 0 and 1, which would remain
  39 where they were.  The next instruction would repeat the analysis,
  40 starting now instead at the *new* byte 2-3.
  41
  42 A 48 bit instruction would again use bytes 2 and 3, read the major
  43 opcode, and extract bytes 0 thru 5 from the stream.  However the 48
  44 bit instruction would be constructed from bytes 2,3,0,1,4,5.  Again:
  45 after these 6 bytes were extracted fron the stream the analysis would
  46 begin again for the next instruction at bytes 2 and 3.
  47
  48 Option 2:
  49
  50 When reading from memory, before handing to the instruction decoder, bytes
  51 0 and 1 are swapped unconditionally with bytes 2 and 3.  Effectively this
  52 is near-identical to LE/BE byte-level swapping on a 32-bit block except
  53 this time it is half-word (16 bit) swapping on a 32-bit block.
  54
  55 With the Major Opcode then always being in the 1st 2 bytes it becomes
  56 much simpler for the pre-analysis phase to determine instruction length,
  57 regardless of what that length is (16/32/48/64/VBLOCK).
  58
  59 # 16 bit Compressed
  60
  61 This one is a conundrum.  OpenPOWER ISA was never designed with 16
  62 bit in mind.  VLE was added 10 years ago but only by way of marking
  63 an entire 64k page as "VLE".  With no means to mix 32 bit and 16 bit,
  64 jumping between the two would have been painful and taken up space.
  65
  66 Here, in order to embed 16 bit into a predominantly 32 bit stream the
  67 overhead of using an entire 16 bits just to switch into Compressed mode
  68 is itself a significant overhead.  The situation is made worse by 5 bits
  69 being taken up by Major Opcode space, leaving only 11 bits to allocate
  70 to actual instructions.
  71
  72 In addition we would like to add SV-C32 which is a Vectorised version
  73 of 16 bit Compressed, and ideally have a variant that adds the 27-bit
  74 prefix format from SV-P64, as well.
  75
  76 Potential ways to reduce pressure on the 16 bit space are:
  77
  78 * To provide "paging".  This involves bank-switching to alternative optimised encodings for specific workloads
  79 * To enter "16 bit mode" for durations specified at the start
  80 * To reserve one bit of every 16 bit instruction to indicate that the 16 bit mode is to continue to be sustained
  81
  82 This latter would be useful in the Vector context to have an alternative
  83 meaning: as the bit which determines whether the instruction is 11-bit
  84 prefixed or 27-bit prefixed:
  85
  86     0 1 2 3 4 5 6 7 8 9 a b c d e f |
  87     |major op | 11 bit vector prefix|
  88     |16 bit opcode  alt vec. mode ^ |
  89     | extra vector prefix if alt set|
  90
  91 Using a major opcode to enter 16 bit mode, leaves 11 bits to find
  92 something to use them for:
  93
  94     0 1 2 3 4 5 6 7 8 9 a b c d e f |
  95     |major op | what to do here   1 |
  96     |16 bit    stay in 16bit mode 1 |
  97     |16 bit    stay in 16bit mode 1 |
  98     |16 bit       exit 16bit mode 0 |
  99
 100 One possibility is that the 11 bits are used for bank selection, with
 101 some room for additional context such as altering the registers used
 102 for the 16 bit operations (bank selection of which scalar regs)
 103
 104 Another is to use the 11 bits for only the utmost commonly used
 105 instructions.  That being the case then even one of those 11 bits would
 106 also need to be dedicated to saying if 16 bit mode is to be continued.
 107 10 bits remain for actual opcodes!
 108
 109 ## 16 bit Compressed opcodes exploration
 110
 111 ### Branch
 112
 113 10 bit mode may be expanded by 16 bit mode later, adding capabilities
 114 that do not fit in the extreme limited space.
 115
 116     | 0 1 2 3 4 | | 5 6 7 | 8 9 | a b | c d | e  | f |
 117     |           | | 0 0 0 |     offs        | LK | 1 | b
 118     |           | | 0 0 1 | 00  | BI  | BO  | LK | 1 | bclr
 119     |           | | 0 0 1 | 01  | BI  | BO  | LK | 1 | bctar
 120
 121 10 bit mode:
 122
 123 * BO[0] enables CR check, BO[1] inverts check
 124 * BI refers to CR0 only (4 bits of)
 125 * no Branch Conditional with immediate
 126 * no Absolute Address
 127 * no CTR mode (and no bctr)
 128 * offs is to 2 byte (signed) aligned
 129 * all branches to 2 byte aligned
 130
 131 ### LD/ST
 132
 133     | 0 1 2 3 4 | | 5 6 7 | 8 9 | a b | c d | e | f |
 134     |           | | 0 0 1 | 11  | RB  | RA  | 0 | 1 | ld
 135     |           | | 0 0 1 | 11  | RB  | RA  | 1 | 1 | st
 136
 137 * elwidth overrides can set different widths
 138
 139 ### Arithmetic
 140
 141     | 0 1 2 3 4 | | 5 6 7 | 8 9 a | b c d | e | f |
 142     |           | | 0 1 0 | RB    | RA    | 0 | 1 | add
 143     |           | | 0 1 0 | RB    | RA    | 1 | 1 | mul
 144     |           | | 0 1 1 | RB    | (RA|0)| 0 | 1 | sub
 145     |           | | 0 1 1 | RB    | (RA|0)| 1 | 1 | cmp
 146
 147 10 bit mode:
 148
 149 * cmp default target is CR0
 150 * for (RA|0) when RA=0 the input is a zero immediate,
 151   meaning that sub becomes neg, and cmp becomes cmp-against-zero
 152
 153 ### Logical
 154
 155     | 0 1 2 3 4 | | 5 6 7 | 8 9 a | b c d | e | f |
 156     |           | | 1 0 0 | RB    | RA    | 0 | 1 | and
 157     |           | | 1 0 0 | RB    | RA    | 1 | 1 | nand
 158     |           | | 1 0 1 | RB    | RA    | 0 | 1 | or
 159     |           | | 1 0 1 | RB    | (RA|0)| 1 | 1 | nor
 160
 161 10 bit mode:
 162
 163 * for (RA|0) when RA=0 the input is a zero immediate,
 164   meaning that nor becomes not
 165
 166 ### Floating Point
 167
 168     | 0 1 2 3 4 | | 5 6 7 | 8 9 a | b c d | e | f |
 169     |           | | 1 1 0 | RB    | RA!=0 | 0 | 1 | fadd
 170     |           | | 1 1 0 | RB    | 0 0 0 | 0 | 1 | fabs
 171     |           | | 1 1 0 | RB    | RA    | 1 | 1 | fmul
 172     |           | | 1 1 1 | RB    | (RA|0)| 0 | 1 | fsub
 173     |           | | 1 1 1 | RB    | (RA|0)| 1 | 1 | fcmp
 174
 175 10 bit mode:
 176
 177 * fcmp default target is CR1
 178 * for (RA|0) when RA=0 the input is a zero immediate,
 179   meaning that fsub becomes fneg, and fcmp becomes fcmp-against-zero
 180
 181 ### Condition Register
 182
 183     | 0 1 2 3 4 | | 5 6 7 | 8 9 | a b | c d e  | f |
 184     |           | | 0 0 1 | 10  | BF  | BFA    | 1 | mcrf
 185
 186 10 bit mode:
 187
 188 * BF is only 2 bits which means the destination is only CR0-CR3
 189