9e77446550bf15730bcfb67b66f4a6e9ab447655
[libreriscv.git] / openpower / sv / 16_bit_compressed.mdwn
1 # 16 bit Compressed
2
3 Similar to VLE (but without immediate-prefixing) this encoding is designed
4 to fit on top of OpenPOWER ISA v3.0B when a "Modeswitch" bit is set (PCR
5 is recommended). Note that Compressed is *mutually exclusively incompatible*
6 with OpenPOWER v3.1B "prefixing" due to using (requiring) both EXT000
7 and EXT001. Hypothetically it could be made to use anything other than
8 EXT001, with some inconvenience (extra gates). The incompatibility is
9 "fixed" by swapping out of "Compressed" Mode and back into "Normal"
10 (v3.1B) Mode, at runtime, as needed.
11
12 Although initially intended to be augmented by Simple-V Prefixing (to
13 add Vector context, width overrides, e.g IEEE754 FP16, and predication) yet not put pressure on I-Cache power
14 or size, this Compressed Encoding is not critically dependent
15 *on* SV Prefixing, and may be used stand-alone.
16
17 See:
18
19 * <https://bugs.libre-soc.org/show_bug.cgi?id=238>
20 * <https://ftp.libre-soc.org/VLE_314-68105.pdf> VLE Encoding
21 * <http://lists.mailinglist.openpowerfoundation.org/pipermail/openpower-hdl-cores/2020-November/000210.html>
22
23 This one is a conundrum. OpenPOWER ISA was never designed with 16
24 bit in mind. VLE was added 10 years ago but only by way of marking
25 an entire 64k page as "VLE". With VLE not maintained it is not
26 fully compatible with current PowerISA.
27
28 Here, in order to embed 16 bit into a predominantly 32 bit stream the
29 overhead of using an entire 16 bits just to switch into Compressed mode
30 is itself a significant overhead. The situation is made worse by
31 OpenPOWER ISA being fundamentally designed with 6 bits uniformly
32 taking up Major Opcode space, leaving only 10 bits to allocate
33 to actual instructions.
34
35 Contrast this with RVC which takes 3 out of 4 combinations of the first 2
36 bits for indicating 16-bit (anything with 0b00 to 0b10 in the LSBs), and
37 uses the 4th (0b11) as a Huffman-style escape-sequence, easily allowing
38 standard 32 bit and 16 bit to intermingle cleanly. To achieve the same
39 thing on OpenPOWER would require a whopping 24 6-bit Major Opcodes which
40 is clearly impractical: other schemes need to be devised.
41
42 In addition we would like to add SV-C32 which is a Vectorised version
43 of 16 bit Compressed, and ideally have a variant that adds the 27-bit
44 prefix format from SV-P64, as well.
45
46 Potential ways to reduce pressure on the 16 bit space are:
47
48 * To use more than one v3.0B Major Opcode, preferably an odd-even
49 contiguous pair
50 * To provide "paging". This involves bank-switching to alternative
51 optimised encodings for specific workloads
52 * To enter "16 bit mode" for durations specified at the start
53 * To reserve one bit of every 16 bit instruction to indicate that the
54 16 bit mode is to continue to be sustained
55
56 This latter would be useful in the Vector context to have an alternative
57 meaning: as the bit which determines whether the instruction is 11-bit
58 prefixed or 27-bit prefixed:
59
60 0 1 2 3 4 5 6 7 8 9 a b c d e f |
61 |major op | 11 bit vector prefix|
62 |16 bit opcode alt vec. mode ^ |
63 | extra vector prefix if alt set|
64
65 Using a major opcode to enter 16 bit mode, leaves 11 bits to find
66 something to use them for:
67
68 0 1 2 3 4 5 6 7 8 9 a b c d e f |
69 |major op | what to do here 1 |
70 |16 bit stay in 16bit mode 1 |
71 |16 bit stay in 16bit mode 1 |
72 |16 bit exit 16bit mode 0 |
73
74 One possibility is that the 11 bits are used for bank selection,
75 with some room for additional context such as altering the registers
76 used for the 16 bit operations (bank selection of which scalar regs).
77 However the downside is that short sequences of Compressed instructions
78 become penalised by the fixed overhead. Even a single 16 bit instruction
79 requires a 16 bit overhead to "gain access" to 16 bit "mode", making
80 the exercise pointless.
81
82 An alternative is to use the first 11 bits for only the utmost commonly
83 used instructions. That being the case then one of those 11 bits could
84 be dedicated to saying if 16 bit mode is to be continued, at which
85 point *all* 16 bits can be used for Compressed. 10 bits remain for
86 actual opcodes, which is ridiculously tight, however the opportunity to
87 subsequently use all 16 bits is worth it.
88
89 The reason for picking 2 contiguous Major v3.0B opcodes is illustrated below:
90
91 |0 1 2 3 4 5 6 7 8 9 a b c d e f|
92 |major op..0| LO Half C space |
93 |major op..1| HI Half C space |
94 |N N N N N|<--11 bits C space-->|
95
96 If NNNNN is the same value (two contiguous Major v3.0B Opcodes) this
97 saves gates at a critical part of the decode phase.
98
99 ## ABI considerations
100
101 Unlike RISC-V RVC, the above "context" encodings require state, to be stored
102 in the PCR, MSR, or a dedicated SPR. These bits (just like LE/BE 32bit
103 mode and the IEEE754 FPCSR mode) all require taking that context into
104 consideration.
105
106 In particular it is critically important to recognise that context (in
107 general) is an implicit part of the ABI implemented for example by glibc6.
108 Therefore (in specific) Compressed Mode Context **must not** be permitted
109 to cross into or out of a function call.
110
111 Thus it is the mandatory responsibility of the compiler to ensure that
112 context returns to "v3.0B Standard" prior to entering a function call
113 (responsibility of caller) and prior to exit from a function call
114 (responsibility of callee).
115
116 Trap Handlers also take responsibility for saving and restoring of
117 Compressed Mode state, just as they already take responsibility for
118 other critical state. This makes traps transparent to functions as
119 far as Compressed Mode Context is concerned, just as traps are already
120 transparent to functions.
121
122 Note however that there are exceptions in a compiler to the otherwise
123 hard rule that Compressed Mode context not be permitted to cross function
124 boundaries: inline functions and static functions. static functions,
125 if correctly identified as never to be called externally, may, as an
126 optimisation, disregard standard ABIs, bearing in mind that this will
127 be fraught (pointers to functions) and not easy to get right.
128
129 # Opcode Allocation Ideas
130
131 * one bit from the 16-bit mode is used to indicate that standard
132 (v3.0B) mode is to be dropped into for only one single instruction
133 <https://bugs.libre-soc.org/show_bug.cgi?id=238#c2>
134
135 ## Opcodes exploration (Attempt 1)
136
137 Switching between different encoding modes is controlled by M (alone)
138 in 10-bit mode, and M and N in 16-bit mode.
139
140 * M in 10-bit mode if zero indicates that following instructions are
141 standard OpenPOWER ISA 32-bit encoded (including, redundantly,
142 further 10/16-bit instructions)
143 * M in 10-bit mode if 1 indicates that following instructions are
144 in 16-bit encoding mode
145
146 Once in 16-bit mode:
147
148 * 0b01 (M=1, N=0): stay in 16-bit mode
149 * 0b00: leave 16-bit mode permanently (return to standard OpenPOWER ISA)
150 * 0b10: leave 16-bit mode for one cycle (return to standard OpenPOWER ISA)
151 * 0b11: free to be used for something completely different.
152
153 The current "top" idea for 0b11 is to use it for a new encoding format
154 of predominantly "immediates-based" 16-bit instructions (branch-conditional,
155 addi, mulli etc.)
156
157 * The Compressed Major Opcode is in bits 5-7.
158 * Minor opcode in bit 8.
159 * In some cases bit 9 is taken as an additional sub-opcode, followed
160 by bits 0-4 (for CR operations)
161 * M+N mode-switching is not available for C-Major.minor 0b001.1
162 * 10 bit mode may be expanded by 16 bit mode, adding capabilities
163 that do not fit in the extreme limited space.
164
165 Mode-switching FSM showing relationship between v3.0B, C 10bit and C 16bit.
166 16-bit immediate mode remains in 16-bit.
167
168 | 0 | 1234 | 567 8 | 9abcde | f | explanation
169 | - | ---- | ------ | ------ | - | -----------
170 | EXT000/1 | Cmaj.m | fields | 0 | 10bit then v3.0B
171 | EXT000/1 | Cmaj.m | fields | 1 | 10bit then 16bit
172 | 0 | flds | Cmaj.m | fields | 0 | 16bit then v3.0B
173 | 0 | flds | Cmaj.m | fields | 1 | 16bit then 16bit
174 | 1 | flds | Cmaj.m | fields | 0 | 16b, 1x v3.0B, 16b
175 | 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit
176
177 Notes:
178
179 * Cmaj.m is the C major/minor opcode: 3 bits for major, 1 for minor
180 * EXT000 and EXT001 are v3.0B Major Opcodes. The first 5 bits
181 are zero, therefore the 6th bit is actually part of Cmaj.
182 * "10bit then 16bit" means "this instruction is encoded C 10bit
183 and the following one in C 16bit"
184 * "16b, 1x v3.0B, 16b" means, "this instruction is encoded C 16bit,
185 the following one is V3.0B Standard, and the one after that is
186 back to 16bit".
187
188 ### C Instruction Encoding types
189
190 10-bit Opcode formats (all start with v3.0B EXT000 or EXT001
191 Major Opcodes)
192
193 | 01234 | 567 8 | 9 | a b | c | d e | f | enc
194 | E01 | Cmaj.m | fld1 | fld2 | M | 10b
195 | E01 | Cmaj.m | offset | M | 10b b
196 | E01 | 001.1 | S1 | fd1 | S2 | fd2 | M | 10b sub
197 | E01 | 111.m | fld1 | fld2 | M | 10b LDST
198
199 16-bit Opcode formats (including 10/16/v3.0B Switching)
200
201 | 0 | 1234 | 567 8 | 9 | a b | c | d e | f | enc
202 | N | immf | Cmaj.m | fld1 | fld2 | M | 16b
203 | 1 | immf | Cmaj.m | fld1 | imm | 1 | 16b imm
204 | N | fd3 | 001.1 | S1 | fd1 | S2 | fd2 | M | 16b sub
205 | N | fd4 | 111.m | fld1 | fld2 | M | 16b LDST
206
207 Notes:
208
209 * fld1 and fld2 can contain reg numbers, immediates, or opcode
210 fields (BO, BI, LK)
211 * S1 and S2 are further sub-selectors of C 001.1
212
213 ### Immediate Opcodes
214
215 only available in 16-bit mode, only available when M=1 and N=1
216 and when Cmaj.min is not 0b001.1.
217
218 instruction counts from objdump on /bin/bash:
219
220 466 extsw r1,r1
221 649 stw r1,1(r1)
222 691 lwz r1,1(r1)
223 705 cmpdi r1,1
224 791 cmpwi r1,1
225 794 addis r1,r1,1
226 1474 std r1,1(r1)
227 1846 li r1,1
228 2031 mr r1,r1
229 2473 addi r1,r1,1
230 3012 nop
231 3028 ld r1,1(r1)
232
233
234 | 0 | 1 | 2 | 3 4 | | 567.8 | 9ab | cde | f |
235 | 1 | 0 | 0 0 0 | | 001.0 | | 000 | 1 | TBD
236 | 1 | 0 | sh2 | | 001.0 | RA | sh | 1 | sradi.
237 | 1 | 1 | 0 0 0 | | 001.0 | | 000 | 1 | TBD
238 | 1 | 1 | 0 | sh2 | | 001.0 | RA | sh | 1 | srawi.
239 | 1 | 1 | 1 | | | 001.0 | 000 | imm | 1 | TBD
240 | 1 | 1 | 1 | i2 | | 001.0 | RA!=0| imm | 1 | addis
241 | 1 | | | 010.0 | 000 | | 1 | TBD
242 | 1 | i2 | | 010.0 | RA!=0| imm | 1 | addi
243 | 1 | 0 | i2 | | 010.1 | RA | imm | 1 | cmpdi
244 | 1 | 1 | i2 | | 010.1 | RA | imm | 1 | cmpwi
245 | 1 | 0 | i2 | | 011.0 | RT | imm | 1 | ldspi
246 | 1 | 1 | i2 | | 011.0 | RT | imm | 1 | lwspi
247 | 1 | 0 | i2 | | 011.1 | RT | imm | 1 | stwspi
248 | 1 | 1 | i2 | | 011.1 | RT | imm | 1 | stdspi
249 | 1 | i2 | RA | | 100.0 | RT | imm | 1 | stwi
250 | 1 | i2 | RA | | 100.1 | RT | imm | 1 | stdi
251 | 1 | i2 | RT | | 101.0 | RA | imm | 1 | ldi
252 | 1 | i2 | RT | | 101.1 | RA | imm | 1 | lwi
253 | 1 | i2 | RA | | 110.0 | RT | imm | 1 | fsti
254 | 1 | i2 | RA | | 110.1 | RT | imm | 1 | fstdi
255 | 1 | i2 | RT | | 111.0 | RA | imm | 1 | flwi
256 | 1 | i2 | RT | | 111.1 | RA | imm | 1 | fldi
257
258 Construction of immediate:
259
260 * LD/ST r1 (SP) variants should be offset by -256
261 see <https://bugs.libre-soc.org/show_bug.cgi?id=238#c43>
262 - SP variants map to e.g ld RT, imm(r1)
263 - SV Prefixing can be used to map r1 to alternate regs
264 * [1] not the same as v3.0B addis: the shift amount is smaller and actually
265 still maps to within the v3.0B addi immediate range.
266 * addi is EXTS(i2||imm) to give a 4-bit range -8 to +7
267 * addis is EXTS(i2||imm||000) to give a 11-bit range -1024 to +1023 in
268 increments of 8
269 * all others are EXTS(i2||imm) to give a 7-bit range -128 to +127
270 (further for LD/ST due to word/dword-alignment)
271
272 Further Notes:
273
274 * bc also has an immediate mode, listed separately below in Branch section
275 * for LD/ST, offset is aligned. 8-byte: i2||imm||0b000 4-byte: 0b00
276 * SV Prefix over-rides help provide alternative bitwidths for LD/ST
277 * RA|0 if RA is zero, addi. becomes "li"
278 - this only works if RT takes part of opcode
279 - mv is also possible by specifying an immediate of zero
280
281 ### Illegal, nop and attn
282
283 Note that illeg is all zeros, including in the 16-bit mode.
284 Given that C is allocated to OpenPOWER ISA Major opcodes EXT000 and
285 EXT001 this ensures that in both 10-bit *and* 16-bit mode, a 16-bit
286 run of all zeros is considered "illegal" whilst 0b0000.0000.1000.0000
287 is "nop"
288
289 | 16-bit mode | | 10-bit mode |
290 | 0 | 1 | 234 | | 567.8 | 9 ab | c de | f |
291 | - | - | --- | | ----- | ----- | ------ | - |
292 | 0 | 0 000 | | 000.0 | 0 00 | 0 00 | 0 | illeg
293 | 0 | 0 000 | | 000.0 | 0 00 | 0 00 | 1 | nop
294
295 16 bit mode only:
296
297 | 0 | 1 | 234 | | 567.8 | 9 ab | c de | f |
298 | - | - | --- | | ----- | ----- | ------ | - |
299 | 1 | 0 000 | | 000.0 | 0 00 | 0 00 | 0 | nop
300 | 1 | 0 000 | | 000.0 | 0 00 | 0 00 | 1 | nop
301 | N | 1 000 | | 000.0 | 0 00 | 0 00 | M | attn
302
303 Notes:
304
305 * All-zeros being an illegal instruction is normal for ISAs. Ensuring that
306 this remains true at all times i.e. for both 10 bit and 16 bit mode is
307 common sense.
308 * The 10-bit nop (bit 15, M=1) is intended for circumstances
309 where alignment to 32-bit before returning to v3.0B is required.
310 M=1 being an indication "return to Standard v3.0B Encoding Mode".
311 * The 16-bit nop (bit 0, N=1) is intended for circumstances where a
312 return to Standard v3.0B Encoding is required for one cycle
313 but one cycle where alignment to a 32-bit boundary is needed.
314 Examples of this would be to return to "strict" (non-C) mode
315 where the PC may not be on a non-word-aligned boundary.
316 * If for any reason multiple 16 bit nops are needed in succession
317 the M=1 variant can be used, because each one returns to
318 Standard v3.0B Encoding Mode, each time.
319
320 In essence the 2 nops are needed due to there being 2 different C forms:
321 10 and 16 bit.
322
323 ### Branch
324
325 TODO: document that branching whilst using mode-switching bits (M/N) is perfectly well permitted but is specifically and wholly the complier/assembler writers responsibility to obey ABI rules and ensure that even with branches and returns that, at no time, is an incorrect mode entered or left that could result in any instruction being misinterpreted.
326
327 | 16-bit mode | | 10-bit mode |
328 | 0 | 1 | 234 | | 567.8 | 9 ab | c de | f |
329 | - | - | --- | | ----- | ----- | ------ | - |
330 | N | offs2 | | 000.LK | offs!=0 | M | b, bl
331 | N | | | 000.1 | 0 00 | 0 00 | M | TBD
332 | 1 | offs2 | | 000.LK | BI | BO1 oo | 1 | bc, bcl
333 | N | BO3 BI3 | | 001.0 | LK BI | BO | M | bclr, bclrl
334
335 16 bit mode:
336
337 * bc only available when N,M=0b11
338 * offs2 extends offset in MSBs
339 * BI3 extends BI in MSBs to allow selection of full CR
340 * BO3 extends BO
341 * bc offset constructed from oo as LSBs and offs2 as MSBs
342 * bc BI allows selection of all bits from CR0 or CR1
343 * bc CR check is always active (as if BO0=1) therefore BO1 inverts
344
345 10 bit mode:
346
347 * illegal (all zeros) covers part of branch (offs=0,M=0,LK=0)
348 * nop also covers part of branch (offs=0,M=0,LK=1)
349 * bc **not available** in 10-bit mode
350 * BO[0] enables CR check, BO[1] inverts check
351 * BI refers to CR0 only (4 bits of)
352 * no Branch Conditional with immediate
353 * no Absolute Address
354 * CTR mode allowed with BO[2] for b only.
355 * offs is to 2 byte (signed) aligned
356 * all branches to 2 byte aligned
357
358 ### LD/ST
359
360 Note: for 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
361
362 | 16-bit mode | | 10-bit mode |
363 | 0 | 1 | 234 | | 567.8 | 9 a b | c d e | f |
364 | - | -- | --- | | ----- | ----- | ----- | - |
365 | N | SZ | RB | | 001.1 | 1 RA | 0 RT | M | st
366 | N | SZ | RB | | 001.1 | 1 RA | 1 RT | M | fst
367 | N | SZ | RT | | 111.0 | RA | RB | M | ld
368 | N | SZ | RT | | 111.1 | RA | RB | M | fld
369
370 * elwidth overrides can set different widths
371
372 16 bit mode:
373
374 * SZ=1 is 64 bit, SZ=0 is 32 bit
375
376 10 bit mode:
377
378 * RA and RB are only 2 bit (0-3)
379 * for LD, RT is implicitly RB: "ld RT=RB, RA(RB)"
380 * for ST, there is no offset: "st RT, RA(0)"
381
382 ### Arithmetic
383
384 * 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
385 * 16-bit: note that bit 1==0 (sub-sub-encoding)
386
387 10 and 16 bit:
388
389 | 16-bit mode | | 10-bit mode |
390 | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
391 | - | - | --- | | ----- | --- | ----- | - |
392 | N | 0 | RT | | 010.0 | RB | RA!=0 | M | add
393 | N | 0 | RT | | 010.1 | RB | RA|0 | M | sub.
394 | N | 0 | BF | | 011.0 | RB | RA|0 | M | cmpl
395
396 Notes:
397
398 * sub. and cmpl: default CR target is CR0
399 * for (RA|0) when RA=0 the input is a zero immediate,
400 meaning that sub. becomes neg. and cmp becomes cmpi against zero
401 * RT is implicitly RB: "add RT(=RB), RA, RB"
402 * Opcode 0b010.0 RA=0 is not missing from the above:
403 it is a system-wide instruction, "cbank" (section below)
404
405 16 bit mode only:
406
407 | 0 | 1 | 234 | | 567.8 | 9ab | cde | f |
408 | - | - | --- | | ----- | --- | ----- | - |
409 | N | 1 | RA | | 010.0 | RB | RS | M | sld.
410 | N | 1 | RA | | 010.1 | RB | RS!=0 | M | srd.
411 | N | 1 | RA | | 010.1 | RB | 000 | M | srad.
412 | N | 1 | BF | | 011.0 | RB | RA|0 | M | cmpw
413
414 Notes:
415
416 * for srad, RS=RA: "srad. RA(=RS), RS, RB"
417
418 ### Logical
419
420 * 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
421 * 16-bit: note that bit 1==0 (sub-sub-encoding)
422
423 10 and 16 bit:
424
425 | 16-bit mode | | 10-bit mode |
426 | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
427 | - | - | --- | | ----- | --- | ----- | - |
428 | N | 0 | RT | | 100.0 | RB | RA!=0 | M | and
429 | N | 0 | RT | | 100.1 | RB | RA!=0 | M | nand
430 | N | 0 | RT | | 101.0 | RB | RA!=0 | M | or
431 | N | 0 | RT | | 101.1 | RB | RA!=0 | M | nor/mr
432 | N | 0 | RT | | 100.0 | RB | 0 0 0 | M | extsw
433 | N | 0 | RT | | 100.1 | RB | 0 0 0 | M | cntlz
434 | N | 0 | RT | | 101.0 | RB | 0 0 0 | M | popcnt
435 | N | 0 | RT | | 101.1 | RB | 0 0 0 | M | not
436
437 16-bit mode only (note that bit 1 == 1):
438
439 | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
440 | - | - | --- | | ----- | --- | ----- | - |
441 | N | 1 | RT | | 100.0 | RB | RA!=0 | M | TBD
442 | N | 1 | RT | | 100.1 | RB | RA!=0 | M | TBD
443 | N | 1 | RT | | 101.0 | RB | RA!=0 | M | xor
444 | N | 1 | RT | | 101.1 | RB | RA!=0 | M | eqv (xnor)
445 | N | 1 | RT | | 100.0 | RB | 0 0 0 | M | extsb
446 | N | 1 | RT | | 100.1 | RB | 0 0 0 | M | cnttz
447 | N | 1 | RT | | 101.0 | RB | 0 0 0 | M | TBD
448 | N | 1 | RT | | 101.1 | RB | 0 0 0 | M | extsh
449
450 10 bit mode:
451
452 * idea: for 10bit mode, nor is actually 'mr' because mr is
453 a more common operation. in 16bit however, this encoding
454 (Cmaj.min=0b101.1, N=0) is 'nor'
455 * for (RA|0) when RA=0 the input is a zero immediate,
456 meaning that nor becomes not
457 * cntlz, popcnt, exts **not available** in 10-bit mode
458 * RT is implicitly RB: "and RT(=RB), RA, RB"
459
460 ### Floating Point
461
462 Note here that elwidth overrides (SV Prefix) can be used to select FP16/32/64
463
464 * 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
465 * 16-bit: note that bit 1==0 (sub-sub-encoding)
466
467 10 and 16 bit:
468
469 | 16-bit mode | | 10-bit mode |
470 | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
471 | - | - | --- | | ----- | --- | ----- | - |
472 | N | | RT | | 011.1 | RB | RA!=0 | M | fsub.
473 | N | 0 | RT | | 110.0 | RB | RA!=0 | M | fadd
474 | N | 0 | RT | | 110.1 | RB | RA!=0 | M | fmul
475 | N | 0 | RT | | 011.1 | RB | 0 0 0 | M | fneg.
476 | N | 0 | | | 110.0 | | 0 0 0 | M | TBD
477 | N | 0 | | | 110.1 | | 0 0 0 | M | TND
478
479 16-bit mode only (note that bit 1 == 1):
480
481 | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
482 | - | - | --- | | ----- | --- | ----- | - |
483 | N | 1 | | | 011.1 | | RA!=0 | M | TBD
484 | N | 1 | | | 110.0 | | RA!=0 | M | TBD
485 | N | 1 | RT | | 110.1 | RB | RA!=0 | M | fdiv
486 | N | 1 | RT | | 011.1 | RB | 0 0 0 | M | fabs.
487 | N | 1 | RT | | 110.0 | RB | 0 0 0 | M | fmr.
488 | N | 1 | | | 110.1 | | 0 0 0 | M | TBD
489
490 16 bit only, FP to INT convert (using C 0b001.1 subencoding)
491
492 | 0 | 123 | 4 | | 567.8 | 9 ab | cde | f |
493 | - | --- | - | | ----- | ---- | ---- | - |
494 | N | 101 | X | | 001.1 | 0 RA | Y RT | M | fp2int
495 | N | 110 | X | | 001.1 | 0 RA | Y RT | M | int2fp
496
497 * X: signed=1, unsigned=0
498 * Y: FP32=0, FP64=1
499
500 10 bit mode:
501
502 * fsub. fneg. and fmr. default target is CR1
503 * fmr. is **not available** in 10-bit mode
504 * fdiv is **not available** in 10-bit mode
505
506 16 bit mode:
507
508 * fmr. copies RB to RT (and sets CR1)
509
510 ### Condition Register
511
512 10-bit or 16 bit:
513
514 | 16-bit mode| | 10-bit mode |
515 | 0 | 123 | 4 | | 567.8 | 9 ab | cde | f |
516 | - | --- | --- | | ----- | ---- | --- | - |
517 | N | 000 | BF2 | | 001.1 | 0 BF | BFA | M | mcrf
518
519 16-bit only:
520
521 | 0 | 1234 | | 567.8 | 9 ab | cde | f |
522 | - | ---- | | ----- | ---- | --- | - |
523 | N | 0010 | | 001.1 | 0 BA | BB | M | crnor
524 | N | 0011 | | 001.1 | 0 BA | BB | M | crandc
525 | N | 0100 | | 001.1 | 0 BA | BB | M | crxor
526 | N | 0101 | | 001.1 | 0 BA | BB | M | crnand
527 | N | 0110 | | 001.1 | 0 BA | BB | M | crand
528 | N | 0111 | | 001.1 | 0 BA | BB | M | creqv
529 | N | 1000 | | 001.1 | 0 BA | BB | M | crorc
530 | N | 1001 | | 001.1 | 0 BA | BB | M | cror
531
532 Notes
533
534 10 bit mode:
535
536 * mcrf BF is only 2 bits which means the destination is only CR0-CR3
537 * CR operations: **not available** in 10-bit mode (but mcrf is)
538
539 16 bit mode:
540
541 * mcrf BF2 extends BF (in MSB) to 3 bits
542 * CR operations: destination register is same as BA.
543 * CR operations: only possible on CR0 and CR1
544
545 SV (Vector Mode):
546
547 * CR operations: greatly extended reach/range (useful for predicates)
548
549 ### System
550
551 cbank: Selection of Compressed-encoding "Bank". Different "banks"
552 give different meanings to opcodes. Example: CBank=0b001 is heavily
553 optimised to A/Video Encode/Decode. cbank borrows from add's encoding
554 space (when RA==0)
555
556 | 16-bit mode | | 10-bit mode |
557 | 0 | 1 2 3 4 | | 567.8 | 9ab | cde | f |
558 | - | ------- | | ----- | ----- | --- | - |
559 | N | 0 Bank2 | | 010.0 | CBank | 000 | M | cbank
560
561 **not available** in 10-bit mode, **only** in 16-bit mode:
562
563 | 0 | 1 | 234 | | 567.8 | 9 ab | cde | f |
564 | - | ------- | | ----- | ---- | ---- | - |
565 | N | 1 | 111 | | 001.1 | 0 00 | RT | M | mtlr
566 | N | 1 | 111 | | 001.1 | 0 01 | RT | M | mtctr
567 | N | 1 | 111 | | 001.1 | 0 00 | RA | M | mflr
568 | N | 1 | 111 | | 001.1 | 0 01 | RA | M | mfctr
569 | N | 0 RA!=0 | | 000.0 | 0 00 | 000 | M | mtcr
570 | N | 1 RT!=0 | | 000.0 | 0 00 | 000 | M | mfcr
571
572 ### Unallocated
573
574 16-bit only:
575
576 | 0 | 1 | 234 | | 567.8 | 9 ab | cde | f |
577 | - | - | --- | | ----- | ---- | ---- | - |
578 | N | 1 | 111 | | 001.1 | 0 10 | | M |
579 | N | 1 | 111 | | 001.1 | 0 11 | | M |
580
581 ## Other ideas (Attempt 2)
582
583 ### 8-bit mode-switching instructions, odd addresses for C mode
584
585 Drop the complexity of the 16-bit encoding further reduced to 10-bit,
586 and use a single byte instead of two to switch between modes. This
587 would place compressed (C) mode instructions at odd bytes, so the LSB
588 of the PC can be used for the processor to tell which mode it is in.
589
590 To switch from traditional to compressed mode, the single-byte
591 instruction would be at the MSByte, that holds the EXT bits. (When we
592 break up a 32-bit instruction across words, the most significant half
593 should go in the word with the lower address.)
594
595 To switch from compressed mode to traditional mode, the single-byte
596 instruction would also be at the opcode/format portion, placed in the
597 lower-address word if split across words, so that the instruction can
598 be recognized as the mode-switching one without going for its second
599 byte.
600
601 The C-mode nop should be encoded so that its second byte encodes a
602 switch to compressed mode, if decoded in traditional mode. This
603 enables such a nop to straddle across a label:
604
605 8-bit first half of nop
606 Label:
607 8-bit second half of nop AKA switch to compressed mode
608 16-bit insns...
609
610 so that if traditional code jumps to the word-aligned label (because
611 traditional branches drop the 2 LSB), it immediately switches to
612 compressed mode; if we fall-through, we remain in 16-bit mode; and if
613 we branch to it from compressed mode, whether we jump to the odd or
614 the even address, we end up in compressed mode as desired.
615
616 Tables explaining encoding:
617
618 | byte 0 | byte 1 | byte 2 | byte 3 |
619 | v3.0B standard 32 bit instruction |
620 | EXT000 | 16 bit | 16... |
621 | .. bit | 8nop | v3.0b stand... |
622 | .. ard 32 bit | EXT000 | 16... |
623 | .. bit | 16 bit | 8nop |
624 | v3.0B standard 32 bit instruction |
625
626
627 # TODO
628
629 * make a preliminary assessment of branch in/out viability
630 * confirm FSM encoding (is LSB of PC really enough?)
631 * guestimate opcode and register allocation (without necessarily doing
632 a full encoding)
633 * write throwaway python program that estimates compression ratio from
634 objdump raw parsing
635 * finally do full opcode allocation
636 * rerun objdump compression ratio estimates
637 * check in FSM if "return to v3.0B then 16bit" if it is ok to have the v3.0B be a 10bit Compressed. should this be ignored and carry on? should a trap occur?
638
639 ### Use 2- rather than 3-register opcodes
640
641 Successful compact ISAs have used 2- rather than 3-register insns, in
642 which the same register serves as input and output. Some 20% of
643 general-purpose 3-register insns already use either input register as
644 output, without any effort by the compiler to do so.
645
646 Repurposing the 3 bits used to encode one one of the input registers
647 in arithmetic, logical and floating-pointer registers, and the 2 bits
648 used to encode the mode of the next two insns, we could make the full
649 register files available to the opcodes already selected for
650 compressed mode, with one bit to spare to bring additional opcodes in.
651
652 An opcode could be assigned to an instruction that combines and
653 extends with the subsequent instruction, providing it with a separate
654 input operand to use rather than the output register, or with
655 additional range for immediate and offset operands, effectively
656 forming a 32-bit operation, enabling us to remain in compressed mode
657 even longer.
658
659 # Appendix
660
661 ## Analysis techniques and tools
662
663 objdump -d --no-show-raw-insn /bin/bash | sed 'y/\t/ /;
664 s/^[ x0-9A-F]*: *\([a-z.]\+\) *\(.*\)/\1 \2 /p; d' |
665 sed 's/\([, (]\)r[1-9][0-9]*/\1r1/g;
666 s/\([ ,]\)-*[0-9]\+\([^0-9]\)/\11\2/g' | sort | uniq --count |
667 sort -n | less
668
669 ## gcc register allocation
670
671 FTR, information extracted from gcc's gcc/config/rs6000/rs6000.h about
672 fixed registers (assigned to special purposes) and register allocation
673 order:
674
675 Special-purpose registers on ppc are:
676
677 r0: constant zero/throw-away
678 r1: stack pointer
679 r2: thread-local storage pointer in 32-bit mode
680 r2: non-minimal TOC register
681 r10: EH return stack adjust register
682 r11: static chain pointer
683 r13: thread-local storage pointer in 64-bit mode
684 r30: minimal-TOC/-fPIC/-fpic base register
685 r31: frame pointer
686 lr: return address register
687
688 the register allocation order in GCC (i.e., it takes the earliest
689 available register that fits the constraints) is:
690
691 We allocate in the following order:
692
693 fp0 (not saved or used for anything)
694 fp13 - fp2 (not saved; incoming fp arg registers)
695 fp1 (not saved; return value)
696 fp31 - fp14 (saved; order given to save least number)
697 cr7, cr5 (not saved or special)
698 cr6 (not saved, but used for vector operations)
699 cr1 (not saved, but used for FP operations)
700 cr0 (not saved, but used for arithmetic operations)
701 cr4, cr3, cr2 (saved)
702 r9 (not saved; best for TImode)
703 r10, r8-r4 (not saved; highest first for less conflict with params)
704 r3 (not saved; return value register)
705 r11 (not saved; later alloc to help shrink-wrap)
706 r0 (not saved; cannot be base reg)
707 r31 - r13 (saved; order given to save least number)
708 r12 (not saved; if used for DImode or DFmode would use r13)
709 ctr (not saved; when we have the choice ctr is better)
710 lr (saved)
711 r1, r2, ap, ca (fixed)
712 v0 - v1 (not saved or used for anything)
713 v13 - v3 (not saved; incoming vector arg registers)
714 v2 (not saved; incoming vector arg reg; return value)
715 v19 - v14 (not saved or used for anything)
716 v31 - v20 (saved; order given to save least number)
717 vrsave, vscr (fixed)
718 sfp (fixed)
719
720 ## Comparison to VLE
721
722 VLE was a means to reduce executable size through three interleaved methods:
723
724 * (1) invention of 16 bit encodings (of exactly 16 bit in length)
725 * (2) invention of 16+16 bit encodings (a 16 bit instruction format but with
726 an *additional* 16 bit immediate "tacked on" to the end, actually
727 making a 32-bit instruction format)
728 * (3) seamless and transparent embedding and intermingling of the
729 above in amongst arbitrary v2.06/7 BE 32 bit instruction sequences,
730 with no additional state,
731 including when the PC was not aligned on a 4-byte boundary.
732
733 Whilst (1) and (3) make perfect sense, (2) makes no sense at all given that, as inspection of "ori" and others show, I-Form 16 bit immediates is the "norm" for v2.06/7 and v3.0B standard instructions. (2) in effect **is** a 32 bit instruction. (2) **is not** a 16 bit instruction.
734
735 *Why "reinvent" an encoding that is 32 bit, when there already exists a 32 bit encoding that does the exact same job?*
736
737 Consequently, we do **not** envisage a scenario where (2) would ever be implemented, nor in the future would this Compressed Encoding be extended beyond 16 bit. Compressed is Compressed and is **by definition** limited to precisely - and only - 16 bit.
738
739 The additional reason why that is the case is because VLE is exceptionally complex to implement. In a single-issue, low clock rate "Embedded" environment for which VLE was originally designed, VLE was perfectly well matched.
740
741 However this Compressed Encoding is designed for High performance multi-issue systems *as well* as Embedded scenarios, and consequently, the complexity of "deep packet inspection" down into the depths of a 16 bit sequence in order to ascertain if it might not be 16 bit after all, is wholly unacceptable.
742
743 By eliminating such 16+16 (actually, 32bit conflation) tricks outlined in (2), Compressed is *specifically* designed to fit into a very small FSM, suitable for multi-issue, that in no way requires "deep-dive" analysis. Yet, despite it never being designed with 16 bit encodings in mind, is still suitable for retro-fitting onto OpenPOWER.
744
745 ## Compressed Decoder Phases
746
747 Phase 1 (stage 1 of a 2-stage pipelined decoder) is defined as the minimum necessary FSM required to determine instruction length and mode. This is implemented with the absolute bare minimum of gates and is based on the 6 encodings involving N, M and EXTNNN (see table, below)
748
749 Phase 2 (stage 2 of a 2-stage pipelined decoder) is defined as the "full decoder" that includes taking into account the length and mode from Phase 1. Given a 2-stage pipelined decoder it is categorically **impossible** for Phase 2 to go backwards in time and affect the decisions made in Phase 1.
750
751 These two phases are specifically designed to take multi-issue execution into account. Phase 1 is intended to be part of an O(log N) algorithm that can use a form of carry-lookahead propagation. Phase 2 is intended to be on a 2nd pipelined clock cycle, comprising a separate suite of independent local-state-only parallel pipelines that do not require any inter-communication of any kind.
752
753 Table: Reminder of the 6 16-bit encodings:
754
755 | 0 | 1234 | 567 8 | 9abcde | f | explanation
756 | - | ---- | ------ | ------ | - | -----------
757 | EXT000/1 | Cmaj.m | fields | 0 | 10bit then v3.0B
758 | EXT000/1 | Cmaj.m | fields | 1 | 10bit then 16bit
759 | 0 | flds | Cmaj.m | fields | 0 | 16bit then v3.0B
760 | 0 | flds | Cmaj.m | fields | 1 | 16bit then 16bit
761 | 1 | flds | Cmaj.m | fields | 0 | 16b, 1x v3.0B, 16b
762 | 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit
763
764 ### Phase 1
765
766 The Phase 1 length/mode identification takes into account only 3 pieces of information:
767
768 * extc_id: insn[0:4] == EXTNNN (Compressed)
769 * N: insn[0]
770 * M: insn[15]
771
772 The Phase 1 length/mode produces the following lengths/modes:
773
774 * 32 - v3.0B (includes v3.0B followed by 16bit)
775 * 16 - 10bit
776 * 16 - 16bit
777
778 **NOTE THAT FURTHER SUBIDENTIFICATION OF C MODES IS NOT CARRIED OUT AT PHASE 1**. In particular note specifically that 16 bit "immediate mode" is **not** part of the Phase 1 FSM, but is specifically isolated to Phase 2.
779
780 Pseudocode:
781
782 # starting point for FSM
783 previ = v3.0B
784
785 if previ.mode == v3.0B:
786 # previous was v3.0B, look for compressed tag
787 if extc_id:
788 # found it. move to 10bit mode
789 nexti.length = 16
790 nexti.mode = 10bit
791 else:
792 # nope. stay in v3.0B
793 nexti.length = 32
794 nexti.mode = v3.0B
795
796 elif previ.mode == 10bit:
797 # previous was v3.0B, move to v3.0B or 16bit?
798 if M == 0:
799 next.length = 32
800 nexti.mode = v3.0B
801 else:
802 # otherwise stay in 16bit mode
803 nexti.length = 16
804 nexti.mode = 16bit
805
806 elif previ.mode == 16bit:
807 # previous was 16bit, stay there or move?
808 if M == 0:
809 # back to v3.0B
810 next.length = 32
811 if N == 1:
812 # ... but only for 1 insn
813 nexti.mode = v3.0B_then_16bit
814 else:
815 nexti.mode = v3.0B
816 else:
817 # otherwise stay in 16bit mode
818 nexti.length = 16
819 nexti.mode = 16bit
820
821 # rest of FSM involving 3.0B to 16bit
822 # and back transitions left to implementor
823 # (or for someone else to add)
824
825 ### Phase 2: Compressed mode
826
827 At this phase, knowing that the length is 16bit and the mode is either 10b or 16b, further analysis is required to determine if the 16bit.immediate encoding is active, and so on. This is a fully combinatorial block that **at no time** steps outside of the strict bounds already determined by Phase 1.
828
829 op_001_1 = insn[5:8] != 0b001.1
830 if mode == 10bit:
831 decode_10bit(insn)
832 elif mode == 16bit:
833 if N == 1 & M == 1 & op_001_1
834 # see immediate opcodes table
835 decode_16bit_immed_mode(insn)
836 if op_001_1:
837 # see CR and System tables
838 # (16 bit ones at least)
839 decode_16bit_cr_or_sys(insn)
840 else:
841 decode_16bit_nonimmed_mode(insn)
842
843 From this point onwards each of the decode_xx functions perform straightforward combinatorial decoding of the 16 bits of "insn". In sone cases this involves further analysis of bit 1, in some cases (Cmaj.m = 0b010.1) even further deep-dive decoding is required (CR ops). *All* of it is entirely combinatorial and at **no time** involves changing of, or interaction with, or disruption of, the Phase 1 determination of Length+Mode (that has *already taken place* in an earlier decoding pipeline time-schedule)
844
845 ### Phase 2: v3.0B mode
846
847 Standard v3.0B decoders are deployed. Absolutely no interaction occurs with any 16 bit decoders or state. Absolutely no interaction with the earlier Phase 1 decoding occurs. Absolutely no interaction occurs whatsoever (assuming an implementation that does not perform macro-op fusion) between other multi-issued v3.0B instructions being decoded in parallel at this time.
848
849 ## Demo of encoding that's backward-compatible with PowerISA v3.1 in both LE and BE mode
850
851 [[demo]]
852
853 ### Efficient Decoding Algorithm
854
855 [[decoding]]