(no commit message)
[libreriscv.git] / openpower / sv / 16_bit_compressed.mdwn
1 # 16 bit Compressed
2
3 Similar to VLE (but without immediate-prefixing) this encoding is designed
4 to fit on top of OpenPOWER ISA v3.0B when a "Modeswitch" bit is set (PCR
5 is recommended). Note that Compressed is *mutually exclusively incompatible*
6 with OpenPOWER v3.1B "prefixing" due to using (requiring) both EXT000
7 and EXT001. Hypothetically it could be made to use anything other than
8 EXT001, with some inconvenience (extra gates). The incompatibility is
9 "fixed" by swapping out of "Compressed" Mode and back into "Normal"
10 (v3.1B) Mode, at runtime, as needed.
11
12 Although initially intended to be augmented by Simple-V Prefixing (to
13 add Vector context, width overrides, e.g IEEE754 FP16, and predication) yet not put pressure on I-Cache power
14 or size, this Compressed Encoding is not critically dependent
15 *on* SV Prefixing, and may be used stand-alone.
16
17 See:
18
19 * <https://bugs.libre-soc.org/show_bug.cgi?id=238>
20 * <https://ftp.libre-soc.org/VLE_314-68105.pdf> VLE Encoding
21 * <http://lists.mailinglist.openpowerfoundation.org/pipermail/openpower-hdl-cores/2020-November/000210.html>
22
23 This one is a conundrum. OpenPOWER ISA was never designed with 16
24 bit in mind. VLE was added 10 years ago but only by way of marking
25 an entire 64k page as "VLE". With VLE not maintained it is not
26 fully compatible with current PowerISA.
27
28 Here, in order to embed 16 bit into a predominantly 32 bit stream the
29 overhead of using an entire 16 bits just to switch into Compressed mode
30 is itself a significant overhead. The situation is made worse by
31 OpenPOWER ISA being fundamentally designed with 6 bits uniformly
32 taking up Major Opcode space, leaving only 10 bits to allocate
33 to actual instructions.
34
35 Contrast this with RVC which takes 3 out of 4 combinations of the first 2
36 bits for indicating 16-bit (anything with 0b00 to 0b10 in the LSBs), and
37 uses the 4th (0b11) as a Huffman-style escape-sequence, easily allowing
38 standard 32 bit and 16 bit to intermingle cleanly. To achieve the same
39 thing on OpenPOWER would require a whopping 24 6-bit Major Opcodes which
40 is clearly impractical: other schemes need to be devised.
41
42 In addition we would like to add SV-C32 which is a Vectorised version
43 of 16 bit Compressed, and ideally have a variant that adds the 27-bit
44 prefix format from SV-P64, as well.
45
46 Potential ways to reduce pressure on the 16 bit space are:
47
48 * To use more than one v3.0B Major Opcode, preferably an odd-even
49 contiguous pair
50 * To provide "paging". This involves bank-switching to alternative
51 optimised encodings for specific workloads
52 * To enter "16 bit mode" for durations specified at the start
53 * To reserve one bit of every 16 bit instruction to indicate that the
54 16 bit mode is to continue to be sustained
55
56 This latter would be useful in the Vector context to have an alternative
57 meaning: as the bit which determines whether the instruction is 11-bit
58 prefixed or 27-bit prefixed:
59
60 0 1 2 3 4 5 6 7 8 9 a b c d e f |
61 |major op | 11 bit vector prefix|
62 |16 bit opcode alt vec. mode ^ |
63 | extra vector prefix if alt set|
64
65 Using a major opcode to enter 16 bit mode, leaves 11 bits to find
66 something to use them for:
67
68 0 1 2 3 4 5 6 7 8 9 a b c d e f |
69 |major op | what to do here 1 |
70 |16 bit stay in 16bit mode 1 |
71 |16 bit stay in 16bit mode 1 |
72 |16 bit exit 16bit mode 0 |
73
74 One possibility is that the 11 bits are used for bank selection,
75 with some room for additional context such as altering the registers
76 used for the 16 bit operations (bank selection of which scalar regs).
77 However the downside is that short sequences of Compressed instructions
78 become penalised by the fixed overhead. Even a single 16 bit instruction
79 requires a 16 bit overhead to "gain access" to 16 bit "mode", making
80 the exercise pointless.
81
82 An alternative is to use the first 11 bits for only the utmost commonly
83 used instructions. That being the case then one of those 11 bits could
84 be dedicated to saying if 16 bit mode is to be continued, at which
85 point *all* 16 bits can be used for Compressed. 10 bits remain for
86 actual opcodes, which is ridiculously tight, however the opportunity to
87 subsequently use all 16 bits is worth it.
88
89 The reason for picking 2 contiguous Major v3.0B opcodes is illustrated below:
90
91 |0 1 2 3 4 5 6 7 8 9 a b c d e f|
92 |major op..0| LO Half C space |
93 |major op..1| HI Half C space |
94 |N N N N N|<--11 bits C space-->|
95
96 If NNNNN is the same value (two contiguous Major v3.0B Opcodes) this
97 saves gates at a critical part of the decode phase.
98
99 ## Comparison to VLE
100
101 VLE was a means to reduce executable size through three interleaved methods:
102
103 * (1) invention of 16 bit encodings (of exactly 16 bit in length)
104 * (2) invention of 16+16 bit encodings (a 16 bit instruction format but with
105 an *additional* 16 bit immediate "tacked on" to the end, actually
106 making a 32-bit instruction format)
107 * (3) seamless and transparent embedding and intermingling of the
108 above in amongst arbitrary v2.06/7 BE 32 bit instruction sequences,
109 with no additional state,
110 including when the PC was not aligned on a 4-byte boundary.
111
112 Whilst (1) and (3) make perfect sense, (2) makes no sense at all given that, as inspection of "ori" and others show, I-Form 16 bit immediates is the "norm" for v2.06/7 and v3.0B standard instructions. (2) in effect **is** a 32 bit instruction. (2) **is not** a 16 bit instruction.
113
114 *Why "reinvent" an encoding that is 32 bit, when there already exists a 32 bit encoding that does the exact same job?*
115
116 Consequently, we do **not** envisage a scenario where (2) would ever be implemented, nor in the future would this Compressed Encoding be extended beyond 16 bit. Compressed is Compressed and is **by definition** limited to precisely - and only - 16 bit.
117
118 The additional reason why that is the case is because VLE is exceptionally complex to implement. In a single-issue, low clock rate "Embedded" environment for which VLE was originally designed, VLE was perfectly well matched.
119
120 However this Compressed Encoding is designed for High performance multi-issue systems *as well* as Embedded scenarios, and consequently, the complexity of "deep packet inspection" down into the depths of a 16 bit sequence in order to ascertain if it might not be 16 bit after all, is wholly unacceptable.
121
122 By eliminating such 16+16 (actually, 32bit conflation) tricks outlined in (2), Compressed is *specifically* designed to fit into a very small FSM, suitable for multi-issue, that in no way requires "deep-dive" analysis. Yet, despite it never being designed with 16 bit encodings in mind, is still suitable for retro-fitting onto OpenPOWER.
123
124 ## ABI considerations
125
126 Unlike RISC-V RVC, the above "context" encodings require state, to be stored
127 in the PCR, MSR, or a dedicated SPR. These bits (just like LE/BE 32bit
128 mode and the IEEE754 FPCSR mode) all require taking that context into
129 consideration.
130
131 In particular it is critically important to recognise that context (in
132 general) is an implicit part of the ABI implemented for example by glibc6.
133 Therefore (in specific) Compressed Mode Context **must not** be permitted
134 to cross into or out of a function call.
135
136 Thus it is the mandatory responsibility of the compiler to ensure that
137 context returns to "v3.0B Standard" prior to entering a function call
138 (responsibility of caller) and prior to exit from a function call
139 (responsibility of callee).
140
141 Trap Handlers also take responsibility for saving and restoring of
142 Compressed Mode state, just as they already take responsibility for
143 other critical state. This makes traps transparent to functions as
144 far as Compressed Mode Context is concerned, just as traps are already
145 transparent to functions.
146
147 Note however that there are exceptions in a compiler to the otherwise
148 hard rule that Compressed Mode context not be permitted to cross function
149 boundaries: inline functions and static functions. static functions,
150 if correctly identified as never to be called externally, may, as an
151 optimisation, disregard standard ABIs, bearing in mind that this will
152 be fraught (pointers to functions) and not easy to get right.
153
154 # Opcode Allocation Ideas
155
156 * one bit from the 16-bit mode is used to indicate that standard
157 (v3.0B) mode is to be dropped into for only one single instruction
158 <https://bugs.libre-soc.org/show_bug.cgi?id=238#c2>
159
160 ## Opcodes exploration (Attempt 1)
161
162 Switching between different encoding modes is controlled by M (alone)
163 in 10-bit mode, and M and N in 16-bit mode.
164
165 * M in 10-bit mode if zero indicates that following instructions are
166 standard OpenPOWER ISA 32-bit encoded (including, redundantly,
167 further 10/16-bit instructions)
168 * M in 10-bit mode if 1 indicates that following instructions are
169 in 16-bit encoding mode
170
171 Once in 16-bit mode:
172
173 * 0b01 (M=1, N=0): stay in 16-bit mode
174 * 0b00: leave 16-bit mode permanently (return to standard OpenPOWER ISA)
175 * 0b10: leave 16-bit mode for one cycle (return to standard OpenPOWER ISA)
176 * 0b11: free to be used for something completely different.
177
178 The current "top" idea for 0b11 is to use it for a new encoding format
179 of predominantly "immediates-based" 16-bit instructions (branch-conditional,
180 addi, mulli etc.)
181
182 * The Compressed Major Opcode is in bits 5-7.
183 * Minor opcode in bit 8.
184 * In some cases bit 9 is taken as an additional sub-opcode, followed
185 by bits 0-4 (for CR operations)
186 * M+N mode-switching is not available for C-Major.minor 0b001.1
187 * 10 bit mode may be expanded by 16 bit mode, adding capabilities
188 that do not fit in the extreme limited space.
189
190 Mode-switching FSM showing relationship between v3.0B, C 10bit and C 16bit.
191 16-bit immediate mode remains in 16-bit.
192
193 | 0 | 1234 | 567 8 | 9abcde | f | explanation
194 | EXT000/1 | Cmaj.m | fields | 0 | 10bit then v3.0B
195 | EXT000/1 | Cmaj.m | fields | 1 | 10bit then 16bit
196 | 0 | flds | Cmaj.m | fields | 0 | 16bit then v3.0B
197 | 0 | flds | Cmaj.m | fields | 1 | 16bit then 16bit
198 | 1 | flds | Cmaj.m | fields | 0 | 16b then 1x v3.0B
199 | 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit
200
201 Notes:
202
203 * Cmaj.m is the C major/minor opcode: 3 bits for major, 1 for minor
204 * EXT000 and EXT001 are v3.0B Major Opcodes. The first 5 bits
205 are zero, therefore the 6th bit is actually part of Cmaj.
206 * "10bit then 16bit" means "this instruction is encoded C 10bit
207 and the following one in C 16bit"
208
209 ### C Instruction Encoding types
210
211 10-bit Opcode formats (all start with v3.0B EXT000 or EXT001
212 Major Opcodes)
213
214 | 01234 | 567 8 | 9 | a b | c | d e | f | enc
215 | E01 | Cmaj.m | fld1 | fld2 | M | 10b
216 | E01 | Cmaj.m | offset | M | 10b b
217 | E01 | 001.1 | S1 | fd1 | S2 | fd2 | M | 10b sub
218 | E01 | 111.m | fld1 | fld2 | M | 10b LDST
219
220 16-bit Opcode formats (including 10/16/v3.0B Switching)
221
222 | 0 | 1234 | 567 8 | 9 | a b | c | d e | f | enc
223 | N | immf | Cmaj.m | fld1 | fld2 | M | 16b
224 | 1 | immf | Cmaj.m | fld1 | imm | 1 | 16b imm
225 | fd3 | 001.1 | S1 | fd1 | S2 | fd2 | M | 16b sub
226 | N | fd4 | 111.m | fld1 | fld2 | M | 16b LDST
227
228 Notes:
229
230 * fld1 and fld2 can contain reg numbers, immediates, or opcode
231 fields (BO, BI, LK)
232 * S1 and S2 are further sub-selectors of C 001.1
233
234 ### Immediate Opcodes
235
236 only available in 16-bit mode, only available when M=1 and N=1
237 and when Cmaj.min is not 0b001.1.
238
239 instruction counts from objdump on /bin/bash:
240
241 466 extsw r1,r1
242 649 stw r1,1(r1)
243 691 lwz r1,1(r1)
244 705 cmpdi r1,1
245 791 cmpwi r1,1
246 794 addis r1,r1,1
247 1474 std r1,1(r1)
248 1846 li r1,1
249 2031 mr r1,r1
250 2473 addi r1,r1,1
251 3012 nop
252 3028 ld r1,1(r1)
253
254
255 | 0 | 1 | 2 | 3 4 | | 567.8 | 9ab | cde | f |
256 | 1 | 0 | 0 0 0 | | 001.0 | | 000 | 1 | TBD
257 | 1 | 0 | sh2 | | 001.0 | RA | sh | 1 | sradi.
258 | 1 | 1 | 0 0 0 | | 001.0 | | 000 | 1 | TBD
259 | 1 | 1 | 0 | sh2 | | 001.0 | RA | sh | 1 | srawi.
260 | 1 | 1 | 1 | | | 001.0 | 000 | imm | 1 | TBD
261 | 1 | 1 | 1 | i2 | | 001.0 | RA!=0| imm | 1 | addis
262 | 1 | | | 010.0 | 000 | | 1 | TBD
263 | 1 | i2 | | 010.0 | RA!=0| imm | 1 | addi
264 | 1 | 0 | i2 | | 010.1 | RA | imm | 1 | cmpdi
265 | 1 | 1 | i2 | | 010.1 | RA | imm | 1 | cmpwi
266 | 1 | 0 | i2 | | 011.0 | RT | imm | 1 | ldspi
267 | 1 | 1 | i2 | | 011.0 | RT | imm | 1 | lwspi
268 | 1 | 0 | i2 | | 011.1 | RT | imm | 1 | stwspi
269 | 1 | 1 | i2 | | 011.1 | RT | imm | 1 | stdspi
270 | 1 | i2 | RA | | 100.0 | RT | imm | 1 | stwi
271 | 1 | i2 | RA | | 100.1 | RT | imm | 1 | stdi
272 | 1 | i2 | RT | | 101.0 | RA | imm | 1 | ldi
273 | 1 | i2 | RT | | 101.1 | RA | imm | 1 | lwi
274 | 1 | i2 | RA | | 110.0 | RT | imm | 1 | fsti
275 | 1 | i2 | RA | | 110.1 | RT | imm | 1 | fstdi
276 | 1 | i2 | RT | | 111.0 | RA | imm | 1 | flwi
277 | 1 | i2 | RT | | 111.1 | RA | imm | 1 | fldi
278
279 Construction of immediate:
280
281 * LD/ST r1 (SP) variants should be offset by -256
282 see <https://bugs.libre-soc.org/show_bug.cgi?id=238#c43>
283 - SP variants map to e.g ld RT, imm(r1)
284 - SV Prefixing can be used to map r1 to alternate regs
285 * [1] not the same as v3.0B addis: the shift amount is smaller and actually
286 still maps to within the v3.0B addi immediate range.
287 * addi is EXTS(i2||imm) to give a 4-bit range -8 to +7
288 * addis is EXTS(i2||imm||000) to give a 11-bit range -1024 to +1023 in
289 increments of 8
290 * all others are EXTS(i2||imm) to give a 7-bit range -128 to +127
291 (further for LD/ST due to word/dword-alignment)
292
293 Further Notes:
294
295 * bc also has an immediate mode, listed separately below in Branch section
296 * for LD/ST, offset is aligned. 8-byte: i2||imm||0b000 4-byte: 0b00
297 * SV Prefix over-rides help provide alternative bitwidths for LD/ST
298 * RA|0 if RA is zero, addi. becomes "li"
299 - this only works if RT takes part of opcode
300 - mv is also possible by specifying an immediate of zero
301
302 ### Illegal, nop and attn
303
304 Note that illeg is all zeros, including in the 16-bit mode.
305 Given that C is allocated to OpenPOWER ISA Major opcodes EXT000 and
306 EXT001 this ensures that in both 10-bit *and* 16-bit mode, a 16-bit
307 run of all zeros is considered "illegal" whilst 0b0000.0000.1000.0000
308 is "nop"
309
310 | 16-bit mode | | 10-bit mode |
311 | 0 | 1 | 234 | | 567.8 | 9 ab | c de | f |
312 | - | - | --- | | ----- | ----- | ------ | - |
313 | 0 | 0 000 | | 000.0 | 0 00 | 0 00 | 0 | illeg
314 | 0 | 0 000 | | 000.0 | 0 00 | 0 00 | 1 | nop
315
316 16 bit mode only:
317
318 | - | - | --- | | ----- | ----- | ------ | - |
319 | 1 | 0 000 | | 000.0 | 0 00 | 0 00 | 0 | nop
320 | 1 | 1 000 | | 000.0 | 0 00 | 0 00 | 0 | attn
321 | 1 | nonzero | | 000.0 | 0 00 | 0 00 | 0 | TBD
322
323 Notes:
324
325 * All-zeros being an illegal instruction is normal for ISAs. Ensuring that
326 this remains true at all times i.e. for both 10 bit and 16 bit mode is
327 common sense.
328 * The 10-bit nop (bit 15, M=1) is intended for circumstances
329 where alignment to 32-bit before returning to v3.0B is required.
330 M=1 being an indication "return to Standard v3.0B Encoding Mode".
331 * The 16-bit nop (bit 0, N=1) is intended for circumstances where a
332 return to Standard v3.0B Encoding is required for one cycle
333 but one cycle where alignment to a 32-bit boundary is needed.
334 Examples of this would be to return to "strict" (non-C) mode
335 where the PC may not be on a non-word-aligned boundary.
336 * If for any reason multiple 16 bit nops are needed in succession
337 the M=1 variant can be used, because each one returns to
338 Standard v3.0B Encoding Mode, each time.
339
340 In essence the 2 nops are needed due to there being 2 different C forms:
341 10 and 16 bit.
342
343 ### Branch
344
345 | 16-bit mode | | 10-bit mode |
346 | 0 | 1 | 234 | | 567.8 | 9 ab | c de | f |
347 | - | - | --- | | ----- | ----- | ------ | - |
348 | N | offs2 | | 000.LK | offs!=0 | M | b, bl
349 | 1 | offs2 | | 000.LK | BI | BO1 oo | 1 | bc, bcl
350 | N | BO3 BI3 | | 001.0 | LK BI | BO | M | bclr, bclrl
351
352 16 bit mode:
353
354 * bc only available when N,M=0b11
355 * offs2 extends offset in MSBs
356 * BI3 extends BI in MSBs to allow selection of full CR
357 * BO3 extends BO
358 * bc offset constructed from oo as LSBs and offs2 as MSBs
359 * bc BI allows selection of all bits from CR0 or CR1
360 * bc CR check is always active (as if BO0=1) therefore BO1 inverts
361
362 10 bit mode:
363
364 * illegal (all zeros) covers part of branch (offs=0,M=0,LK=0)
365 * nop also covers part of branch (offs=0,M=0,LK=1)
366 * bc **not available** in 10-bit mode
367 * BO[0] enables CR check, BO[1] inverts check
368 * BI refers to CR0 only (4 bits of)
369 * no Branch Conditional with immediate
370 * no Absolute Address
371 * CTR mode allowed with BO[2] for b only.
372 * offs is to 2 byte (signed) aligned
373 * all branches to 2 byte aligned
374
375 ### LD/ST
376
377 Note: for 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
378
379 | 16-bit mode | | 10-bit mode |
380 | 0 | 1 | 234 | | 567.8 | 9 a b | c d e | f |
381 | --- | -- | --- | | ----- | ----- | ----- | - |
382 | RA2 | SZ | RB | | 001.1 | 1 RA | 0 RT | M | st
383 | RA2 | SZ | RB | | 001.1 | 1 RA | 1 RT | M | fst
384 | N | SZ | RT | | 111.0 | RA | RB | M | ld
385 | N | SZ | RT | | 111.1 | RA | RB | M | fld
386
387 * elwidth overrides can set different widths
388
389 16 bit mode:
390
391 * SZ=1 is 64 bit, SZ=0 is 32 bit
392 * RA2 extends RA to 3 bits (MSB)
393 * RT2 extends RT to 3 bits (MSB)
394
395 10 bit mode:
396
397 * RA and RB are only 2 bit (0-3)
398 * for LD, RT is implicitly RB: "ld RT=RB, RA(RB)"
399 * for ST, there is no offset: "st RT, RA(0)"
400
401 ### Arithmetic
402
403 * 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
404 * 16-bit: note that bit 1==0 (sub-sub-encoding)
405
406 | 16-bit mode | | 10-bit mode |
407 | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
408 | - | - | --- | | ----- | --- | ----- | - |
409 | N | 0 | RT | | 010.0 | RB | RA!=0 | M | add
410 | N | 0 | RT | | 010.1 | RB | RA|0 | M | sub.
411 | N | 0 | BF | | 011.0 | RB | RA|0 | M | cmpl
412
413 Notes:
414
415 * sub. and cmpl: default CR target is CR0
416 * for (RA|0) when RA=0 the input is a zero immediate,
417 meaning that sub. becomes neg. and cmp becomes cmpi against zero
418 * RT is implicitly RB: "add RT(=RB), RA, RB"
419 * Opcode 0b010.0 RA=0 is not missing from the above:
420 it is a system-wide instruction, "cbank" (section below)
421
422 16 bit mode only:
423
424 | 0 | 1 | 234 | | 567.8 | 9ab | cde | f |
425 | - | - | --- | | ----- | --- | ----- | - |
426 | N | 1 | RA | | 010.0 | RB | RS | 0 | sld.
427 | N | 1 | RA | | 010.1 | RB | RS!=0 | 0 | srd.
428 | N | 1 | RA | | 010.1 | RB | 000 | 0 | srad.
429 | N | 1 | BF | | 011.0 | RB | RA|0 | 0 | cmpw
430
431 Notes:
432
433 * for srad, RS=RA: "srad. RA(=RS), RS, RB"
434
435 ### Logical
436
437 * 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
438 * 16-bit: note that bit 1==0 (sub-sub-encoding)
439
440 | 16-bit mode | | 10-bit mode |
441 | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
442 | - | - | --- | | ----- | --- | ----- | - |
443 | N | 0 | RT | | 100.0 | RB | RA!=0 | M | and
444 | N | 0 | RT | | 100.1 | RB | RA!=0 | M | nand
445 | N | 0 | RT | | 101.0 | RB | RA!=0 | M | or
446 | N | 0 | RT | | 101.1 | RB | RA!=0 | M | nor/mr
447 | N | 0 | RT | | 100.0 | RB | 0 0 0 | M | extsw
448 | N | 0 | RT | | 100.1 | RB | 0 0 0 | M | cntlz
449 | N | 0 | RT | | 101.0 | RB | 0 0 0 | M | popcnt
450 | N | 0 | RT | | 101.1 | RB | 0 0 0 | M | not
451
452 16-bit mode only (note that bit 1 == 1):
453
454 | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
455 | - | - | --- | | ----- | --- | ----- | - |
456 | N | 1 | RT | | 100.0 | RB | RA!=0 | 0 | TBD
457 | N | 1 | RT | | 100.1 | RB | RA!=0 | 0 | TBD
458 | N | 1 | RT | | 101.0 | RB | RA!=0 | 0 | xor
459 | N | 1 | RT | | 101.1 | RB | RA!=0 | 0 | eqv (xnor)
460 | N | 1 | RT | | 100.0 | RB | 0 0 0 | 0 | extsb
461 | N | 1 | RT | | 100.1 | RB | 0 0 0 | 0 | cnttz
462 | N | 1 | RT | | 101.0 | RB | 0 0 0 | 0 | TBD
463 | N | 1 | RT | | 101.1 | RB | 0 0 0 | 0 | extsh
464
465 10 bit mode:
466
467 * idea: for 10bit mode, nor is actually 'mr' because mr is
468 a more common operation. in 16bit however, this encoding
469 (Cmaj.min=0b101.1, N=0) is 'nor'
470 * for (RA|0) when RA=0 the input is a zero immediate,
471 meaning that nor becomes not
472 * cntlz, popcnt, exts **not available** in 10-bit mode
473 * RT is implicitly RB: "and RT(=RB), RA, RB"
474
475 ### Floating Point
476
477 Note here that elwidth overrides (SV Prefix) can be used to select FP16/32/64
478
479 * 10-bit, ignore bits 0-4 (used by EXTNNN=Compressed)
480 * 16-bit: note that bit 1==0 (sub-sub-encoding)
481
482 | 16-bit mode | | 10-bit mode |
483 | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
484 | - | - | --- | | ----- | --- | ----- | - |
485 | N | | RT | | 011.1 | RB | RA!=0 | M | fsub.
486 | N | 0 | RT | | 110.0 | RB | RA!=0 | M | fadd
487 | N | 0 | RT | | 110.1 | RB | RA!=0 | M | fmul
488 | N | 0 | RT | | 011.1 | RB | 0 0 0 | M | fneg.
489 | N | 0 | RT | | 110.0 | RB | 0 0 0 | M |
490 | N | 0 | RT | | 110.1 | RB | 0 0 0 | M |
491
492 16-bit mode only (note that bit 1 == 1):
493
494 | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
495 | - | - | --- | | ----- | --- | ----- | - |
496 | N | 1 | RT | | 011.1 | RB | RA!=0 | 0 |
497 | N | 1 | RT | | 110.0 | RB | RA!=0 | 0 |
498 | N | 1 | RT | | 110.1 | RB | RA!=0 | 0 | fdiv
499 | N | 1 | RT | | 011.1 | RB | 0 0 0 | 0 | fabs.
500 | N | 1 | RT | | 110.0 | RB | 0 0 0 | 0 | fmr.
501 | N | 1 | RT | | 110.1 | RB | 0 0 0 | 0 |
502
503 16 bit only, FP to INT convert (using C 0b001.1 subencoding)
504
505 | 0123 | 4 | | 567.8 | 9 ab | cde | f |
506 | ---- | - | | ----- | ---- | ---- | - |
507 | 0010 | X | | 001.1 | 0 RA | Y RT | M | fp2int
508 | 0011 | X | | 001.1 | 0 RA | Y RT | M | int2fp
509
510 * X: signed=1, unsigned=0
511 * Y: FP32=0, FP64=1
512
513 10 bit mode:
514
515 * fsub. fneg. and fmr. default target is CR1
516 * fmr. is **not available** in 10-bit mode
517 * fdiv is **not available** in 10-bit mode
518
519 16 bit mode:
520
521 * fmr. copies RB to RT (and sets CR1)
522
523 ### Condition Register
524
525 | 16-bit mode | | 10-bit mode |
526 | 0 1 2 3 | 4 | | 567.8 | 9 ab | cde | f |
527 | ------- | --- | | ----- | ---- | --- | - |
528 | 0 0 0 0 | BF2 | | 001.1 | 0 BF | BFA | M | mcrf
529 | 0 0 0 1 | BA2 | | 001.1 | 0 BA | BB | M | crnor
530 | 0 1 0 0 | BA2 | | 001.1 | 0 BA | BB | M | crandc
531 | 0 1 1 0 | BA2 | | 001.1 | 0 BA | BB | M | crxor
532 | 0 1 1 1 | BA2 | | 001.1 | 0 BA | BB | M | crnand
533 | 1 0 0 0 | BA2 | | 001.1 | 0 BA | BB | M | crand
534 | 1 0 0 1 | BA2 | | 001.1 | 0 BA | BB | M | creqv
535 | 1 1 0 1 | BA2 | | 001.1 | 0 BA | BB | M | crorc
536 | 1 1 1 0 | BA2 | | 001.1 | 0 BA | BB | M | cror
537
538 10 bit mode:
539
540 * mcrf BF is only 2 bits which means the destination is only CR0-CR3
541 * CR operations: **not available** in 10-bit mode (but mcrf is)
542
543 16 bit mode:
544
545 * mcrf BF2 extends BF (in MSB) to 3 bits
546 * CR operations: destination register is same as BA.
547 * CR operations: only possible on CR0 and CR1
548
549 SV (Vector Mode):
550
551 * CR operations: greatly extended reach/range (useful for predicates)
552
553 ### System
554
555 cbank: Selection of Compressed-encoding "Bank". Different "banks"
556 give different meanings to opcodes. Example: CBank=0b001 is heavily
557 optimised to A/Video Encode/Decode. cbank borrows from add's encoding
558 space (when RA==0)
559
560 | 16-bit mode | | 10-bit mode |
561 | 0 | 1 2 3 4 | | 567.8 | 9ab | cde | f |
562 | - | ------- | | ----- | ----- | --- | - |
563 | N | 0 Bank2 | | 010.0 | CBank | 000 | M | cbank
564
565 **not available** in 10-bit mode, **only** in 16-bit mode:
566
567 | 0 1 2 3 | 4 | | 567.8 | 9 ab | cde | f |
568 | ------- | - | | ----- | ---- | ---- | - |
569 | 1 1 1 1 | 0 | | 001.1 | 0 00 | RT | M | mtlr
570 | 1 1 1 1 | 0 | | 001.1 | 0 01 | RT | M | mtctr
571 | 1 1 1 1 | 0 | | 001.1 | 0 11 | RT | M | mtcr
572 | 1 1 1 1 | 1 | | 001.1 | 0 00 | RA | M | mflr
573 | 1 1 1 1 | 1 | | 001.1 | 0 01 | RA | M | mfctr
574 | 1 1 1 1 | 1 | | 001.1 | 0 11 | RA | M | mfcr
575
576 ### Unallocated
577
578 | 0 1 2 3 | 4 | | 567.8 | 9 ab | cde | f |
579 | ------- | - | | ----- | ---- | ---- | - |
580 | 0 1 0 1 | | | 001.1 | 0 | | M |
581 | 1 0 1 0 | | | 001.1 | 0 | | M |
582 | 1 0 1 1 | | | 001.1 | 0 | | M |
583 | 1 1 0 0 | | | 001.1 | 0 | | M |
584 | 1 1 1 1 | | | 001.1 | 0 10 | | M |
585
586 ## Other ideas (Attempt 2)
587
588 ### 8-bit mode-switching instructions, odd addresses for C mode
589
590 Drop the complexity of the 16-bit encoding further reduced to 10-bit,
591 and use a single byte instead of two to switch between modes. This
592 would place compressed (C) mode instructions at odd bytes, so the LSB
593 of the PC can be used for the processor to tell which mode it is in.
594
595 To switch from traditional to compressed mode, the single-byte
596 instruction would be at the MSByte, that holds the EXT bits. (When we
597 break up a 32-bit instruction across words, the most significant half
598 should go in the word with the lower address.)
599
600 To switch from compressed mode to traditional mode, the single-byte
601 instruction would also be at the opcode/format portion, placed in the
602 lower-address word if split across words, so that the instruction can
603 be recognized as the mode-switching one without going for its second
604 byte.
605
606 The C-mode nop should be encoded so that its second byte encodes a
607 switch to compressed mode, if decoded in traditional mode. This
608 enables such a nop to straddle across a label:
609
610 8-bit first half of nop
611 Label:
612 8-bit second half of nop AKA switch to compressed mode
613 16-bit insns...
614
615 so that if traditional code jumps to the word-aligned label (because
616 traditional branches drop the 2 LSB), it immediately switches to
617 compressed mode; if we fall-through, we remain in 16-bit mode; and if
618 we branch to it from compressed mode, whether we jump to the odd or
619 the even address, we end up in compressed mode as desired.
620
621 Tables explaining encoding:
622
623 | byte 0 | byte 1 | byte 2 | byte 3 |
624 | v3.0B standard 32 bit instruction |
625 | EXT000 | 16 bit | 16... |
626 | .. bit | 8nop | v3.0b stand... |
627 | .. ard 32 bit | EXT000 | 16... |
628 | .. bit | 16 bit | 8nop |
629 | v3.0B standard 32 bit instruction |
630
631
632 ### TODO
633
634 * make a preliminary assessment of branch in/out viability
635 * confirm FSM encoding (is LSB of PC really enough?)
636 * guestimate opcode and register allocation (without necessarily doing
637 a full encoding)
638 * write throwaway python program that estimates compression ratio from
639 objdump raw parsing
640 * finally do full opcode allocation
641 * rerun objdump compression ratio estimates
642
643 ### Use 2- rather than 3-register opcodes
644
645 Successful compact ISAs have used 2- rather than 3-register insns, in
646 which the same register serves as input and output. Some 20% of
647 general-purpose 3-register insns already use either input register as
648 output, without any effort by the compiler to do so.
649
650 Repurposing the 3 bits used to encode one one of the input registers
651 in arithmetic, logical and floating-pointer registers, and the 2 bits
652 used to encode the mode of the next two insns, we could make the full
653 register files available to the opcodes already selected for
654 compressed mode, with one bit to spare to bring additional opcodes in.
655
656 An opcode could be assigned to an instruction that combines and
657 extends with the subsequent instruction, providing it with a separate
658 input operand to use rather than the output register, or with
659 additional range for immediate and offset operands, effectively
660 forming a 32-bit operation, enabling us to remain in compressed mode
661 even longer.
662
663 # Analysis techniques and tools
664
665 objdump -d --no-show-raw-insn /bin/bash | sed 'y/\t/ /;
666 s/^[ x0-9A-F]*: *\([a-z.]\+\) *\(.*\)/\1 \2 /p; d' |
667 sed 's/\([, (]\)r[1-9][0-9]*/\1r1/g;
668 s/\([ ,]\)-*[0-9]\+\([^0-9]\)/\11\2/g' | sort | uniq --count |
669 sort -n | less
670
671 # gcc register allocation
672
673 FTR, information extracted from gcc's gcc/config/rs6000/rs6000.h about
674 fixed registers (assigned to special purposes) and register allocation
675 order:
676
677 Special-purpose registers on ppc are:
678
679 r0: constant zero/throw-away
680 r1: stack pointer
681 r2: thread-local storage pointer in 32-bit mode
682 r2: non-minimal TOC register
683 r10: EH return stack adjust register
684 r11: static chain pointer
685 r13: thread-local storage pointer in 64-bit mode
686 r30: minimal-TOC/-fPIC/-fpic base register
687 r31: frame pointer
688 lr: return address register
689
690 the register allocation order in GCC (i.e., it takes the earliest
691 available register that fits the constraints) is:
692
693 We allocate in the following order:
694
695 fp0 (not saved or used for anything)
696 fp13 - fp2 (not saved; incoming fp arg registers)
697 fp1 (not saved; return value)
698 fp31 - fp14 (saved; order given to save least number)
699 cr7, cr5 (not saved or special)
700 cr6 (not saved, but used for vector operations)
701 cr1 (not saved, but used for FP operations)
702 cr0 (not saved, but used for arithmetic operations)
703 cr4, cr3, cr2 (saved)
704 r9 (not saved; best for TImode)
705 r10, r8-r4 (not saved; highest first for less conflict with params)
706 r3 (not saved; return value register)
707 r11 (not saved; later alloc to help shrink-wrap)
708 r0 (not saved; cannot be base reg)
709 r31 - r13 (saved; order given to save least number)
710 r12 (not saved; if used for DImode or DFmode would use r13)
711 ctr (not saved; when we have the choice ctr is better)
712 lr (saved)
713 r1, r2, ap, ca (fixed)
714 v0 - v1 (not saved or used for anything)
715 v13 - v3 (not saved; incoming vector arg registers)
716 v2 (not saved; incoming vector arg reg; return value)
717 v19 - v14 (not saved or used for anything)
718 v31 - v20 (saved; order given to save least number)
719 vrsave, vscr (fixed)
720 sfp (fixed)
721
722 # Demo of encoding that's backward-compatible with PowerISA v3.1 in both LE and BE mode
723
724 [[demo]]
725
726 # Efficient Decoding Algorithm
727
728 [[decoding]]