add efficient variable-length instruction decoding algorithm
[libreriscv.git] / openpower / sv / 16_bit_compressed.mdwn
1 # 16 bit Compressed
2
3 Similar to VLE (but without immediate-prefixing) this encoding is designed
4 to fit on top of OpenPOWER ISA v3.0B when a "Modeswitch" bit is set (PCR
5 is recommended). Note that Compressed is *mutually exclusively incompatible*
6 with OpenPOWER v3.1B "prefixing" due to using (requiring) both EXT000
7 and EXT001. Hypothetically it could be made to use anything other than
8 EXT001, with some inconvenience (extra gates). The incompatibility is
9 "fixed" by swapping out of "Compressed" Mode and back into "Normal"
10 (v3.1B) Mode, at runtime, as needed.
11
12 Although initially intended to be augmented by Simple-V Prefixing (to
13 add Vector context, width overrides, e.g IEEE754 FP16, and predication) yet not put pressure on I-Cache power
14 or size, this Compressed Encoding is not critically dependent
15 *on* SV Prefixing, and may be used stand-alone.
16
17 See:
18
19 * <https://bugs.libre-soc.org/show_bug.cgi?id=238>
20 * <https://ftp.libre-soc.org/VLE_314-68105.pdf> VLE Encoding
21 * <http://lists.mailinglist.openpowerfoundation.org/pipermail/openpower-hdl-cores/2020-November/000210.html>
22
23 This one is a conundrum. OpenPOWER ISA was never designed with 16
24 bit in mind. VLE was added 10 years ago but only by way of marking
25 an entire 64k page as "VLE". With VLE not maintained it is not
26 fully compatible with current PowerISA.
27
28 Here, in order to embed 16 bit into a predominantly 32 bit stream the
29 overhead of using an entire 16 bits just to switch into Compressed mode
30 is itself a significant overhead. The situation is made worse by
31 OpenPOWER ISA being fundamentally designed with 6 bits uniformly
32 taking up Major Opcode space, leaving only 10 bits to allocate
33 to actual instructions.
34
35 Contrast this with RVC which takes 3 out of 4
36 combinations of the first 2 bits for indicating 16-bit (anything with 0b00 to 0b10 in the LSBs), and uses the 4th (0b11) as a Huffman-style escape-sequence, easily allowing standard 32 bit and 16 bit to intermingle cleanly. To achieve the same thing on OpenPOWER would require a whopping 24 6-bit Major Opcodes which is clearly impractical: other schemes need to be devised.
37
38 In addition we would like to add SV-C32 which is a Vectorised version
39 of 16 bit Compressed, and ideally have a variant that adds the 27-bit
40 prefix format from SV-P64, as well.
41
42 Potential ways to reduce pressure on the 16 bit space are:
43
44 * To use more than one v3.0B Major Opcode, preferably an odd-even
45 contiguous pair
46 * To provide "paging". This involves bank-switching to alternative optimised encodings for specific workloads
47 * To enter "16 bit mode" for durations specified at the start
48 * To reserve one bit of every 16 bit instruction to indicate that the 16 bit mode is to continue to be sustained
49
50 This latter would be useful in the Vector context to have an alternative
51 meaning: as the bit which determines whether the instruction is 11-bit
52 prefixed or 27-bit prefixed:
53
54 0 1 2 3 4 5 6 7 8 9 a b c d e f |
55 |major op | 11 bit vector prefix|
56 |16 bit opcode alt vec. mode ^ |
57 | extra vector prefix if alt set|
58
59 Using a major opcode to enter 16 bit mode, leaves 11 bits to find
60 something to use them for:
61
62 0 1 2 3 4 5 6 7 8 9 a b c d e f |
63 |major op | what to do here 1 |
64 |16 bit stay in 16bit mode 1 |
65 |16 bit stay in 16bit mode 1 |
66 |16 bit exit 16bit mode 0 |
67
68 One possibility is that the 11 bits are used for bank selection, with
69 some room for additional context such as altering the registers used
70 for the 16 bit operations (bank selection of which scalar regs).
71 However the downside is that short sequences of Compressed instructions
72 become penalised by the fixed overhead. Even a single 16 bit instruction requires a 16 bit overhead to "gain access" to 16 bit "mode", making the exercise pointless.
73
74 An alternative is to use the first 11 bits for only the utmost commonly used
75 instructions. That being the case then one of those 11 bits could
76 be dedicated to saying if 16 bit mode is to be continued, at which
77 point *all* 16 bits can be used for Compressed.
78 10 bits remain for actual opcodes, which is ridiculously tight,
79 however the opportunity to subsequently use all 16 bits is worth it.
80
81 The reason for picking 2 contiguous Major v3.0B opcodes is illustrated below:
82
83 |0 1 2 3 4 5 6 7 8 9 a b c d e f|
84 |major op..0| LO Half C space |
85 |major op..1| HI Half C space |
86 |N N N N N|<--11 bits C space-->|
87
88 If NNNNN is the same value (two contiguous Major v3.0B Opcodes) this saves gates at a critical part of the decode phase.
89
90 ## ABI considerations
91
92 Unlike RVC, the above "context" encodings require state, to be stored in the PCR, MSR, or a dedicated SPR. These bits (just like LE/BE 32bit mode and the IEEE754 FPCSR mode) all require taking that context into consideration.
93
94 In particular it is critically important to recognise that context (in general) is an implicit part of the ABI implemented for example by glibc6. Therefore (in specific) Compressed Mode Context **must not** be permitted to cross into or out of a function call.
95
96 Thus it is the mandatory responsibility of the compiler to ensure that context returns to "v3.0B Standard" prior to entering a function call (responsibility of caller) and prior to exit from a function call (responsibility of callee).
97
98 Trap Handlers also take responsibility for saving and restoring of Compressed Mode state, just as they already take responsibility for other critical state. This makes traps transparent to functions as far as Compressed Mode Context is concerned, just as traps are already transparent to functions.
99
100 Note however that there are exceptions in a compiler to the otherwise hard rule that Compressed Mode context not be permitted to cross function boundaries: inline functions and static functions. static functions, if correctly identified as never to be called externally, may, as an optimisation, disregard standard ABIs, bearing in mind that this will be fraught (pointers to functions) and not easy to get right.
101
102 # Opcode Allocation Ideas
103
104 * one bit from the 16-bit mode is used to indicate that standard
105 (v3.0B) mode is to be dropped into for only one single instruction
106 <https://bugs.libre-soc.org/show_bug.cgi?id=238#c2>
107
108 ## Opcodes exploration (Attempt 1)
109
110 Switching between different encoding modes is controlled by M (alone)
111 in 10-bit mode, and M and N in 16-bit mode.
112
113 * M in 10-bit mode if zero indicates that following instructions are
114 standard OpenPOWER ISA 32-bit encoded (including, redundantly,
115 further 10/16-bit instructions)
116 * M in 10-bit mode if 1 indicates that following instructions are
117 in 16-bit encoding mode
118
119 Once in 16-bit mode:
120
121 * 0b01 (M=1, N=0): stay in 16-bit mode
122 * 0b00: leave 16-bit mode permanently (return to standard OpenPOWER ISA)
123 * 0b10: leave 16-bit mode for one cycle (return to standard OpenPOWER ISA)
124 * 0b11: free to be used for something completely different.
125
126 The current "top" idea for 0b11 is to use it for a new encoding format
127 of predominantly "immediates-based" 16-bit instructions (branch-conditional,
128 addi, mulli etc.)
129
130 * The Compressed Major Opcode is in bits 5-7.
131 * Minor opcode in bit 8.
132 * In some cases bit 9 is taken as an additional sub-opcode, followed
133 by bits 0-4 (for CR operations)
134 * M+N mode-switching is not available for C-Major.minor 0b001.1
135 * 10 bit mode may be expanded by 16 bit mode, adding capabilities
136 that do not fit in the extreme limited space.
137
138 Mode-switching FSM showing relationship between v3.0B, C 10bit and C 16bit.
139 16-bit immediate mode remains in 16-bit.
140
141 | 0 | 1234 | 567 8 | 9abcde | f | explanation
142 | EXT000/1 | Cmaj.m | fields | 0 | 10bit then v3.0B
143 | EXT000/1 | Cmaj.m | fields | 1 | 10bit then 16bit
144 | 0 | flds | Cmaj.m | fields | 0 | 16bit then v3.0B
145 | 0 | flds | Cmaj.m | fields | 1 | 16bit then 16bit
146 | 1 | flds | Cmaj.m | fields | 0 | 16b then 1x v3.0B
147 | 1 | flds | Cmaj.m | fields | 1 | 16b/imm then 16bit
148
149 Notes:
150
151 * Cmaj.m is the C major/minor opcode: 3 bits for major, 1 for minor
152 * EXT000 and EXT001 are v3.0B Major Opcodes. The first 5 bits
153 are zero, therefore the 6th bit is actually part of Cmaj.
154 * "10bit then 16bit" means "this instruction is encoded C 10bit
155 and the following one in C 16bit"
156
157 ### C Instruction Encoding types
158
159 10-bit Opcode formats (all start with v3.0B EXT000 or EXT001
160 Major Opcodes)
161
162 | 01234 | 567 8 | 9 | a b | c | d e | f | enc
163 | E01 | Cmaj.m | fld1 | fld2 | M | 10b
164 | E01 | Cmaj.m | offset | M | 10b b
165 | E01 | 001.1 | S1 | fd1 | S2 | fd2 | M | 10b sub
166 | E01 | 111.m | fld1 | fld2 | M | 10b LDST
167
168 16-bit Opcode formats (including 10/16/v3.0B Switching)
169
170 | 0 | 1234 | 567 8 | 9 | a b | c | d e | f | enc
171 | N | immf | Cmaj.m | fld1 | fld2 | M | 16b
172 | 1 | immf | Cmaj.m | fld1 | imm | 1 | 16b imm
173 | fd3 | 001.1 | S1 | fd1 | S2 | fd2 | M | 16b sub
174 | N | fd4 | 111.m | fld1 | fld2 | M | 16b LDST
175
176 Notes:
177
178 * fld1 and fld2 can contain reg numbers, immediates, or opcode
179 fields (BO, BI, LK)
180 * S1 and S2 are further sub-selectors of C 001.1
181
182 ### Immediate Opcodes
183
184 only available in 16-bit mode, only available when M=1 and N=1
185 and when Cmaj.min is not 0b001.1.
186
187 instruction counts from objdump on /bin/bash:
188
189 466 extsw r1,r1
190 649 stw r1,1(r1)
191 691 lwz r1,1(r1)
192 705 cmpdi r1,1
193 791 cmpwi r1,1
194 794 addis r1,r1,1
195 1474 std r1,1(r1)
196 1846 li r1,1
197 2031 mr r1,r1
198 2473 addi r1,r1,1
199 3012 nop
200 3028 ld r1,1(r1)
201
202
203 | 0 | 1 | 2 | 3 4 | | 567.8 | 9ab | cde | f |
204 | 1 | 0 | 0 0 0 | | 001.0 | | 000 | 1 | TBD
205 | 1 | 0 | sh2 | | 001.0 | RA | sh | 1 | sradi.
206 | 1 | 1 | 0 0 0 | | 001.0 | | 000 | 1 | TBD
207 | 1 | 1 | 0 | sh2 | | 001.0 | RA | sh | 1 | srawi.
208 | 1 | 1 | 1 | | | 001.0 | | | 1 | TBD
209 | 1 | i2 | RT | | 010.0 | RA|0 | imm | 1 | addi
210 | 1 | 0 | i2 | | 010.1 | RA | imm | 1 | cmpdi
211 | 1 | 1 | i2 | | 010.1 | RA | imm | 1 | cmpwi
212 | 1 | 0 | i2 | | 011.0 | RT | imm | 1 | ldspi
213 | 1 | 1 | i2 | | 011.0 | RT | imm | 1 | lwspi
214 | 1 | 0 | i2 | | 011.1 | RT | imm | 1 | stwspi
215 | 1 | 1 | i2 | | 011.1 | RT | imm | 1 | stdspi
216 | 1 | i2 | RA | | 100.0 | RT | imm | 1 | stwi
217 | 1 | i2 | RA | | 100.1 | RT | imm | 1 | stdi
218 | 1 | i2 | RT | | 101.0 | RA | imm | 1 | ldi
219 | 1 | i2 | RT | | 101.1 | RA | imm | 1 | lwi
220 | 1 | i2 | RA | | 110.0 | RT | imm | 1 | fsti
221 | 1 | i2 | RA | | 110.1 | RT | imm | 1 | fstdi
222 | 1 | i2 | RT | | 111.0 | RA | imm | 1 | flwi
223 | 1 | i2 | RT | | 111.1 | RA | imm | 1 | fldi
224
225 Construction of immediate:
226
227 * LD/ST r1 (SP) variants should be offset by -256
228 see <https://bugs.libre-soc.org/show_bug.cgi?id=238#c43>
229 - SP variants map to e.g ld RT, imm(r1)
230 - SV Prefixing can be used to map r1 to alternate regs
231 * [1] not the same as v3.0B addis: the shift amount is smaller and actually
232 still maps to within the v3.0B addi immediate range.
233 * addi is EXTS(i2||imm) to give a 4-bit range -8 to +7
234 * addis is EXTS(i2||imm||000) to give a 11-bit range -1024 to +1023 in increments of 8
235 * all others are EXTS(i2||imm) to give a 7-bit range -128 to +127
236 (further for LD/ST due to word/dword-alignment)
237
238 Further Notes:
239
240 * bc also has an immediate mode, listed separately below in Branch section
241 * for LD/ST, offset is aligned. 8-byte: i2||imm||0b000 4-byte: 0b00
242 * SV Prefix over-rides help provide alternative bitwidths for LD/ST
243 * RA|0 if RA is zero, addi. becomes "li"
244 - this only works if RT takes part of opcode
245 - mv is also possible by specifying an immediate of zero
246
247 ### Illegal, nop and attn
248
249 Note that illeg is all zeros, including in the 16-bit mode.
250 Given that C is allocated to OpenPOWER ISA Major opcodes EXT000 and
251 EXT001 this ensures that in both 10-bit *and* 16-bit mode, a 16-bit
252 run of all zeros is considered "illegal" whilst 0b0000.0000.1000.0000
253 is "nop"
254
255 | 16-bit mode | | 10-bit mode |
256 | 0 | 1 | 234 | | 567.8 | 9 ab | c de | f |
257 | 0 | 0 000 | | 000.0 | 0 00 | 0 00 | 0 | illeg
258 | 0 | 0 000 | | 000.0 | 0 00 | 0 00 | 1 | nop
259
260 16 bit mode only:
261
262 | 1 | 0 000 | | 000.0 | 0 00 | 0 00 | 0 | nop
263 | 1 | 1 000 | | 000.0 | 0 00 | 0 00 | 0 | attn
264 | 1 | nonzero | | 000.0 | 0 00 | 0 00 | 0 | TBD
265
266 Notes:
267
268 * All-zeros being an illegal instruction is normal for ISAs. Ensuring that
269 this remains true at all times i.e. for both 10 bit and 16 bit mode is
270 common sense.
271 * The 10-bit nop (bit 15, M=1) is intended for circumstances
272 where alignment to 32-bit before returning to v3.0B is required.
273 M=1 being an indication "return to Standard v3.0B Encoding Mode".
274 * The 16-bit nop (bit 0, N=1) is intended for circumstances where a
275 return to Standard v3.0B Encoding is required for one cycle
276 but one cycle where alignment to a 32-bit boundary is needed.
277 Examples of this would be to return to "strict" (non-C) mode
278 where the PC may not be on a non-word-aligned boundary.
279 * If for any reason multiple 16 bit nops are needed in succession
280 the M=1 variant can be used, because each one returns to
281 Standard v3.0B Encoding Mode, each time.
282
283 In essence the 2 nops are needed due to there being 2 different C forms: 10 and 16 bit.
284
285 ### Branch
286
287 | 16-bit mode | | 10-bit mode |
288 | 0 | 1 | 234 | | 567.8 | 9 ab | c de | f |
289 | N | offs2 | | 000.LK | offs!=0 | M | b, bl
290 | 1 | offs2 | | 000.LK | BI | BO1 oo | 1 | bc, bcl
291 | N | BO3 BI3 | | 001.0 | LK BI | BO | M | bclr, bclrl
292
293 16 bit mode:
294
295 * bc only available when N,M=0b11
296 * offs2 extends offset in MSBs
297 * BI3 extends BI in MSBs to allow selection of full CR
298 * BO3 extends BO
299 * bc offset constructed from oo as LSBs and offs2 as MSBs
300 * bc BI allows selection of all bits from CR0 or CR1
301 * bc CR check is always active (as if BO0=1) therefore BO1 inverts
302
303 10 bit mode:
304
305 * illegal (all zeros) covers part of branch (offs=0,M=0,LK=0)
306 * nop also covers part of branch (offs=0,M=0,LK=1)
307 * bc **not available** in 10-bit mode
308 * BO[0] enables CR check, BO[1] inverts check
309 * BI refers to CR0 only (4 bits of)
310 * no Branch Conditional with immediate
311 * no Absolute Address
312 * CTR mode allowed with BO[2] for b only.
313 * offs is to 2 byte (signed) aligned
314 * all branches to 2 byte aligned
315
316 ### LD/ST
317
318 | 16-bit mode | | 10-bit mode |
319 | 0 | 1 | 2 3 4 | | 567.8 | 9 a b | c d e | f |
320 | RA2 | SZ | RB | | 001.1 | 1 RA | 0 RT | M | st
321 | RA2 | SZ | RB | | 001.1 | 1 RA | 1 RT | M | fst
322 | N | SZ | RT | | 111.0 | RA | RB | M | ld
323 | N | SZ | RT | | 111.1 | RA | RB | M | fld
324
325 * elwidth overrides can set different widths
326
327 16 bit mode:
328
329 * SZ=1 is 64 bit, SZ=0 is 32 bit
330 * RA2 extends RA to 3 bits (MSB)
331 * RT2 extends RT to 3 bits (MSB)
332
333 10 bit mode:
334
335 * RA and RB are only 2 bit (0-3)
336 * for LD, RT is implicitly RB: "ld RT=RB, RA(RB)"
337 * for ST, there is no offset: "st RT, RA(0)"
338
339 ### Arithmetic
340
341 | 16-bit mode | | 10-bit mode |
342 | 0 | 1 | 234 | | 567.8 | 9ab | c d e | f |
343 | N | 0 | RT | | 010.0 | RB | RA!=0 | M | add
344 | N | 0 | RT | | 010.1 | RB | RA|0 | M | sub.
345 | N | 0 | BF | | 011.0 | RB | RA|0 | M | cmpl
346
347 Notes:
348
349 * sub. and cmpl: default CR target is CR0
350 * for (RA|0) when RA=0 the input is a zero immediate,
351 meaning that sub. becomes neg. and cmp becomes cmpi against zero
352 * RT is implicitly RB: "add RT(=RB), RA, RB"
353 * Opcode 0b010.0 RA=0 is not missing from the above:
354 it is a system-wide instruction, "cbank" (section below)
355
356 16 bit mode only:
357
358 | 0 | 1 | 234 | | 567.8 | 9ab | cde | f |
359 | N | 1 | RA | | 010.0 | RB | RS | 0 | sld.
360 | N | 1 | RA | | 010.1 | RB | RS!=0 | 0 | srd.
361 | N | 1 | RA | | 010.1 | RB | 000 | 0 | srad.
362 | N | 1 | BF | | 011.0 | RB | RA|0 | 0 | cmpw
363
364 Notes:
365
366 * for srad, RS=RA: "srad. RA(=RS), RS, RB"
367
368
369 ### Logical
370
371 | 16-bit mode | | 10-bit mode |
372 | 0 | 1 | 2 3 4 | | 567.8 | 9ab | c d e | f |
373 | N | 0 | RT | | 100.0 | RB | RA!=0 | M | and
374 | N | 0 | RT | | 100.1 | RB | RA!=0 | M | nand
375 | N | 0 | RT | | 101.0 | RB | RA!=0 | M | or
376 | N | 0 | RT | | 101.1 | RB | RA!=0 | M | nor
377 | N | 0 | RT | | 100.0 | RB | 0 0 0 | M | extsw
378 | N | 0 | RT | | 100.1 | RB | 0 0 0 | M | cntlz
379 | N | 0 | RT | | 101.0 | RB | 0 0 0 | M | popcnt
380 | N | 0 | RT | | 101.1 | RB | 0 0 0 | M | not
381
382 16-bit mode only:
383
384 | 0 | 1 | 2 3 4 | | 567.8 | 9ab | c d e | f |
385 | N | 1 | RT | | 100.0 | RB | RA!=0 | 0 | TBD
386 | N | 1 | RT | | 100.1 | RB | RA!=0 | 0 | TBD
387 | N | 1 | RT | | 101.0 | RB | RA!=0 | 0 | xor
388 | N | 1 | RT | | 101.1 | RB | RA!=0 | 0 | eqv (xnor)
389 | N | 1 | RT | | 100.0 | RB | 0 0 0 | 0 | extsb
390 | N | 1 | RT | | 100.1 | RB | 0 0 0 | 0 | cnttz
391 | N | 1 | RT | | 101.0 | RB | 0 0 0 | 0 | TBD
392 | N | 1 | RT | | 101.1 | RB | 0 0 0 | 0 | extsh
393
394 10 bit mode:
395
396 * for (RA|0) when RA=0 the input is a zero immediate,
397 meaning that nor becomes not
398 * cntlz, popcnt, exts **not available** in 10-bit mode
399 * RT is implicitly RB: "and RT(=RB), RA, RB"
400
401 ### Floating Point
402
403 Note here that elwidth overrides (SV Prefix) can be used to select FP16/32/64
404
405 | 16-bit mode | | 10-bit mode |
406 | 0 | 1 | 2 3 4 | | 567.8 | 9ab | c d e | f |
407 | N | | RT | | 011.1 | RB | RA!=0 | M | fsub.
408 | N | 0 | RT | | 110.0 | RB | RA!=0 | M | fadd
409 | N | 0 | RT | | 110.1 | RB | RA!=0 | M | fmul
410 | N | 0 | RT | | 011.1 | RB | 0 0 0 | M | fneg.
411 | N | 0 | RT | | 110.0 | RB | 0 0 0 | M |
412 | N | 0 | RT | | 110.1 | RB | 0 0 0 | M |
413
414 16-bit mode only:
415
416 | 0 | 1 | 2 3 4 | | 567.8 | 9ab | c d e | f |
417 | N | 1 | RT | | 011.1 | RB | RA!=0 | 0 |
418 | N | 1 | RT | | 110.0 | RB | RA!=0 | 0 |
419 | N | 1 | RT | | 110.1 | RB | RA!=0 | 0 | fdiv
420 | N | 1 | RT | | 011.1 | RB | 0 0 0 | 0 | fabs.
421 | N | 1 | RT | | 110.0 | RB | 0 0 0 | 0 | fmr.
422 | N | 1 | RT | | 110.1 | RB | 0 0 0 | 0 |
423
424 16 bit only, FP to INT convert (using C 0b001.1 subencoding)
425
426 | 0123 | 4 | | 567.8 | 9 ab | cde | f |
427 | 0010 | X | | 001.1 | 0 RA | Y RT | M | fp2int
428 | 0011 | X | | 001.1 | 0 RA | Y RT | M | int2fp
429
430 * X: signed=1, unsigned=0
431 * Y: FP32=0, FP64=1
432
433 10 bit mode:
434
435 * fsub. fneg. and fmr. default target is CR1
436 * fmr. is **not available** in 10-bit mode
437 * fdiv is **not available** in 10-bit mode
438
439 16 bit mode:
440
441 * fmr. copies RB to RT (and sets CR1)
442
443 ### Condition Register
444
445 | 16-bit mode | | 10-bit mode |
446 | 0 1 2 3 | 4 | | 567.8 | 9 ab | cde | f |
447 | 0 0 0 0 | BF2 | | 001.1 | 0 BF | BFA | M | mcrf
448 | 0 0 0 1 | BA2 | | 001.1 | 0 BA | BB | M | crnor
449 | 0 1 0 0 | BA2 | | 001.1 | 0 BA | BB | M | crandc
450 | 0 1 1 0 | BA2 | | 001.1 | 0 BA | BB | M | crxor
451 | 0 1 1 1 | BA2 | | 001.1 | 0 BA | BB | M | crnand
452 | 1 0 0 0 | BA2 | | 001.1 | 0 BA | BB | M | crand
453 | 1 0 0 1 | BA2 | | 001.1 | 0 BA | BB | M | creqv
454 | 1 1 0 1 | BA2 | | 001.1 | 0 BA | BB | M | crorc
455 | 1 1 1 0 | BA2 | | 001.1 | 0 BA | BB | M | cror
456
457 10 bit mode:
458
459 * mcrf BF is only 2 bits which means the destination is only CR0-CR3
460 * CR operations: **not available** in 10-bit mode (but mcrf is)
461
462 16 bit mode:
463
464 * mcrf BF2 extends BF (in MSB) to 3 bits
465 * CR operations: destination register is same as BA.
466 * CR operations: only possible on CR0 and CR1
467
468 SV (Vector Mode):
469
470 * CR operations: greatly extended reach/range (useful for predicates)
471
472 ### System
473
474 cbank: Selection of Compressed-encoding "Bank". Different "banks"
475 give different meanings to opcodes. Example: CBank=0b001 is heavily
476 optimised to A/Video Encode/Decode. cbank borrows from add's encoding
477 space (when RA==0)
478
479 | 16-bit mode | | 10-bit mode |
480 | 0 | 1 2 3 4 | | 567.8 | 9ab | cde | f |
481 | N | 0 Bank2 | | 010.0 | CBank | 000 | M | cbank
482
483 **not available** in 10-bit mode:
484
485 | 0 1 2 3 | 4 | | 567.8 | 9 ab | cde | f |
486 | 1 1 1 1 | 0 | | 001.1 | 0 00 | RT | M | mtlr
487 | 1 1 1 1 | 0 | | 001.1 | 0 01 | RT | M | mtctr
488 | 1 1 1 1 | 0 | | 001.1 | 0 11 | RT | M | mtcr
489 | 1 1 1 1 | 1 | | 001.1 | 0 00 | RA | M | mflr
490 | 1 1 1 1 | 1 | | 001.1 | 0 01 | RA | M | mfctr
491 | 1 1 1 1 | 1 | | 001.1 | 0 11 | RA | M | mfcr
492
493 ### Unallocated
494
495 | 0 1 2 3 | 4 | | 567.8 | 9 ab | cde | f |
496 | 0 1 0 1 | | | 001.1 | 0 | | M |
497 | 1 0 1 0 | | | 001.1 | 0 | | M |
498 | 1 0 1 1 | | | 001.1 | 0 | | M |
499 | 1 1 0 0 | | | 001.1 | 0 | | M |
500 | 1 1 1 1 | | | 001.1 | 0 10 | | M |
501
502 ## Other ideas (Attempt 2)
503
504 ### 8-bit mode-switching instructions, odd addresses for C mode
505
506 Drop the complexity of the 16-bit encoding further reduced to 10-bit,
507 and use a single byte instead of two to switch between modes. This
508 would place compressed (C) mode instructions at odd bytes, so the LSB
509 of the PC can be used for the processor to tell which mode it is in.
510
511 To switch from traditional to compressed mode, the single-byte
512 instruction would be at the MSByte, that holds the EXT bits. (When we
513 break up a 32-bit instruction across words, the most significant half
514 should go in the word with the lower address.)
515
516 To switch from compressed mode to traditional mode, the single-byte
517 instruction would also be at the opcode/format portion, placed in the
518 lower-address word if split across words, so that the instruction can
519 be recognized as the mode-switching one without going for its second
520 byte.
521
522 The C-mode nop should be encoded so that its second byte encodes a
523 switch to compressed mode, if decoded in traditional mode. This
524 enables such a nop to straddle across a label:
525
526 8-bit first half of nop
527 Label:
528 8-bit second half of nop AKA switch to compressed mode
529 16-bit insns...
530
531 so that if traditional code jumps to the word-aligned label (because
532 traditional branches drop the 2 LSB), it immediately switches to
533 compressed mode; if we fall-through, we remain in 16-bit mode; and if
534 we branch to it from compressed mode, whether we jump to the odd or
535 the even address, we end up in compressed mode as desired.
536
537 Tables explaining encoding:
538
539 | byte 0 | byte 1 | byte 2 | byte 3 |
540 | v3.0B standard 32 bit instruction |
541 | EXT000 | 16 bit | 16... |
542 | .. bit | 8nop | v3.0b stand... |
543 | .. ard 32 bit | EXT000 | 16... |
544 | .. bit | 16 bit | 8nop |
545 | v3.0B standard 32 bit instruction |
546
547
548 ### TODO
549
550 * make a preliminary assessment of branch in/out viability
551 * confirm FSM encoding (is LSB of PC really enough?)
552 * guestimate opcode and register allocation (without necessarily doing a full encoding)
553 * write throwaway python program that estimates compression ratio from objdump raw parsing
554 * finally do full opcode allocation
555 * rerun objdump compression ratio estimates
556
557 ### Use 2- rather than 3-register opcodes
558
559 Successful compact ISAs have used 2- rather than 3-register insns, in
560 which the same register serves as input and output. Some 20% of
561 general-purpose 3-register insns already use either input register as
562 output, without any effort by the compiler to do so.
563
564 Repurposing the 3 bits used to encode one one of the input registers
565 in arithmetic, logical and floating-pointer registers, and the 2 bits
566 used to encode the mode of the next two insns, we could make the full
567 register files available to the opcodes already selected for
568 compressed mode, with one bit to spare to bring additional opcodes in.
569
570 An opcode could be assigned to an instruction that combines and
571 extends with the subsequent instruction, providing it with a separate
572 input operand to use rather than the output register, or with
573 additional range for immediate and offset operands, effectively
574 forming a 32-bit operation, enabling us to remain in compressed mode
575 even longer.
576
577 # Analysis techniques and tools
578
579 objdump -d --no-show-raw-insn /bin/bash | sed 'y/\t/ /;
580 s/^[ x0-9A-F]*: *\([a-z.]\+\) *\(.*\)/\1 \2 /p; d' |
581 sed 's/\([, (]\)r[1-9][0-9]*/\1r1/g;
582 s/\([ ,]\)-*[0-9]\+\([^0-9]\)/\11\2/g' | sort | uniq --count |
583 sort -n | less
584
585 # gcc register allocation
586
587 FTR, information extracted from gcc's gcc/config/rs6000/rs6000.h about fixed registers (assigned to special purposes) and register allocation order:
588
589 Special-purpose registers on ppc are:
590
591 r0: constant zero/throw-away
592 r1: stack pointer
593 r2: thread-local storage pointer in 32-bit mode
594 r2: non-minimal TOC register
595 r10: EH return stack adjust register
596 r11: static chain pointer
597 r13: thread-local storage pointer in 64-bit mode
598 r30: minimal-TOC/-fPIC/-fpic base register
599 r31: frame pointer
600 lr: return address register
601
602 the register allocation order in GCC (i.e., it takes the earliest available register that fits the constraints) is:
603
604 We allocate in the following order:
605 fp0 (not saved or used for anything)
606 fp13 - fp2 (not saved; incoming fp arg registers)
607 fp1 (not saved; return value)
608 fp31 - fp14 (saved; order given to save least number)
609 cr7, cr5 (not saved or special)
610 cr6 (not saved, but used for vector operations)
611 cr1 (not saved, but used for FP operations)
612 cr0 (not saved, but used for arithmetic operations)
613 cr4, cr3, cr2 (saved)
614 r9 (not saved; best for TImode)
615 r10, r8-r4 (not saved; highest first for less conflict with params)
616 r3 (not saved; return value register)
617 r11 (not saved; later alloc to help shrink-wrap)
618 r0 (not saved; cannot be base reg)
619 r31 - r13 (saved; order given to save least number)
620 r12 (not saved; if used for DImode or DFmode would use r13)
621 ctr (not saved; when we have the choice ctr is better)
622 lr (saved)
623 r1, r2, ap, ca (fixed)
624 v0 - v1 (not saved or used for anything)
625 v13 - v3 (not saved; incoming vector arg registers)
626 v2 (not saved; incoming vector arg reg; return value)
627 v19 - v14 (not saved or used for anything)
628 v31 - v20 (saved; order given to save least number)
629 vrsave, vscr (fixed)
630 sfp (fixed)
631
632 # Demo of encoding that's backward-compatible with PowerISA v3.1 in both LE and BE mode
633
634 [[demo]]
635
636 # Efficient Decoding Algorithm
637
638 [[decoding]]