fix int/fp mv/cvt in optable.csv
[libreriscv.git] / openpower / sv / rfc / ls005.mdwn
1 # OPF ISA WG External RFC ls005 v1: XLEN
2
3 * RFC Author: Luke Kenneth Casson Leighton.
4 * RFC Contributors/Ideas: Jacob Lifshay, Toshaan Bharvani
5 * Funded by NLnet under the NGI Zero Entrust EU Horizon Europe Grant 101069594
6
7 **URLs**:
8
9 * <https://libre-soc.org/openpower/sv/rfc/ls005/>
10 * <https://bugs.libre-soc.org/show_bug.cgi?id=988>
11 * <https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=openpower/isa;hb=HEAD>
12 * <https://git.openpower.foundation/isa/PowerISA/issues/104>
13
14 **Severity**: Major
15
16 **Status**: New
17
18 **Date**: 22 Dec 2022 v2 TODO
19
20 **Target** v3.2B
21
22 **Books and Section affected**:
23
24 ```
25 Everything (in a consistent, regular and systematic fashion)
26 ```
27
28 **Summary**
29
30 ```
31 Exactly as is already done in RISC-V, convert the entire use of 64-bit hard-coding to "XLEN".
32 Exactly as is in RISC-V, options then include PowerISA-32, PowerISA-64 and PowerISA-128.
33 Unlike in RISC-V, the concept of PowerISA-16 and PowerISA-8 is also floated, for Embedded,
34 AI, Edge, Processing-in-Memory, Distributed Computing and other purposes.
35 ```
36
37 **Submitter**: Luke Leighton (Libre-SOC)
38
39 **Requester**: Libre-SOC
40
41 **Impact on processor**:
42
43 ```
44 Entirely new processors, entirely new markets.
45 ```
46
47 **Impact on software**:
48
49 ```
50 Massive but regular, consistent, and systematic.
51 ```
52
53 **Keywords**:
54
55 ```
56 XLEN
57 ```
58
59 **Motivation**
60
61 The Power ISA is far too massive, making it wholly unsuited for Embedded
62 markets and adversely impacting its reach and potential. The RISC paradigm
63 it is based on has gone too far into PackedSIMD (128-bit). Fixing this is
64 relatively and conceptually straightforward: allow 32-bit and even 16-bit
65 and 8-bit implementations, and use the opportunity to allow future Scalar
66 128-bit implementations in the exact same strategic way that RISC-V has RV128.
67
68 Register files are redefined to XLEN width but are permitted to "group"
69 registers together to create 16-bit, 32-bit and 64-bit addresses.
70 In this way, the limitations of what would otherwise restrict the usefulness
71 of a severely-targetted application-specific processor may be overcome in
72 order to make it still possible to (at reduced performance) still run
73 general-purpose applications.
74 AI application-specific processing or other Processing-In-Memory or other
75 specialist design therefore may for example focus a balance
76 of raw computing power heavily onto 8-bit or 16-bit computation, but still
77 gain the benefit of the Power ISA and everything it brings. Contrast
78 this with the more "normal" approach of creating heavily-focussed
79 specialist "AI" Engines incapable of Turing-completeness and the benefits
80 are clear.
81
82 Note 1: SVP64 **requires** this change as a 100% critical dependency.
83 SIMD back-end ALUs process Vectors of "Elements" at 8, 16 and 32-bit (and
84 64-bit), read from, processed, and returned to, the standard **Scalar**
85 Register Files, with byte-level write-enable lines. The proposal is
86 therefore made as an opportunity for others interested in Scalar ISA
87 8/16/32-bit (and future 128-bit variants of Scalar Power ISA) to take
88 **and complete** that work in an incremental fashion, without having
89 to be faced with a massive bulk and body of work as a prerequisite.
90
91 Examples include that whilst an SVP64 Prefixed '''lbz''' instruction
92 ('''sv.lbz''') is well-defined and has strict well-defined behaviour,
93 a pure **Scalar-only** (non-SVP64) over-ridden '''lbz''' instruction
94 has not been so well-defined, and would require a Stakeholder interested
95 in 8/16/32-bit (and future 128-bit) to think through the implications
96 and incrementally submit further OPF ISA RFCs. With RISC-V **already
97 having done this type of work** it is not technically difficult: it
98 just requires another Stakeholder to do it.
99
100 Note 2: one alternative to this proposal, as far as SVP64 is concerned,
101 is to literally duplicate the entirety of Chapters 3 and 4 Book III,
102 and to create - and then maintain - multiple identical copies of the
103 instructions including identical copies of the pseudocode except for
104 substitution of occurrences of "64" with a "32" variant, "16" variant,
105 "8" variant (and future "128" variant), and so on. This would add
106 over 700 additional pages to the Power ISA Specification and it should
107 be clear that it would become a maintenance nightmare.
108
109 Another alternative is to poison and irredemably damage the Power ISA
110 (as a powerful and lean RISC ISA) by adding several hundred (close to 1,000)
111 additional specific 8-bit, 16-bit and 32-bit (and in future 128-bit) Scalar
112 instructions. Given that the 32-bit Opcode Allocation Space is already
113 under pressure such a move would be extremely unwise for that reason alone.
114
115 **Changes**
116
117 For all pseudocode right across the board in all Scalar operations, replace
118 hard-coded "64" with "XLEN". **This work is already underway as sponsored
119 by NLnet in the Libre-SOC Power ISA Pseudocode**. The default is obviously
120 recommended to be "XLEN=64" in order to create zero disruption.
121
122 Definitions of the Register File(s) for GPR and FPR are then changed to be
123 "XLEN" wide. However, for Embedded purposes (XLEN=32/16/8), an SPR controls
124 whether (and how many) sequentially-grouped registers are taken together to
125 create 16-bit, 32-bit and 64-bit addresses (depending on application need).
126 GPR is obvious, FPR is quirky. SVP64 redefines FP ops (those not ending in "s")
127 to be "full width" and all ops ending in "s" to be "half of
128 the full width".
129
130 * XLEN=64 keeps FPR "full width" exactly as presently defined, and
131 "half width" exactly as presently defined.
132 * XLEN=32 overrides FPR "full width" operations to
133 full BFP32, and "half width" to be "BFP16 stored in an BFP32"
134 * XLEN=16 redefines FPR "full width" operations to full [IEEE BFP16](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) and leaves
135 "half width" RESERVED (there is no IEEE version of [FP8](https://web.archive.org/web/20221223085833/https://wccftech.com/nvidia-intel-arm-bet-their-ai-future-on-fp8-whitepaper-for-8-bit-fp-published/)).
136 * XLEN=8 redefines FPR "full width" operations to [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) and leaves
137 "half width" RESERVED.
138
139 ----------------
140
141 # Examples
142
143 ## pseudocode examples demonstrating modification.
144
145 before for popcntb:
146
147 ```
148 do i = 0 to 7
149 n <- 0
150 do j = 0 to 7
151 if (RS)[(i*8)+j] = 1 then
152 n <- n+1
153 RA[(i*8):(i*8)+7] <- n
154 ```
155
156 after:
157
158 ```
159 do i = 0 to ((XLEN/8)-1)
160 n <- 0
161 do j = 0 to 7
162 if (RS)[(i*8)+j] = 1 then
163 n <- n+1
164 RA[(i*8):(i*8)+7] <- n
165 ```
166
167 Here as the instruction's intent is to count bytes, and RA contains on
168 a per-byte basis a SIMD-style count of each byte's 1s, it becomes possible
169 to simply count less bytes.
170
171 Should it be more useful to redefine popcntb in terms of always returning
172 eight results? For example `sv.popcntb/w=16` to return 8 2-bit counts of
173 the number of bits in each 2-bit group in RS?
174
175 ## no modification needed, but function changes
176
177 For the `addi` instruction there is no apparent change:
178
179 ```
180 RT <- (RA|0) + EXTS(SI)
181 ```
182
183 However behind the scenes, RA is XLEN bits wide, therefore EXTS performs an
184 increase in bitlength not to exactly 64 but to XLEN. Obviousy for XLEN=16
185 there is no sign-extension, and for XLEN=8 truncation of `SI` will occur.
186 Illustrates that there are subtle quirks involved, requiring some thought.
187
188 The reason for keeping as many bits of the Immediate as possible should be clear.
189
190 ## Compare Ranged Byte (cmprb BF,L,RA,RB)
191
192 ```
193 src1 <- EXTZ((RA)[XLEN-8:XLEN-1])
194 src21hi <- EXTZ((RB)[XLEN-32:XLEN-23])
195 src21lo <- EXTZ((RB)[XLEN-24:XLEN-17])
196 src22hi <- EXTZ((RB)[XLEN-16:XLEN-9])
197 src22lo <- EXTZ((RB)[XLEN-8:XLEN-1])
198 if L=0 then
199 in_range <- (src22lo <= src1) & (src1 <= src22hi)
200 else
201 in_range <- (((src21lo <= src1) & (src1 <= src21hi)) |
202 ((src22lo <= src1) & (src1 <= src22hi)))
203 CR[4*BF+32] <- 0b0
204 CR[4*BF+33] <- in_range
205 CR[4*BF+34] <- 0b0
206 CR[4*BF+35] <- 0b0
207 ```
208
209 Compare Ranged Byte takes either one or two ranges from RB as individual bytes,
210 thus requiring a minimum 16-bit (32-bit when L=1) operand RB.
211 src1 on the other hand is only
212 8-bit long: the first byte of RA.
213
214 Therefore a little more thought is required. Should this simply be UNDEFINED
215 behaviour when XLEN=8/16 and L=1? When XLEN=16, L=0 the instruction is still
216 valid. Would it be costly at the Decoder?
217
218 ## Trap Word Immediate
219
220 Like FP Single operations there also exist operations at "half of regfile width"
221 in the Integer realm. They are discernable with the designation "Word" in their
222 title, such as "Trap WORD Immediate".
223
224 ```
225 a <- EXTS((RA)[XLEN/2:XLEN-1])
226 if (a < EXTS(SI)) & TO[0] then TRAP
227 if (a > EXTS(SI)) & TO[1] then TRAP
228 if (a = EXTS(SI)) & TO[2] then TRAP
229 if (a <u EXTS(SI)) & TO[3] then TRAP
230 if (a >u EXTS(SI)) & TO[4] then TRAP
231 ```
232
233 Here, EXTS receives **half** of the bits of its input register operand, RA.
234 Note this is **not** "32 bit because a Word is 32-bit". The definition
235 "Trap Word Immediate" has to be replaced with "Trap Half-register-width Immediate"
236 but this is very clumsy.
237
238 When XLEN=8 "half register width" is clearly 4 bit, thus the LSB nibble is tested,
239 but still sign-extended for comparison
240 against the 16-bit signed immediate.
241
242 ## Extend Sign byte/half/word
243
244 This instruction can be redefined again in terms of:
245
246 * "Word" meaning "Half of register width"
247 * "Half-word" meaning "Quarter of register width"
248 * "Byte" meaning "One-eighth of register width"
249
250 And a table results as follows:
251
252 ```
253 XLEN=8:
254 extsb: 1-bit -> 8-bit sign extension
255 extsh: 2-bit -> 8-bit sign extension
256 extsw: 4-bit -> 8-bit sign extension
257 XLEN=16:
258 extsb: 2-bit -> 16-bit sign extension
259 extsh: 4-bit -> 16-bit sign extension
260 extsw: 8-bit -> 16-bit sign extension
261 XLEN=32:
262 extsb: 4-bit -> 32-bit sign extension
263 extsh: 8-bit -> 32-bit sign extension
264 extsw: 16-bit -> 32-bit sign extension
265 XLEN=64:
266 extsb: 8-bit -> 64-bit sign extension
267 extsh: 16-bit -> 64-bit sign extension
268 extsw: 32-bit -> 64-bit sign extension
269 ```
270
271 If the instructions were kept as presently defined then there
272 is a loss of functionality and opportunity:
273
274 ```
275 XLEN=8: # completely wasted opportunity
276 extsb: 8-bit -> 8-bit does nothing
277 extsh: 16-bit -> 8-bit truncates
278 extsw: 32-bit -> 8-bit truncates
279 XLEN=16: # wasted 2/3 of encoding
280 extsb: 8-bit -> 16-bit sign extension
281 extsh: 16-bit -> 16-bit does nothing
282 extsw: 32-bit -> 16-bit truncates
283 XLEN=32: # wasted 1/3 of encoding
284 extsb: 8-bit -> 32-bit sign extension
285 extsh: 16-bit -> 32-bit sign extension
286 extsw: 32-bit -> 32-bit does nothing
287 XLEN=64: # unchanged (default) behaviour
288 extsb: 8-bit -> 64-bit sign extension
289 extsh: 16-bit -> 64-bit sign extension
290 extsw: 32-bit -> 64-bit sign extension
291 ```
292
293 The RTL for `extsb` becomes:
294
295 ```
296 in <- (RA)[XLEN-8:XLEN-1] # extract first byte
297 if XLEN = 8 then RT <- in[7] * 8 # 1->8
298 if XLEN = 16 then RT <- in[6] * 15 || in[7] # 2->16
299 if XLEN = 32 then RT <- in[4] * 29 || in[5:7] # 4->32
300 if XLEN = 64 then RT <- in[0] * 57 || in[1:7] # 8->64
301 ```
302
303 And `extsh` and `extsw` follow similar logic. Interestingly there is
304 no loss of functionality compared to keeping `extsb` always as "byte
305 sign-extending" and ironically the loss of opportunity *is* to keep
306 `extsb` the same (extend *byte* regardless of XLEN).
307
308 [[!tag opf_rfc]]
309
310 \newpage{}
311