simple_v_extension/sv_prefix_proposal/discussion.rst

   1 RVC
   2 ===
   3
   4 The comment in the RVC section says that the Opcodes will be evaluated to see which are most useful to provide.
   5
   6 This takes a huge amount of time and, if not *exactly* RVC, would require a special decode engine, taking up extra gates as well as need time to develop.
   7
   8 Far better to just embed RVC into the opcode and prefix it. This is inline with the strategic principle behind SV: "No new opcodes, only prefixed augmentation"
   9
  10 Taking an entire major 32 bit opcode (or two) seems logical (RV128 space). I type funct3 to specify the C type page, Imm 12 bits for the operation.
  11
  12 Or, just "to hell with it" and just take the entire opcode and stuff C into it, no regard for R/I/U/S and instead do whatever we like.
  13
  14
  15 +----------+------+---------------------+---------------------+-------+--------+
  16 | 15 14 13 |  12  |   11 10 9     8   7 | 6    5    4   3   2 | 1   0 | format |
  17 +----------+------+---------------------+---------------------+-------+--------+
  18 |    funct4       |     rd/rs1          |      rs2            | op    | CR     |
  19 +----------+------+---------------------+---------------------+-------+--------+
  20 |funct3    | imm  |     rd/rs1          |     imm             | op    | CI     |
  21 +----------+------+---------------------+---------------------+-------+--------+
  22 |funct3    |          imm               |      rs2            | op    | CSS    |
  23 +----------+----------------------------+---------+-----------+-------+--------+
  24 |funct3    |              imm                     |  rd'      | op    | CIW    |
  25 +----------+----------------+-----------+---------+-----------+-------+--------+
  26 |funct3    |    imm         | rs1'      | imm     |  rd'      | op    | CL     |
  27 +----------+----------------+-----------+---------+-----------+-------+--------+
  28 |funct3    |    imm         | rs1'      | imm     |  rs2'     | op    | CS     |
  29 +----------+----------------+-----------+---------+-----------+-------+--------+
  30 |       funct6              | rd'/rs1'  | funct2  |  rs2'     | op    | CA     |
  31 +----------+----------------+-----------+---------+-----------+-------+--------+
  32 |funct3    |   offset       |  rs1'     |     offset          | op    | CB     |
  33 +----------+----------------+-----------+---------------------+-------+--------+
  34 |funct3    |                jump target                       | op    | CJ     |
  35 +----------+--------------------------------------------------+-------+--------+
  36
  37 * top 14 bits of RVC to go into "MAJOR OPCODE 0-2" to represent
  38   RVC op[1:0] == 0b00, 0b01 and 0b02.  Therefore,
  39   18 bits remain in the 32-bit opcode space
  40 * 32-bit opcode prefix takes 7 bits, therefore 11 bits remain to fit
  41   a SVPrefix.
  42 * compared to P48, 11 bits are needed, and we have a match.
  43
  44 P48:
  45
  46 +---------------+--------+--------+----------+-----+--------+-------------+------+
  47 | Encoding      | 17     | 16     | 15       | 14  | 13     | 12          | 11:7 |
  48 +---------------+--------+--------+----------+-----+--------+-------------+------+
  49 | P32C-LD-type  | rd[5]  | rs1[5] | vitp7[6] | vd  | vs1    | vitp7[5:0]         |
  50 +---------------+--------+--------+----------+-----+--------+-------------+------+
  51 | P32C-ST-type  |vitp7[6]| rs1[5] | rs2[5]   | vs2 | vs1    | vitp7[5:0]         |
  52 +---------------+--------+--------+----------+-----+--------+-------------+------+
  53 | P32C-R-type   | rd[5]  | rs1[5] | rs2[5]   | vs2 | vs1    | vitp6              |
  54 +---------------+--------+--------+----------+-----+--------+--------------------+
  55 | P32C-I-type   | rd[5]  | rs1[5] | vitp7[6] | vd  | vs1    | vitp7[5:0]         |
  56 +---------------+--------+--------+----------+-----+--------+--------------------+
  57 | P32C-U-type   | rd[5]  | *Rsvd* | *Rsvd*   | vd  | *Rsvd* | vitp6              |
  58 +---------------+--------+--------+----------+-----+--------+-------------+------+
  59 | P32C-FR-type  | rd[5]  | rs1[5] | rs2[5]   | vs2 | vs1    | *Rsvd*      | vtp5 |
  60 +---------------+--------+--------+----------+-----+--------+-------------+------+
  61 | P32C-FI-type  | rd[5]  | rs1[5] | vitp7[6] | vd  | vs1    | vitp7[5:0]         |
  62 +---------------+--------+--------+----------+-----+--------+-------------+------+
  63 | P32C-FR4-type | rd[5]  | rs1[5] | rs2[5]   | vs2 | rs3[5] | vs3 [#fr4]_ | vtp5 |
  64 +---------------+--------+--------+----------+-----+--------+-------------+------+
  65
  66 P32C Prefix:
  67
  68 +---------------+--------+--------+----------+-----+--------+------------+
  69 | Encoding      | 31     | 30     | 29       | 28  | 27     | 26:21      |
  70 +---------------+--------+--------+----------+-----+--------+------------+
  71 | P32C-CL-type  | rd[5]  | rs1[5] | vitp7[6] | vd  | vs1    | vitp7[5:0] |
  72 +---------------+--------+--------+----------+-----+--------+------------+
  73 | P32C-CS-type  | rs2[5] | rs1[5] | vitp7[6] | vs2 | vs1    | vitp7[5:0] |
  74 +---------------+--------+--------+----------+-----+--------+------------+
  75 | P32C-CR-type  | rd[5]  | rs1[5] | *Rsvd*   | vd  | vs1    | vitp6      |
  76 +---------------+--------+--------+----------+-----+--------+------------+
  77 | P32C-CI1-type | rd[5]  | rs1[5] | vitp7[6] | vd  | vs1    | vitp7[5:0] |
  78 +---------------+--------+--------+----------+-----+--------+------------+
  79 | P32C-CI2-type | rd[5]  | *Rsvd* | *Rsvd*   | vd  | *Rsvd* | vitp6      |
  80 +---------------+--------+--------+----------+-----+--------+------------+
  81 | P32C-CB-type  | *Rsvd* | rs1[5] | vitp7[6] |*Rsv*| vs1    | vitp7[5:0] |
  82 +---------------+--------+--------+----------+-----+--------+------------+
  83 | P32C-CMv-type | rd[5]  | rs1[5] | vitp7[6] | vd  | vs1    | vitp7[5:0] |
  84 +---------------+--------+--------+----------+-----+--------+------------+
  85
  86 Mapping P32-* Quadrants 0-2 to CUSTOM OPCODEs 0-2:
  87
  88 +-------------+--------+-----------+----------+
  89 | Encoding    | 31:21  | 20:7      | 6:0      |
  90 +-------------+--------+-----------+----------+
  91 | P32C RVC-Q0 | P32-*  | RVC[15:2] | OPCODE-0 |
  92 +-------------+--------+-----------+----------+
  93 | P32C RVC-Q1 | P32-*  | RVC[15:2] | OPCODE-1 |
  94 +-------------+--------+-----------+----------+
  95 | P32C RVC-Q2 | P32-*  | RVC[15:2] | OPCODE-2 |
  96 +-------------+--------+-----------+----------+
  97
  98 Questions
  99 =========
 100
 101 Confirmation needed as to whether subvector extraction can be covered
 102 by twin predication (it probably can, it is one of the many purposes it
 103 is for).
 104
 105 Answer:
 106
 107 Yes, it can, but VL needs to be changed for it to work, since predicates
 108 work at the size of a whole subvector instead of an element of that
 109 subvector. To avoid needing to constantly change VL, and since swizzles
 110 are a very common operation, I think we should have a separate instruction
 111 -- a subvector element swizzle instruction::
 112
 113     velswizzle x32, x64, SRCSUBVL=3, DESTSUBVL=4, ELTYPE=u8, elements=[0, 0, 2, 1]
 114
 115 Answer:
 116
 117     > ok, i like that idea - adding to TODO list
 118     > see MV.X_
 119
 120 .. _MV.X: http://libre-riscv.org/simple_v_extension/specification/mv.x/
 121
 122 Example pseudocode:
 123
 124 .. code:: C
 125
 126     // processor state:
 127     uint64_t regs[128];
 128     int VL = 5;
 129
 130     typedef uint8_t ELTYPE;
 131     const int SRCSUBVL = 3;
 132     const int DESTSUBVL = 4;
 133     const int elements[] = [0, 0, 2, 1];
 134     ELTYPE *rd = (ELTYPE *)&regs[32];
 135     ELTYPE *rs1 = (ELTYPE *)&regs[48];
 136     for(int i = 0; i < VL; i++)
 137     {
 138         rd[i * DESTSUBVL + 0] = rs1[i * SRCSUBVL + elements[0]];
 139         rd[i * DESTSUBVL + 1] = rs1[i * SRCSUBVL + elements[1]];
 140         rd[i * DESTSUBVL + 2] = rs1[i * SRCSUBVL + elements[2]];
 141         rd[i * DESTSUBVL + 3] = rs1[i * SRCSUBVL + elements[3]];
 142     }
 143
 144 To use the subvector element swizzle instruction to extract a subvector element,
 145 all that needs to be done is to have DESTSUBVL be 1::
 146
 147     // extract element index 2
 148     velswizzle rd, rs1, SRCSUBVL=4, DESTSUBVL=1, ELTYPE=u32, elements=[2]
 149
 150 Example pseudocode:
 151
 152 .. code:: C
 153
 154     // processor state:
 155     uint64_t regs[128];
 156     int VL = 5;
 157
 158     typedef uint32_t ELTYPE;
 159     const int SRCSUBVL = 4;
 160     const int DESTSUBVL = 1;
 161     const int elements[] = [2];
 162     ELTYPE *rd = (ELTYPE *)&regs[...];
 163     ELTYPE *rs1 = (ELTYPE *)&regs[...];
 164     for(int i = 0; i < VL; i++)
 165     {
 166         rd[i * DESTSUBVL + 0] = rs1[i * SRCSUBVL + elements[0]];
 167     }
 168
 169 ----
 170
 171 What is SUBVL and how does it work
 172
 173 Answer:
 174
 175 SUBVL is the instruction field in P48 instructions that specifies
 176 the sub-vector length. The sub-vector length is the number of scalars
 177 that are grouped together and treated like an element by both VL and
 178 predication. This is used to support operations where the elements are
 179 short vectors (2-4 elements) in Vulkan and OpenGL. Those short vectors
 180 are mostly used as mathematical vectors to handle directions, positions,
 181 and colors, rather than as a pure optimization.
 182
 183 For example, when VL is 5::
 184
 185     add x32, x48, x64, SUBVL=3, ELTYPE=u16, PRED=!x9
 186
 187 performs the following operation:
 188
 189 .. code:: C
 190
 191     // processor state:
 192     uint64_t regs[128];
 193     int VL = 5;
 194
 195     // instruction fields:
 196     typedef uint16_t ELTYPE;
 197     const int SUBVL = 3;
 198     ELTYPE *rd = (ELTYPE *)&regs[32];
 199     ELTYPE *rs1 = (ELTYPE *)&regs[48];
 200     ELTYPE *rs2 = (ELTYPE *)&regs[64];
 201     for(int i = 0; i < VL; i++)
 202     {
 203         if(~regs[9] & 0x1)
 204         {
 205             rd[i * SUBVL + 0] = rs1[i * SUBVL + 0] + rs2[i * SUBVL + 0];
 206             rd[i * SUBVL + 1] = rs1[i * SUBVL + 1] + rs2[i * SUBVL + 1];
 207             rd[i * SUBVL + 2] = rs1[i * SUBVL + 2] + rs2[i * SUBVL + 2];
 208         }
 209     }
 210
 211 ----
 212
 213 SVorig goes to a lot of effort to make VL 1<= MAXVL and MAXVL 1..64
 214 where both CSRs may be stored internally in only 6 bits.
 215
 216 Thus, CSRRWI can reach 1..32 for VL and MAXVL.
 217
 218 In addition, setting a hardware loop to zero turning instructions into
 219 NOPs, um, just branch over them, to start the first loop at the end,
 220 on the test for loop variable being zero, a la c "while do" instead of
 221 "do while".
 222
 223 Or, does it not matter that VL only goes up to 31 on a CSRRWI, and that
 224 it only goes to a max of 63 rather than 64?
 225
 226 Answer:
 227
 228 I think supporting SETVL where VL would be set to 0 should be done. that
 229 way, the branch can be put after SETVL, allowing SETVL to execute
 230 earlier giving more time for VL to propagate (preventing stalling)
 231 to the instruction decoder.  I have no problem with having 0 stored to
 232 VL via CSRW resulting in VL=64 (or whatever maximum value is supported
 233 in hardware).
 234
 235 One related idea would to support VL > XLEN but to only allow unpredicated
 236 instructions when VL > XLEN. This would allow later implementing register
 237 pairs/triplets/etc. as predicates as an extension.
 238
 239 ----
 240
 241 Is MV.X good enough a substitute for swizzle?
 242
 243 Answer:
 244
 245 no, since the swizzle instruction specifies in the opcode which elements are
 246 used and where they go, so it can run much faster since the execution engine
 247 doesn't need to pessimize. Additionally, swizzles almost always have constant
 248 element selectors. MV.X is meant more as a last-resort instruction that is
 249 better than load/store, but worse than everything else.
 250
 251     > ok, then we'll need a way to do that.  given that it needs to apply
 252     > to, well... everything, basically, i'm tempted to recommend it be
 253     > done as a CSR and/or as (another) table in VBLOCK.
 254     > the reason is, it's just too much to expect to massively duplicate
 255     > literally every single opcode in existence, just to add swizzle
 256     > when there's no room in the opcode space to do so.
 257     > not sure what alternatives there might be.
 258
 259 ----
 260
 261 Is vectorised srcbase ok as a gather scatter and ok substitute for
 262 register stride? 5 dependency registers (reg stride being the 5th)
 263 is quite scary
 264
 265 ----
 266
 267 Why are integer conversion instructions needed, when the main SV spec
 268 covers them by allowing elwidth to be set on both src and dest regs?
 269
 270 ----
 271
 272 Why are the SETVL rules so complex? What is the reason, how are loops
 273 carried out?
 274
 275 Partial Answer:
 276
 277 The idea is that the compiler knows maxVL at compile time since it allocated the
 278 backing registers, so SETVL has the maxVL as an immediate value. There is no
 279 maxVL CSR needed for just SVPrefix.
 280
 281     > when looking at a loop assembly sequence
 282     > i think you'll find this approach will not work.
 283     > RVV loops on which SV loops are directly based needs understanding
 284     > of the use of MIN within the actual SETVL instruction.
 285     > Yes MVL is known at compile time
 286     > however unless MVL is communicates to the hardware, SETVL just
 287     > does not work: it has absolutely no way of knowing when to stop
 288     > processing.  The point being: it's not *MVL* that's the problem
 289     > if MVL is not a CSR, it's *VL* that becomes the problem.
 290     > The only other option which does work is to set a mandatory
 291     > hardcoded MVL baked into the actual hardware.
 292     > That results in loss of flexibility and defeats the purpose of SV.
 293
 294 ----
 295
 296 With SUBVL (sub vector len) being both a CSR and also part of the 48/64
 297 bit opcode, how does that work?
 298
 299 Answer:
 300
 301 I think we should just ignore the SUBVL CSR and use the value from the
 302 SUBVL field when executing 48/64-bit instructions. For just SVPrefix,
 303 I would say that the only user-visible CSR needed is VL. This is ignoring
 304 all the state for context-switching and exception handling.
 305
 306     > the consequence of that would be that P48/64 would need
 307     > its own CSR State to track the subelement index.
 308     > or that any exceptions would need to occur on a group
 309     > basis, which is less than ideal,
 310     > and interrupts would have to be stalled.
 311     > interacting with SUBVL and requiring P48/64 to save the
 312     > STATE CSR if needed is a workable compromise that
 313     > does not result in huge CSR proliferation
 314
 315 ----
 316
 317 What are the interaction rules when a 48/64 prefix opcode has a rd/rs
 318 that already has a Vector Context for either predication or a register?
 319
 320 It would perhaps make sense (and for svlen as well) to make 48/64 isolated
 321 and unaffected by VLIW context, with the exception of VL/MVL.
 322
 323 MVL and VL should be modifiable by 64 bit prefix as they are global
 324 in nature.
 325
 326 Possible solution, svlen and VLtyp allowed to share STATE CSR however
 327 programmer becomes responsible for push and pop of state during use of
 328 a sequence of P48 and P64 ops.
 329
 330 ----
 331
 332 Can bit 60 of P64 be put to use (in all but the FR4 case)?
 333
 334
 335
 336 experiment VLtyp
 337 ================
 338
 339 experiment 1:
 340
 341 +-----------+-------------+--------------+------------+----------------------+
 342 | VLtyp[11] | VLtyp[10:6] | VLtyp[5:3]   | VLtyp[2:0] | comment              |
 343 +-----------+-------------+--------------+------------+----------------------+
 344 | 0         |  00000      | 000          |  000       | no change to VL/MVL  |
 345 +-----------+-------------+--------------+------------+----------------------+
 346 | 0         |  imm        | 000          |  rs'!=0    |                      |
 347 +-----------+-------------+--------------+------------+----------------------+
 348 | 0         |  imm        | rd'!=0       |  000       |                      |
 349 +-----------+-------------+--------------+------------+----------------------+
 350 | 0         |  imm        | rd'!=0       |  rs'!=0    |                      |
 351 +-----------+-------------+--------------+------------+----------------------+
 352 | 1         |  imm        | 000          |  000       |                      |
 353 +-----------+-------------+--------------+------------+----------------------+
 354 | 1         |  imm        | 000          |  rs'!=0    |                      |
 355 +-----------+-------------+--------------+------------+----------------------+
 356 | 1         |  imm        | rd'!=0       | 000        |                      |
 357 +-----------+-------------+--------------+------------+----------------------+
 358 | 1         |  imm        | rd'!=0       |  rs'!=0    |                      |
 359 +-----------+-------------+--------------+------------+----------------------+
 360
 361
 362 experiment 2:
 363
 364 +----+------+-----+-------+----------+-----------------------------------------------+
 365 | 11 | 10:6 | 5   | 4:3   | 2:0      | comment                                       |
 366 +----+------+-----+-------+----------+-----------------------------------------------+
 367 | 0  |  000 | 000         |  000     | no change to VL/MVL                           |
 368 +----+------+-------------+----------+-----------------------------------------------+
 369 | 0  |  imm | 000         |  rs'!=0  | MVL = imm; vl = min(r[rs'], MVL)              |
 370 +----+------+-------------+----------+-----------------------------------------------+
 371 | 0  |  imm | rd'!=0      |  000     | MVL = imm; vl = MVL; r[rd'] = vl              |
 372 +----+------+-------------+----------+-----------------------------------------------+
 373 | 0  |  imm | rd'!=0      |  rs'!=0  | MVL = imm; vl = min(r[rs'], MVL); r[rd'] = vl |
 374 +----+------+-----+-------+----------+-----------------------------------------------+
 375 | 1  |  imm | 0   |  00      000     | MVL = imm; vl = MVL;                          |
 376 +----+------+-----+------------------+-----------------------------------------------+
 377 | 1  |  imm | 0   |  rd[4:0]         | MVL = imm; vl = MVL; r[rd] = vl               |
 378 +----+------+-----+------------------+-----------------------------------------------+
 379 | 1  |  imm | 1   |  00      000     | reserved                                      |
 380 +----+------+-----+------------------+-----------------------------------------------+
 381 | 1  |  imm | 1   |  rs1[4:0]        | MVL = imm; vl = min(r[rs], MVL)               |
 382 +----+------+-----+------------------+-----------------------------------------------+
 383
 384 interestingly, "VLtyp[11] = 0" fits the sv.setvl pseudcode really well.