simple_v_extension/sv_prefix_proposal/discussion.rst

   1 RVC
   2 ===
   3
   4 The comment in the RVC section says that the Opcodes will be evaluated
   5 to see which are most useful to provide.
   6
   7 This takes a huge amount of time and, if not *exactly* RVC, would require
   8 a special decode engine, taking up extra gates as well as need time
   9 to develop.
  10
  11 Far better to just embed RVC into the opcode and prefix it. This is
  12 inline with the strategic principle behind SV: "No new opcodes, only
  13 prefixed augmentation"
  14
  15 Taking an entire major 32 bit opcode (or two) seems logical (RV128
  16 space). I type funct3 to specify the C type page, Imm 12 bits for the
  17 operation.
  18
  19 Or, just "to hell with it" and just take the entire opcode and stuff C
  20 into it, no regard for R/I/U/S and instead do whatever we like.
  21
  22
  23 +----------+------+---------------------+---------------------+-------+--------+
  24 | 15 14 13 |  12  |   11 10 9     8   7 | 6    5    4   3   2 | 1   0 | format |
  25 +----------+------+---------------------+---------------------+-------+--------+
  26 |    funct4       |     rd/rs1          |      rs2            | op    | CR     |
  27 +----------+------+---------------------+---------------------+-------+--------+
  28 |funct3    | imm  |     rd/rs1          |     imm             | op    | CI     |
  29 +----------+------+---------------------+---------------------+-------+--------+
  30 |funct3    |          imm               |      rs2            | op    | CSS    |
  31 +----------+----------------------------+---------+-----------+-------+--------+
  32 |funct3    |              imm                     |  rd'      | op    | CIW    |
  33 +----------+----------------+-----------+---------+-----------+-------+--------+
  34 |funct3    |    imm         | rs1'      | imm     |  rd'      | op    | CL     |
  35 +----------+----------------+-----------+---------+-----------+-------+--------+
  36 |funct3    |    imm         | rs1'      | imm     |  rs2'     | op    | CS     |
  37 +----------+----------------+-----------+---------+-----------+-------+--------+
  38 |       funct6              | rd'/rs1'  | funct2  |  rs2'     | op    | CA     |
  39 +----------+----------------+-----------+---------+-----------+-------+--------+
  40 |funct3    |   offset       |  rs1'     |     offset          | op    | CB     |
  41 +----------+----------------+-----------+---------------------+-------+--------+
  42 |funct3    |                jump target                       | op    | CJ     |
  43 +----------+--------------------------------------------------+-------+--------+
  44
  45 * top 14 bits of RVC to go into "MAJOR OPCODE 0-2" to represent
  46   RVC op[1:0] == 0b00, 0b01 and 0b02.  Therefore,
  47   18 bits remain in the 32-bit opcode space
  48 * 32-bit opcode prefix takes 7 bits, therefore 11 bits remain to fit
  49   a SVPrefix.
  50 * compared to P48, 11 bits are needed, and we have a match.
  51
  52 P48:
  53
  54 +---------------+--------+--------+----------+-----+--------+-------------+------+
  55 | Encoding      | 17     | 16     | 15       | 14  | 13     | 12          | 11:7 |
  56 +---------------+--------+--------+----------+-----+--------+-------------+------+
  57 | P32C-LD-type  | rd[5]  | rs1[5] | vitp7[6] | vd  | vs1    | vitp7[5:0]         |
  58 +---------------+--------+--------+----------+-----+--------+-------------+------+
  59 | P32C-ST-type  |vitp7[6]| rs1[5] | rs2[5]   | vs2 | vs1    | vitp7[5:0]         |
  60 +---------------+--------+--------+----------+-----+--------+-------------+------+
  61 | P32C-R-type   | rd[5]  | rs1[5] | rs2[5]   | vs2 | vs1    | vitp6              |
  62 +---------------+--------+--------+----------+-----+--------+--------------------+
  63 | P32C-I-type   | rd[5]  | rs1[5] | vitp7[6] | vd  | vs1    | vitp7[5:0]         |
  64 +---------------+--------+--------+----------+-----+--------+--------------------+
  65 | P32C-U-type   | rd[5]  | *Rsvd* | *Rsvd*   | vd  | *Rsvd* | vitp6              |
  66 +---------------+--------+--------+----------+-----+--------+-------------+------+
  67 | P32C-FR-type  | rd[5]  | rs1[5] | rs2[5]   | vs2 | vs1    | *Rsvd*      | vtp5 |
  68 +---------------+--------+--------+----------+-----+--------+-------------+------+
  69 | P32C-FI-type  | rd[5]  | rs1[5] | vitp7[6] | vd  | vs1    | vitp7[5:0]         |
  70 +---------------+--------+--------+----------+-----+--------+-------------+------+
  71 | P32C-FR4-type | rd[5]  | rs1[5] | rs2[5]   | vs2 | rs3[5] | vs3 [#fr4]_ | vtp5 |
  72 +---------------+--------+--------+----------+-----+--------+-------------+------+
  73
  74 P32C Prefix:
  75
  76 +---------------+--------+--------+-----+--------+-----+------------+
  77 | Encoding      | 31     | 30     | 29  | 28     | 27  | 26:21      |
  78 +---------------+--------+--------+-----+--------+-----+------------+
  79 | P32C-CL-type  | rd[5]  | rs1[5] | vd  | vs1    | vitp7[5:0]       |
  80 +---------------+--------+--------+-----+--------+------------------+
  81 | P32C-CS-type  | rs2[5] | rs1[5] | vs2 | vs1    | vitp7[5:0]       |
  82 +---------------+--------+--------+-----+--------+-----+------------+
  83 | P32C-CR-type  | rd[5]  | rs1[5] | vd  | vs1    |*Rsv*| vitp6      |
  84 +---------------+--------+--------+-----+--------+-----+------------+
  85 | P32C-CI1-type | rd[5]  | rs1[5] | vd  | vs1    | vitp7[5:0]       |
  86 +---------------+--------+--------+-----+--------+-----+------------+
  87 | P32C-CI2-type | rd[5]  | *Rsvd* | vd  | *Rsvd* |*Rsv*| vitp6      |
  88 +---------------+--------+--------+-----+--------+-----+------------+
  89 | P32C-CB-type  | *Rsvd* | rs1[5] |*Rsv*| vs1    | vitp7[5:0]       |
  90 +---------------+--------+--------+-----+--------+------------------+
  91 | P32C-CMv-type | rd[5]  | rs1[5] | vd  | vs1    | vitp7[5:0]       |
  92 +---------------+--------+--------+-----+--------+------------------+
  93
  94 Mapping P32-* Quadrants 0-2 to CUSTOM OPCODEs 0-2:
  95
  96 +-------------+--------+-----------+----------+
  97 | Encoding    | 31:21  | 20:7      | 6:0      |
  98 +-------------+--------+-----------+----------+
  99 | P32C RVC-Q0 | P32-*  | RVC[15:2] | OPCODE-0 |
 100 +-------------+--------+-----------+----------+
 101 | P32C RVC-Q1 | P32-*  | RVC[15:2] | OPCODE-1 |
 102 +-------------+--------+-----------+----------+
 103 | P32C RVC-Q2 | P32-*  | RVC[15:2] | OPCODE-2 |
 104 +-------------+--------+-----------+----------+
 105
 106 Notes:
 107
 108 * Branch type requires 2 predicate registers as the second
 109   is used to store the combined results of the comparisons
 110   (not as twin-predication).  The tpred field is therefore
 111   used to determine whether x10 is enabled as the second
 112   register.  TDB, there may be a better (unique) encoding
 113
 114 Questions
 115 =========
 116
 117 Confirmation needed as to whether subvector extraction can be covered
 118 by twin predication (it probably can, it is one of the many purposes it
 119 is for).
 120
 121 Answer:
 122
 123 Yes, it can, but VL needs to be changed for it to work, since predicates
 124 work at the size of a whole subvector instead of an element of that
 125 subvector. To avoid needing to constantly change VL, and since swizzles
 126 are a very common operation, I think we should have a separate instruction
 127 -- a subvector element swizzle instruction::
 128
 129     velswizzle x32, x64, SRCSUBVL=3, DESTSUBVL=4, ELTYPE=u8, elements=[0, 0, 2, 1]
 130
 131 Answer:
 132
 133     > ok, i like that idea - adding to TODO list
 134     > see MV.X_
 135
 136 .. _MV.X: http://libre-riscv.org/simple_v_extension/specification/mv.x/
 137
 138 Example pseudocode:
 139
 140 .. code:: C
 141
 142     // processor state:
 143     uint64_t regs[128];
 144     int VL = 5;
 145
 146     typedef uint8_t ELTYPE;
 147     const int SRCSUBVL = 3;
 148     const int DESTSUBVL = 4;
 149     const int elements[] = [0, 0, 2, 1];
 150     ELTYPE *rd = (ELTYPE *)&regs[32];
 151     ELTYPE *rs1 = (ELTYPE *)&regs[48];
 152     for(int i = 0; i < VL; i++)
 153     {
 154         rd[i * DESTSUBVL + 0] = rs1[i * SRCSUBVL + elements[0]];
 155         rd[i * DESTSUBVL + 1] = rs1[i * SRCSUBVL + elements[1]];
 156         rd[i * DESTSUBVL + 2] = rs1[i * SRCSUBVL + elements[2]];
 157         rd[i * DESTSUBVL + 3] = rs1[i * SRCSUBVL + elements[3]];
 158     }
 159
 160 To use the subvector element swizzle instruction to extract a subvector element,
 161 all that needs to be done is to have DESTSUBVL be 1::
 162
 163     // extract element index 2
 164     velswizzle rd, rs1, SRCSUBVL=4, DESTSUBVL=1, ELTYPE=u32, elements=[2]
 165
 166 Example pseudocode:
 167
 168 .. code:: C
 169
 170     // processor state:
 171     uint64_t regs[128];
 172     int VL = 5;
 173
 174     typedef uint32_t ELTYPE;
 175     const int SRCSUBVL = 4;
 176     const int DESTSUBVL = 1;
 177     const int elements[] = [2];
 178     ELTYPE *rd = (ELTYPE *)&regs[...];
 179     ELTYPE *rs1 = (ELTYPE *)&regs[...];
 180     for(int i = 0; i < VL; i++)
 181     {
 182         rd[i * DESTSUBVL + 0] = rs1[i * SRCSUBVL + elements[0]];
 183     }
 184
 185 ----
 186
 187 What is SUBVL and how does it work
 188
 189 Answer:
 190
 191 SUBVL is the instruction field in P48 instructions that specifies
 192 the sub-vector length. The sub-vector length is the number of scalars
 193 that are grouped together and treated like an element by both VL and
 194 predication. This is used to support operations where the elements are
 195 short vectors (2-4 elements) in Vulkan and OpenGL. Those short vectors
 196 are mostly used as mathematical vectors to handle directions, positions,
 197 and colors, rather than as a pure optimization.
 198
 199 For example, when VL is 5::
 200
 201     add x32, x48, x64, SUBVL=3, ELTYPE=u16, PRED=!x9
 202
 203 performs the following operation:
 204
 205 .. code:: C
 206
 207     // processor state:
 208     uint64_t regs[128];
 209     int VL = 5;
 210
 211     // instruction fields:
 212     typedef uint16_t ELTYPE;
 213     const int SUBVL = 3;
 214     ELTYPE *rd = (ELTYPE *)&regs[32];
 215     ELTYPE *rs1 = (ELTYPE *)&regs[48];
 216     ELTYPE *rs2 = (ELTYPE *)&regs[64];
 217     for(int i = 0; i < VL; i++)
 218     {
 219         if(~regs[9] & 0x1)
 220         {
 221             rd[i * SUBVL + 0] = rs1[i * SUBVL + 0] + rs2[i * SUBVL + 0];
 222             rd[i * SUBVL + 1] = rs1[i * SUBVL + 1] + rs2[i * SUBVL + 1];
 223             rd[i * SUBVL + 2] = rs1[i * SUBVL + 2] + rs2[i * SUBVL + 2];
 224         }
 225     }
 226
 227 ----
 228
 229 SVorig goes to a lot of effort to make VL 1<= MAXVL and MAXVL 1..64
 230 where both CSRs may be stored internally in only 6 bits.
 231
 232 Thus, CSRRWI can reach 1..32 for VL and MAXVL.
 233
 234 In addition, setting a hardware loop to zero turning instructions into
 235 NOPs, um, just branch over them, to start the first loop at the end,
 236 on the test for loop variable being zero, a la c "while do" instead of
 237 "do while".
 238
 239 Or, does it not matter that VL only goes up to 31 on a CSRRWI, and that
 240 it only goes to a max of 63 rather than 64?
 241
 242 Answer:
 243
 244 I think supporting SETVL where VL would be set to 0 should be done. that
 245 way, the branch can be put after SETVL, allowing SETVL to execute
 246 earlier giving more time for VL to propagate (preventing stalling)
 247 to the instruction decoder.  I have no problem with having 0 stored to
 248 VL via CSRW resulting in VL=64 (or whatever maximum value is supported
 249 in hardware).
 250
 251 One related idea would to support VL > XLEN but to only allow unpredicated
 252 instructions when VL > XLEN. This would allow later implementing register
 253 pairs/triplets/etc. as predicates as an extension.
 254
 255 ----
 256
 257 Is MV.X good enough a substitute for swizzle?
 258
 259 Answer:
 260
 261 no, since the swizzle instruction specifies in the opcode which elements are
 262 used and where they go, so it can run much faster since the execution engine
 263 doesn't need to pessimize. Additionally, swizzles almost always have constant
 264 element selectors. MV.X is meant more as a last-resort instruction that is
 265 better than load/store, but worse than everything else.
 266
 267     > ok, then we'll need a way to do that.  given that it needs to apply
 268     > to, well... everything, basically, i'm tempted to recommend it be
 269     > done as a CSR and/or as (another) table in VBLOCK.
 270     > the reason is, it's just too much to expect to massively duplicate
 271     > literally every single opcode in existence, just to add swizzle
 272     > when there's no room in the opcode space to do so.
 273     > not sure what alternatives there might be.
 274
 275 ----
 276
 277 Is vectorised srcbase ok as a gather scatter and ok substitute for
 278 register stride? 5 dependency registers (reg stride being the 5th)
 279 is quite scary
 280
 281 ----
 282
 283 Why are integer conversion instructions needed, when the main SV spec
 284 covers them by allowing elwidth to be set on both src and dest regs?
 285
 286 ----
 287
 288 Why are the SETVL rules so complex? What is the reason, how are loops
 289 carried out?
 290
 291 Partial Answer:
 292
 293 The idea is that the compiler knows maxVL at compile time since it allocated the
 294 backing registers, so SETVL has the maxVL as an immediate value. There is no
 295 maxVL CSR needed for just SVPrefix.
 296
 297     > when looking at a loop assembly sequence
 298     > i think you'll find this approach will not work.
 299     > RVV loops on which SV loops are directly based needs understanding
 300     > of the use of MIN within the actual SETVL instruction.
 301     > Yes MVL is known at compile time
 302     > however unless MVL is communicates to the hardware, SETVL just
 303     > does not work: it has absolutely no way of knowing when to stop
 304     > processing.  The point being: it's not *MVL* that's the problem
 305     > if MVL is not a CSR, it's *VL* that becomes the problem.
 306     > The only other option which does work is to set a mandatory
 307     > hardcoded MVL baked into the actual hardware.
 308     > That results in loss of flexibility and defeats the purpose of SV.
 309
 310 ----
 311
 312 With SUBVL (sub vector len) being both a CSR and also part of the 48/64
 313 bit opcode, how does that work?
 314
 315 Answer:
 316
 317 I think we should just ignore the SUBVL CSR and use the value from the
 318 SUBVL field when executing 48/64-bit instructions. For just SVPrefix,
 319 I would say that the only user-visible CSR needed is VL. This is ignoring
 320 all the state for context-switching and exception handling.
 321
 322     > the consequence of that would be that P48/64 would need
 323     > its own CSR State to track the subelement index.
 324     > or that any exceptions would need to occur on a group
 325     > basis, which is less than ideal,
 326     > and interrupts would have to be stalled.
 327     > interacting with SUBVL and requiring P48/64 to save the
 328     > STATE CSR if needed is a workable compromise that
 329     > does not result in huge CSR proliferation
 330
 331 ----
 332
 333 What are the interaction rules when a 48/64 prefix opcode has a rd/rs
 334 that already has a Vector Context for either predication or a register?
 335
 336 It would perhaps make sense (and for svlen as well) to make 48/64 isolated
 337 and unaffected by VLIW context, with the exception of VL/MVL.
 338
 339 MVL and VL should be modifiable by 64 bit prefix as they are global
 340 in nature.
 341
 342 Possible solution, svlen and VLtyp allowed to share STATE CSR however
 343 programmer becomes responsible for push and pop of state during use of
 344 a sequence of P48 and P64 ops.
 345
 346 ----
 347
 348 Can bit 60 of P64 be put to use (in all but the FR4 case)?
 349
 350
 351
 352 experiment VLtyp
 353 ================
 354
 355 experiment 1:
 356
 357 +-----------+-------------+--------------+------------+----------------------+
 358 | VLtyp[11] | VLtyp[10:6] | VLtyp[5:3]   | VLtyp[2:0] | comment              |
 359 +-----------+-------------+--------------+------------+----------------------+
 360 | 0         |  00000      | 000          |  000       | no change to VL/MVL  |
 361 +-----------+-------------+--------------+------------+----------------------+
 362 | 0         |  imm        | 000          |  rs'!=0    |                      |
 363 +-----------+-------------+--------------+------------+----------------------+
 364 | 0         |  imm        | rd'!=0       |  000       |                      |
 365 +-----------+-------------+--------------+------------+----------------------+
 366 | 0         |  imm        | rd'!=0       |  rs'!=0    |                      |
 367 +-----------+-------------+--------------+------------+----------------------+
 368 | 1         |  imm        | 000          |  000       |                      |
 369 +-----------+-------------+--------------+------------+----------------------+
 370 | 1         |  imm        | 000          |  rs'!=0    |                      |
 371 +-----------+-------------+--------------+------------+----------------------+
 372 | 1         |  imm        | rd'!=0       | 000        |                      |
 373 +-----------+-------------+--------------+------------+----------------------+
 374 | 1         |  imm        | rd'!=0       |  rs'!=0    |                      |
 375 +-----------+-------------+--------------+------------+----------------------+
 376
 377
 378 experiment 2:
 379
 380 +----+------+-----+-------+----------+-----------------------------------------------+
 381 | 11 | 10:6 | 5   | 4:3   | 2:0      | comment                                       |
 382 +----+------+-----+-------+----------+-----------------------------------------------+
 383 | 0  |  000 | 000         |  000     | no change to VL/MVL                           |
 384 +----+------+-------------+----------+-----------------------------------------------+
 385 | 0  |  imm | 000         |  rs'!=0  | MVL = imm; vl = min(r[rs'], MVL)              |
 386 +----+------+-------------+----------+-----------------------------------------------+
 387 | 0  |  imm | rd'!=0      |  000     | MVL = imm; vl = MVL; r[rd'] = vl              |
 388 +----+------+-------------+----------+-----------------------------------------------+
 389 | 0  |  imm | rd'!=0      |  rs'!=0  | MVL = imm; vl = min(r[rs'], MVL); r[rd'] = vl |
 390 +----+------+-----+-------+----------+-----------------------------------------------+
 391 | 1  |  imm | 0   |  00      000     | MVL = imm; vl = MVL;                          |
 392 +----+------+-----+------------------+-----------------------------------------------+
 393 | 1  |  imm | 0   |  rd[4:0]         | MVL = imm; vl = MVL; r[rd] = vl               |
 394 +----+------+-----+------------------+-----------------------------------------------+
 395 | 1  |  imm | 1   |  00      000     | reserved                                      |
 396 +----+------+-----+------------------+-----------------------------------------------+
 397 | 1  |  imm | 1   |  rs1[4:0]        | MVL = imm; vl = min(r[rs], MVL)               |
 398 +----+------+-----+------------------+-----------------------------------------------+
 399
 400 interestingly, "VLtyp[11] = 0" fits the sv.setvl pseudcode really well.