simple_v_extension/sv_prefix_proposal/discussion.rst

   1 RVC
   2 ===
   3
   4 The comment in the RVC section says that the Opcodes will be evaluated to see which are most useful to provide.
   5
   6 This takes a huge amount of time and, if not *exactly* RVC, would require a special decode engine, taking up extra gates as well as need time to develop.
   7
   8 Far better to just embed RVC into the opcode and prefix it. This is inline with the strategic principle behind SV: "No new opcodes, only prefixed augmentation"
   9
  10 Taking an entire major 32 bit opcode (or two) seems logical (RV128 space). I type funct3 to specify the C type page, Imm 12 bits for the operation.
  11
  12 Or, just "to hell with it" and just take the entire opcode and stuff C into it, no regard for R/I/U/S and instead do whatever we like.
  13
  14
  15 Questions
  16 =========
  17
  18 Confirmation needed as to whether subvector extraction can be covered
  19 by twin predication (it probably can, it is one of the many purposes it
  20 is for).
  21
  22 Answer:
  23
  24 Yes, it can, but VL needs to be changed for it to work, since predicates
  25 work at the size of a whole subvector instead of an element of that
  26 subvector. To avoid needing to constantly change VL, and since swizzles
  27 are a very common operation, I think we should have a separate instruction
  28 -- a subvector element swizzle instruction::
  29
  30     velswizzle x32, x64, SRCSUBVL=3, DESTSUBVL=4, ELTYPE=u8, elements=[0, 0, 2, 1]
  31
  32 Answer:
  33
  34     > ok, i like that idea - adding to TODO list
  35     > see MV.X_
  36
  37 .. _MV.X: http://libre-riscv.org/simple_v_extension/specification/mv.x/
  38
  39 Example pseudocode:
  40
  41 .. code:: C
  42
  43     // processor state:
  44     uint64_t regs[128];
  45     int VL = 5;
  46
  47     typedef uint8_t ELTYPE;
  48     const int SRCSUBVL = 3;
  49     const int DESTSUBVL = 4;
  50     const int elements[] = [0, 0, 2, 1];
  51     ELTYPE *rd = (ELTYPE *)&regs[32];
  52     ELTYPE *rs1 = (ELTYPE *)&regs[48];
  53     for(int i = 0; i < VL; i++)
  54     {
  55         rd[i * DESTSUBVL + 0] = rs1[i * SRCSUBVL + elements[0]];
  56         rd[i * DESTSUBVL + 1] = rs1[i * SRCSUBVL + elements[1]];
  57         rd[i * DESTSUBVL + 2] = rs1[i * SRCSUBVL + elements[2]];
  58         rd[i * DESTSUBVL + 3] = rs1[i * SRCSUBVL + elements[3]];
  59     }
  60
  61 To use the subvector element swizzle instruction to extract a subvector element,
  62 all that needs to be done is to have DESTSUBVL be 1::
  63
  64     // extract element index 2
  65     velswizzle rd, rs1, SRCSUBVL=4, DESTSUBVL=1, ELTYPE=u32, elements=[2]
  66
  67 Example pseudocode:
  68
  69 .. code:: C
  70
  71     // processor state:
  72     uint64_t regs[128];
  73     int VL = 5;
  74
  75     typedef uint32_t ELTYPE;
  76     const int SRCSUBVL = 4;
  77     const int DESTSUBVL = 1;
  78     const int elements[] = [2];
  79     ELTYPE *rd = (ELTYPE *)&regs[...];
  80     ELTYPE *rs1 = (ELTYPE *)&regs[...];
  81     for(int i = 0; i < VL; i++)
  82     {
  83         rd[i * DESTSUBVL + 0] = rs1[i * SRCSUBVL + elements[0]];
  84     }
  85
  86 ----
  87
  88 What is SUBVL and how does it work
  89
  90 Answer:
  91
  92 SUBVL is the instruction field in P48 instructions that specifies
  93 the sub-vector length. The sub-vector length is the number of scalars
  94 that are grouped together and treated like an element by both VL and
  95 predication. This is used to support operations where the elements are
  96 short vectors (2-4 elements) in Vulkan and OpenGL. Those short vectors
  97 are mostly used as mathematical vectors to handle directions, positions,
  98 and colors, rather than as a pure optimization.
  99
 100 For example, when VL is 5::
 101
 102     add x32, x48, x64, SUBVL=3, ELTYPE=u16, PRED=!x9
 103
 104 performs the following operation:
 105
 106 .. code:: C
 107
 108     // processor state:
 109     uint64_t regs[128];
 110     int VL = 5;
 111
 112     // instruction fields:
 113     typedef uint16_t ELTYPE;
 114     const int SUBVL = 3;
 115     ELTYPE *rd = (ELTYPE *)&regs[32];
 116     ELTYPE *rs1 = (ELTYPE *)&regs[48];
 117     ELTYPE *rs2 = (ELTYPE *)&regs[64];
 118     for(int i = 0; i < VL; i++)
 119     {
 120         if(~regs[9] & 0x1)
 121         {
 122             rd[i * SUBVL + 0] = rs1[i * SUBVL + 0] + rs2[i * SUBVL + 0];
 123             rd[i * SUBVL + 1] = rs1[i * SUBVL + 1] + rs2[i * SUBVL + 1];
 124             rd[i * SUBVL + 2] = rs1[i * SUBVL + 2] + rs2[i * SUBVL + 2];
 125         }
 126     }
 127
 128 ----
 129
 130 SVorig goes to a lot of effort to make VL 1<= MAXVL and MAXVL 1..64
 131 where both CSRs may be stored internally in only 6 bits.
 132
 133 Thus, CSRRWI can reach 1..32 for VL and MAXVL.
 134
 135 In addition, setting a hardware loop to zero turning instructions into
 136 NOPs, um, just branch over them, to start the first loop at the end,
 137 on the test for loop variable being zero, a la c "while do" instead of
 138 "do while".
 139
 140 Or, does it not matter that VL only goes up to 31 on a CSRRWI, and that
 141 it only goes to a max of 63 rather than 64?
 142
 143 Answer:
 144
 145 I think supporting SETVL where VL would be set to 0 should be done. that
 146 way, the branch can be put after SETVL, allowing SETVL to execute
 147 earlier giving more time for VL to propagate (preventing stalling)
 148 to the instruction decoder.  I have no problem with having 0 stored to
 149 VL via CSRW resulting in VL=64 (or whatever maximum value is supported
 150 in hardware).
 151
 152 One related idea would to support VL > XLEN but to only allow unpredicated
 153 instructions when VL > XLEN. This would allow later implementing register
 154 pairs/triplets/etc. as predicates as an extension.
 155
 156 ----
 157
 158 Is MV.X good enough a substitute for swizzle?
 159
 160 Answer:
 161
 162 no, since the swizzle instruction specifies in the opcode which elements are
 163 used and where they go, so it can run much faster since the execution engine
 164 doesn't need to pessimize. Additionally, swizzles almost always have constant
 165 element selectors. MV.X is meant more as a last-resort instruction that is
 166 better than load/store, but worse than everything else.
 167
 168     > ok, then we'll need a way to do that.  given that it needs to apply
 169     > to, well... everything, basically, i'm tempted to recommend it be
 170     > done as a CSR and/or as (another) table in VBLOCK.
 171     > the reason is, it's just too much to expect to massively duplicate
 172     > literally every single opcode in existence, just to add swizzle
 173     > when there's no room in the opcode space to do so.
 174     > not sure what alternatives there might be.
 175
 176 ----
 177
 178 Is vectorised srcbase ok as a gather scatter and ok substitute for
 179 register stride? 5 dependency registers (reg stride being the 5th)
 180 is quite scary
 181
 182 ----
 183
 184 Why are integer conversion instructions needed, when the main SV spec
 185 covers them by allowing elwidth to be set on both src and dest regs?
 186
 187 ----
 188
 189 Why are the SETVL rules so complex? What is the reason, how are loops
 190 carried out?
 191
 192 Partial Answer:
 193
 194 The idea is that the compiler knows maxVL at compile time since it allocated the
 195 backing registers, so SETVL has the maxVL as an immediate value. There is no
 196 maxVL CSR needed for just SVPrefix.
 197
 198     > when looking at a loop assembly sequence
 199     > i think you'll find this approach will not work.
 200     > RVV loops on which SV loops are directly based needs understanding
 201     > of the use of MIN within the actual SETVL instruction.
 202     > Yes MVL is known at compile time
 203     > however unless MVL is communicates to the hardware, SETVL just
 204     > does not work: it has absolutely no way of knowing when to stop
 205     > processing.  The point being: it's not *MVL* that's the problem
 206     > if MVL is not a CSR, it's *VL* that becomes the problem.
 207     > The only other option which does work is to set a mandatory
 208     > hardcoded MVL baked into the actual hardware.
 209     > That results in loss of flexibility and defeats the purpose of SV.
 210
 211 ----
 212
 213 With SUBVL (sub vector len) being both a CSR and also part of the 48/64
 214 bit opcode, how does that work?
 215
 216 Answer:
 217
 218 I think we should just ignore the SUBVL CSR and use the value from the
 219 SUBVL field when executing 48/64-bit instructions. For just SVPrefix,
 220 I would say that the only user-visible CSR needed is VL. This is ignoring
 221 all the state for context-switching and exception handling.
 222
 223     > the consequence of that would be that P48/64 would need
 224     > its own CSR State to track the subelement index.
 225     > or that any exceptions would need to occur on a group
 226     > basis, which is less than ideal,
 227     > and interrupts would have to be stalled.
 228     > interacting with SUBVL and requiring P48/64 to save the
 229     > STATE CSR if needed is a workable compromise that
 230     > does not result in huge CSR proliferation
 231
 232 ----
 233
 234 What are the interaction rules when a 48/64 prefix opcode has a rd/rs
 235 that already has a Vector Context for either predication or a register?
 236
 237 It would perhaps make sense (and for svlen as well) to make 48/64 isolated
 238 and unaffected by VLIW context, with the exception of VL/MVL.
 239
 240 MVL and VL should be modifiable by 64 bit prefix as they are global
 241 in nature.
 242
 243 Possible solution, svlen and VLtyp allowed to share STATE CSR however
 244 programmer becomes responsible for push and pop of state during use of
 245 a sequence of P48 and P64 ops.
 246
 247 ----
 248
 249 Can bit 60 of P64 be put to use (in all but the FR4 case)?
 250
 251
 252
 253 experiment VLtyp
 254 ================
 255
 256 experiment 1:
 257
 258 +-----------+-------------+--------------+------------+----------------------+
 259 | VLtyp[11] | VLtyp[10:6] | VLtyp[5:3]   | VLtyp[2:0] | comment              |
 260 +-----------+-------------+--------------+------------+----------------------+
 261 | 0         |  00000      | 000          |  000       | no change to VL/MVL  |
 262 +-----------+-------------+--------------+------------+----------------------+
 263 | 0         |  imm        | 000          |  rs'!=0    |                      |
 264 +-----------+-------------+--------------+------------+----------------------+
 265 | 0         |  imm        | rd'!=0       |  000       |                      |
 266 +-----------+-------------+--------------+------------+----------------------+
 267 | 0         |  imm        | rd'!=0       |  rs'!=0    |                      |
 268 +-----------+-------------+--------------+------------+----------------------+
 269 | 1         |  imm        | 000          |  000       |                      |
 270 +-----------+-------------+--------------+------------+----------------------+
 271 | 1         |  imm        | 000          |  rs'!=0    |                      |
 272 +-----------+-------------+--------------+------------+----------------------+
 273 | 1         |  imm        | rd'!=0       | 000        |                      |
 274 +-----------+-------------+--------------+------------+----------------------+
 275 | 1         |  imm        | rd'!=0       |  rs'!=0    |                      |
 276 +-----------+-------------+--------------+------------+----------------------+
 277
 278
 279 experiment 2:
 280
 281 +----+------+-----+-------+----------+-----------------------------------------------+
 282 | 11 | 10:6 | 5   | 4:3   | 2:0      | comment                                       |
 283 +----+------+-----+-------+----------+-----------------------------------------------+
 284 | 0  |  000 | 000         |  000     | no change to VL/MVL                           |
 285 +----+------+-------------+----------+-----------------------------------------------+
 286 | 0  |  imm | 000         |  rs'!=0  | MVL = imm; vl = min(r[rs'], MVL)              |
 287 +----+------+-------------+----------+-----------------------------------------------+
 288 | 0  |  imm | rd'!=0      |  000     | MVL = imm; vl = MVL; r[rd'] = vl              |
 289 +----+------+-------------+----------+-----------------------------------------------+
 290 | 0  |  imm | rd'!=0      |  rs'!=0  | MVL = imm; vl = min(r[rs'], MVL); r[rd'] = vl |
 291 +----+------+-----+-------+----------+-----------------------------------------------+
 292 | 1  |  imm | 0   |  00      000     | MVL = imm; vl = MVL;                          |
 293 +----+------+-----+------------------+-----------------------------------------------+
 294 | 1  |  imm | 0   |  rd[4:0]         | MVL = imm; vl = MVL; r[rd] = vl               |
 295 +----+------+-----+------------------+-----------------------------------------------+
 296 | 1  |  imm | 1   |  00      000     | reserved                                      |
 297 +----+------+-----+------------------+-----------------------------------------------+
 298 | 1  |  imm | 1   |  rs1[4:0]        | MVL = imm; vl = min(r[rs], MVL)               |
 299 +----+------+-----+------------------+-----------------------------------------------+
 300
 301 interestingly, "VLtyp[11] = 0" fits the sv.setvl pseudcode really well.