simple_v_extension/sv_prefix_proposal/discussion.rst

   1 Questions
   2 =========
   3
   4 Confirmation needed as to whether subvector extraction can be covered
   5 by twin predication (it probably can, it is one of the many purposes it
   6 is for).
   7
   8 Answer:
   9
  10 Yes, it can, but VL needs to be changed for it to work, since predicates
  11 work at the size of a whole subvector instead of an element of that
  12 subvector. To avoid needing to constantly change VL, and since swizzles
  13 are a very common operation, I think we should have a separate instruction
  14 -- a subvector element swizzle instruction::
  15
  16     velswizzle x32, x64, SRCSUBVL=3, DESTSUBVL=4, ELTYPE=u8, elements=[0, 0, 2, 1]
  17
  18 Answer:
  19
  20     > ok, i like that idea - adding to TODO list
  21     > see MV.X_
  22
  23 .. _MV.X: http://libre-riscv.org/simple_v_extension/specification/mv.x/
  24
  25 Example pseudocode:
  26
  27 .. code:: C
  28
  29     // processor state:
  30     uint64_t regs[128];
  31     int VL = 5;
  32
  33     typedef uint8_t ELTYPE;
  34     const int SRCSUBVL = 3;
  35     const int DESTSUBVL = 4;
  36     const int elements[] = [0, 0, 2, 1];
  37     ELTYPE *rd = (ELTYPE *)&regs[32];
  38     ELTYPE *rs1 = (ELTYPE *)&regs[48];
  39     for(int i = 0; i < VL; i++)
  40     {
  41         rd[i * DESTSUBVL + 0] = rs1[i * SRCSUBVL + elements[0]];
  42         rd[i * DESTSUBVL + 1] = rs1[i * SRCSUBVL + elements[1]];
  43         rd[i * DESTSUBVL + 2] = rs1[i * SRCSUBVL + elements[2]];
  44         rd[i * DESTSUBVL + 3] = rs1[i * SRCSUBVL + elements[3]];
  45     }
  46
  47 To use the subvector element swizzle instruction to extract a subvector element,
  48 all that needs to be done is to have DESTSUBVL be 1::
  49
  50     // extract element index 2
  51     velswizzle rd, rs1, SRCSUBVL=4, DESTSUBVL=1, ELTYPE=u32, elements=[2]
  52
  53 Example pseudocode:
  54
  55 .. code:: C
  56
  57     // processor state:
  58     uint64_t regs[128];
  59     int VL = 5;
  60
  61     typedef uint32_t ELTYPE;
  62     const int SRCSUBVL = 4;
  63     const int DESTSUBVL = 1;
  64     const int elements[] = [2];
  65     ELTYPE *rd = (ELTYPE *)&regs[...];
  66     ELTYPE *rs1 = (ELTYPE *)&regs[...];
  67     for(int i = 0; i < VL; i++)
  68     {
  69         rd[i * DESTSUBVL + 0] = rs1[i * SRCSUBVL + elements[0]];
  70     }
  71
  72 ----
  73
  74 What is SUBVL and how does it work
  75
  76 Answer:
  77
  78 SUBVL is the instruction field in P48 instructions that specifies
  79 the sub-vector length. The sub-vector length is the number of scalars
  80 that are grouped together and treated like an element by both VL and
  81 predication. This is used to support operations where the elements are
  82 short vectors (2-4 elements) in Vulkan and OpenGL. Those short vectors
  83 are mostly used as mathematical vectors to handle directions, positions,
  84 and colors, rather than as a pure optimization.
  85
  86 For example, when VL is 5::
  87
  88     add x32, x48, x64, SUBVL=3, ELTYPE=u16, PRED=!x9
  89
  90 performs the following operation:
  91
  92 .. code:: C
  93
  94     // processor state:
  95     uint64_t regs[128];
  96     int VL = 5;
  97
  98     // instruction fields:
  99     typedef uint16_t ELTYPE;
 100     const int SUBVL = 3;
 101     ELTYPE *rd = (ELTYPE *)&regs[32];
 102     ELTYPE *rs1 = (ELTYPE *)&regs[48];
 103     ELTYPE *rs2 = (ELTYPE *)&regs[64];
 104     for(int i = 0; i < VL; i++)
 105     {
 106         if(~regs[9] & 0x1)
 107         {
 108             rd[i * SUBVL + 0] = rs1[i * SUBVL + 0] + rs2[i * SUBVL + 0];
 109             rd[i * SUBVL + 1] = rs1[i * SUBVL + 1] + rs2[i * SUBVL + 1];
 110             rd[i * SUBVL + 2] = rs1[i * SUBVL + 2] + rs2[i * SUBVL + 2];
 111         }
 112     }
 113
 114 ----
 115
 116 SVorig goes to a lot of effort to make VL 1<= MAXVL and MAXVL 1..64
 117 where both CSRs may be stored internally in only 6 bits.
 118
 119 Thus, CSRRWI can reach 1..32 for VL and MAXVL.
 120
 121 In addition, setting a hardware loop to zero turning instructions into
 122 NOPs, um, just branch over them, to start the first loop at the end,
 123 on the test for loop variable being zero, a la c "while do" instead of
 124 "do while".
 125
 126 Or, does it not matter that VL only goes up to 31 on a CSRRWI, and that
 127 it only goes to a max of 63 rather than 64?
 128
 129 Answer:
 130
 131 I think supporting SETVL where VL would be set to 0 should be done. that
 132 way, the branch can be put after SETVL, allowing SETVL to execute
 133 earlier giving more time for VL to propagate (preventing stalling)
 134 to the instruction decoder.  I have no problem with having 0 stored to
 135 VL via CSRW resulting in VL=64 (or whatever maximum value is supported
 136 in hardware).
 137
 138 One related idea would to support VL > XLEN but to only allow unpredicated
 139 instructions when VL > XLEN. This would allow later implementing register
 140 pairs/triplets/etc. as predicates as an extension.
 141
 142 ----
 143
 144 Is MV.X good enough a substitute for swizzle?
 145
 146 Answer:
 147
 148 no, since the swizzle instruction specifies in the opcode which elements are
 149 used and where they go, so it can run much faster since the execution engine
 150 doesn't need to pessimize. Additionally, swizzles almost always have constant
 151 element selectors. MV.X is meant more as a last-resort instruction that is
 152 better than load/store, but worse than everything else.
 153
 154     > ok, then we'll need a way to do that.  given that it needs to apply
 155     > to, well... everything, basically, i'm tempted to recommend it be
 156     > done as a CSR and/or as (another) table in VBLOCK.
 157     > the reason is, it's just too much to expect to massively duplicate
 158     > literally every single opcode in existence, just to add swizzle
 159     > when there's no room in the opcode space to do so.
 160     > not sure what alternatives there might be.
 161
 162 ----
 163
 164 Is vectorised srcbase ok as a gather scatter and ok substitute for
 165 register stride? 5 dependency registers (reg stride being the 5th)
 166 is quite scary
 167
 168 ----
 169
 170 Why are integer conversion instructions needed, when the main SV spec
 171 covers them by allowing elwidth to be set on both src and dest regs?
 172
 173 ----
 174
 175 Why are the SETVL rules so complex? What is the reason, how are loops
 176 carried out?
 177
 178 Partial Answer:
 179
 180 The idea is that the compiler knows maxVL at compile time since it allocated the
 181 backing registers, so SETVL has the maxVL as an immediate value. There is no
 182 maxVL CSR needed for just SVPrefix.
 183
 184     > when looking at a loop assembly sequence
 185     > i think you'll find this approach will not work.
 186     > RVV loops on which SV loops are directly based needs understanding
 187     > of the use of MIN within the actual SETVL instruction.
 188     > Yes MVL is known at compile time
 189     > however unless MVL is communicates to the hardware, SETVL just
 190     > does not work: it has absolutely no way of knowing when to stop
 191     > processing.  The point being: it's not *MVL* that's the problem
 192     > if MVL is not a CSR, it's *VL* that becomes the problem.
 193     > The only other option which does work is to set a mandatory
 194     > hardcoded MVL baked into the actual hardware.
 195     > That results in loss of flexibility and defeats the purpose of SV.
 196
 197 ----
 198
 199 With SUBVL (sub vector len) being both a CSR and also part of the 48/64
 200 bit opcode, how does that work?
 201
 202 Answer:
 203
 204 I think we should just ignore the SUBVL CSR and use the value from the
 205 SUBVL field when executing 48/64-bit instructions. For just SVPrefix,
 206 I would say that the only user-visible CSR needed is VL. This is ignoring
 207 all the state for context-switching and exception handling.
 208
 209     > the consequence of that would be that P48/64 would need
 210     > its own CSR State to track the subelement index.
 211     > or that any exceptions would need to occur on a group
 212     > basis, which is less than ideal,
 213     > and interrupts would have to be stalled.
 214     > interacting with SUBVL and requiring P48/64 to save the
 215     > STATE CSR if needed is a workable compromise that
 216     > does not result in huge CSR proliferation
 217
 218 ----
 219
 220 What are the interaction rules when a 48/64 prefix opcode has a rd/rs
 221 that already has a Vector Context for either predication or a register?
 222
 223 It would perhaps make sense (and for svlen as well) to make 48/64 isolated
 224 and unaffected by VLIW context, with the exception of VL/MVL.
 225
 226 MVL and VL should be modifiable by 64 bit prefix as they are global
 227 in nature.
 228
 229 Possible solution, svlen and VLtyp allowed to share STATE CSR however
 230 programmer becomes responsible for push and pop of state during use of
 231 a sequence of P48 and P64 ops.
 232
 233 ----
 234
 235 Can bit 60 of P64 be put to use (in all but the FR4 case)?
 236
 237
 238
 239 experiment VLtyp
 240 ================
 241
 242 experiment 1:
 243
 244 +-----------+-------------+--------------+------------+----------------------+
 245 | VLtyp[11] | VLtyp[10:6] | VLtyp[5:3]   | VLtyp[2:0] | comment              |
 246 +-----------+-------------+--------------+------------+----------------------+
 247 | 0         |  00000      | 000          |  000       | no change to VL/MVL  |
 248 +-----------+-------------+--------------+------------+----------------------+
 249 | 0         |  imm        | 000          |  rs'!=0    |                      |
 250 +-----------+-------------+--------------+------------+----------------------+
 251 | 0         |  imm        | rd'!=0       |  000       |                      |
 252 +-----------+-------------+--------------+------------+----------------------+
 253 | 0         |  imm        | rd'!=0       |  rs'!=0    |                      |
 254 +-----------+-------------+--------------+------------+----------------------+
 255 | 1         |  imm        | 000          |  000       |                      |
 256 +-----------+-------------+--------------+------------+----------------------+
 257 | 1         |  imm        | 000          |  rs'!=0    |                      |
 258 +-----------+-------------+--------------+------------+----------------------+
 259 | 1         |  imm        | rd'!=0       | 000        |                      |
 260 +-----------+-------------+--------------+------------+----------------------+
 261 | 1         |  imm        | rd'!=0       |  rs'!=0    |                      |
 262 +-----------+-------------+--------------+------------+----------------------+
 263
 264
 265 experiment 2:
 266
 267 +-----------+-------------+------------+--------------+------------+----------------------+
 268 | VLtyp[11] | VLtyp[10:6] | VLtyp[5]   | VLtyp[4:3]   | VLtyp[2:0] | comment              |
 269 +-----------+-------------+---------------------------+------------+----------------------+
 270 | 0         |  00000      | 0            00           |  000       | no change to VL/MVL  |
 271 +-----------+-------------+---------------------------+------------+----------------------+
 272 | 0         |  imm        | 000                       |  rs'!=0    | sv.setvl immed mode  |
 273 +-----------+-------------+---------------------------+------------+----------------------+
 274 | 0         |  imm        | rd'!=0                    |  000       | not sure             |
 275 +-----------+-------------+---------------------------+------------+----------------------+
 276 | 0         |  imm        | rd'!=0                    |  rs'!=0    | sv.setvl rd, rs, MVL |
 277 +-----------+-------------+------------+---------------------------+----------------------+
 278 | 1         |  imm        | 0          |  00000                    | set MVL immed        |
 279 +-----------+-------------+------------+---------------------------+----------------------+
 280 | 1         |  imm        | 1          |  rd[4:0]                  | sv.setvl rd, immed   |
 281 +-----------+-------------+------------+---------------------------+----------------------+
 282
 283