simple_v_extension/sv_prefix_proposal/discussion.mdwn

   1 # questions
   2 # =========
   3
   4 Confirmation needed as to whether subvector extraction can be covered
   5 by twin predication (it probably can, it is one of the many purposes it
   6 is for).
   7
   8 Answer:
   9
  10 Yes, it can, but VL needs to be changed for it to work, since predicates work at
  11 the size of a whole subvector instead of an element of that subvector. To avoid
  12 needing to constantly change VL, and since swizzles are a very common operation, I
  13 think we should have a separate instruction -- a subvector element swizzle
  14 instruction::
  15
  16     velswizzle x32, x64, SRCSUBVL=3, DESTSUBVL=4, ELTYPE=u8, elements=[0, 0, 2, 1]
  17
  18 Example pseudocode:
  19
  20 .. code:: C
  21
  22     // processor state:
  23     uint64_t regs[128];
  24     int VL = 5;
  25
  26     typedef uint8_t ELTYPE;
  27     const int SRCSUBVL = 3;
  28     const int DESTSUBVL = 4;
  29     const int elements[] = [0, 0, 2, 1];
  30     ELTYPE *rd = (ELTYPE *)&regs[32];
  31     ELTYPE *rs1 = (ELTYPE *)&regs[48];
  32     for(int i = 0; i < VL; i++)
  33     {
  34         rd[i * DESTSUBVL + 0] = rs1[i * SRCSUBVL + elements[0]];
  35         rd[i * DESTSUBVL + 1] = rs1[i * SRCSUBVL + elements[1]];
  36         rd[i * DESTSUBVL + 2] = rs1[i * SRCSUBVL + elements[2]];
  37         rd[i * DESTSUBVL + 3] = rs1[i * SRCSUBVL + elements[3]];
  38     }
  39
  40 To use the subvector element swizzle instruction to extract a subvector element,
  41 all that needs to be done is to have DESTSUBVL be 1::
  42
  43     // extract element index 2
  44     velswizzle rd, rs1, SRCSUBVL=4, DESTSUBVL=1, ELTYPE=u32, elements=[2]
  45
  46 Example pseudocode:
  47
  48 .. code:: C
  49
  50     // processor state:
  51     uint64_t regs[128];
  52     int VL = 5;
  53
  54     typedef uint32_t ELTYPE;
  55     const int SRCSUBVL = 4;
  56     const int DESTSUBVL = 1;
  57     const int elements[] = [2];
  58     ELTYPE *rd = (ELTYPE *)&regs[...];
  59     ELTYPE *rs1 = (ELTYPE *)&regs[...];
  60     for(int i = 0; i < VL; i++)
  61     {
  62         rd[i * DESTSUBVL + 0] = rs1[i * SRCSUBVL + elements[0]];
  63     }
  64
  65 --
  66
  67 What is SUBVL and how does it work
  68
  69 Answer:
  70
  71 SUBVL is the instruction field in P48 instructions that specifies the sub-vector
  72 length. The sub-vector length is the number of scalars that are grouped together
  73 and treated like an element by both VL and predication. This is used to support
  74 operations where the elements are short vectors (2-4 elements) in Vulkan and
  75 OpenGL. Those short vectors are mostly used as mathematical vectors to handle
  76 directions, positions, and colors, rather than as a pure optimization.
  77
  78 For example, when VL is 5::
  79
  80     add x32, x48, x64, SUBVL=3, ELTYPE=u16, PRED=!x9
  81
  82 performs the following operation:
  83
  84 .. code:: C
  85
  86     // processor state:
  87     uint64_t regs[128];
  88     int VL = 5;
  89
  90     // instruction fields:
  91     typedef uint16_t ELTYPE;
  92     const int SUBVL = 3;
  93     ELTYPE *rd = (ELTYPE *)&regs[32];
  94     ELTYPE *rs1 = (ELTYPE *)&regs[48];
  95     ELTYPE *rs2 = (ELTYPE *)&regs[64];
  96     for(int i = 0; i < VL; i++)
  97     {
  98         if(~regs[9] & 0x1)
  99         {
 100             rd[i * SUBVL + 0] = rs1[i * SUBVL + 0] + rs2[i * SUBVL + 0];
 101             rd[i * SUBVL + 1] = rs1[i * SUBVL + 1] + rs2[i * SUBVL + 1];
 102             rd[i * SUBVL + 2] = rs1[i * SUBVL + 2] + rs2[i * SUBVL + 2];
 103         }
 104     }
 105
 106 --
 107
 108 SVorig goes to a lot of effort to make VL 1<= MAXVL and MAXVL 1..64
 109 where both CSRs may be stored internally in only 6 bits.
 110
 111 Thus, CSRRWI can reach 1..32 for VL and MAXVL.
 112
 113 In addition, setting a hardware loop to zero turning instructions into
 114 NOPs, um, just branch over them, to start the first loop at the end,
 115 on the test for loop variable being zero, a la c "while do" instead of
 116 "do while".
 117
 118 Or, does it not matter that VL only goes up to 31 on a CSRRWI, and that
 119 it only goes to a max of 63 rather than 64?
 120
 121 Answer:
 122
 123 I think supporting SETVL where VL would be set to 0 should be done. that way,
 124 the branch can be put after SETVL, allowing SETVL to execute earlier giving more
 125 time for VL to propagate (preventing stalling) to the instruction decoder.
 126 I have no problem with having 0 stored to VL via CSRW resulting in VL=64
 127 (or whatever maximum value is supported in hardware).
 128
 129 One related idea would to support VL > XLEN but to only allow unpredicated
 130 instructions when VL > XLEN. This would allow later implementing register
 131 pairs/triplets/etc. as predicates as an extension.
 132
 133 --
 134
 135 Should these questions be moved to Discussion subpage
 136
 137 Answer:
 138
 139 probably, I'll let Luke do that if desired.
 140
 141 --
 142
 143 Is MV.X good enough a substitute for swizzle?
 144
 145 Answer:
 146
 147 no, since the swizzle instruction specifies in the opcode which elements are
 148 used and where they go, so it can run much faster since the execution engine
 149 doesn't need to pessimize. Additionally, swizzles almost always have constant
 150 element selectors. MV.X is meant more as a last-resort instruction that is
 151 better than load/store, but worse than everything else.
 152
 153 --
 154
 155 Is vectorised srcbase ok as a gather scatter and ok substitute for
 156 register stride? 5 dependency registers (reg stride being the 5th)
 157 is quite scary
 158
 159 --
 160
 161 Why are integer conversion instructions needed, when the main SV spec
 162 covers them by allowing elwidth to be set on both src and dest regs?
 163
 164 --
 165
 166 Why are the SETVL rules so complex? What is the reason, how are loops
 167 carried out?
 168
 169 Partial Answer:
 170
 171 The idea is that the compiler knows maxVL at compile time since it allocated the
 172 backing registers, so SETVL has the maxVL as an immediate value. There is no
 173 maxVL CSR needed for just SVPrefix.
 174
 175 > when looking at a loop assembly sequence
 176 > i think you'll find this approach will not work.
 177 > RVV loops on which SV loops are directly based needs understanding
 178 > of the use of MIN. Yes MVL is known at compile time
 179 > however unless MVL is communicates to the hardware, SETVL just
 180 > does not work.
 181 > The only other option which does work is to set a mandatory
 182 > hardcoded MVL baked into the actual hardware.
 183 > That results in loss of flexibility and defeats the purpose of SV.
 184
 185 --
 186
 187 With SUBVL (sub vector len) being both a CSR and also part of the 48/64
 188 bit opcode, how does that work?
 189
 190 Answer:
 191
 192 I think we should just ignore the SUBVL CSR and use the value from the SUBVL field when
 193 executing 48/64-bit instructions. For just SVPrefix, I would say that the only
 194 user-visible CSR needed is VL. This is ignoring all the state for
 195 context-switching and exception handling.
 196
 197 > the consequence of that would be that P48/64 would need
 198 > its own CSR State to track the subelement index.
 199 > or that any exceptions would need to occur on a group
 200 > basis, which is less than ideal,
 201 > and interrupts would have to be stalled.
 202 > interacting with SUBVL and requiring P48/64 to save the
 203 > STATE CSR if needed is a workable compromise that
 204 > does not result in huge CSR proliferation
 205
 206 --
 207
 208 What are the interaction rules when a 48/64 prefix opcode has a rd/rs
 209 that already has a Vector Context for either predication or a register?
 210
 211 It would perhaps make sense (and for svlen as well) to make 48/64 isolated
 212 and unaffected by VLIW context, with the exception of VL/MVL.
 213
 214 MVL and VL should be modifiable by 64 bit prefix as they are global
 215 in nature.
 216
 217 Possible solution, svlen and VLtyp allowed to share STATE CSR however
 218 programmer becomes responsible for push and pop of state during use of
 219 a sequence of P48 and P64 ops.
 220
 221 --
 222
 223 Can bit 60 of P64 be put to use (in all but the FR4 case)?
 224