simple_v_extension/specification/mv.x.rst

   1 [[!tag standards]]
   2
   3 MV.X and MV.swizzle
   4 ===================
   5
   6 swizzle needs a MV (there are 2 of them: swizzle and swizzle2).
   7 see below for a potential way to use the funct7 to do a swizzle in rs2.
   8
   9 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
  10 | Encoding      | 31:27       | 26:25 | 24:20    | 19:15    | 14:12  | 11:7     | 6:2    | 1:0    |
  11 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
  12 | RV32-I-type   + imm[11:0]                      + rs1[4:0] + funct3 | rd[4:0]  + opcode + 0b11   |
  13 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
  14 | RV32-I-type   + fn4[3:0]    + swizzle[7:0]     + rs1[4:0] + 0b000  | rd[4:0]  + OP-V   + 0b11   |
  15 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
  16
  17 * funct3 = MV: 0b000 for FP, 0b001 for INT
  18 * OP-V = 0b1010111
  19 * fn4 = 4 bit function.
  20 * fn4 = 0b0000 - MV-SWIZZLE
  21 * fn4 = 0bNN01 - MV-X, NN=elwidth (default/8/16/32)
  22 * fn4 = 0bNN11 - MV-X.SUBVL NN=elwidth (default/8/16/32)
  23
  24 swizzle (only active on SV or P48/P64 when SUBVL!=0):
  25
  26 +-----+-----+-----+-----+
  27 | 7:6 | 5:4 | 3:2 | 1:0 |
  28 +-----+-----+-----+-----+
  29 |   w |   z |   y |   x |
  30 +-----+-----+-----+-----+
  31
  32 MV.X has two modes: SUBVL mode applies the element offsets only within a SUBVL inner loop. This can be used for transposition.
  33
  34 ::
  35
  36   for i in range(VL):
  37      for j in range(SUBVL):
  38         regs[rd] = regs[rd+regs[rs+j]]
  39
  40 Normal mode will apply the element offsets incrementally:
  41
  42 ::
  43
  44   for i in range(VL):
  45      for j in range(SUBVL):
  46         regs[rd] = regs[rd+regs[rs+k]]
  47           k++
  48
  49
  50 Pseudocode for element width part of MV.X:
  51
  52 ::
  53
  54   def mv_x(rd, rs1, funct4):
  55       elwidth = (funct4>>2) & 0x3
  56       bitwidth = {0:XLEN, 1:8, 2:16, 3:32}[elwidth] # get bits per el
  57       bytewidth = bitwidth / 8 # get bytes per el
  58       for i in range(VL):
  59           addr = (unsigned char *)&regs[rs1]
  60           offset = addr + bytewidth # get offset within regfile as SRAM
  61           # TODO, actually, needs to respect rd and rs1 element width,
  62           # here, as well.  this pseudocode just illustrates that the
  63           # MV.X operation contains a way to compact the indices into
  64           # less space.
  65           regs[rd] = (unsigned char*)(regs)[offset]
  66
  67 The idea here is to allow 8-bit indices to be stored inside XLEN-sized
  68 registers, such that rather than doing this:
  69
  70 .. parsed-literal::
  71     ldimm x8, 1
  72     ldimm x9, 3
  73     ldimm x10, 2
  74     ldimm x11, 0
  75     {SVP.VL=4} MV.X x3, x8, elwidth=default
  76
  77 The alternative is this:
  78
  79 .. parsed-literal::
  80     ldimm x8, 0x00020301
  81     {SVP.VL=4} MV.X x3, x8, elwidth=8
  82
  83 Thus compacting four indices into the one register.  x3 and x8's element
  84 width are *independent* of the MV.X elwidth, thus allowing both source
  85 and element element widths of the *elements* to be moved to be over-ridden,
  86 whilst *at the same time* allowing the *indices* to be compacted, as well.
  87
  88 ----
  89
  90 potential MV.X?  register-version of MV-swizzle?
  91
  92 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  93 | Encoding    | 31:27 | 26:25 | 24:20    | 19:15    | 14:12  | 11:7     | 6:2    | 1:0    |
  94 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  95 | RV32-R-type + funct7        + rs2[4:0] + rs1[4:0] + funct3 | rd[4:0]  + opcode + 0b11   |
  96 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  97 | RV32-R-type + 0b0000000     + rs2[4:0] + rs1[4:0] + 0b001  | rd[4:0]  + OP-V   + 0b11   |
  98 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  99
 100 * funct3 = MV.X
 101 * OP-V = 0b1010111
 102 * funct7 = 0b000NN00 - INT MV.X, elwidth=NN (default/8/16/32)
 103 * funct7 = 0b000NN10 - FP MV.X, elwidth=NN (default/8/16/32)
 104 * funct7 = 0b0000001 - INT MV.swizzle to say that rs2 is a swizzle argument?
 105 * funct7 = 0b0000011 - FP MV.swizzle to say that rs2 is a swizzle argument?
 106
 107 question: do we need a swizzle MV.X as well?
 108
 109 MV.X with 3 operands
 110 ====================
 111
 112 regs[rd] = regs[rs1 + regs[rs2]]
 113
 114 Similar to LD/ST with the same twin predication rules
 115
 116 macro-op fusion
 117 ===============
 118
 119 there is the potential for macro-op fusion of mv-swizzle with the following instruction and/or preceding instruction.
 120 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002486.html>
 121
 122 VBLOCK context?
 123 ===============
 124
 125 additional idea: a VBLOCK context that says that if a given register is used, it indicates that the
 126 register is to be "swizzled", and the VBLOCK swizzle context contains the swizzling to be carried out.
 127
 128 mm_shuffle_ps?
 129 ==============
 130
 131 __m128 _mm_shuffle_ps(__m128 lo,__m128 hi,
 132        _MM_SHUFFLE(hi3,hi2,lo1,lo0))
 133 Interleave inputs into low 2 floats and high 2 floats of output. Basically
 134    out[0]=lo[lo0];
 135    out[1]=lo[lo1];
 136    out[2]=hi[hi2];
 137    out[3]=hi[hi3];
 138
 139 For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float
 140 a[i] into all 4 output floats.
 141
 142 Transpose
 143 =========
 144
 145 assuming a vector of 4x4 matrixes is stored as 4 separate vectors with subvl=4 in struct-of-array-of-struct form (the form I've been planning on using):
 146 using standard (4+4) -> 4 swizzle instructions with 2 input vectors with subvl=4 and 1 output vector with subvl, a vectorized matrix transpose operation can be done in 2 steps with 4 instructions per step to give 8 instructions in total:
 147
 148 input:
 149 | m00 m10 m20 m30 |
 150 | m01 m11 m21 m31 |
 151 | m02 m12 m22 m32 |
 152 | m03 m13 m23 m33 |
 153
 154 transpose 4 corner 2x2 matrices
 155
 156 intermediate:
 157 | m00 m01 m20 m21 |
 158 | m10 m11 m30 m31 |
 159 | m02 m03 m22 m23 |
 160 | m12 m13 m32 m33 |
 161
 162 finish transpose
 163
 164 output:
 165 | m00 m01 m02 m03 |
 166 | m10 m11 m12 m13 |
 167 | m20 m21 m22 m23 |
 168 | m30 m31 m32 m33 |
 169
 170 <http://web.archive.org/web/20100111104515/http://www.randombit.net:80/bitbashing/programming/integer_matrix_transpose_in_sse2.html>
 171
 172
 173 ::
 174
 175    __m128i T0 = _mm_unpacklo_epi32(I0, I1);
 176    __m128i T1 = _mm_unpacklo_epi32(I2, I3);
 177    __m128i T2 = _mm_unpackhi_epi32(I0, I1);
 178    __m128i T3 = _mm_unpackhi_epi32(I2, I3);
 179
 180    /* Assigning transposed values back into I[0-3] */
 181    I0 = _mm_unpacklo_epi64(T0, T1);
 182    I1 = _mm_unpackhi_epi64(T0, T1);
 183    I2 = _mm_unpacklo_epi64(T2, T3);
 184    I3 = _mm_unpackhi_epi64(T2, T3);
 185
 186 Transforms for DCT
 187 ==================
 188
 189 <https://opencores.org/websvn/filedetails?repname=mpeg2fpga&path=%2Fmpeg2fpga%2Ftrunk%2Frtl%2Fmpeg2%2Fidct.v>
 190
 191 Table to evaluate
 192 =================
 193
 194 swizzle2 takes 2 arguments, interleaving the two vectors depending on a 3rd (the swizzle selector)
 195
 196 +-----------+-------+-------+-------+-------+-------+------+
 197 |           | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 |
 198 +===========+=======+=======+=======+=======+=======+======+
 199 | swizzle2  | rs3   | 00    | rs2   | rs1   | 000   | rd   |
 200 +-----------+-------+-------+-------+-------+-------+------+
 201 | fswizzle2 | rs3   | 01    | rs2   | rs1   | 000   | rd   |
 202 +-----------+-------+-------+-------+-------+-------+------+
 203 | swizzle   | 0     | 10    | rs2   | rs1   | 000   | rd   |
 204 +-----------+-------+-------+-------+-------+-------+------+
 205 | fswizzle  | 0     | 11    | rs2   | rs1   | 000   | rd   |
 206 +-----------+-------+-------+-------+-------+-------+------+
 207 | swizzlei  | imm                   | rs1   | 001   | rd   |
 208 +-----------+                       +-------+-------+------+
 209 | fswizzlei |                       | rs1   | 010   | rd   |
 210 +-----------+-------+-------+-------+-------+-------+------+
 211
 212 More:
 213
 214 swizzlei would still need the 12-bit format due to not having enough immediate bits. we can get away with only 3 i-type funct3s used for [f]swizzlei by having one funct3 for destsubvl 1 through 3 for int and fp versions and a separate one for destsubvl = 4 that's shared between int/fp:
 215
 216 +--------+-----------+----+-----------+----------+-------+-------+------+
 217 | int/fp | DESTSUBVL | 31 | 30:29     | 28:20    | 19:15 | 14:12 | 11:7 |
 218 +========+===========+====+===========+==========+=======+=======+======+
 219 | int    | 1 to 3    | 0  | DESTSUBVL | selector | rs    | 000   | rd   |
 220 +--------+-----------+----+-----------+----------+-------+-------+------+
 221 | fp     | 1 to 3    | 1  | DESTSUBVL | selector | rs    | 000   | rd   |
 222 +--------+-----------+----+-----------+----------+-------+-------+------+
 223 | int    | 4         | selector[11:0]            | rs    | 001   | rd   |
 224 +--------+-----------+---------------------------+-------+-------+------+
 225 | fp     | 4         | selector[11:0]            | rs    | 010   | rd   |
 226 +--------+-----------+---------------------------+-------+-------+------+
 227
 228 the rest could be encoded as follows:
 229
 230 +-----------+-------+-----------+-------+-------+-------+------+
 231 |           | 31:27 | 26:25     | 24:20 | 19:15 | 14:12 | 11:7 |
 232 +===========+=======+===========+=======+=======+=======+======+
 233 | swizzle2  | rs3   | DESTSUBVL | rs2   | rs1   | 100   | rd   |
 234 +-----------+-------+-----------+-------+-------+-------+------+
 235 | swizzle   | rs1   | DESTSUBVL | rs2   | rs1   | 100   | rd   |
 236 +-----------+-------+-----------+-------+-------+-------+------+
 237 | fswizzle2 | rs3   | DESTSUBVL | rs2   | rs1   | 101   | rd   |
 238 +-----------+-------+-----------+-------+-------+-------+------+
 239 | fswizzle  | rs1   | DESTSUBVL | rs2   | rs1   | 101   | rd   |
 240 +-----------+-------+-----------+-------+-------+-------+------+
 241
 242 note how for [f]swizzle, rs3 == rs1
 243
 244 so it uses 5 funct3 values overall, which is appropriate, since swizzle is probably right after muladd in usage in graphics shaders.
 245
 246 Matrix 4x4 Vector mul
 247 =====================
 248
 249 ::
 250
 251     pfscale,3 F2, F1, F10
 252     pfscaleadd,2 F2, F1, F11, F2
 253     pfscaleadd,1 F2, F1, F12, F2
 254     pfscaleadd,0 F2, F1, F13, F2
 255
 256 pfscale is a 4 vec mv.shuffle followed by a fmul. pfscaleadd is a 4 vec mv.shuffle followed by a fmac.
 257
 258 In effect what this is doing is:
 259
 260 ::
 261
 262     fmul f2, f1.xxxx, f10
 263     fmac f2, f1.yyyy, f11, f2
 264     fmac f2, f1.zzzz, f12, f2
 265     fmac f2, f1.wwww, f13, f2
 266
 267 Where all of f2, f1, and f10-13 are vec4, and f1.x-w are copied (fixed index) where the other vec4 indices progress.
 268
 269 Pseudocode
 270 ==========
 271
 272 Swizzle:
 273
 274 ::
 275
 276     pub trait SwizzleConstants: Copy + 'static {
 277         const CONSTANTS: &'static [Self; 4];
 278     }
 279
 280     impl SwizzleConstants for u8 {
 281         const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFF, 0x7F];
 282     }
 283
 284     impl SwizzleConstants for u16 {
 285         const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFFFF, 0x7FFF];
 286     }
 287
 288     impl SwizzleConstants for f32 {
 289         const CONSTANTS: &'static [Self; 4] = &[0.0, 1.0, -1.0, 0.5];
 290     }
 291
 292     // impl for other types too...
 293
 294     pub fn swizzle<Elm, Selector>(
 295         rd: &mut [Elm],
 296         rs1: &[Elm],
 297         rs2: &[Selector],
 298         vl: usize,
 299         destsubvl: usize,
 300         srcsubvl: usize)
 301     where
 302         Elm: SwizzleConstants,
 303         // Selector is a copyable type that can be converted into u64
 304         Selector: Copy + Into<u64>,
 305     {
 306         const FIELD_SIZE: usize = 3;
 307         const FIELD_MASK: u64 = 0b111;
 308         for vindex in 0..vl {
 309             let selector = rs2[vindex].into();
 310             // selector's type is u64
 311             if selector >> (FIELD_SIZE * destsubvl) != 0 {
 312                 // handle illegal instruction trap
 313             }
 314             for i in 0..destsubvl {
 315                 let mut sel_field = selector >> (FIELD_SIZE * i);
 316                 sel_field &= FIELD_MASK;
 317                 let src = if (sel_field & 0b100) == 0 {
 318                     &rs1[(vindex * srcsubvl)..]
 319                 } else {
 320                     SwizzleConstants::CONSTANTS
 321                 };
 322                 sel_field &= 0b11;
 323                 if sel_field as usize >= srcsubvl {
 324                     // handle illegal instruction trap
 325                 }
 326                 let value = src[sel_field as usize];
 327                 rd[vindex * destsubvl + i] = value;
 328             }
 329         }
 330     }
 331
 332 Swizzle2:
 333
 334 ::
 335
 336     fn swizzle2<Elm, Selector>(
 337         rd: &mut [Elm],
 338         rs1: &[Elm],
 339         rs2: &[Selector],
 340         rs3: &[Elm],
 341         vl: usize,
 342         destsubvl: usize,
 343         srcsubvl: usize)
 344     where
 345         // Elm is a copyable type
 346         Elm: Copy,
 347         // Selector is a copyable type that can be converted into u64
 348         Selector: Copy + Into<u64>,
 349     {
 350         const FIELD_SIZE: usize = 3;
 351         const FIELD_MASK: u64 = 0b111;
 352         for vindex in 0..vl {
 353             let selector = rs2[vindex].into();
 354             // selector's type is u64
 355             if selector >> (FIELD_SIZE * destsubvl) != 0 {
 356                 // handle illegal instruction trap
 357             }
 358             for i in 0..destsubvl {
 359                 let mut sel_field = selector >> (FIELD_SIZE * i);
 360                 sel_field &= FIELD_MASK;
 361                 let src = if (sel_field & 0b100) != 0 {
 362                     rs1
 363                 } else {
 364                     rs3
 365                 };
 366                 sel_field &= 0b11;
 367                 if sel_field as usize >= srcsubvl {
 368                     // handle illegal instruction trap
 369                 }
 370                 let value = src[vindex * srcsubvl + (sel_field as usize)];
 371                 rd[vindex * destsubvl + i] = value;
 372             }
 373         }
 374     }
 375