simple_v_extension/specification/mv.x.rst

   1 MV.X and MV.swizzle
   2 ===================
   3
   4 swizzle needs a MV.  see below for a potential way to use the funct7 to do a swizzle in rs2.
   5
   6 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
   7 | Encoding      | 31:27       | 26:25 | 24:20    | 19:15    | 14:12  | 11:7     | 6:2    | 1:0    |
   8 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
   9 | RV32-I-type   + imm[11:0]                      + rs1[4:0] + funct3 | rd[4:0]  + opcode + 0b11   |
  10 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
  11 | RV32-I-type   + fn4[3:0]    + swizzle[7:0]     + rs1[4:0] + 0b000  | rd[4:0]  + OP-V   + 0b11   |
  12 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
  13
  14 * funct3 = MV
  15 * OP-V = 0b1010111
  16 * fn4 = 4 bit function.
  17 * fn4 = 0b0000 - INT MV-SWIZZLE ?
  18 * fn4 = 0b0001 - FP MV-SWIZZLE ?
  19 * fn4 = 0bNN10 - INT MV-X, NN=elwidth (default/8/16/32)
  20 * fn4 = 0bNN11 - FP MV-X NN=elwidth (default/8/16/32)
  21
  22 swizzle (only active on SV or P48/P64 when SUBVL!=0):
  23
  24 +-----+-----+-----+-----+
  25 | 7:6 | 5:4 | 3:2 | 1:0 |
  26 +-----+-----+-----+-----+
  27 |   w |   z |   y |   x |
  28 +-----+-----+-----+-----+
  29
  30 Pseudocode for element width part of MV.X:
  31
  32 ::
  33
  34   def mv_x(rd, rs1, funct4):
  35       elwidth = (funct4>>2) & 0x3
  36       bitwidth = {0:XLEN, 1:8, 2:16, 3:32}[elwidth] # get bits per el
  37       bytewidth = bitwidth / 8 # get bytes per el
  38       for i in range(VL):
  39           addr = (unsigned char *)&regs[rs1]
  40           offset = addr + bytewidth # get offset within regfile as SRAM
  41           # TODO, actually, needs to respect rd and rs1 element width,
  42           # here, as well.  this pseudocode just illustrates that the
  43           # MV.X operation contains a way to compact the indices into
  44           # less space.
  45           regs[rd] = (unsigned char*)(regs)[offset]
  46
  47 The idea here is to allow 8-bit indices to be stored inside XLEN-sized
  48 registers, such that rather than doing this:
  49
  50 .. parsed-literal::
  51     ldimm x8, 1
  52     ldimm x9, 3
  53     ldimm x10, 2
  54     ldimm x11, 0
  55     {SVP.VL=4} MV.X x3, x8, elwidth=default
  56
  57 The alternative is this:
  58
  59 .. parsed-literal::
  60     ldimm x8, 0x00020301
  61     {SVP.VL=4} MV.X x3, x8, elwidth=8
  62
  63 Thus compacting four indices into the one register.  x3 and x8's element
  64 width are *independent* of the MV.X elwidth, thus allowing both source
  65 and element element widths of the *elements* to be moved to be over-ridden,
  66 whilst *at the same time* allowing the *indices* to be compacted, as well.
  67
  68 ----
  69
  70 potential MV.X?  register-version of MV-swizzle?
  71
  72 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  73 | Encoding    | 31:27 | 26:25 | 24:20    | 19:15    | 14:12  | 11:7     | 6:2    | 1:0    |
  74 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  75 | RV32-R-type + funct7        + rs2[4:0] + rs1[4:0] + funct3 | rd[4:0]  + opcode + 0b11   |
  76 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  77 | RV32-R-type + 0b0000000     + rs2[4:0] + rs1[4:0] + 0b001  | rd[4:0]  + OP-V   + 0b11   |
  78 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  79
  80 * funct3 = MV.X
  81 * OP-V = 0b1010111
  82 * funct7 = 0b000NN00 - INT MV.X, elwidth=NN (default/8/16/32)
  83 * funct7 = 0b000NN10 - FP MV.X, elwidth=NN (default/8/16/32)
  84 * funct7 = 0b0000001 - INT MV.swizzle to say that rs2 is a swizzle argument?
  85 * funct7 = 0b0000011 - FP MV.swizzle to say that rs2 is a swizzle argument?
  86
  87 question: do we need a swizzle MV.X as well?
  88
  89 macro-op fusion
  90 ===============
  91
  92 there is the potential for macro-op fusion of mv-swizzle with the following instruction and/or preceding instruction.
  93 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002486.html>
  94
  95 VBLOCK context?
  96 ===============
  97
  98 additional idea: a VBLOCK context that says that if a given register is used, it indicates that the
  99 register is to be "swizzled", and the VBLOCK swizzle context contains the swizzling to be carried out.
 100
 101 mm_shuffle_ps?
 102 ==============
 103
 104 __m128 _mm_shuffle_ps(__m128 lo,__m128 hi,
 105        _MM_SHUFFLE(hi3,hi2,lo1,lo0))
 106 Interleave inputs into low 2 floats and high 2 floats of output. Basically
 107    out[0]=lo[lo0];
 108    out[1]=lo[lo1];
 109    out[2]=hi[hi2];
 110    out[3]=hi[hi3];
 111
 112 For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float
 113 a[i] into all 4 output floats.
 114
 115 Transpose
 116 =========
 117
 118 assuming a vector of 4x4 matrixes is stored as 4 separate vectors with subvl=4 in struct-of-array-of-struct form (the form I've been planning on using):
 119 using standard (4+4) -> 4 swizzle instructions with 2 input vectors with subvl=4 and 1 output vector with subvl, a vectorized matrix transpose operation can be done in 2 steps with 4 instructions per step to give 8 instructions in total:
 120
 121 input:
 122 | m00 m10 m20 m30 |
 123 | m01 m11 m21 m31 |
 124 | m02 m12 m22 m32 |
 125 | m03 m13 m23 m33 |
 126
 127 transpose 4 corner 2x2 matrices
 128
 129 intermediate:
 130 | m00 m01 m20 m21 |
 131 | m10 m11 m30 m31 |
 132 | m02 m03 m22 m23 |
 133 | m12 m13 m32 m33 |
 134
 135 finish transpose
 136
 137 output:
 138 | m00 m01 m02 m03 |
 139 | m10 m11 m12 m13 |
 140 | m20 m21 m22 m23 |
 141 | m30 m31 m32 m33 |