simple_v_extension/specification/mv.x.rst

   1 [[!tag standards]]
   2
   3 MV.X and MV.swizzle
   4 ===================
   5
   6 swizzle needs a MV (there are 2 of them: swizzle and swizzle2).
   7 see below for a potential way to use the funct7 to do a swizzle in rs2.
   8
   9 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
  10 | Encoding      | 31:27       | 26:25 | 24:20    | 19:15    | 14:12  | 11:7     | 6:2    | 1:0    |
  11 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
  12 | RV32-I-type   + imm[11:0]                      + rs1[4:0] + funct3 | rd[4:0]  + opcode + 0b11   |
  13 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
  14 | RV32-I-type   + fn4[3:0]    + swizzle[7:0]     + rs1[4:0] + 0b000  | rd[4:0]  + OP-V   + 0b11   |
  15 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
  16
  17 * funct3 = MV: 0b000 for FP, 0b001 for INT
  18 * OP-V = 0b1010111
  19 * fn4 = 4 bit function.
  20 * fn4 = 0b0000 - MV-SWIZZLE
  21 * fn4 = 0bNN01 - MV-X, NN=elwidth (default/8/16/32)
  22 * fn4 = 0bNN11 - MV-X.SUBVL NN=elwidth (default/8/16/32)
  23
  24 swizzle (only active on SV or P48/P64 when SUBVL!=0):
  25
  26 +-----+-----+-----+-----+
  27 | 7:6 | 5:4 | 3:2 | 1:0 |
  28 +-----+-----+-----+-----+
  29 |   w |   z |   y |   x |
  30 +-----+-----+-----+-----+
  31
  32 MV.X has two modes: SUBVL mode applies the element offsets only within a SUBVL inner loop. This can be used for transposition.
  33
  34 ::
  35
  36   for i in range(VL):
  37      for j in range(SUBVL):
  38         regs[rd] = regs[rd+regs[rs+j]]
  39
  40 Normal mode will apply the element offsets incrementally:
  41
  42 ::
  43
  44   for i in range(VL):
  45      for j in range(SUBVL):
  46         regs[rd] = regs[rd+regs[rs+k]]
  47           k++
  48
  49
  50 Pseudocode for element width part of MV.X:
  51
  52 ::
  53
  54   def mv_x(rd, rs1, funct4):
  55       elwidth = (funct4>>2) & 0x3
  56       bitwidth = {0:XLEN, 1:8, 2:16, 3:32}[elwidth] # get bits per el
  57       bytewidth = bitwidth / 8 # get bytes per el
  58       for i in range(VL):
  59           addr = (unsigned char *)&regs[rs1]
  60           offset = addr + bytewidth # get offset within regfile as SRAM
  61           # TODO, actually, needs to respect rd and rs1 element width,
  62           # here, as well.  this pseudocode just illustrates that the
  63           # MV.X operation contains a way to compact the indices into
  64           # less space.
  65           regs[rd] = (unsigned char*)(regs)[offset]
  66
  67 The idea here is to allow 8-bit indices to be stored inside XLEN-sized
  68 registers, such that rather than doing this:
  69
  70 .. parsed-literal::
  71     ldimm x8, 1
  72     ldimm x9, 3
  73     ldimm x10, 2
  74     ldimm x11, 0
  75     {SVP.VL=4} MV.X x3, x8, elwidth=default
  76
  77 The alternative is this:
  78
  79 .. parsed-literal::
  80     ldimm x8, 0x00020301
  81     {SVP.VL=4} MV.X x3, x8, elwidth=8
  82
  83 Thus compacting four indices into the one register.  x3 and x8's element
  84 width are *independent* of the MV.X elwidth, thus allowing both source
  85 and element element widths of the *elements* to be moved to be over-ridden,
  86 whilst *at the same time* allowing the *indices* to be compacted, as well.
  87
  88 ----
  89
  90 potential MV.X?  register-version of MV-swizzle?
  91
  92 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  93 | Encoding    | 31:27 | 26:25 | 24:20    | 19:15    | 14:12  | 11:7     | 6:2    | 1:0    |
  94 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  95 | RV32-R-type + funct7        + rs2[4:0] + rs1[4:0] + funct3 | rd[4:0]  + opcode + 0b11   |
  96 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  97 | RV32-R-type + 0b0000000     + rs2[4:0] + rs1[4:0] + 0b001  | rd[4:0]  + OP-V   + 0b11   |
  98 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
  99
 100 * funct3 = MV.X
 101 * OP-V = 0b1010111
 102 * funct7 = 0b000NN00 - INT MV.X, elwidth=NN (default/8/16/32)
 103 * funct7 = 0b000NN10 - FP MV.X, elwidth=NN (default/8/16/32)
 104 * funct7 = 0b0000001 - INT MV.swizzle to say that rs2 is a swizzle argument?
 105 * funct7 = 0b0000011 - FP MV.swizzle to say that rs2 is a swizzle argument?
 106
 107 question: do we need a swizzle MV.X as well?
 108
 109 macro-op fusion
 110 ===============
 111
 112 there is the potential for macro-op fusion of mv-swizzle with the following instruction and/or preceding instruction.
 113 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002486.html>
 114
 115 VBLOCK context?
 116 ===============
 117
 118 additional idea: a VBLOCK context that says that if a given register is used, it indicates that the
 119 register is to be "swizzled", and the VBLOCK swizzle context contains the swizzling to be carried out.
 120
 121 mm_shuffle_ps?
 122 ==============
 123
 124 __m128 _mm_shuffle_ps(__m128 lo,__m128 hi,
 125        _MM_SHUFFLE(hi3,hi2,lo1,lo0))
 126 Interleave inputs into low 2 floats and high 2 floats of output. Basically
 127    out[0]=lo[lo0];
 128    out[1]=lo[lo1];
 129    out[2]=hi[hi2];
 130    out[3]=hi[hi3];
 131
 132 For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float
 133 a[i] into all 4 output floats.
 134
 135 Transpose
 136 =========
 137
 138 assuming a vector of 4x4 matrixes is stored as 4 separate vectors with subvl=4 in struct-of-array-of-struct form (the form I've been planning on using):
 139 using standard (4+4) -> 4 swizzle instructions with 2 input vectors with subvl=4 and 1 output vector with subvl, a vectorized matrix transpose operation can be done in 2 steps with 4 instructions per step to give 8 instructions in total:
 140
 141 input:
 142 | m00 m10 m20 m30 |
 143 | m01 m11 m21 m31 |
 144 | m02 m12 m22 m32 |
 145 | m03 m13 m23 m33 |
 146
 147 transpose 4 corner 2x2 matrices
 148
 149 intermediate:
 150 | m00 m01 m20 m21 |
 151 | m10 m11 m30 m31 |
 152 | m02 m03 m22 m23 |
 153 | m12 m13 m32 m33 |
 154
 155 finish transpose
 156
 157 output:
 158 | m00 m01 m02 m03 |
 159 | m10 m11 m12 m13 |
 160 | m20 m21 m22 m23 |
 161 | m30 m31 m32 m33 |
 162
 163 <http://web.archive.org/web/20100111104515/http://www.randombit.net:80/bitbashing/programming/integer_matrix_transpose_in_sse2.html>
 164
 165
 166 ::
 167
 168    __m128i T0 = _mm_unpacklo_epi32(I0, I1);
 169    __m128i T1 = _mm_unpacklo_epi32(I2, I3);
 170    __m128i T2 = _mm_unpackhi_epi32(I0, I1);
 171    __m128i T3 = _mm_unpackhi_epi32(I2, I3);
 172
 173    /* Assigning transposed values back into I[0-3] */
 174    I0 = _mm_unpacklo_epi64(T0, T1);
 175    I1 = _mm_unpackhi_epi64(T0, T1);
 176    I2 = _mm_unpacklo_epi64(T2, T3);
 177    I3 = _mm_unpackhi_epi64(T2, T3);
 178
 179 Transforms for DCT
 180 ==================
 181
 182 <https://opencores.org/websvn/filedetails?repname=mpeg2fpga&path=%2Fmpeg2fpga%2Ftrunk%2Frtl%2Fmpeg2%2Fidct.v>
 183
 184 Table to evaluate
 185 =================
 186
 187 swizzle2 takes 2 arguments, interleaving the two vectors depending on a 3rd (the swizzle selector)
 188
 189 +-----------+-------+-------+-------+-------+-------+------+
 190 |           | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 |
 191 +===========+=======+=======+=======+=======+=======+======+
 192 | swizzle2  | rs3   | 00    | rs2   | rs1   | 000   | rd   |
 193 +-----------+-------+-------+-------+-------+-------+------+
 194 | fswizzle2 | rs3   | 01    | rs2   | rs1   | 000   | rd   |
 195 +-----------+-------+-------+-------+-------+-------+------+
 196 | swizzle   | 0     | 10    | rs2   | rs1   | 000   | rd   |
 197 +-----------+-------+-------+-------+-------+-------+------+
 198 | fswizzle  | 0     | 11    | rs2   | rs1   | 000   | rd   |
 199 +-----------+-------+-------+-------+-------+-------+------+
 200 | swizzlei  | imm                   | rs1   | 001   | rd   |
 201 +-----------+                       +-------+-------+------+
 202 | fswizzlei |                       | rs1   | 010   | rd   |
 203 +-----------+-------+-------+-------+-------+-------+------+