+[[!tag oldstandards]]
+
MV.X and MV.swizzle
===================
-swizzle needs a MV. see below for a potential way to use the funct7 to do a swizzle in rs2.
+swizzle needs a MV (there are 2 of them: swizzle and swizzle2).
+see below for a potential way to use the funct7 to do a swizzle in rs2.
+---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
| Encoding | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | 6:2 | 1:0 |
+---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
| RV32-I-type + imm[11:0] + rs1[4:0] + funct3 | rd[4:0] + opcode + 0b11 |
+---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
-| RV32-I-type + fn4[11:8] swizzle[7:0] + rs1[4:0] + 0b000 | rd[4:0] + OP-V + 0b11 |
+| RV32-I-type + fn4[3:0] + swizzle[7:0] + rs1[4:0] + 0b000 | rd[4:0] + OP-V + 0b11 |
+---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
-* funct3 = MV
+* funct3 = MV: 0b000 for FP, 0b001 for INT
* OP-V = 0b1010111
* fn4 = 4 bit function.
-* fn4 = 0b0000 - INT MV-SWIZZLE ?
-* fn4 = 0b0001 - FP MV-SWIZZLE ?
+* fn4 = 0b0000 - MV-SWIZZLE
+* fn4 = 0bNN01 - MV-X, NN=elwidth (default/8/16/32)
+* fn4 = 0bNN11 - MV-X.SUBVL NN=elwidth (default/8/16/32)
swizzle (only active on SV or P48/P64 when SUBVL!=0):
-+-----+---+
-| 1:0 | x |
-+-----+---+
-| 3:2 | y |
-+-----+---+
-| 5:4 | z |
-+-----+---+
-| 7:6 | w |
-+-----+---+
++-----+-----+-----+-----+
+| 7:6 | 5:4 | 3:2 | 1:0 |
++-----+-----+-----+-----+
+| w | z | y | x |
++-----+-----+-----+-----+
+
+MV.X has two modes: SUBVL mode applies the element offsets only within a SUBVL inner loop. This can be used for transposition.
+
+::
+
+ for i in range(VL):
+ for j in range(SUBVL):
+ regs[rd] = regs[rd+regs[rs+j]]
+
+Normal mode will apply the element offsets incrementally:
+
+::
+
+ for i in range(VL):
+ for j in range(SUBVL):
+ regs[rd] = regs[rd+regs[rs+k]]
+ k++
+
+
+Pseudocode for element width part of MV.X:
+
+::
+
+ def mv_x(rd, rs1, funct4):
+ elwidth = (funct4>>2) & 0x3
+ bitwidth = {0:XLEN, 1:8, 2:16, 3:32}[elwidth] # get bits per el
+ bytewidth = bitwidth / 8 # get bytes per el
+ for i in range(VL):
+ addr = (unsigned char *)®s[rs1]
+ offset = addr + bytewidth # get offset within regfile as SRAM
+ # TODO, actually, needs to respect rd and rs1 element width,
+ # here, as well. this pseudocode just illustrates that the
+ # MV.X operation contains a way to compact the indices into
+ # less space.
+ regs[rd] = (unsigned char*)(regs)[offset]
+
+The idea here is to allow 8-bit indices to be stored inside XLEN-sized
+registers, such that rather than doing this:
+
+.. parsed-literal::
+ ldimm x8, 1
+ ldimm x9, 3
+ ldimm x10, 2
+ ldimm x11, 0
+ {SVP.VL=4} MV.X x3, x8, elwidth=default
+
+The alternative is this:
+
+.. parsed-literal::
+ ldimm x8, 0x00020301
+ {SVP.VL=4} MV.X x3, x8, elwidth=8
+
+Thus compacting four indices into the one register. x3 and x8's element
+width are *independent* of the MV.X elwidth, thus allowing both source
+and element element widths of the *elements* to be moved to be over-ridden,
+whilst *at the same time* allowing the *indices* to be compacted, as well.
----
potential MV.X? register-version of MV-swizzle?
-+---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
-| Encoding | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | 6:2 | 1:0 |
-+---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
-| RV32-R-type + funct7 + rs2[4:0] + rs1[4:0] + funct3 | rd[4:0] + opcode + 0b11 |
-+---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
-| RV32-R-type + 0b0000000 + rs2[4:0] + rs1[4:0] + 0b001 | rd[4:0] + OP-V + 0b11 |
-+---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
++-------------+-------+-------+----------+----------+--------+----------+--------+--------+
+| Encoding | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | 6:2 | 1:0 |
++-------------+-------+-------+----------+----------+--------+----------+--------+--------+
+| RV32-R-type + funct7 + rs2[4:0] + rs1[4:0] + funct3 | rd[4:0] + opcode + 0b11 |
++-------------+-------+-------+----------+----------+--------+----------+--------+--------+
+| RV32-R-type + 0b0000000 + rs2[4:0] + rs1[4:0] + 0b001 | rd[4:0] + OP-V + 0b11 |
++-------------+-------+-------+----------+----------+--------+----------+--------+--------+
* funct3 = MV.X
* OP-V = 0b1010111
-* funct7 = 0b0000000 - INT MV.X
-* funct7 = 0b0000001 - FP MV.X
-* funct7 = 0b0000010 - INT MV.swizzle to say that rs2 is a swizzle argument?
+* funct7 = 0b000NN00 - INT MV.X, elwidth=NN (default/8/16/32)
+* funct7 = 0b000NN10 - FP MV.X, elwidth=NN (default/8/16/32)
+* funct7 = 0b0000001 - INT MV.swizzle to say that rs2 is a swizzle argument?
* funct7 = 0b0000011 - FP MV.swizzle to say that rs2 is a swizzle argument?
question: do we need a swizzle MV.X as well?
+MV.X with 3 operands
+====================
+
+regs[rd] = regs[rs1 + regs[rs2]]
+
+Similar to LD/ST with the same twin predication rules
+
macro-op fusion
===============
additional idea: a VBLOCK context that says that if a given register is used, it indicates that the
register is to be "swizzled", and the VBLOCK swizzle context contains the swizzling to be carried out.
+
+mm_shuffle_ps?
+==============
+
+__m128 _mm_shuffle_ps(__m128 lo,__m128 hi,
+ _MM_SHUFFLE(hi3,hi2,lo1,lo0))
+Interleave inputs into low 2 floats and high 2 floats of output. Basically
+ out[0]=lo[lo0];
+ out[1]=lo[lo1];
+ out[2]=hi[hi2];
+ out[3]=hi[hi3];
+
+For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float
+a[i] into all 4 output floats.
+
+Transpose
+=========
+
+assuming a vector of 4x4 matrixes is stored as 4 separate vectors with subvl=4 in struct-of-array-of-struct form (the form I've been planning on using):
+using standard (4+4) -> 4 swizzle instructions with 2 input vectors with subvl=4 and 1 output vector with subvl, a vectorized matrix transpose operation can be done in 2 steps with 4 instructions per step to give 8 instructions in total:
+
+input:
+| m00 m10 m20 m30 |
+| m01 m11 m21 m31 |
+| m02 m12 m22 m32 |
+| m03 m13 m23 m33 |
+
+transpose 4 corner 2x2 matrices
+
+intermediate:
+| m00 m01 m20 m21 |
+| m10 m11 m30 m31 |
+| m02 m03 m22 m23 |
+| m12 m13 m32 m33 |
+
+finish transpose
+
+output:
+| m00 m01 m02 m03 |
+| m10 m11 m12 m13 |
+| m20 m21 m22 m23 |
+| m30 m31 m32 m33 |
+
+<http://web.archive.org/web/20100111104515/http://www.randombit.net:80/bitbashing/programming/integer_matrix_transpose_in_sse2.html>
+
+
+::
+
+ __m128i T0 = _mm_unpacklo_epi32(I0, I1);
+ __m128i T1 = _mm_unpacklo_epi32(I2, I3);
+ __m128i T2 = _mm_unpackhi_epi32(I0, I1);
+ __m128i T3 = _mm_unpackhi_epi32(I2, I3);
+
+ /* Assigning transposed values back into I[0-3] */
+ I0 = _mm_unpacklo_epi64(T0, T1);
+ I1 = _mm_unpackhi_epi64(T0, T1);
+ I2 = _mm_unpacklo_epi64(T2, T3);
+ I3 = _mm_unpackhi_epi64(T2, T3);
+
+Transforms for DCT
+==================
+
+<https://opencores.org/websvn/filedetails?repname=mpeg2fpga&path=%2Fmpeg2fpga%2Ftrunk%2Frtl%2Fmpeg2%2Fidct.v>
+
+Table to evaluate
+=================
+
+swizzle2 takes 2 arguments, interleaving the two vectors depending on a 3rd (the swizzle selector)
+
++-----------+-------+-------+-------+-------+-------+------+
+| | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 |
++===========+=======+=======+=======+=======+=======+======+
+| swizzle2 | rs3 | 00 | rs2 | rs1 | 000 | rd |
++-----------+-------+-------+-------+-------+-------+------+
+| fswizzle2 | rs3 | 01 | rs2 | rs1 | 000 | rd |
++-----------+-------+-------+-------+-------+-------+------+
+| swizzle | 0 | 10 | rs2 | rs1 | 000 | rd |
++-----------+-------+-------+-------+-------+-------+------+
+| fswizzle | 0 | 11 | rs2 | rs1 | 000 | rd |
++-----------+-------+-------+-------+-------+-------+------+
+| swizzlei | imm | rs1 | 001 | rd |
++-----------+ +-------+-------+------+
+| fswizzlei | | rs1 | 010 | rd |
++-----------+-------+-------+-------+-------+-------+------+
+
+More:
+
+swizzlei would still need the 12-bit format due to not having enough immediate bits. we can get away with only 3 i-type funct3s used for [f]swizzlei by having one funct3 for destsubvl 1 through 3 for int and fp versions and a separate one for destsubvl = 4 that's shared between int/fp:
+
++--------+-----------+----+-----------+----------+-------+-------+------+
+| int/fp | DESTSUBVL | 31 | 30:29 | 28:20 | 19:15 | 14:12 | 11:7 |
++========+===========+====+===========+==========+=======+=======+======+
+| int | 1 to 3 | 0 | DESTSUBVL | selector | rs | 000 | rd |
++--------+-----------+----+-----------+----------+-------+-------+------+
+| fp | 1 to 3 | 1 | DESTSUBVL | selector | rs | 000 | rd |
++--------+-----------+----+-----------+----------+-------+-------+------+
+| int | 4 | selector[11:0] | rs | 001 | rd |
++--------+-----------+---------------------------+-------+-------+------+
+| fp | 4 | selector[11:0] | rs | 010 | rd |
++--------+-----------+---------------------------+-------+-------+------+
+
+the rest could be encoded as follows:
+
++-----------+-------+-----------+-------+-------+-------+------+
+| | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 |
++===========+=======+===========+=======+=======+=======+======+
+| swizzle2 | rs3 | DESTSUBVL | rs2 | rs1 | 100 | rd |
++-----------+-------+-----------+-------+-------+-------+------+
+| swizzle | rs1 | DESTSUBVL | rs2 | rs1 | 100 | rd |
++-----------+-------+-----------+-------+-------+-------+------+
+| fswizzle2 | rs3 | DESTSUBVL | rs2 | rs1 | 101 | rd |
++-----------+-------+-----------+-------+-------+-------+------+
+| fswizzle | rs1 | DESTSUBVL | rs2 | rs1 | 101 | rd |
++-----------+-------+-----------+-------+-------+-------+------+
+
+note how for [f]swizzle, rs3 == rs1
+
+so it uses 5 funct3 values overall, which is appropriate, since swizzle is probably right after muladd in usage in graphics shaders.
+
+Alternative immed encoding
+
++--------+-----------+----------+-------+-------+------+
+| int/fp | 31:28 | 27:20 | 19:15 | 14:12 | 11:7 |
++========+===========+==========+=======+=======+======+
+| int | DESTMASK | selector | rs | 000 | rd |
++--------+-----------+----------+-------+-------+------+
+| fp | DESTMASK | selector | rs | 001 | rd |
++--------+-----------+----------+-------+-------+------+
+| int | DESTMASK | constsel | rs | 010 | rd |
++--------+-----------+----------+-------+-------+------+
+| fp | DESTMASK | constsel | rs | 011 | rd |
++--------+-----------+----------+-------+-------+------+
+
+Allows setting of arbitrary dest (xz, yw) without needing register-versions. Saves on instruction count.
+Needs 4 funct3 to express.
+
+Matrix 4x4 Vector mul
+=====================
+
+::
+
+ pfscale,3 F2, F1, F10
+ pfscaleadd,2 F2, F1, F11, F2
+ pfscaleadd,1 F2, F1, F12, F2
+ pfscaleadd,0 F2, F1, F13, F2
+
+pfscale is a 4 vec mv.shuffle followed by a fmul. pfscaleadd is a 4 vec mv.shuffle followed by a fmac.
+
+In effect what this is doing is:
+
+::
+
+ fmul f2, f1.xxxx, f10
+ fmac f2, f1.yyyy, f11, f2
+ fmac f2, f1.zzzz, f12, f2
+ fmac f2, f1.wwww, f13, f2
+
+Where all of f2, f1, and f10-13 are vec4, and f1.x-w are copied (fixed index) where the other vec4 indices progress.
+
+Pseudocode
+==========
+
+Swizzle:
+
+::
+
+ pub trait SwizzleConstants: Copy + 'static {
+ const CONSTANTS: &'static [Self; 4];
+ }
+
+ impl SwizzleConstants for u8 {
+ const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFF, 0x7F];
+ }
+
+ impl SwizzleConstants for u16 {
+ const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFFFF, 0x7FFF];
+ }
+
+ impl SwizzleConstants for f32 {
+ const CONSTANTS: &'static [Self; 4] = &[0.0, 1.0, -1.0, 0.5];
+ }
+
+ // impl for other types too...
+
+ pub fn swizzle<Elm, Selector>(
+ rd: &mut [Elm],
+ rs1: &[Elm],
+ rs2: &[Selector],
+ vl: usize,
+ destsubvl: usize,
+ srcsubvl: usize)
+ where
+ Elm: SwizzleConstants,
+ // Selector is a copyable type that can be converted into u64
+ Selector: Copy + Into<u64>,
+ {
+ const FIELD_SIZE: usize = 3;
+ const FIELD_MASK: u64 = 0b111;
+ for vindex in 0..vl {
+ let selector = rs2[vindex].into();
+ // selector's type is u64
+ if selector >> (FIELD_SIZE * destsubvl) != 0 {
+ // handle illegal instruction trap
+ }
+ for i in 0..destsubvl {
+ let mut sel_field = selector >> (FIELD_SIZE * i);
+ sel_field &= FIELD_MASK;
+ let src = if (sel_field & 0b100) == 0 {
+ &rs1[(vindex * srcsubvl)..]
+ } else {
+ SwizzleConstants::CONSTANTS
+ };
+ sel_field &= 0b11;
+ if sel_field as usize >= srcsubvl {
+ // handle illegal instruction trap
+ }
+ let value = src[sel_field as usize];
+ rd[vindex * destsubvl + i] = value;
+ }
+ }
+ }
+
+Swizzle2:
+
+::
+
+ fn swizzle2<Elm, Selector>(
+ rd: &mut [Elm],
+ rs1: &[Elm],
+ rs2: &[Selector],
+ rs3: &[Elm],
+ vl: usize,
+ destsubvl: usize,
+ srcsubvl: usize)
+ where
+ // Elm is a copyable type
+ Elm: Copy,
+ // Selector is a copyable type that can be converted into u64
+ Selector: Copy + Into<u64>,
+ {
+ const FIELD_SIZE: usize = 3;
+ const FIELD_MASK: u64 = 0b111;
+ for vindex in 0..vl {
+ let selector = rs2[vindex].into();
+ // selector's type is u64
+ if selector >> (FIELD_SIZE * destsubvl) != 0 {
+ // handle illegal instruction trap
+ }
+ for i in 0..destsubvl {
+ let mut sel_field = selector >> (FIELD_SIZE * i);
+ sel_field &= FIELD_MASK;
+ let src = if (sel_field & 0b100) != 0 {
+ rs1
+ } else {
+ rs3
+ };
+ sel_field &= 0b11;
+ if sel_field as usize >= srcsubvl {
+ // handle illegal instruction trap
+ }
+ let value = src[vindex * srcsubvl + (sel_field as usize)];
+ rd[vindex * destsubvl + i] = value;
+ }
+ }
+ }
+