[[!tag standards]] # REMAP REMAP allows the usual vector loop `0..VL-1` to be "reshaped" (re-mapped) from a linear form to a 2D or 3D transposed form, or "offset" to permit arbitrary access to elements, independently on each Vector src or dest register. Their primary use is for Matrix Multiplication, reordering of sequential data in-place. Four SPRs are provided so that a single FMAC may be used in a single loop to perform 4x4 times 4x4 Matrix multiplication, generating 64 FMACs. Additional uses include regular "Structure Packing" such as RGB pixel data extraction and reforming. REMAP, like all of SV, is abstracted out, meaning that unlike traditional Vector ISAs which would typically only have a limited set of instructions that can be structure-packed (LD/ST typically), REMAP may be applied to literally any instruction: CRs, Arithmetic, Logical, LD/ST, anything. Note that REMAP does not apply to sub-vector elements: that is what swizzle is for. Swizzle *can* however be applied to the same instruction as REMAP. REMAP is quite expensive to set up, and on some implementations introduce latency, so should realistically be used only where it is worthwhile # SHAPE 1D/2D/3D vector-matrix remapping SPRs There are four "shape" SPRs, SHAPE0-3, 32-bits in each, which have the same format. [[!inline raw="yes" pages="simple_v_extension/shape_table_format" ]] The algorithm below shows how REMAP works more clearly, and may be executed as a python program: xdim = 3 # changeme ydim = 4 zdim = 1 lims = [xdim, ydim, zdim] idxs = [0,0,0] # starting indices order = [0,1,2] # experiment with different permutations, here offset = 2 # experiment with different offset, here VL = xdim * ydim * zdim # multiply (or add) to this to get "cycling" applydim = 0 invxyz = [0,0,0] # run for offset iterations before actually starting for idx in range(offset): for i in range(3): idxs[order[i]] = idxs[order[i]] + 1 if (idxs[order[i]] != lims[order[i]]): break idxs[order[i]] = 0 break_count = 0 for idx in range(VL): ix = [0] * 3 for i in range(3): if i >= applydim: ix[i] = idxs[i] if invxyz[i]: ix[i] = lims[i] - 1 - ix[i] new_idx = ix[0] + ix[1] * xdim + ix[2] * xdim * ydim print new_idx, break_count += 1 if break_count == lims[order[0]]: print break_count = 0 for i in range(3): idxs[order[i]] = idxs[order[i]] + 1 if (idxs[order[i]] != lims[order[i]]): break idxs[order[i]] = 0 Each element index from the for-loop `0..VL-1` is run through the above algorithm to work out the **actual** element index, instead. Given that there are four possible SHAPE entries, up to four separate registers in any given operation may be simultaneously remapped: function op_add(rd, rs1, rs2) # add not VADD! ... ...  for (i = 0; i < VL; i++) xSTATE.srcoffs = i # save context if (predval & 1< Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33` a 2D REMAP allows: * the column bytes (as a vec4) to be iterated over as an inner loop, progressing vertically (`a00 a10 a20 a30`) * the columns themselves to be iterated as an outer loop * a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed. This entirely in-place without special 128-bit opcodes. Below is the pseudocode for [[!wikipedia Rijndael MixColumns]] ``` void gmix_column(unsigned char *r) { unsigned char a[4]; unsigned char b[4]; unsigned char c; unsigned char h; // no swizzle here but still SUBVL.Remap // can be done as vec4 byte-level // elwidth overrides though. for (c = 0; c < 4; c++) { a[c] = r[c]; h = (unsigned char)((signed char)r[c] >> 7); b[c] = r[c] << 1; b[c] ^= 0x1B & h; /* Rijndael's Galois field */ } // SUBVL.Remap still needed here // byyelevel elwidth overrides and vec4 // These may then each be 4x 8bit bit Swizzled // r0.vec4 = b.vec4 // r0.vec4 ^= a.vec4.WXYZ // r0.vec4 ^= a.vec4.ZWXY // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1]; r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2]; r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3]; r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0]; } ``` With the assumption made by the above code that the column bytes have already been turned around (vertical rather than horizontal) SUBVL.REMAP may transparently fill that role, in-place, without a complex byte-level mv operation. The application of the swizzles allows the remapped vec4 a, b and r variables to perform four straight linear 32 bit XOR operations where a scalar processor would be required to perform 16 byte-level individual operations. Given wide enough SIMD backends in hardware these 3 bit XORs may be done as single-cycle operations across the entire 128 bit Rijndael Matrix. The other alternative is to simply perform the actual 4x4 GF(256) Matrix Multiply using the MDS Matrix.