This section is under revision (and is optional)
-# REMAP CSR <a name="remap" />
+# REMAP <a name="remap" />
-There is one 32-bit CSR which may be used to indicate which registers,
-if used in any operation, must be "reshaped" (re-mapped) from a linear
+REMAP allows the usual vector loop `0..VL-1` to be "reshaped" (re-mapped) from a linear
form to a 2D or 3D transposed form, or "offset" to permit arbitrary
-access to elements within a register.
+access to elements, independently on each Vector src or dest register.
-Their primary use is for Matrix Multiplication, reordering of sequential data in-place. Three CSRs are provided so that a single FMAC may be used in a single loop to perform 4x4 times 4x4 Matrix multiplication, generating 64 FMACs.
-
-The 32-bit REMAP CSR may reshape up to 3 registers:
-
-| 29..28 | 27..26 | 25..24 | 23 | 22..16 | 15 | 14..8 | 7 | 6..0 |
-| ------ | ------ | ------ | -- | ------- | -- | ------- | -- | ------- |
-| shape2 | shape1 | shape0 | 0 | regidx2 | 0 | regidx1 | 0 | regidx0 |
-
-regidx0-2 refer not to the Register CSR CAM entry but to the underlying
-*real* register (see regidx, the value) and consequently is 7-bits wide.
-When set to zero (referring to x0), clearly reshaping x0 is pointless,
-so is used to indicate "disabled".
-shape0-2 refers to one of three SHAPE CSRs. A value of 0x3 is reserved.
-Bits 7, 15, 23, 30 and 31 are also reserved, and must be set to zero.
-
-It is anticipated that these specialist CSRs not be very often used.
-Unlike the CSR Register and Predication tables, the REMAP CSRs use
-the full 7-bit regidx so that they can be set once and left alone,
-whilst the CSR Register entries pointing to them are disabled, instead.
+Their primary use is for Matrix Multiplication, reordering of sequential data in-place. Four CSRs are provided so that a single FMAC may be used in a single loop to perform 4x4 times 4x4 Matrix multiplication, generating 64 FMACs. Additional uses include regular "Structure Packing" such as RGB pixel data extraction and reforming.
# SHAPE 1D/2D/3D vector-matrix remapping CSRs
-There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each,
-which have the same format. When each SHAPE CSR is set entirely to zeros,
-remapping is disabled: the register's elements are a linear (1D) vector.
-
-| 31..30 | 29..24 | 23..21 | 20..18 | 17..12 | 11..6 | 5..0 |
-| -------- | ------ | ------- | ------- | ------- | -------- | ------- |
-| applydim |modulo | invxyz | permute | zdimsz | ydimsz | xdimsz |
-
-applydim will set to zero the dimensions less than this. applydim=0 applies all three. applydim=1 applies y and z. applydim=2 applys only z. applydim=3 is reserved.
-
-invxyz will invert the start index of each of x, y or z. If invxyz[0] is zero then x-dimensional counting begins from 0 and increments, otherwise it begins from xdimsz-1 and iterates down to zero. Likewise for y and z.
-
-modulo will cause the output to wrap and remain within the range 0 to modulo. The value zero disables modulus application. Note that modulo arithmetic is applied after all other remapping calculations.
-
-xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
-that the array dimensionality for that dimension is 1. A value of xdimsz=2
-would indicate that in the first dimension there are 3 elements in the
-array. The format of the array is therefore as follows:
-
- array[xdim+1][ydim+1][zdim+1]
+There are four "shape" CSRs, SHAPE0-3, 32-bits in each,
+which have the same format.
-However whilst illustrative of the dimensionality, that does not take the
-"permute" setting into account. "permute" may be any one of six values
-(0-5, with values of 6 and 7 being reserved, and not legal). The table
-below shows how the permutation dimensionality order works:
+[[!inline raw="yes" pages="simple_v_extension/shape_table_format" ]]
-| permute | order | array format |
-| ------- | ----- | ------------------------ |
-| 000 | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
-| 001 | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
-| 010 | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
-| 011 | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
-| 100 | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
-| 101 | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
+The algorithm below shows how REMAP works more clearly, and may be
+executed as a python program:
-In other words, the "permute" option changes the order in which
-nested for-loops over the array would be done. The algorithm below
-shows this more clearly, and may be executed as a python program:
-
- # mapidx = REMAP.shape2
- xdim = 3 # SHAPE[mapidx].xdim_sz+1
- ydim = 4 # SHAPE[mapidx].ydim_sz+1
- zdim = 5 # SHAPE[mapidx].zdim_sz+1
+ xdim = 3
+ ydim = 4
+ zdim = 1
lims = [xdim, ydim, zdim]
idxs = [0,0,0] # starting indices
- order = [1,0,2] # experiment with different permutations, here
- modulo = 64 # experiment with different modulus, here
- applydim=0
- invxyz = [0,0,0]
+ order = [0,1,2] # experiment with different permutations, here
+ offset = 2 # experiment with different offset, here
+ VL = xdim * ydim * zdim # multiply (or add) to this to get "cycling"
+ applydim = 0
+ invxyz = [0,0,0]
+
+ # run for offset iterations before actually starting
+ for idx in range(offset):
+ for i in range(3):
+ idxs[order[i]] = idxs[order[i]] + 1
+ if (idxs[order[i]] != lims[order[i]]):
+ break
+ idxs[order[i]] = 0
+
+ break_count = 0
- for idx in range(xdim * ydim * zdim):
+ for idx in range(VL):
ix = [0] * 3
for i in range(3):
if i >= applydim:
ix[i] = idxs[i]
if invxyz[i]:
- ix[i] = lims[i] - ix[i]
+ ix[i] = lims[i] - 1 - ix[i]
new_idx = ix[0] + ix[1] * xdim + ix[2] * xdim * ydim
- print new_idx % modulo
+ print new_idx,
+ break_count += 1
+ if break_count == lims[order[0]]:
+ print
+ break_count = 0
for i in range(3):
idxs[order[i]] = idxs[order[i]] + 1
if (idxs[order[i]] != lims[order[i]]):
break
- print
idxs[order[i]] = 0
Here, it is assumed that this algorithm be run within all pseudo-code
run from 0 to VL-1 to refer to contiguous register
elements; instead, where REMAP indicates to do so, the element index
is run through the above algorithm to work out the **actual** element
-index, instead. Given that there are three possible SHAPE entries, up to
-three separate registers in any given operation may be simultaneously
+index, instead. Given that there are four possible SHAPE entries, up to
+four separate registers in any given operation may be simultaneously
remapped:
function op_add(rd, rs1, rs2) # add not VADD!
By changing remappings, 2D matrices may be transposed "in-place" for one
operation, followed by setting a different permutation order without
-having to move the values in the registers to or from memory. Also,
-the reason for having REMAP separate from the three SHAPE CSRs is so
-that in a chain of matrix multiplications and additions, for example,
-the SHAPE CSRs need only be set up once; only the REMAP CSR need be
-changed to target different registers.
+having to move the values in the registers to or from memory.
Note that:
applies (i.e. it offsets elements *within* registers rather than
entire registers).
* If permute option 000 is utilised, the actual order of the
- reindexing does not change!
+ reindexing does not change. However, modulo MVL still occurs
+ which will result in repeated operations (use with caution).
* If two or more dimensions are set to zero, the actual order does not change!
* The above algorithm is pseudo-code **only**. Actual implementations
will need to take into account the fact that the element for-looping