[[!tag standards]]
# NOTE
This section is under revision (and is optional)
# REMAP CSR
There is one 32-bit CSR which may be used to indicate which registers,
if used in any operation, must be "reshaped" (re-mapped) from a linear
form to a 2D or 3D transposed form, or "offset" to permit arbitrary
access to elements within a register.
Their primary use is for Matrix Multiplication, reordering of sequential data in-place. Three CSRs are provided so that a single FMAC may be used in a single loop to perform 4x4 times 4x4 Matrix multiplication, generating 64 FMACs.
The 32-bit REMAP CSR may reshape up to 3 registers:
| 29..28 | 27..26 | 25..24 | 23 | 22..16 | 15 | 14..8 | 7 | 6..0 |
| ------ | ------ | ------ | -- | ------- | -- | ------- | -- | ------- |
| shape2 | shape1 | shape0 | 0 | regidx2 | 0 | regidx1 | 0 | regidx0 |
regidx0-2 refer not to the Register CSR CAM entry but to the underlying
*real* register (see regidx, the value) and consequently is 7-bits wide.
When set to zero (referring to x0), clearly reshaping x0 is pointless,
so is used to indicate "disabled".
shape0-2 refers to one of three SHAPE CSRs. A value of 0x3 is reserved.
Bits 7, 15, 23, 30 and 31 are also reserved, and must be set to zero.
It is anticipated that these specialist CSRs not be very often used.
Unlike the CSR Register and Predication tables, the REMAP CSRs use
the full 7-bit regidx so that they can be set once and left alone,
whilst the CSR Register entries pointing to them are disabled, instead.
# SHAPE 1D/2D/3D vector-matrix remapping CSRs
There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each,
which have the same format. When each SHAPE CSR is set entirely to zeros,
remapping is disabled: the register's elements are a linear (1D) vector.
| 31..30 | 29..24 | 23..21 | 20..18 | 17..12 | 11..6 | 5..0 |
| -------- | ------ | ------- | ------- | ------- | -------- | ------- |
| applydim |modulo | invxyz | permute | zdimsz | ydimsz | xdimsz |
applydim will set to zero the dimensions less than this. applydim=0 applies all three. applydim=1 applies y and z. applydim=2 applys only z. applydim=3 is reserved.
invxyz will invert the start index of each of x, y or z. If invxyz[0] is zero then x-dimensional counting begins from 0 and increments, otherwise it begins from xdimsz-1 and iterates down to zero. Likewise for y and z.
modulo will cause the output to wrap and remain within the range 0 to modulo. The value zero disables modulus application. Note that modulo arithmetic is applied after all other remapping calculations.
xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
that the array dimensionality for that dimension is 1. A value of xdimsz=2
would indicate that in the first dimension there are 3 elements in the
array. The format of the array is therefore as follows:
array[xdim+1][ydim+1][zdim+1]
However whilst illustrative of the dimensionality, that does not take the
"permute" setting into account. "permute" may be any one of six values
(0-5, with values of 6 and 7 being reserved, and not legal). The table
below shows how the permutation dimensionality order works:
| permute | order | array format |
| ------- | ----- | ------------------------ |
| 000 | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
| 001 | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
| 010 | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
| 011 | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
| 100 | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
| 101 | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
In other words, the "permute" option changes the order in which
nested for-loops over the array would be done. The algorithm below
shows this more clearly, and may be executed as a python program:
# mapidx = REMAP.shape2
xdim = 3 # SHAPE[mapidx].xdim_sz+1
ydim = 4 # SHAPE[mapidx].ydim_sz+1
zdim = 5 # SHAPE[mapidx].zdim_sz+1
lims = [xdim, ydim, zdim]
idxs = [0,0,0] # starting indices
order = [1,0,2] # experiment with different permutations, here
modulo = 64 # experiment with different modulus, here
applydim=0
invxyz = [0,0,0]
for idx in range(xdim * ydim * zdim):
ix = [0] * 3
for i in range(3):
if i >= applydim:
ix[i] = idxs[i]
if invxyz[i]:
ix[i] = lims[i] - ix[i]
new_idx = ix[0] + ix[1] * xdim + ix[2] * xdim * ydim
print new_idx % modulo
for i in range(3):
idxs[order[i]] = idxs[order[i]] + 1
if (idxs[order[i]] != lims[order[i]]):
break
print
idxs[order[i]] = 0
Here, it is assumed that this algorithm be run within all pseudo-code
throughout this document where a (parallelism) for-loop would normally
run from 0 to VL-1 to refer to contiguous register
elements; instead, where REMAP indicates to do so, the element index
is run through the above algorithm to work out the **actual** element
index, instead. Given that there are three possible SHAPE entries, up to
three separate registers in any given operation may be simultaneously
remapped:
function op_add(rd, rs1, rs2) # add not VADD!
...
...
for (i = 0; i < VL; i++)
xSTATE.srcoffs = i # save context
if (predval & 1<