From c38a2d9661c87453ed3bbc11bc3974f07d5cc130 Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Tue, 16 Oct 2018 01:31:36 +0100 Subject: [PATCH] add reshaping section --- simple_v_extension/specification.mdwn | 135 +++++++++++++++++++++++--- 1 file changed, 122 insertions(+), 13 deletions(-) diff --git a/simple_v_extension/specification.mdwn b/simple_v_extension/specification.mdwn index fef048cd2..b46d937f1 100644 --- a/simple_v_extension/specification.mdwn +++ b/simple_v_extension/specification.mdwn @@ -65,6 +65,7 @@ tables which are used at the register decode phase. * A register CSR key-value table (8 32-bit CSRs of 2 16-bits each) * A predication CSR key-value table (again, 8 32-bit CSRs of 2 16-bits each) +* A "reshaping" There are also four additional CSRs for User-Mode: @@ -416,6 +417,111 @@ for the storage of comparisions: in these specific circumstances the requirement for there to be an active CSR *register* entry is removed. +## REMAP CSR + +There is one 32-bit CSR which may be used to indicate which registers, +if used in any operation, must be "reshaped" (re-mapped) from a linear +form to a 2D or 3D transposed form. The 32-bit REMAP CSR may reshape +up to 3 registers: + +| 29..28 | 27..26 | 25..24 | 23 | 22..16 | 15 | 14..8 | 7 | 6..0 | +| ------ | ------ | ------ | -- | ------- | -- | ------- | -- | ------- | +| shape2 | shape1 | shape0 | 0 | regidx2 | 0 | regidx1 | 0 | regidx0 | + +regidx0-2 refer not to the Registe CSR CAM entry but to the underlying +*real* register (see regidx, the value) and consequently is 7-bits wide. +shape0-2 refers to one of three SHAPE CSRs. A value of 0x3 is reserved. +Bits 7, 15, 23, 30 and 31 are reserved, and must be set to zero. + +## SHAPE 1D/2D/3D vector-matrix remapping CSRs + +There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each, +which have the same format. When each SHAPE CSR is set entirely to zeros, +remapping is disabled: the register's elements are a linear (1D) vector. + +| 26..24 | 23 | 22..16 | 15 | 14..8 | 7 | 6..0 | +| ------- | -- | ------- | -- | ------- | -- | ------- | +| permute | 0 | zdimsz | 0 | ydimsz | 0 | xdimsz | + +xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates +that the array dimensionality for that dimension is 1. A value of xdimsz=2 +would indicate that in the first dimension there are 3 elements in the +array. The format of the array is therefore as follows: + + array[xdim+1][ydim+1][zdim+1] + +However whilst illustrative of the dimensionality, that does not take the +"permute" setting into account. "permute" may be any one of six values +(0-5, with values of 6 and 7 being reserved, and not legal). The table +below shows how the permutation dimensionality order works: + +| permute | order | array format | +| ------- | ----- | ------------------------ | +| 000 | 0,1,2 | (xdim+1){ydim+1)(zdim+1) | +| 001 | 0,2,1 | (xdim+1){zdim+1)(ydim+1) | +| 010 | 1,0,2 | (ydim+1){xdim+1)(zdim+1) | +| 011 | 1,2,0 | (ydim+1){zdim+1)(xdim+1) | +| 100 | 2,0,1 | (zdim+1){xdim+1)(ydim+1) | +| 101 | 2,1,0 | (zdim+1){ydim+1)(xdim+1) | + +In other words, the "permute" option changes the order in which +nested for-loops over the array would be done. The algorithm below +shows this more clearly, and may be executed as a python program: + + # mapidx = REMAP.shape2 + xdim = 3 # SHAPE[mapidx].xdim_sz+1 + ydim = 4 # SHAPE[mapidx].ydim_sz+1 + zdim = 5 # SHAPE[mapidx].zdim_sz+1 + + lims = [xdim, ydim, zdim] + idxs = [0,0,0] + order = [1,0,2] + + for idx in range(xdim * ydim * zdim): + new_idx = idxs[0] + idxs[1] * xdim + idxs[2] * xdim * ydim + print new_idx, + for i in range(3): + idxs[order[i]] = idxs[order[i]] + 1 + if (idxs[order[i]] != lims[order[i]]): + break + print + idxs[order[i]] = 0 + +Here, it is assumed that this algorithm be run within all pseudo-code +throughout this document where a (parallelism) for-loop would normally +run from 0 to VL-1 and then use that to refer to contiguous register +elements; instead, where REMAP indicates to do so, the element index +is run through the above algorithm to work out the **actual** element +index. Given that there are three possible SHAPE entries, up to +three separate registers in any given operation may be simultaneously +remapped. + +In this way, 2D matrices may be transposed "in-place" for one operation, +followed by setting a different permutation order without having to +move the values in the registers to or from memory. Also, the reason +for having REMAP separate from the three SHAPE CSRs is so that in a +chain of matrix multiplications and additions, for example, the SHAPE +CSRs need only be set up once; only the REMAP CSR need be changed to +target different + +Note that: + +* If permute option 000 is utilised, the actual order of the + reindexing does not change! +* If two or more dimensions are set to zero, the actual order does not change! +* Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to + less than MVL is **perfectly legal**, albeit very obscure. It permits + entries to be regularly presented to operands **more than once**, thus + allowing the same underlying registers to act as an accumulator of + multiple vector or matrix operations, for example. + +Clearly here some considerable care needs to be taken as the remapping +could hypothetically create arithmetic operations that target the +exact same underlying registers, resulting in data corruption due to +pipeline overlaps. Out-of-order / Superscalar micro-architectures with +register-renaming will have an easier time dealing with this than +DSP-style SIMD micro-architectures. + # Instruction Execution Order Simple-V behaves as if it is a hardware-level "macro expansion system", @@ -484,6 +590,22 @@ All other operations using registers are automatically parallelised. This includes AMOMAX, AMOSWAP and so on, where particular care and attention must be paid. +Example pseudo-code for an integer ADD operation (including scalar operations). +Floating-point uses fp csrs. + + function op_add(rd, rs1, rs2) # add not VADD! +  int i, id=0, irs1=0, irs2=0; +  predval = get_pred_val(FALSE, rd); +  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd; +  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1; +  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2; +  for (i = 0; i < VL; i++) + if (predval & 1<