# Note about naming

the original assessment for SVP from 18 months ago concluded that it should be easy for scalar (non SV) instructions to get at the exact same scalar registers when in SVP mode.  otherwise scalar v3.0B code needs to restrict itself to a massively truncated subset of the scalar registers numbered 0-31 (only r0, r4, r8...) which hugely interferes with ABIs to such an extent that it would compromise SV.

question: has anything changed about the assessment that was done, which concluded that for scalar SVP regs they should overlap completely with scalar ISA regs?


# Notes on requirements for bit allocations

do not try to jam VL or MAXVL in.  go with the flow of 24 bits spare.

* 2: SUBVL
* 2: elwidth
* 2: twin-predication (src, dest) elwidth
* 1: select INT or CR predication
* 3: predicate selection and inversion (QTY 2 for tpred)
* 4x2 or 3x3: src1/2/3/dest Vector/Scalar reg
* 5: mode

totals: 24 bits (dest elwidth shared)

http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-December/001434.html

## All zeros indicates "disable SVP"

The defaults for all capabilities of SVP should be zero to indicate "no action".  SUBVL=1 encoded as 0b00.  register name prefixes, scalar=0b0, elwidth overrides DEFAULT=0b00, predication off=0b000 etc.

this way SV may be entirely disabled, leaving an "all zeros" to indicate to v3.1B 64bit prefixing that the standard OpenPOWER v3.1B encodings are in full effect (and that SV is not). As all zeros meshes with current "reserved" encodings this should work well.


## twin predication

twin predication and twin elwidth overrides is extremely important to have to be able to override both the src and dest elwidth yet keep the underlying scalar operation intact.  examples include mr with an elwidth=8, VL=8 on the src will take a byte at a time from one 64 bit reg and place it into 8x 64-bit regs, zero-extended.  more complex operations involve SUBVL and Audio/Video DSP operations, see [[av_opcodes]]

something like:

| 0   1 | 2 3 | 4 5 | 6    | 7  9 | 10 12 | 13 18 | 19 23 |
|-------|-----|-----|------|------|-------|-------|-------|
| subvl | sew | dew | ptyp | psrc | pdst  | vspec | mode  |

table:

* subvl - 1 to 4 scalar / vec2 / vec3 / vec4
* sew / dew - DEFAULT / 8 / 16 /32 element width
* ptyp - predication INT / CR
* psrc / pdst - predicate mask selector and inversion
* vspec - 3 bit src / dest scalar-vector extension
* mode: 5 bits

## twin predication, CR based.

separate src and dest predicates are a critical part of SV for provision of VEXPAND, VCOMPRESS, VSPLAT, VINSERT and many more operations.

Twin CR predication could be done in two ways:

* start from different CRs for the src and dest
* start from the same CR.

With different bits being selectable (CR[0..3]) starting from the same CR makes some sense.

# standard arith ops (single predication)

these are of the form res = op(src1, src2, ...)

| 0   1 | 2 3 | 4 5 | 6    | 7  9 | 10 18 | 19  23 |
|-------|-----|-----|------|------|-------|--------|
| subvl | sew | dew | ptyp | pred | vspec | mode   |

table:

* subvl - 1 to 4 scalar / vec2 / vec3 / vec4
* sew / dew - DEFAULT / 8 / 16 /32 element width
* ptyp - predication INT / CR
* pred - predicate mask selector and inversion
* vspec - 2/3 bit src / dest scalar-vector extension
* mode - 5 bit

For 2 op (dest/src1/src2) the tag may be 3 bits: total 9 bits.  for 3 op (dest/src1/2/3) the vspec may be 2 bits per reg: total 8 bits.

Note:

* the operation should always be done at max(srcwidth, dstwidth), unless it can 
  be proven using the lower will lead to the same result
* saturation is done on the result at the **dest** elwidth

Some examples on different operation widths:

    u16 / u16 = u8
    256 / 2 = 128 # if we used the smaller width, we'd get 0. Wrong

    u8 * u8 = u16
    255 * 2 = 510 # if we used the smaller width, we'd get 254. Wrong

    u16 + u16 = u8
    256 + 2 = 2 # this is correct whether we use the larger or smaller width
                # aka hw can optimize narrowing addition


# Notes about Swizzle

Basically, there isn't enough room to try to fit two src src1/2 swizzle, and SV, even into 64 bit (actually 24) without severely compromising on the number of bits allocated to either swizzle, or SV, or both.

therefore the strategy proposed is:

* design 16bit scalar ops
* use the 11 bit old SV prefix to create 32bit insns
* when those are embedded into v3.1B 64 prefix, the 24 bits are entirely allocated to swizzle.

with 2x12 this would mean no need to have complex encoding of swizzle.

if we really do need 2 bits spare then the complex encoder of swizzle could be deployed.  (*an analysis shows this to be very unlikely. 7^4 is around 2400 which still requires 12 bits to encode* (that's miscalculated, see Single Swizzle section below.)  it isn't because you missed out predicate mask skip as thr 7th option.)

## Single Swizzle

I expect swizzle to not be common enough to warrant 2 swizzles in a single instruction, therefor the above swizzle strategy is probably unnecessary.

Also, if a swizzle supports up to subvl=4, then 11 bits is sufficient since each swizzle element needs to be able to select 1 of 6 different values: 0, 1, x, y, z, w. 6^4 = 1296 which easily fits in 11 bits (only by dropping "predicate mask" from the list of options, which makes 7 options not 6.  see [[mv.swizzle]])

What about subvl=4 that skips one element?  src vec is 4 but one of the elements is to be left alone?  This is not 6 options, it is 7 options (including "skip" i.e combining with a predicate mask in effect).  note that this is not the same as a vec3-with-a-skip

What could hypothetically be done is: when SUBVL=3 a different encoding is used, one that allows the "skip" to be specified.  X Y skip W for example.  this would then be interpreted, "actually the vector is vec4 but one rlement is skipped"

the problem with that is that now SUBVL has become critically dependent on the swizzle, worse than that the swizzle is embedded in the instruction, even worse than that it's encoded in a complex multi-gate fashion.

all of which screams, "this is going in completely the wrong direction".  keep it simple.  7 options, 3 bits, 4x3, 12 bits for swizzle, ignore some if SUBVL is 1 2 or 3.

# note about INT predicate

001	ALWAYS (implicit)	Operation is not masked

this means by default that 001 will always be in nonpredicated ops, which seems anomalous.  would 000 be better to indicate "no predication"?

000 would indicate "the predicate is an immediate of all 1s" i.e. "no operation is masked out"

programmerjake:
I picked 0001 to indicate ALWAYS since that matches with the other semantics: the LSB bit is invert-the-mask, and you can think about the table as-if it is really:

this is the opposite of what feels natural.  inversion should switch *off* something.  also 000 is the canonical "this feature is off by default" number.

the constant should be an immediate of all 1s (not r0), which is the natural way to think of "predication is off".

i get the idea "r0 to be used therefore it is all zeros" but that makes 001 the "default", not 000.

| Value | Mnemonic    |
|-------|-------------|
| 000  | R0 (zero)   set to all 1s, naturally means "no predication" |
| 001  | ~R0 (~zero) |
| 010  | R3          |
| 011  | ~R3         |
| 100  | R10         |
| 101  | ~R10        |
| 110  | R30         |
| 111  | ~R30        |


# CR Vectorisation

Some thoughts on this: the sensible (sane) number of CRs to have is 64.  A case could be made for having 128 but it is an awful lot.  64 CRs also has the advantage that it is only 4x 64 bit registers on a context-switch (programmerjake: yeah, but we already have 256 64-bit registers, a few more won't change much).

A practical issue stems from the fact that accessing the CR regfile on a non-aligned 8-CR boundary during Vector operations would significantly increase internal routing.  By aligning Vector Reads/Writes to 8 CRs this requires only 32 bit aligned read/writes. (programmerjake: simple solution -- rename them internally such that CR6 is the first one)

How to number them as vectors gets particularly interesting.  A case could be made for treating the 64 CRs as a square, and using CR numbering (CR0-7) to begin VL for-loop incrementing first by row and when rolling over to increment the column.  CR6 CR14 ... CR62 then CR7 CR15 ...

When the SV prefix marks them with 2 bits, one of those could be used to indicate scalar, and the other to indicate whether the 3 bit CR number is to be treated as a horizontal vector (CR incrementing straight by 1) or a vertical vector (incrementing by 8)

When there are 3 bits it would be possible to indicate whether to begin from a position offset by 4 (middle of matrix, edge of matrix).

Note: considerable care needs to be taken when putting these horiz/vertical CRs through the Dependency Matrices

Indexing algorithm illustrating how the H/V modes would work.  Note that BA is the 3 bit CR register field that normsll, in scalar ISA, would reference only CR0-7 as CR[BA].

    for i in range(VL)
        y = i % 8
        x = i // 8
        if verticalmode:
            CRINDEX = BA + y*8 + x
        else:
            CRINDEX = BA*8 + i
        CR[CRINDEX] = ...

# Should twin-predication (src=1, dest=1) have DEST SUBVL?

this is tricky: there isn't really enough space unless the reg scalar-vector extension (currently 3 bits per reg) is compacted to only 2 bits each, which would provide 2 extra bits.

so before adding this, an evaluation is needed: *is it necessary*?

what actual operations out of this list need - and work - with a separate SRC and DEST SUBVL?

* mv (the usual way that V* operations are created)
* exts* sign-extension
* rwlinm and other RS-RA shift operations
* LD and ST (treating AGEN as one source)
* FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
* Condition Register ops mfcr, mtcr and other similar

Evaluation:

* mv: yes.  these may need merge/split
* exts: no.  no transformation.
* rwlinm shift operations: no
* LD and ST: no
* FP ops: no
* CR ops: maybe on mvs, not on arithmetic.

therefore it makes no sense to have DEST SUBVL, and instead to have special mv operations.  see [[mv.vec]]