this extension amalgamates bitmanipulation primitives from many sources,
including RISC-V bitmanip, Packed SIMD, AVX-512 and OpenPOWER VSX.
Also included are DSP/Multimedia operations suitable for Audio/Video.
-Vectorisation and SIMD are removed: these are straight scalar (element)
-operations making them suitable for embedded applications. Vectorisation
+Vectorization and SIMD are removed: these are straight scalar (element)
+operations making them suitable for embedded applications. Vectorization
Context is provided by [[openpower/sv]].
When combined with SV, scalar variants of bitmanip operations found in
for i in range(64):
RT[i] = lut2(CRs{BFA}, RB[i], RA[i])
-When Vectorised with SVP64, as usual both source and destination may be
+When Vectorized with SVP64, as usual both source and destination may be
Vector or Scalar.
*Programmer's note: a dynamic ternary lookup may be synthesised from
## crternlogi
-another mode selection would be CRs not Ints.
+another mode selection would be CRs not Ints.
-| 0.5|6.8 | 9.11|12.14|15.17|18.20|21.28 | 29.30|31|
-| -- | -- | --- | --- | --- |-----|----- | -----|--|
-| NN | BT | BA | BB | BC |m0-2 | imm | 01 |m3|
+CRB-Form:
+
+| 0.5|6.8 |9.10|11.13|14.15|16.18|19.25|26.30| 31|
+|----|----|----|-----|-----|-----|-----|-----|---|
+| NN | BF | msk|BFA | msk | BFB | TLI | XO |TLI|
- mask = m0-3
for i in range(4):
- a,b,c = CRs[BA][i], CRs[BB][i], CRs[BC][i])
- if mask[i] CRs[BT][i] = lut3(imm, a, b, c)
+ a,b,c = CRs[BF][i], CRs[BFA][i], CRs[BFB][i])
+ if msk[i] CRs[BF][i] = lut3(imm, a, b, c)
This instruction is remarkably similar to the existing crops, `crand` etc.
which have been noted to be a 4-bit (binary) LUT. In effect `crternlogi`
-is the ternary LUT version of crops, having an 8-bit LUT.
+is the ternary LUT version of crops, having an 8-bit LUT. However it
+is an overwrite instruction in order to save on register file ports,
+due to the mask requiring the contents of the BF to be both read and
+written.
+
+Programmer's note: This instruction is useful when combined with Matrix REMAP
+in "Inner Product" Mode, creating Warshall Transitive Closure that has many
+applications in Computer Science.
## crbinlog
With ternary (LUT3) dynamic instructions being very costly,
and CR Fields being only 4 bit, a binary (LUT2) variant is better
-| 0.5|6.8 | 9.11|12.14|15.17|18.21|22...30 |31|
-| -- | -- | --- | --- | --- |-----| -------- |--|
-| NN | BT | BA | BB | BC |m0-m3|000101110 |0 |
+CRB-Form:
+
+| 0.5|6.8 |9.10|11.13|14.15|16.18|19.25|26.30| 31|
+|----|----|----|-----|-----|-----|-----|-----|---|
+| NN | BF | msk|BFA | msk | BFB | // | XO | //|
- mask = m0..m3
for i in range(4):
- a,b = CRs[BA][i], CRs[BB][i])
- if mask[i] CRs[BT][i] = lut2(CRs[BC], a, b)
+ a,b = CRs[BF][i], CRs[BF][i])
+ if msk[i] CRs[BF][i] = lut2(CRs[BFB], a, b)
-When SVP64 Vectorised any of the 4 operands may be Scalar or
-Vector, including `BC` meaning that multiple different dynamic
-lookups may be performed with a single instruction.
+When SVP64 Vectorized any of the 4 operands may be Scalar or
+Vector, including `BFB` meaning that multiple different dynamic
+lookups may be performed with a single instruction. Note that
+this instruction is deliberately an overwrite in order to reduce
+the number of register file ports required: like `crternlogi`
+the contents of `BF` **must** be read due to the mask only
+writing back to non-masked-out bits of `BF`.
*Programmer's note: just as with binlut and ternlogi, a pair
of crbinlog instructions followed by a merging crternlogi may
required for the [[sv/av_opcodes]]
-signed and unsigned min/max for integer. this is sort-of partly
-synthesiseable in [[sv/svp64]] with pred-result as long as the dest reg
-is one of the sources, but not both signed and unsigned. when the dest
-is also one of the srces and the mv fails due to the CR bittest failing
-this will only overwrite the dest where the src is greater (or less).
+signed and unsigned min/max for integer.
signed/unsigned min/max gives more flexibility.
+\[un]signed min/max instructions are specifically needed for vector reduce min/max operations which are pretty common.
+
X-Form
-* XO=0001001110, itype=0b00 min, unsigned
-* XO=0101001110, itype=0b01 min, signed
-* XO=0011001110, itype=0b10 max, unsigned
-* XO=0111001110, itype=0b11 max, signed
+* PO=19, XO=----000011 `minmax RT, RA, RB, MMM`
+* PO=19, XO=----000011 `minmax. RT, RA, RB, MMM`
+
+see [[openpower/sv/rfc/ls013]] for `MMM` definition and pseudo-code.
+implements all of (and more):
```
uint_xlen_t mins(uint_xlen_t rs1, uint_xlen_t rs2)
Pseudo-code (shadd):
- shift <- shift + 1 # Shift is between 1-4
- sum[0:63] <- ((RB) << shift) + (RA) # Shift RB, add RA
- RT <- sum # Result stored in RT
+ n <- (RB)
+ m <- sm + 1
+ RT <- (n[m:XLEN-1] || [0]*m) + (RA)
+
+Pseudo-code (shaddw):
+
+ shift <- sm + 1 # Shift is between 1-4
+ n <- EXTS((RB)[XLEN/2:XLEN-1]) # Only use lower XLEN/2-bits of RB
+ RT <- (n << shift) + (RA) # Shift n, add RA
Pseudo-code (shadduw):
- shift <- shift + 1 # Shift is between 1-4
- n <- (RB)[XLEN/2:XLEN-1] # Limit RB to upper word (32-bits)
- sum[0:63] <- (n << shift) + (RA) # Shift n, add RA
- RT <- sum # Result stored in RT
+ n <- ([0]*(XLEN/2)) || (RB)[XLEN/2:XLEN-1]
+ m <- sm + 1
+ RT <- (n[m:XLEN-1] || [0]*m) + (RA)
```
uint_xlen_t shadd(uint_xlen_t RA, uint_xlen_t RB, uint8_t sm) {
return (RB << (sm+1)) + RA;
}
+uint_xlen_t shaddw(uint_xlen_t RA, uint_xlen_t RB, uint8_t sm) {
+ uint_xlen_t n = (int_xlen_t)(RB << XLEN / 2) >> XLEN / 2;
+ sm = sm & 0x3;
+ return (n << (sm+1)) + RA;
+}
+
uint_xlen_t shadduw(uint_xlen_t RA, uint_xlen_t RB, uint8_t sm) {
uint_xlen_t n = RB & 0xFFFFFFFF;
sm = sm & 0x3;
demo code [[openpower/sv/grevlut.py]]
```
-lut2(imm, a, b):
+def lut2(imm, a, b):
idx = b << 1 | a
- return imm[idx] # idx by LSB0 order
-
-dorow(imm8, step_i, chunksize, us32b):
- for j in 0 to 31 if is32b else 63:
- if (j&chunk_size) == 0
- imm = imm8[0..3]
- else
- imm = imm8[4..7]
- step_o[j] = lut2(imm, step_i[j], step_i[j ^ chunk_size])
+ return (imm>>idx) & 1
+
+def dorow(imm8, step_i, chunk_size):
+ step_o = 0
+ for j in range(64):
+ if (j&chunk_size) == 0:
+ imm = (imm8 & 0b1111)
+ else:
+ imm = (imm8>>4)
+ a = (step_i>>j)&1
+ b = (step_i>>(j ^ chunk_size))&1
+ res = lut2(imm, a, b)
+ #print(j, bin(imm), a, b, res)
+ step_o |= (res<<j)
+ #print (" ", chunk_size, bin(step_o))
return step_o
-uint64_t grevlut(uint64_t RA, uint64_t RB, uint8 imm, bool iv, bool is32b)
-{
- uint64_t x = 0x5555_5555_5555_5555;
- if (RA != 0) x = GPR(RA);
- if (iv) x = ~x;
- int shamt = RB & 31 if is32b else 63
- for i in 0 to (6-is32b)
+def grevlut64(RA, RB, imm, iv):
+ x = 0
+ if RA is None: # RA=0
+ x = 0x5555555555555555
+ else:
+ x = RA
+ if (iv): x = ~x;
+ shamt = RB & 63;
+ for i in range(6):
step = 1<<i
- if (shamt & step) x = dorow(imm, x, step, is32b)
- return x;
-}
+ if (shamt & step):
+ x = dorow(imm, x, step)
+ return x & ((1<<64)-1)
```
A variant may specify different LUT-pairs per row,