[[!tag standards]] # SV Vector Operations. The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example. This section includes such examples. Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors) Notes: * Some of these actually could be added to a scalar ISA as bitmanipulation instructions. These are separated out into their own section. * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU) * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations. See [[sv/cr_int_predication]]. Links: * * conflictd example * * specialist vector ops out of scope for this document * [[simple_v_extension/specification/bitmanip]] previous version, contains pseudocode for sof, sif, sbf # Vector ## conflictd This is based on the AVX512 conflict detection instruction. Internally the logic is used to detect address conflicts in multi-issue LD/ST operations. Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion. the instruction may be used for histograms (computed in parallel) input = [100, 100, 3, 100, 5, 100, 100, 3] conflict result = [ 0b00000000, // Note: first element always zero 0b00000001, // 100 is present on #0 0b00000000, 0b00000011, // 100 is present on #0 and #1 0b00000000, 0b00001011, // 100 is present on #0, #1, #3 0b00011011, // .. and #4 0b00000100 // 3 is present on #2 ] Pseudocode: for i in range(VL): for j in range(1, i): if src1[i] == src2[j]: result[j] |= 1< * * * * `((P|G)+G)^P` * ``` P = (A | B) & Ci G = (A & B) ``` Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here). The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element. ``` At each id, compute C[id] = A[id]+B[id]+0 Get G[id] = C[id] > radix -1 Get P[id] = C[id] == radix-1 Join all P[id] together, likewise G[id] Compute newC = ((P|G)+G)^P result[id] = (C[id] + newC[id]) % radix ``` two versions: scalar int version and CR based version. scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits. if zero (no propagation) then CR0.eq is zero CR based version, TODO.