X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=openpower%2Fsv%2Fvector_ops.mdwn;h=1a69d92070cc2980debbb5d6aa30e5ee0487c546;hb=88a823723277b6f79040042f9f2c1a00b33d7f8d;hp=6a80ee3b87d8519693790cf69a78577c550d70c1;hpb=2e37f5cdb9e1f0e128eeb7c50a976c51717b1f06;p=libreriscv.git diff --git a/openpower/sv/vector_ops.mdwn b/openpower/sv/vector_ops.mdwn index 6a80ee3b8..1a69d9207 100644 --- a/openpower/sv/vector_ops.mdwn +++ b/openpower/sv/vector_ops.mdwn @@ -1,11 +1,17 @@ +[[!tag standards]] + # SV Vector Operations. -The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVC512 conflictd for example. This section includes such examples. Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors) +TODO merge old standards page [[simple_v_extension/vector_ops/]] + +The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example. This section includes such examples. Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors) + +Notes: -However some of these actually could be added to a scalar ISA as bitmanipulation instructions. These are separated out into their own section. -Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU) +* Some of these actually could be added to a scalar ISA as bitmanipulation instructions. These are separated out into their own section. +* Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU) +* Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations. See [[sv/cr_int_predication]]. -. Links: * @@ -13,12 +19,14 @@ Links: * * specialist vector ops out of scope for this document +* [[simple_v_extension/specification/bitmanip]] previous version, + contains pseudocode for sof, sif, sbf # Vector ## conflictd -This is based on the AVX512 conflict detection instruction. Internally the logic is used to detect address conflicts in LD/ST operations. Two arrays of indices are given. +This is based on the AVX512 conflict detection instruction. Internally the logic is used to detect address conflicts in multi-issue LD/ST operations. Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion. the instruction may be used for histograms (computed in parallel) input = [100, 100, 3, 100, 5, 100, 100, 3] conflict result = [ @@ -32,15 +40,22 @@ This is based on the AVX512 conflict detection instruction. Internally the logi 0b00000100 // 3 is present on #2 ] +Pseudocode: + + for i in range(VL): + for j in range(1, i): + if src1[i] == src2[j]: + result[j] |= 1< +* +* +* +* + `((P|G)+G)^P` +* + +``` + P = (A | B) & Ci + G = (A & B) +``` + +Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here). The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element. + +``` + At each id, compute C[id] = A[id]+B[id]+0 + Get G[id] = C[id] > radix -1 + Get P[id] = C[id] == radix-1 + Join all P[id] together, likewise G[id] + Compute newC = ((P|G)+G)^P + result[id] = (C[id] + newC[id]) % radix +``` + +two versions: scalar int version and CR based version. + +scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge + +vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits. + +if zero (no propagation) then CR0.eq is zero + +CR based version, TODO.