From 25523fc9996ba4f385ab407b6fcdfc0f504a282e Mon Sep 17 00:00:00 2001 From: lkcl Date: Fri, 6 May 2022 09:27:21 +0100 Subject: [PATCH] --- openpower/sv/SimpleV_rationale.mdwn | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/openpower/sv/SimpleV_rationale.mdwn b/openpower/sv/SimpleV_rationale.mdwn index 24c4d4eb6..8845d0505 100644 --- a/openpower/sv/SimpleV_rationale.mdwn +++ b/openpower/sv/SimpleV_rationale.mdwn @@ -377,4 +377,16 @@ stripmining setup and teardown is not required. However a 2-wide Packed SIMD instruction is nowhere near as high a bang-per-buck ratio as a 64-wide Vector Length. -Realistically, +Realistically, for general use cases however it is extremely common +to have the Packed SIMD setup and teardown. `strncpy` for VSX is an +astounding 240 hand-coded assembler instructions where it is around +12 to 14 for both RVV and SVP64. Worst case (full algorithm unrolling +for Massive FFTs) the L1 I-Cache becomes completely ineffective, and in +the case of the IBM POWER9 a little-known design flaw this results in +contention between the L1 D and I Caches at the L2 Bus, slowing down +execution even further. + +Additional savings come in the form of `SVREMAP`. This is a hardware +index transformation system where the normally sequentially-linear +element access may be "Re-Mapped" to a limited but algorithmic-tailored +deterministic schedule, for example Matrix Multiply, DCT, or FFT. -- 2.30.2