(no commit message)

author lkcl <lkcl@web>

Fri, 6 May 2022 08:27:21 +0000 (09:27 +0100)

committer IkiWiki <ikiwiki.info>

Fri, 6 May 2022 08:27:21 +0000 (09:27 +0100)
author lkcl <lkcl@web>
Fri, 6 May 2022 08:27:21 +0000 (09:27 +0100)
committer IkiWiki <ikiwiki.info>
Fri, 6 May 2022 08:27:21 +0000 (09:27 +0100)
diff --git a/openpower/sv/SimpleV_rationale.mdwn b/openpower/sv/SimpleV_rationale.mdwn

index 24c4d4eb6401e5d9010cf1f51bd4f4e43f1a98c7..8845d0505c7d737c64aa72416f95def4114ba66b 100644 (file)
--- a/openpower/sv/SimpleV_rationale.mdwn
+++ b/openpower/sv/SimpleV_rationale.mdwn
@@ -377,4 +377,16 @@ stripmining setup and teardown is not required.  However a
  2-wide Packed SIMD instruction is nowhere near as high a bang-per-buck
  ratio as a 64-wide Vector Length.
  
-Realistically, 
+Realistically, for general use cases however it is extremely common
+to have the Packed SIMD setup and teardown. `strncpy` for VSX is an
+astounding 240 hand-coded assembler instructions where it is around
+12 to 14 for both RVV and SVP64. Worst case (full algorithm unrolling
+for Massive FFTs) the L1 I-Cache becomes completely ineffective, and in
+the case of the IBM POWER9 a little-known design flaw this results in
+contention between the L1 D and I Caches at the L2 Bus, slowing down
+execution even further.
+
+Additional savings come in the form of `SVREMAP`. This is a hardware
+index transformation system where the normally sequentially-linear
+element access may be "Re-Mapped" to a limited but algorithmic-tailored
+deterministic schedule, for example Matrix Multiply, DCT, or FFT.