From bdef73c0c3b8abac753758da5a50276ba36b0e98 Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Sun, 8 May 2022 07:02:21 +0100
Subject: [PATCH]

---
 openpower/sv/SimpleV_rationale.mdwn | 36 +++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/openpower/sv/SimpleV_rationale.mdwn b/openpower/sv/SimpleV_rationale.mdwn
index aecdc58ef..97954ed1f 100644
--- a/openpower/sv/SimpleV_rationale.mdwn
+++ b/openpower/sv/SimpleV_rationale.mdwn
@@ -836,6 +836,42 @@ expand SVP64's capability for Matrices, currently limited to
 around 5x7 to 6x6 Matrices and constrained by the size of
 the register files (128 64-bit entries), to arbitrary (massive) sizes.
 
+**Use-case: Matrix and Convolutions**
+
+Imagine a large Matrix scenario, with several values close to zero that
+could be skipped: no need to include zero-multiplications.
+SVP64 is able to do what is termed "Vertical-First" Vectorisation,
+combined with SVREMAP Matrix Schedules.  Imagine that SVREMAP has been
+extended, Snitch-style, to perform a deterministic memory-array walk of
+a large Matrix.
+
+Let us also imagine that the Matrices are stored in Memory with PEs
+attached, and that the PEs are fully functioning Power ISA with Draft
+SVP64 their Multiply capability is not as good as the main CPU. Therefore:
+we want the PEs to feed the sparse data to the main CPU.
+
+* The ZOLC SVREMAP System running on the main CPU generates a Matrix
+  Memory-Load Schedule.
+* The Schedule is sent to the PEs, next to the Memory, via OpenCAPI
+* The PEs are also sent the Basic Block to be executed on each
+  Memory Load (each element of the Matrices to be multiplied)
+* The PEs execute the Basic Block and **exclude**, in a deterministic
+  fashion, any elements containing Zero values
+* Non-zero elements are sent, via OpenCAPI, to the main CPU, which
+  queues sequences of Multiply-and-Accumulate, and feeds the results
+  back to Memory, again via OpenCAPI, to the PEs.
+* The PEs, which are tracking the Sparse Conditions, know where
+  to store the results received
+
+In essence this is near-identical to the original Snitch concept
+except that there are, like Extra-V, PEs able to perform
+conditional testing of the data as it goes both to and from the
+main CPU.  In this way a large Sparse Matrix Multiply or Convolution
+may be achieved without having to pass unnecessary data through
+L1/L2/L3 Caches only to find, at the CPU, that it is zero.
+
+**Summary**
+
 Bottom line is that there is a clear roadmap towards solving a long
 standing problem facing Computer Science and doing so in a way that
 reduces power consumption reduces algorithm completion time and reduces
-- 
2.30.2