From cad60e729727017520dac103fc26a4c5873b949c Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Fri, 27 Apr 2018 05:30:03 +0100
Subject: [PATCH]

---
 harmonised_rvv_rvp/discussion.mdwn | 35 ++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/harmonised_rvv_rvp/discussion.mdwn b/harmonised_rvv_rvp/discussion.mdwn
index 128409895..1d86fcc9a 100644
--- a/harmonised_rvv_rvp/discussion.mdwn
+++ b/harmonised_rvv_rvp/discussion.mdwn
@@ -7,3 +7,38 @@
 * Likewise the last (and first) of 2-wide 16-bit operations?
 * What about predication within a 4-wide 8-bit group?
 * Likewise what about predication within a 2-wide 16-bit group?
+
+## Providing "cross-over" between elements in a group
+
+what do you think of the "CSR cross[32][6]" idea?  sorry below may 
+not be exactly clear, it's basically a way to generalise all 
+cross-operations, even the SUNPKD810 rt, ra and ZUNPKD810 rt, ra would 
+reduce down to one instruction as opposed to 8 right now. 
+
+    def butterfly_remap(remap_me): 
+        # hmmm a little hazy on the details here.... 
+        # help, help! logic-dyslexia kicking in! 
+        # erm do some crossover using the 6 bits from 
+        # the CSR cross map.  first 2 bits swap 
+        # elements in index positions 0,1 and 2,3 
+        # second 2 bits swap elements in positions 0,2 and 1,3 
+        # then swap 0,1 and 2,3 a second time. 
+        # gives full set of all permutations. 
+        return something, something 
+
+    def crossover(elidx, destreg): 
+        base = elidx & ~0x7 
+        return butterfly_remap(CSR_cross[destreg][elidx & 0x7]) 
+
+    def op(v1, v2, v3): 
+       for l in vlen: 
+          remap_src1, remap_src2 = crossover(i, v1) 
+          # remap_srcN references byte offsets? erm.... :) 
+          GPR[v1] = scalar_op(GPR[v2][remap_src1],
+                              GPR[v3][remap_src2])
+
+Otherwise, VSHUFFLE and so on (and possibly xBitManip) would
+need to be used. xBitManip would not be a bad idea, except
+consideration of VLIW-like DSP (TI C67*) architectures needs
+to be given, which do not do register-renaming and have fixed
+pipeline phases with no stalling on register-dependencies.
-- 
2.30.2