From: cand@51b69dee28eeccfe0f04790433b843689895c6e3 <cand@web>
Date: Sat, 12 Dec 2020 16:57:02 +0000 (+0000)
Subject: 4-gather and 4-scatter pseudocode
X-Git-Tag: convert-csv-opcode-to-binary~1386
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=5c239af9da570f2c7d149386fab0ebe07b4c7d9b;p=libreriscv.git

4-gather and 4-scatter pseudocode
---

diff --git a/openpower/sv/av_opcodes.mdwn b/openpower/sv/av_opcodes.mdwn
index 60e6ad0ae..396b958c4 100644
--- a/openpower/sv/av_opcodes.mdwn
+++ b/openpower/sv/av_opcodes.mdwn
@@ -114,6 +114,27 @@ This should be separated to a horizontal multiply and a horizontal add. How a ho
 
 That has the issue that's it's a massive PITA to code, plus it's slow. Plus there's the "access to non-4-offset regs stalls". Even if there's no ready operation, it should be made easier and faster than a manual mapreduce loop.
 
+--
+
+As a mid-solution, 4-element gathers were discussed. 4 elements would also make them useful for pixel packing, not just the general vector gather. This is because OR and ADD are the same operation when bits don't overlap.
+
+    gather-add: d = a.x + a.y + a.z + a.w
+    gather-mul: d = a.x * a.y * a.z * a.w
+
+But can the SV loop increment the src reg # by 4? Hmm.
+
+The idea then leads to the opposite operation, a 1-to-4 bit scatter instruction. Like gather, it could be implemented with a normal loop, but it's faster for certain uses.
+
+    bit-scatter dest, src, bits
+
+    bit-scatter rd, rs, 8 # assuming source and dest are 32-bit wide
+    rd = (rs >> 0 * 8) & (2^8 - 1)
+    rd+1 = (rs >> 1 * 8) & (2^8 - 1)
+    rd+2 = (rs >> 2 * 8) & (2^8 - 1)
+    rd+3 = (rs >> 3 * 8) & (2^8 - 1)
+
+So at the start you have a RGBA packed pixel in one 32-bit register, at the end you have each channel separated into its own register, in the low bits, and ANDed so only the relevant bits are there.
+
 ## vec_mul*
 
 There should be both a same-width multiply and a widening multiply. Signed and unsigned versions. Optionally saturated.