From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Sat, 18 Jun 2022 11:18:21 +0000 (+0100)
Subject: move vwctor swizzle page to discussion
X-Git-Tag: opf_rfc_ls005_v1~1721
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=8adf194d8908cae6fea92621f1127bba480402d4;p=libreriscv.git

move vwctor swizzle page to discussion
---

diff --git a/openpower/sv/mv.swizzle/discussion.mdwn b/openpower/sv/mv.swizzle/discussion.mdwn
new file mode 100644
index 000000000..34322e8a4
--- /dev/null
+++ b/openpower/sv/mv.swizzle/discussion.mdwn
@@ -0,0 +1,49 @@
+[[!tag standards]]
+
+# SV Vector Prefix Swizzle
+
+* <https://bugs.libre-soc.org/show_bug.cgi?id=139>
+* <https://libre-soc.org/simple_v_extension/specification/mv.x/>
+
+3D GPU operations on batches of vec2, vec3 and vec4 often require re-ordering of the elements in an "out of lane" fashion with respect to standard high performance non-GPU-centric Vector Processors.  Examples include:
+
+* Normalisation of Vectors of XYZ with respect to one dimension
+* Alteration of ARGB pixel vectors with respect to opacity (A)
+* Adjustment of YUV vectors with respect to luminosity
+
+and many more.  Lane-based Vector Processors not having the 2/3/4 inter-lane crossing have some difficulty processing such data and require it to be pushed into memory and retrieved, which is prohibitively costly in both instructions, time, and power consumption.
+
+The lane reordering cost is so great and the requirement so common that it easily justifies augmenting the ISA of a GPU to be able to specify the reordering of vec2/3/4 elements, often drastically increasing the instruction size in the process.
+
+The reason for the dramatic increase is that the reordering of each element in vec4 requires 2 bits per element, plus a predicate mask.  This means a minimum of 3 bits per element: 12 bits for a vec4, and if there are 2 src operands this is a whopping 24 bits of immediate data, per instruction.
+
+There is also benefit to encoding some useful immediates into src operands, on a per sub-element basis: being able to specify for example that the Z element of a vec4 is to be 1.0 saves a complex LD-immediate merging operation for that lane.
+
+# Options
+
+## Predication plus indices
+
+* 4 bits for predication
+* 2 bits per element
+
+## SUBVL plus indices
+
+* SUBVL specifies the length (vec2/3/4)
+* However index selection is 2 bits per element
+* Therefore the src SUBVL must be separate and distinct from the dest SUBVL
+
+## Predication mixed with immediates and indices
+
+* Three bits per element.
+* One encoding (0b000) indicates "mask"
+* Four encodings (0b1NN) indicate vec4 selection
+* Three remaining indices indicate constants
+  - 0 (or 0.0)
+  - 1 (or 1.0)
+  - -1 (or -1.0) or some other option?
+
+# mv.swizzle
+
+is definitely needed.  TBD encoding.  requires 1 src, 1 dest, and 12 bits immediate minimum.
+
+[[sv/mv.swizzle]]
diff --git a/openpower/sv/vector_swizzle.mdwn b/openpower/sv/vector_swizzle.mdwn
deleted file mode 100644
index 34322e8a4..000000000
--- a/openpower/sv/vector_swizzle.mdwn
+++ /dev/null
@@ -1,49 +0,0 @@
-[[!tag standards]]
-
-# SV Vector Prefix Swizzle
-
-* <https://bugs.libre-soc.org/show_bug.cgi?id=139>
-* <https://libre-soc.org/simple_v_extension/specification/mv.x/>
-
-3D GPU operations on batches of vec2, vec3 and vec4 often require re-ordering of the elements in an "out of lane" fashion with respect to standard high performance non-GPU-centric Vector Processors.  Examples include:
-
-* Normalisation of Vectors of XYZ with respect to one dimension
-* Alteration of ARGB pixel vectors with respect to opacity (A)
-* Adjustment of YUV vectors with respect to luminosity
-
-and many more.  Lane-based Vector Processors not having the 2/3/4 inter-lane crossing have some difficulty processing such data and require it to be pushed into memory and retrieved, which is prohibitively costly in both instructions, time, and power consumption.
-
-The lane reordering cost is so great and the requirement so common that it easily justifies augmenting the ISA of a GPU to be able to specify the reordering of vec2/3/4 elements, often drastically increasing the instruction size in the process.
-
-The reason for the dramatic increase is that the reordering of each element in vec4 requires 2 bits per element, plus a predicate mask.  This means a minimum of 3 bits per element: 12 bits for a vec4, and if there are 2 src operands this is a whopping 24 bits of immediate data, per instruction.
-
-There is also benefit to encoding some useful immediates into src operands, on a per sub-element basis: being able to specify for example that the Z element of a vec4 is to be 1.0 saves a complex LD-immediate merging operation for that lane.
-
-# Options
-
-## Predication plus indices
-
-* 4 bits for predication
-* 2 bits per element
-
-## SUBVL plus indices
-
-* SUBVL specifies the length (vec2/3/4)
-* However index selection is 2 bits per element
-* Therefore the src SUBVL must be separate and distinct from the dest SUBVL
-
-## Predication mixed with immediates and indices
-
-* Three bits per element.
-* One encoding (0b000) indicates "mask"
-* Four encodings (0b1NN) indicate vec4 selection
-* Three remaining indices indicate constants
-  - 0 (or 0.0)
-  - 1 (or 1.0)
-  - -1 (or -1.0) or some other option?
-
-# mv.swizzle
-
-is definitely needed.  TBD encoding.  requires 1 src, 1 dest, and 12 bits immediate minimum.
-
-[[sv/mv.swizzle]]