From: Luke Kenneth Casson Leighton Date: Sat, 18 Jun 2022 11:18:21 +0000 (+0100) Subject: move vwctor swizzle page to discussion X-Git-Tag: opf_rfc_ls005_v1~1721 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=8adf194d8908cae6fea92621f1127bba480402d4;p=libreriscv.git move vwctor swizzle page to discussion --- diff --git a/openpower/sv/mv.swizzle/discussion.mdwn b/openpower/sv/mv.swizzle/discussion.mdwn new file mode 100644 index 000000000..34322e8a4 --- /dev/null +++ b/openpower/sv/mv.swizzle/discussion.mdwn @@ -0,0 +1,49 @@ +[[!tag standards]] + +# SV Vector Prefix Swizzle + +* +* + +3D GPU operations on batches of vec2, vec3 and vec4 often require re-ordering of the elements in an "out of lane" fashion with respect to standard high performance non-GPU-centric Vector Processors. Examples include: + +* Normalisation of Vectors of XYZ with respect to one dimension +* Alteration of ARGB pixel vectors with respect to opacity (A) +* Adjustment of YUV vectors with respect to luminosity + +and many more. Lane-based Vector Processors not having the 2/3/4 inter-lane crossing have some difficulty processing such data and require it to be pushed into memory and retrieved, which is prohibitively costly in both instructions, time, and power consumption. + +The lane reordering cost is so great and the requirement so common that it easily justifies augmenting the ISA of a GPU to be able to specify the reordering of vec2/3/4 elements, often drastically increasing the instruction size in the process. + +The reason for the dramatic increase is that the reordering of each element in vec4 requires 2 bits per element, plus a predicate mask. This means a minimum of 3 bits per element: 12 bits for a vec4, and if there are 2 src operands this is a whopping 24 bits of immediate data, per instruction. + +There is also benefit to encoding some useful immediates into src operands, on a per sub-element basis: being able to specify for example that the Z element of a vec4 is to be 1.0 saves a complex LD-immediate merging operation for that lane. + +# Options + +## Predication plus indices + +* 4 bits for predication +* 2 bits per element + +## SUBVL plus indices + +* SUBVL specifies the length (vec2/3/4) +* However index selection is 2 bits per element +* Therefore the src SUBVL must be separate and distinct from the dest SUBVL + +## Predication mixed with immediates and indices + +* Three bits per element. +* One encoding (0b000) indicates "mask" +* Four encodings (0b1NN) indicate vec4 selection +* Three remaining indices indicate constants + - 0 (or 0.0) + - 1 (or 1.0) + - -1 (or -1.0) or some other option? + +# mv.swizzle + +is definitely needed. TBD encoding. requires 1 src, 1 dest, and 12 bits immediate minimum. + +[[sv/mv.swizzle]] diff --git a/openpower/sv/vector_swizzle.mdwn b/openpower/sv/vector_swizzle.mdwn deleted file mode 100644 index 34322e8a4..000000000 --- a/openpower/sv/vector_swizzle.mdwn +++ /dev/null @@ -1,49 +0,0 @@ -[[!tag standards]] - -# SV Vector Prefix Swizzle - -* -* - -3D GPU operations on batches of vec2, vec3 and vec4 often require re-ordering of the elements in an "out of lane" fashion with respect to standard high performance non-GPU-centric Vector Processors. Examples include: - -* Normalisation of Vectors of XYZ with respect to one dimension -* Alteration of ARGB pixel vectors with respect to opacity (A) -* Adjustment of YUV vectors with respect to luminosity - -and many more. Lane-based Vector Processors not having the 2/3/4 inter-lane crossing have some difficulty processing such data and require it to be pushed into memory and retrieved, which is prohibitively costly in both instructions, time, and power consumption. - -The lane reordering cost is so great and the requirement so common that it easily justifies augmenting the ISA of a GPU to be able to specify the reordering of vec2/3/4 elements, often drastically increasing the instruction size in the process. - -The reason for the dramatic increase is that the reordering of each element in vec4 requires 2 bits per element, plus a predicate mask. This means a minimum of 3 bits per element: 12 bits for a vec4, and if there are 2 src operands this is a whopping 24 bits of immediate data, per instruction. - -There is also benefit to encoding some useful immediates into src operands, on a per sub-element basis: being able to specify for example that the Z element of a vec4 is to be 1.0 saves a complex LD-immediate merging operation for that lane. - -# Options - -## Predication plus indices - -* 4 bits for predication -* 2 bits per element - -## SUBVL plus indices - -* SUBVL specifies the length (vec2/3/4) -* However index selection is 2 bits per element -* Therefore the src SUBVL must be separate and distinct from the dest SUBVL - -## Predication mixed with immediates and indices - -* Three bits per element. -* One encoding (0b000) indicates "mask" -* Four encodings (0b1NN) indicate vec4 selection -* Three remaining indices indicate constants - - 0 (or 0.0) - - 1 (or 1.0) - - -1 (or -1.0) or some other option? - -# mv.swizzle - -is definitely needed. TBD encoding. requires 1 src, 1 dest, and 12 bits immediate minimum. - -[[sv/mv.swizzle]]