openpower/sv/vector_swizzle.mdwn

   1 [[!tag standards]]
   2
   3 # SV Vector Prefix Swizzle
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=139>
   6 * <https://libre-soc.org/simple_v_extension/specification/mv.x/>
   7
   8 3D GPU operations on batches of vec2, vec3 and vec4 often require re-ordering of the elements in an "out of lane" fashion with respect to standard high performance non-GPU-centric Vector Processors.  Examples include:
   9
  10 * Normalisation of Vectors of XYZ with respect to one dimension
  11 * Alteration of ARGB pixel vectors with respect to opacity (A)
  12 * Adjustment of YUV vectors with respect to luminosity
  13
  14 and many more.  Lane-based Vector Processors not having the 2/3/4 inter-lane crossing have some difficulty processing such data and require it to be pushed into memory and retrieved, which is prohibitively costly in both instructions, time, and power consumption.
  15
  16 The lane reordering cost is so great and the requirement so common that it easily justifies augmenting the ISA of a GPU to be able to specify the reordering of vec2/3/4 elements, often drastically increasing the instruction size in the process.
  17
  18 The reason for the dramatic increase is that the reordering of each element in vec4 requires 2 bits per element, plus a predicate mask.  This means a minimum of 3 bits per element: 12 bits for a vec4, and if there are 2 src operands this is a whopping 24 bits of immediate data, per instruction.
  19
  20 There is also benefit to encoding some useful immediates into src operands, on a per sub-element basis: being able to specify for example that the Z element of a vec4 is to be 1.0 saves a complex LD-immediate merging operation for that lane.
  21
  22 # Options
  23
  24 ## Predication plus indices
  25
  26 * 4 bits for predication
  27 * 2 bits per element
  28
  29 ## SUBVL plus indices
  30
  31 * SUBVL specifies the length (vec2/3/4)
  32 * However index selection is 2 bits per element
  33 * Therefore the src SUBVL must be separate and distinct from the dest SUBVL
  34
  35 ## Predication mixed with immediates and indices
  36
  37 * Three bits per element.
  38 * One encoding (0b000) indicates "mask"
  39 * Four encodings (0b1NN) indicate vec4 selection
  40 * Three remaining indices indicate constants
  41   - 0 (or 0.0)
  42   - 1 (or 1.0)
  43   - -1 (or -1.0) or some other option?
  44
  45 # mv.swizzle
  46
  47 is definitely needed.  TBD encoding.  requires 1 src, 1 dest, and 12 bits immediate minimum.
  48
  49 [[sv/mv.swizzle]]