simple_v_extension/vector_ops.mdwn

   1 [[!tag standards]]
   2
   3 # Vector Operations Extension to SV
   4
   5 This extension defines vector operations that would otherwise take several cycles to complete in software. With 3D priorities being to compute as many pixels per clock as possible, the normal RISC rules (reduce opcode count and make heavy use of macro op fusion) do not necessarily apply.
   6
   7 This extension is usually dependent on SV SUBVL being implemented. When SUBVL is set to define the length of a subvector the operations in this extension interpret the elements as a single vector.
   8
   9 Normally in SV all operations are scalar and independent, and the operations on them may inherently be independently parallelised, with the result being a vector of length exactly equal to the input vectors.
  10
  11 In this extension, the subvector itself is typically the unit, although some operations will work on scalars or standard vectors as well, or the result is a scalar that is dependent on all elements within the vector arguments.
  12
  13 However given that some of the parameters are vectors (with and without SUBVL set), and some are scalars (where SUBVL will not apply), some clear rules need to be defined as to how the operations work.
  14
  15 Examples which can require SUBVL include cross product and may in future involve complex numbers.
  16
  17 ## CORDIC
  18
  19 6 opcode options (fmt3):
  20
  21 * CORDIC.lin.rot vd, vs, beta
  22 * CORDIC.cir.rot vd, vs, beta
  23 * CORDIC.hyp.rot vd, vs, beta
  24 * CORDIC.lin.vec vd, vs, beta
  25 * CORDIC.cir.vec vd, vs, beta
  26 * CORDIC.hyp.vec vd, vs, beta
  27
  28
  29 | Instr | result | src1 | src2 | SUBVL | VL | Notes |
  30 | ------------------ | ------ | ---- | ---- | ----- | -- | ------ |
  31 | CORDIC.x.t vd, vs1, rs2 | vec2 | vec2 | scal | 2 | any | src2 ignores SUBVL |
  32
  33 SUBVL must be set to 2 and applies to vd and vs. SUBVL is *ignored* on beta.  vd and vs must be marked as vectors.
  34
  35 VL may be applied.  beta as a scalar is ok (applies across all vectors vd and vs). Predication is also ok (single predication) sourced from vd. Use of swizzle is also ok.
  36
  37 Non vector args vd, vs are reserved encodings.
  38
  39 CORDIC is an extremely general-purpose algorithm useful for a huge number
  40 of diverse purposes.  In its full form it does however require quite a
  41 few parameters, one of which is a vector, making it awkward to include in
  42 a standard "scalar" ISA.  Additionally the coordinates can be set to circular,
  43 linear or hyperbolic, producing three different modes, and the algorithm
  44 may also be run in either "vector" mode or "rotation" mode.  See [[discussion]]
  45
  46 CORDIC can also be used for performing DCT.  See
  47 <https://arxiv.org/abs/1606.02424>
  48
  49 CORDIC has several RADIX-4 papers for efficient pipelining.  Each stage requires its own ROM tables which can get costly.  Two combinatorial blocks may be chained together to double the RADIX and halve the pipeline depth, at the cost of doubling the latency.
  50
  51 Also, to get good accuracy, particularly at the limits of CORDIC input range, requires double the bitwidth of the output in internal computations. This similar to how MUL requires double the bitwidth to compute.
  52
  53 Links:
  54
  55 * <http://www.myhdl.org/docs/examples/sinecomp/>
  56 * <https://www.atlantis-press.com/proceedings/jcis2006/232>
  57
  58 ## Vector cross product
  59
  60 * VCROSS vd, vs1, vs1
  61
  62 Result is the cross product of x and y.
  63
  64 SUBVL must be set to 3, and all regs must be vectors. VL nonzero produces multiple results in vd.
  65
  66 The resulting components are, in order:
  67
  68     x[1] * y[2] - y[1] * x[2]
  69     x[2] * y[0] - y[2] * x[0]
  70     x[0] * y[1] - y[0] * x[1]
  71
  72 All the operands must be vectors of 3 components of a floating-point type.
  73
  74 Pseudocode:
  75
  76     vec3 a, b; // elements in order a.x, a.y, a.z
  77     // compute a cross b:
  78     vec3 t1 = a.yzx; // produce vector [a.y, a.z, a.x]
  79     vec3 t2 = b.zxy;
  80     vec3 t3 = a.zxy;
  81     vec3 t4 = b.yzx;
  82     vec3 p = t3 * t4;
  83     vec3 cross = t1 * t2 - p;
  84
  85 Assembler:
  86
  87     fswizzlei,2130 F4, F1
  88     fswizzlei,1320 F5, F1
  89     fswizzlei,2130 F6, F2
  90     fswizzlei,1320 F7, F2
  91     fmul F8, F5, F6
  92     fmulsub F3, F4, F7, F8
  93
  94 ## Vector dot product
  95
  96
  97 * VDOT rd, vs1, vs2
  98
  99 Computes the dot product of two vectors. Internal accuracy must be
 100 greater than the input vectors and the result.
 101
 102 There are two possible argument options:
 103
 104 * SUBVL=2,3,4 vs1 and vs2 set as vectors,  multiple results are generated. When VL is set, only the first (unpredicated) SUBVector is used to create a result, if rd is scalar (standard behaviour for single predication). Otherwise, if rd is a vector, multiple scalar results are calculated (i.e. SUBVL is always ignored for rd). Swizzling may be applied.
 105 * When rd=scalar, SUBVL=1 and vs1=vec, vs2=vec, one scalar result is generated from the entire src vectors.  Predication is allowed on the src vectors.
 106
 107
 108 | Instr | result | src1 | src2 | SUBVL | VL |
 109 | ------------------ | ------ | ---- | ---- | ----- | -- |
 110 | VDOT rd, vs1, vs2 | scal | vec  | vec | 2-4 | any |
 111 | VDOT rd, vs1, vs2 | scal | vec  | vec | 1 | any |
 112
 113 Pseudocode in python:
 114
 115     from operator import mul
 116     sum(map(mul, A, B))
 117
 118 Pseudocode in c:
 119
 120     double dot_product(float v[], float u[], int n)
 121     {
 122         double result = 0.0;
 123         for (int i = 0; i < n; i++)
 124             result += v[i] * u[i];
 125         return result;
 126     }
 127
 128 ## Vector Normalisation (not included)
 129
 130 Vector normalisation may be performed through dot product, recip square root and multiplication:
 131
 132     fdot F3, F1, F1 # vector dot with self
 133     rcpsqrta F3, F3
 134     fscale,0 F2, F3, F1
 135
 136 Or it may be performed through VLEN (Vector length) and division.
 137
 138 ## Vector length
 139
 140 * rd=scalar, vs1=vec (SUBVL=1)
 141 * rd=scalar, vs1=vec (SUBVL=2,3,4) only 1 (predication rules apply)
 142 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=2,3,4
 143 * rd=vec, SUBVL ignored; vs1=vec, SUBVL=1: reserved encoding.
 144
 145 * VLEN rd, vs1
 146
 147 The scalar length of a vector:
 148
 149     sqrt(x[0]^2 + x[1]^2 + ...).
 150
 151 One option is for this to be a macro op fusion sequence, with inverse-sqrt also being a second macro op sequence suitable for normalisation.
 152
 153 ## Vector distance
 154
 155 * VDIST rd, vs1, vs2
 156
 157 The scalar distance between two vectors. Subtracts one vector from the
 158 other and returns length:
 159
 160     length(v0 - v1)
 161
 162 ## Vector LERP
 163
 164 * VLERP vd, vs1, rs2 # SUBVL=2: vs1.v0 vs1.v1
 165
 166 | Instr | result | src1 | src2 | SUBVL | VL |
 167 | ------------------ | ------ | ---- | ---- | ----- | -- |
 168 | VLERP vd, vs1, rs2 | vec2 | vec2 | scal | 2 | any |
 169
 170 Known as **fmix** in GLSL.
 171
 172 <https://en.m.wikipedia.org/wiki/Linear_interpolation>
 173
 174 Pseudocode:
 175
 176     // Imprecise method, which does not guarantee v = v1 when t = 1,
 177     // due to floating-point arithmetic error.
 178     // This form may be used when the hardware has a native fused
 179     // multiply-add instruction.
 180     float lerp(float v0, float v1, float t) {
 181       return v0 + t * (v1 - v0);
 182     }
 183
 184     // Precise method, which guarantees v = v1 when t = 1.
 185     float lerp(float v0, float v1, float t) {
 186       return (1 - t) * v0 + t * v1;
 187     }
 188
 189 ## Vector SLERP
 190
 191 * VSLERP vd, vs1, vs2, rs3
 192
 193 Not recommended as it is not commonly used and has several trigonometric
 194 functions, although CORDIC in vector rotate circular mode is designed for this purpose. Also a costly 4 arg operation.
 195
 196 <https://en.m.wikipedia.org/wiki/Slerp>
 197
 198 Pseudocode:
 199
 200     Quaternion slerp(Quaternion v0, Quaternion v1, double t) {
 201         // Only unit quaternions are valid rotations.
 202         // Normalize to avoid undefined behavior.
 203         v0.normalize();
 204         v1.normalize();
 205
 206         // Compute the cosine of the angle between the two vectors.
 207         double dot = dot_product(v0, v1);
 208
 209         // If the dot product is negative, slerp won't take
 210         // the shorter path. Note that v1 and -v1 are equivalent when
 211         // the negation is applied to all four components. Fix by
 212         // reversing one quaternion.
 213         if (dot < 0.0f) {
 214             v1 = -v1;
 215             dot = -dot;
 216         }
 217
 218         const double DOT_THRESHOLD = 0.9995;
 219         if (dot > DOT_THRESHOLD) {
 220             // If the inputs are too close for comfort, linearly interpolate
 221             // and normalize the result.
 222
 223             Quaternion result = v0 + t*(v1 - v0);
 224             result.normalize();
 225             return result;
 226         }
 227
 228         // Since dot is in range [0, DOT_THRESHOLD], acos is safe
 229         double theta_0 = acos(dot);        // theta_0 = angle between input vectors
 230         double theta = theta_0*t;          // theta = angle between v0 and result
 231         double sin_theta = sin(theta);     // compute this value only once
 232         double sin_theta_0 = sin(theta_0); // compute this value only once
 233
 234         double s0 = cos(theta) - dot * sin_theta / sin_theta_0;  // == sin(theta_0 - theta) / sin(theta_0)
 235         double s1 = sin_theta / sin_theta_0;
 236
 237         return (s0 * v0) + (s1 * v1);
 238     }
 239
 240 However this algorithm does not involve transcendentals except in
 241 the computation of the tables: <https://en.wikipedia.org/wiki/CORDIC#Rotation_mode>
 242
 243     function v = cordic(beta,n)
 244         % This function computes v = [cos(beta), sin(beta)] (beta in radians)
 245         % using n iterations. Increasing n will increase the precision.
 246
 247         if beta < -pi/2 || beta > pi/2
 248             if beta < 0
 249                 v = cordic(beta + pi, n);
 250             else
 251                 v = cordic(beta - pi, n);
 252             end
 253             v = -v; % flip the sign for second or third quadrant
 254             return
 255         end
 256
 257         % Initialization of tables of constants used by CORDIC
 258         % need a table of arctangents of negative powers of two, in radians:
 259         % angles = atan(2.^-(0:27));
 260         angles =  [  ...
 261             0.78539816339745   0.46364760900081
 262             0.24497866312686   0.12435499454676 ...
 263             0.06241880999596   0.03123983343027
 264             0.01562372862048   0.00781234106010 ...
 265             0.00390623013197   0.00195312251648
 266             0.00097656218956   0.00048828121119 ...
 267             0.00024414062015   0.00012207031189
 268             0.00006103515617   0.00003051757812 ...
 269             0.00001525878906   0.00000762939453
 270             0.00000381469727   0.00000190734863 ...
 271             0.00000095367432   0.00000047683716
 272             0.00000023841858   0.00000011920929 ...
 273             0.00000005960464   0.00000002980232
 274             0.00000001490116   0.00000000745058 ];
 275         % and a table of products of reciprocal lengths of vectors [1, 2^-2j]:
 276         % Kvalues = cumprod(1./abs(1 + 1j*2.^(-(0:23))))
 277         Kvalues = [ ...
 278             0.70710678118655   0.63245553203368
 279             0.61357199107790   0.60883391251775 ...
 280             0.60764825625617   0.60735177014130
 281             0.60727764409353   0.60725911229889 ...
 282             0.60725447933256   0.60725332108988
 283             0.60725303152913   0.60725295913894 ...
 284             0.60725294104140   0.60725293651701
 285             0.60725293538591   0.60725293510314 ...
 286             0.60725293503245   0.60725293501477
 287             0.60725293501035   0.60725293500925 ...
 288             0.60725293500897   0.60725293500890
 289             0.60725293500889   0.60725293500888 ];
 290         Kn = Kvalues(min(n, length(Kvalues)));
 291
 292         % Initialize loop variables:
 293         v = [1;0]; % start with 2-vector cosine and sine of zero
 294         poweroftwo = 1;
 295         angle = angles(1);
 296
 297         % Iterations
 298         for j = 0:n-1;
 299             if beta < 0
 300                 sigma = -1;
 301             else
 302                 sigma = 1;
 303             end
 304             factor = sigma * poweroftwo;
 305             % Note the matrix multiplication can be done using scaling by
 306             % powers of two and addition subtraction
 307             R = [1, -factor; factor, 1];
 308             v = R * v; % 2-by-2 matrix multiply
 309             beta = beta - sigma * angle; % update the remaining angle
 310             poweroftwo = poweroftwo / 2;
 311             % update the angle from table, or eventually by just dividing by two
 312             if j+2 > length(angles)
 313                 angle = angle / 2;
 314             else
 315                 angle = angles(j+2);
 316             end
 317         end
 318
 319         % Adjust length of output vector to be [cos(beta), sin(beta)]:
 320         v = v * Kn;
 321         return
 322
 323     endfunction
 324
 325 2x2 matrix multiply can be done with shifts and adds:
 326
 327     x = v[0] - sigma * (v[1] * 2^(-j));
 328     y = sigma * (v[0] * 2^(-j)) + v[1];
 329     v = [x; y];
 330
 331 The technique is outlined in a paper as being applicable to 3D:
 332 <https://www.atlantis-press.com/proceedings/jcis2006/232>
 333
 334 # Expensive 3-operand OP32 operations
 335
 336 3-operand operations are extremely expensive in terms of OP32 encoding space.  A potential idea is to embed 3 RVC register formats across two out of three 5-bit fields rs1/rs2/rd
 337
 338 Another is to overwrite one of the src registers.
 339
 340 # Opcode Table
 341
 342 TODO
 343
 344 # Links
 345
 346 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002736.html>
 347 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-September/002733.html>
 348 * <http://bugs.libre-riscv.org/show_bug.cgi?id=142>
 349
 350 Research Papers
 351
 352 * <https://www.researchgate.net/publication/2938554_PLX_FP_An_Efficient_Floating-Point_Instruction_Set_for_3D_Graphics>