src/panfrost/bifrost/Notes.txt

   1 # Notes on opcodes
   2
   3 _Notes mainly by Connor Abbott extracted from the disassembler_
   4
   5 LOG_FREXPM:
   6
   7         // From the ARM patent US20160364209A1:
   8         // "Decompose v (the input) into numbers x1 and s such that v = x1 * 2^s,
   9         // and x1 is a floating point value in a predetermined range where the
  10         // value 1 is within the range and not at one extremity of the range (e.g.
  11         // choose a range where 1 is towards middle of range)."
  12         //
  13         // This computes x1.
  14
  15 FRCP_FREXPM:
  16
  17         // Given a floating point number m * 2^e, returns m * 2^{-1}. This is
  18         // exactly the same as the mantissa part of frexp().
  19
  20 FSQRT_FREXPM:
  21         // Given a floating point number m * 2^e, returns m * 2^{-2} if e is even,
  22         // and m * 2^{-1} if e is odd. In other words, scales by powers of 4 until
  23         // within the range [0.25, 1). Used for square-root and reciprocal
  24         // square-root.
  25
  26
  27
  28
  29 FRCP_FREXPE:
  30         // Given a floating point number m * 2^e, computes -e - 1 as an integer.
  31         // Zero and infinity/NaN return 0.
  32
  33 FSQRT_FREXPE:
  34         // Computes floor(e/2) + 1.
  35
  36 FRSQ_FREXPE:
  37         // Given a floating point number m * 2^e, computes -floor(e/2) - 1 as an
  38         // integer.
  39
  40 LSHIFT_ADD_LOW32:
  41         // These instructions in the FMA slot, together with LSHIFT_ADD_HIGH32.i32
  42         // in the ADD slot, allow one to do a 64-bit addition with an extra small
  43         // shift on one of the sources. There are three possible scenarios:
  44         //
  45         // 1) Full 64-bit addition. Do:
  46         // out.x = LSHIFT_ADD_LOW32.i64 src1.x, src2.x, shift
  47         // out.y = LSHIFT_ADD_HIGH32.i32 src1.y, src2.y
  48         //
  49         // The shift amount is applied to src2 before adding. The shift amount, and
  50         // any extra bits from src2 plus the overflow bit, are sent directly from
  51         // FMA to ADD instead of being passed explicitly. Hence, these two must be
  52         // bundled together into the same instruction.
  53         //
  54         // 2) Add a 64-bit value src1 to a zero-extended 32-bit value src2. Do:
  55         // out.x = LSHIFT_ADD_LOW32.u32 src1.x, src2, shift
  56         // out.y = LSHIFT_ADD_HIGH32.i32 src1.x, 0
  57         //
  58         // Note that in this case, the second argument to LSHIFT_ADD_HIGH32 is
  59         // ignored, so it can actually be anything. As before, the shift is applied
  60         // to src2 before adding.
  61         //
  62         // 3) Add a 64-bit value to a sign-extended 32-bit value src2. Do:
  63         // out.x = LSHIFT_ADD_LOW32.i32 src1.x, src2, shift
  64         // out.y = LSHIFT_ADD_HIGH32.i32 src1.x, 0
  65         //
  66         // The only difference is the .i32 instead of .u32. Otherwise, this is
  67         // exactly the same as before.
  68         //
  69         // In all these instructions, the shift amount is stored where the third
  70         // source would be, so the shift has to be a small immediate from 0 to 7.
  71         // This is fine for the expected use-case of these instructions, which is
  72         // manipulating 64-bit pointers.
  73         //
  74         // These instructions can also be combined with various load/store
  75         // instructions which normally take a 64-bit pointer in order to add a
  76         // 32-bit or 64-bit offset to the pointer before doing the operation,
  77         // optionally shifting the offset. The load/store op implicity does
  78         // LSHIFT_ADD_HIGH32.i32 internally. Letting ptr be the pointer, and offset
  79         // the desired offset, the cases go as follows:
  80         //
  81         // 1) Add a 64-bit offset:
  82         // LSHIFT_ADD_LOW32.i64 ptr.x, offset.x, shift
  83         // ld_st_op ptr.y, offset.y, ...
  84         //
  85         // Note that the output of LSHIFT_ADD_LOW32.i64 is not used, instead being
  86         // implicitly sent to the load/store op to serve as the low 32 bits of the
  87         // pointer.
  88         //
  89         // 2) Add a 32-bit unsigned offset:
  90         // temp = LSHIFT_ADD_LOW32.u32 ptr.x, offset, shift
  91         // ld_st_op temp, ptr.y, ...
  92         //
  93         // Now, the low 32 bits of offset << shift + ptr are passed explicitly to
  94         // the ld_st_op, to match the case where there is no offset and ld_st_op is
  95         // called directly.
  96         //
  97         // 3) Add a 32-bit signed offset:
  98         // temp = LSHIFT_ADD_LOW32.i32 ptr.x, offset, shift
  99         // ld_st_op temp, ptr.y, ...
 100         //
 101         // Again, the same as the unsigned case except for the offset.
 102
 103 ---
 104
 105 ADD ops..
 106
 107 F16_TO_F32.X: // take the low  16 bits, and expand it to a 32-bit float
 108 F16_TO_F32.Y: // take the high 16 bits, and expand it to a 32-bit float
 109
 110 MOV:
 111         // Logically, this should be SWZ.XY, but that's equivalent to a move, and
 112         // this seems to be the canonical way the blob generates a MOV.
 113
 114
 115 FRCP_FREXPM:
 116         // Given a floating point number m * 2^e, returns m ^ 2^{-1}.
 117
 118 FLOG_FREXPE:
 119         // From the ARM patent US20160364209A1:
 120         // "Decompose v (the input) into numbers x1 and s such that v = x1 * 2^s,
 121         // and x1 is a floating point value in a predetermined range where the
 122         // value 1 is within the range and not at one extremity of the range (e.g.
 123         // choose a range where 1 is towards middle of range)."
 124         //
 125         // This computes s.
 126
 127 LD_UBO.v4i32
 128         // src0 = offset, src1 = binding
 129
 130 FRCP_FAST.f32:
 131         // *_FAST does not exist on G71 (added to G51, G72, and everything after)
 132
 133 FRCP_TABLE
 134         // Given a floating point number m * 2^e, produces a table-based
 135         // approximation of 2/m using the top 17 bits. Includes special cases for
 136         // infinity, NaN, and zero, and copies the sign bit.
 137
 138 FRCP_FAST.f16.X
 139         // Exists on G71
 140
 141 FRSQ_TABLE:
 142         // A similar table for inverse square root, using the high 17 bits of the
 143         // mantissa as well as the low bit of the exponent.
 144
 145 FRCP_APPROX:
 146         // Used in the argument reduction for log. Given a floating-point number
 147         // m * 2^e, uses the top 4 bits of m to produce an approximation to 1/m
 148         // with the exponent forced to 0 and only the top 5 bits are nonzero. 0,
 149         // infinity, and NaN all return 1.0.
 150         // See the ARM patent for more information.
 151
 152 MUX:
 153         // For each bit i, return src2[i] ? src0[i] : src1[i]. In other words, this
 154         // is the same as (src2 & src0) | (~src2 & src1).
 155
 156 ST_VAR:
 157         // store a varying given the address and datatype from LD_VAR_ADDR
 158
 159 LD_VAR_ADDR:
 160         // Compute varying address and datatype (for storing in the vertex shader),
 161         // and store the vec3 result in the data register. The result is passed as
 162         // the 3 normal arguments to ST_VAR.
 163
 164 DISCARD
 165         // Conditional discards (discard_if) in NIR. Compares the first two
 166         // sources and discards if the result is true
 167
 168 ATEST.f32:
 169         // Implements alpha-to-coverage, as well as possibly the late depth and
 170         // stencil tests. The first source is the existing sample mask in R60
 171         // (possibly modified by gl_SampleMask), and the second source is the alpha
 172         // value.  The sample mask is written right away based on the
 173         // alpha-to-coverage result using the normal register write mechanism,
 174         // since that doesn't need to read from any memory, and then written again
 175         // later based on the result of the stencil and depth tests using the
 176         // special register.
 177
 178 BLEND:
 179         // This takes the sample coverage mask (computed by ATEST above) as a
 180         // regular argument, in addition to the vec4 color in the special register.