12 Adreno GPUs prior to 6xx use two micro-controllers to parse the command-stream,
13 setup the hardware for draws (or compute jobs), and do various GPU
14 housekeeping. They are relatively simple (basically glorified
15 register writers) and basically all their state is in a collection
16 of registers. Ie. there is no stack, and no memory assigned to
17 them; any global state like which bank of context registers is to
18 be used in the next draw is stored in a register.
20 The setup is similar to radeon, in fact Adreno 2xx thru 4xx used
21 basically the same instruction set as r600. There is a "PFP"
22 (Prefetch Parser) and "ME" (Micro Engine, also confusingly referred
23 to as "PM4"). These make up the "CP" ("Command Parser"). The
24 PFP runs ahead of the ME, with some PM4 packets handled entirely
25 in the PFP. Between the PFP and ME is a FIFO ("MEQ"). In the
26 generations prior to Adreno 5xx, the PFP and ME had different
29 Starting with Adreno 5xx, a new microcontroller with a unified
30 instruction set was introduced, although the overall architecture
31 and purpose of the two microcontrollers remains the same.
33 For lack of a better name, this new instruction set is called
34 "Adreno Five MicroCode" or "afuc". (No idea what Qualcomm calls
37 With Adreno 6xx, the separate PF and ME are replaced with a single
38 SQE microcontroller using the same instruction set as 5xx.
42 Instruction Set Overview
43 ========================
45 32bit instruction set with basic arithmatic ops that can take
46 either two source registers or one src and a 16b immediate.
48 32 registers, although some are special purpose:
50 - ``$00`` - always reads zero, otherwise seems to be the PC
51 - ``$01`` - current PM4 packet header
52 - ``$1c`` - alias ``$rem``, remaining data in packet
53 - ``$1d`` - alias ``$addr``
54 - ``$1f`` - alias ``$data``
56 Branch instructions have a delay slot so the following instruction
57 is always executed regardless of whether branch is taken or not.
65 The following instructions are available:
68 - ``addhi`` - add + carry (for upper 32b of 64b value)
70 - ``subhi`` - subtract + carry (for upper 32b of 64b value)
71 - ``and`` - bitwise AND
73 - ``xor`` - bitwise XOR
74 - ``not`` - bitwise NOT (no src1)
75 - ``shl`` - shift-left
76 - ``ushr`` - unsigned shift-right
77 - ``ishr`` - signed shift-right
78 - ``rot`` - rotate-left (like shift-left with wrap-around)
79 - ``mul8`` - multiply low 8b of two src
82 - ``comp`` - compare two values
84 The ALU instructions can take either two src registers, or a src
85 plus 16b immediate as 2nd src, ex::
87 add $dst, $src, 0x1234 ; src2 is immed
88 add $dst, $src1, $src2 ; src2 is reg
90 The ``not`` instruction only takes a single source::
97 The ``cmp`` instruction returns:
99 - ``0x00`` if src1 > src2
100 - ``0x2b`` if src1 == src2
101 - ``0x1e`` if src1 < src2
103 See explanation in :ref:`afuc-branch`
111 The following branch/jump instructions are available:
113 - ``brne`` - branch if not equal (or bit not set)
114 - ``breq`` - branch if equal (or bit set)
115 - ``jump`` - unconditional jump
117 Both ``brne`` and ``breq`` have two forms, comparing the src register
118 against either a small immediate (up to 5 bits) or a specific bit::
120 breq $src, b3, #somelabel ; branch if src & (1 << 3)
121 breq $src, 0x3, #somelabel ; branch if src == 3
123 The branch instructions are encoded with a 16b relative offset.
124 Since ``$00`` always reads back zero, it can be used to construct
125 an unconditional relative jump.
127 The :ref:`cmp <afuc-alu-cmp>` instruction can be paired with the
128 bit-test variants of ``brne``/``breq`` to implement gt/ge/lt/le,
129 due to the bit pattern it returns, for example::
132 breq $04, b1, #somelabel
134 will branch if ``$02`` is less than or equal to ``$03``.
142 Simple subroutines can be implemented with ``call``/``ret``. The
143 jump instruction encodes a fixed offset.
145 TODO not sure how many levels deep function calls can be nested.
146 There isn't really a stack. Definitely seems to be multiple
147 levels of fxn call, see in PFP: CP_CONTEXT_SWITCH_YIELD -> f13 ->
156 These seem to read/write config state in other parts of CP. In at
157 least some cases I expect these map to CP registers (but possibly
160 - ``cread $dst, [$off + addr], flags``
161 - ``cwrite $src, [$off + addr], flags``
163 In cases where no offset is needed, ``$00`` is frequently used as
166 For example, the following sequences sets::
168 ; load CP_INDIRECT_BUFFER parameters from cmdstream:
169 mov $02, $data ; low 32b of IB target address
170 mov $03, $data ; high 32b of IB target
171 mov $04, $data ; IB size in dwords
173 ; sanity check # of dwords:
174 breq $04, 0x0, #l23 (#69, 04a2)
176 ; this seems something to do with figuring out whether
177 ; we are going from RB->IB1 or IB1->IB2 (ie. so the
178 ; below cwrite instructions update either
179 ; CP_IB1_BASE_LO/HI/BUFSIZE or CP_IB2_BASE_LO/HI/BUFSIZE
183 ; update CP_IBn_BASE_LO/HI/BUFSIZE:
184 cwrite $02, [$05 + 0x0b0], 0x8
185 cwrite $03, [$05 + 0x0b1], 0x8
186 cwrite $04, [$05 + 0x0b2], 0x8
195 The special registers ``$addr`` and ``$data`` can be used to write GPU
196 registers, for example, to write::
198 mov $addr, CP_SCRATCH_REG[0x2] ; set register to write
199 mov $data, $03 ; CP_SCRATCH_REG[0x2]
200 mov $data, $04 ; CP_SCRATCH_REG[0x3]
203 subsequent writes to ``$data`` will increment the address of the register
204 to write, so a sequence of consecutive registers can be written
208 mov $addr, CP_SCRATCH_REG[0x2]
212 Many registers that are updated frequently have two banks, so they can be
213 updated without stalling for previous draw to finish. These banks are
214 arranged so bit 11 is zero for bank 0 and 1 for bank 1. The ME fw (at
215 least the version I'm looking at) stores this in ``$17``, so to update
216 these registers from ME::
218 or $addr, $17, VFD_INDEX_OFFSET
222 Note that PFP doesn't seem to use this approach, instead it does something
225 mov $0c, CP_SCRATCH_REG[0x7]
226 mov $02, 0x789a ; value
227 cwrite $0c, [$00 + 0x010], 0x8
228 cwrite $02, [$00 + 0x011], 0x8
230 Like with the ``$addr``/``$data`` approach, the destination register address
231 increments on each write.
238 There are no load/store instructions, as such. The microcontrollers
239 have only indirect memory access via GPU registers. There are two
242 Read/Write via CP_NRT Registers
243 -------------------------------
245 This seems to be only used by ME. If PFP were also using it, they would
246 race with each other. It seems to be primarily used for small reads.
248 - ``CP_ME_NRT_ADDR_LO``/``_HI`` - write to set the address to read or write
249 - ``CP_ME_NRT_DATA`` - write to trigger write to address in ``CP_ME_NRT_ADDR``
251 The address register increments with successive reads or writes.
253 Memory Write example::
255 ; store 64b value in $04+$05 to 64b address in $02+$03
256 mov $addr, CP_ME_NRT_ADDR_LO
259 mov $addr, CP_ME_NRT_DATA
263 Memory Read example::
265 ; load 64b value from address in $02+$03 into $04+$05
266 mov $addr, CP_ME_NRT_ADDR_LO
273 Read via Control Instructions
274 -----------------------------
276 This is used by PFP whenever it needs to read memory. Also seems to be
277 used by ME for streaming reads (larger amounts of data). The DMA access
278 seems to be done by ROQ.
280 TODO might also be possible for write access
282 TODO some of the control commands might be synchronizing access
285 An example from ``CP_DRAW_INDIRECT`` packet handler::
287 mov $07, 0x0004 ; # of dwords to read from draw-indirect buffer
288 ; load address of indirect buffer from cmdstream:
289 cwrite $data, [$00 + 0x0b8], 0x8
290 cwrite $data, [$00 + 0x0b9], 0x8
291 ; set # of dwords to read:
292 cwrite $07, [$00 + 0x0ba], 0x8
294 ; read parameters from draw-indirect buffer:
297 cread $12, [$00 + 0x040], 0x8
298 ; the start parameter gets written into MEQ, which ME writes
299 ; to VFD_INDEX_OFFSET register:
306 The ``$14`` register holds global flags set by:
308 CP_SKIP_IB2_ENABLE_LOCAL - b8
309 CP_SKIP_IB2_ENABLE_GLOBAL - b9
312 MODE=BLIT2D - clears b15, b12, b7
313 CP_SET_MODE - b29+b30
314 CP_SET_VISIBILITY_OVERRIDE - b11, b21, b30?
315 CP_SET_DRAW_STATE - checks b29+b30
317 CP_COND_REG_EXEC - checks b10, which should be predicate flag?