update diagram and include text on regfile arrangement
[libreriscv.git] / 3d_gpu / architecture / regfile.mdwn
1 # Register Files
2
3 A minimum of 3 register files are required for POWER:
4
5 * Floating-point
6 * Integer
7 * Control and Condition Code Registers (CR0-7, CTR, LR)
8 * SPRs (Special Purpose Registers)
9
10 The FP and Integer registers need to be a massive 128 x 64-bit.
11
12 # Connectivity between regfiles and Function Units
13
14 The target for the first ASICs is a minimum of 4 32-bit FMACs per clock cycle.
15 If it is acceptable that this be achieved on sequentially-adjacent-numbered
16 registers, a significant reduction in the amount of regfile porting may be
17 achieved (down from 12R4W)
18
19 It does however require that the register file be broken into four
20 completely separate and independent quadrants, each with their own
21 separate and independent 3R1W (or 4R1W ports).
22
23 This then requires some Bus Architecture to connect and keep the pipelines
24 busy. Below is the connectivity diagram:
25
26 * A single Dynamic PartitionedSignal capable 64-bit-wide pipeline is at the
27 top (a second Dynamic pipeline is off-page, with its own FUs)
28 * A **pair** of 32-bit Function Units connect to the (shared) pipeline.
29 * The number of **pairs** of Function Units **must** match (or preferably
30 exceed) the number of pipeline stages.
31 * Connected to each of the Operand and Result Ports on each Function Unit
32 is a cyclic buffer.
33 * Read-operands may "cycle" to reach their destination
34 * Write-operands may be "cycled" so as to pick an appropriate destination.
35 * **Independent** Common Data Buses, one for each Quadrant of the Regfile,
36 connect between the Function Unit's cyclic buffers and the **global**
37 cyclic buffers dedicated to that Quadrant.
38 * Within each Quadrant's global cyclic buffers, inter-buffer transfer ports
39 allow for copies of regfile data to be transferred from write-side to
40 read-side. This constitutes the entirety of what is known as an
41 **Operand Forwarding Bus**.
42 * **Between** each Quadrant's global cyclic buffers, there exists a 4x4
43 Crossbar that allows data to move (slowly, and if necessary) across
44 Quadrants.
45
46 Notes:
47
48 * The **only** way for register results and operands to cross over between
49 quadrants of the regfile is that 4x4 crossbar. Data transfer bandwidth
50 being limited, the placement of an operation adversely affects its
51 completion time. Thus, given that read operands exceed the number
52 of write operands, allocation of operations to Function Units should
53 prioritise placing the operation where the "reads" may go straight
54 through.
55 * Outlined in this comment <https://bugs.libre-soc.org/show_bug.cgi?id=296#10>
56 the infrastructure above can, by way of the cyclic buffers, cope with
57 and automatically adapt between a *serial* delivery of operands, and
58 a *parallel* delivery of operands. And, that, actually, performance is
59 not adversely affected by the serial delivery, although the latency
60 of an FMAC is extended by 3 cycles: this being the fact that only one
61 CDB is available to deliver operands.
62
63
64 [[!img regfile_hilo_32_odd_even.png size="500px"]]