3d_gpu/architecture/regfile.mdwn

   1 # Register Files
   2
   3 Discussion:
   4
   5 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-June/008368.html>
   6
   7 A minimum of 4 register files are required for POWER:
   8
   9 * Floating-point
  10 * Integer
  11 * Control and Condition Code Registers (CR0-7, CTR, LR)
  12 * SPRs (Special Purpose Registers)
  13
  14 Source code:
  15
  16 * <https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/regfile/regfiles.py;hb=HEAD>
  17
  18 For a GPU, the FP and Integer registers need to be a massive 128 x 64-bit.
  19
  20 # Regfile groups, Port Allocations and bit-widths
  21
  22 * INT regfile: 32x 64-bit with 4R1W
  23 * SPR regfile: 1024x 64-bit (!) needs a "map" on that  1R1W
  24 * CR regfile:  8x 4-bit with full 8R8W (for full 32-bit read/write)
  25   - CR0-7: 4-bit
  26 * XER regfile: 2x 2-bit, 1x 1-bit with full 3R3W
  27   - CA(32) - 2-bit
  28   - OV(32) - 2-bit
  29   - SO     - 1 bit
  30 * FAST regfile: 7x 64-bit, full 3R2W (possibly greater)
  31   - MSR: 64-bit
  32   - PC: 64-bit
  33   - LR: 64-bit
  34   - CTR: 64-bit
  35   - TAR: 64-bit
  36   - SRR1: 64-bit
  37   - SRR2: 64-bit
  38
  39 # Connectivity between regfiles and Function Units
  40
  41 The target for the first ASICs is a minimum of 4 32-bit FMACs per clock cycle.
  42 If it is acceptable that this be achieved on sequentially-adjacent-numbered
  43 registers, a significant reduction in the amount of regfile porting may be
  44 achieved (down from 12R4W)
  45
  46 It does however require that the register file be broken into four
  47 completely separate and independent quadrants, each with their own
  48 separate and independent 3R1W (or 4R1W ports).
  49
  50 This then requires some Bus Architecture to connect and keep the pipelines
  51 busy.  Below is the connectivity diagram:
  52
  53 * A single Dynamic PartitionedSignal capable 64-bit-wide pipeline is at the
  54   top left and top right.
  55 * Multiple **pairs** of 32-bit Function Units (making up a 64-bit data
  56   path) connect, as "Concurrent Units", to each pipeline.
  57 * The number of **pairs** of Function Units **must** match (or preferably
  58   exceed) the number of pipeline stages.
  59 * Connected to each of the Operand and Result Ports on each Function Unit
  60   is a cyclic buffer.
  61 * Read-operands may "cycle" to reach their destination
  62 * Write-operands may be "cycled" so as to pick an appropriate destination.
  63 * **Independent** Common Data Buses, one for each Quadrant of the Regfile,
  64   connect between the Function Unit's cyclic buffers and the **global**
  65   cyclic buffers dedicated to that Quadrant.
  66 * Within each Quadrant's global cyclic buffers, inter-buffer transfer ports
  67   allow for copies of regfile data to be transferred from write-side to
  68   read-side.  This constitutes the entirety of what is known as an
  69   **Operand Forwarding Bus**.
  70 * **Between** each Quadrant's global cyclic buffers, there exists a 4x4
  71   Crossbar that allows data to move (slowly, and if necessary) across
  72   Quadrants.
  73
  74 Notes:
  75
  76 * There is only **one** 4x4 crossbar (or, one for reads, one for writes?)
  77   and thus only **one** inter-Quadrant 32-bit-wide data path (total
  78   bandwidth 4x32 bits).  These to be shared by **five** groups of
  79   operand ports at each of the Quadrant Global Cyclic Buffers.
  80 * The **only** way for register results and operands to cross over between
  81   quadrants of the regfile is that 4x4 crossbar.  Data transfer bandwidth
  82   being limited, the placement of an operation adversely affects its
  83   completion time.  Thus, given that read operands exceed the number
  84   of write operands, allocation of operations to Function Units should
  85   prioritise placing the operation where the "reads" may go straight
  86   through.
  87 * Outlined in this comment <https://bugs.libre-soc.org/show_bug.cgi?id=296#10>
  88   the infrastructure above can, by way of the cyclic buffers, cope with
  89   and automatically adapt between a *serial* delivery of operands, and
  90   a *parallel* delivery of operands.  And, that, actually, performance is
  91   not adversely affected by the serial delivery, although the latency
  92   of an FMAC is extended by 3 cycles: this being the fact that only one
  93   CDB is available to deliver operands.
  94
  95 Click on the image to expand it full-screen:
  96
  97 [[!img regfile_hilo_32_odd_even.png size="500px"]]
  98
  99 # Regspecs
 100
 101 * Source: <https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/regspec.py;hb=HEAD>
 102
 103 "Regspecs" is a term used for describing the relationship between register files,
 104 register file ports, register widths, and the Computation Units that they connect
 105 to.
 106
 107 Regspecs are defined, in python, as follows:
 108
 109 | Regfile name | CompUnit Record name | bit range register mapping |
 110 | ----         | ----------           | ------------               |
 111 | INT          | ra                   | 0:3,5                      |
 112
 113 Description of each heading:
 114
 115 * Regfile name: INT corresponds to the INTEGER file, CR to Condition Register etc.
 116 * CompUnit Record name: in the Input or Output Record there will be a signal by
 117   name.  This field refers to that record signal, thus providing a sequential
 118   ordering for the fields.
 119 * Bit range: this is specified as an *inclusive* range of the form "start:end"
 120   or just a single bit, "N".  Multiple ranges may be specified, and are
 121   comma-separated.
 122
 123 Here is how they are used:
 124 ```
 125 class CRInputData(IntegerData):
 126     regspec = [('INT', 'a', '0:63'),      # 64 bit range
 127                ('INT', 'b', '0:63'),      # 6B bit range
 128                ('CR', 'full_cr', '0:31'), # 32 bit range
 129                ('CR', 'cr_a', '0:3'),     # 4 bit range
 130                ('CR', 'cr_b', '0:3'),     # 4 bit range
 131                ('CR', 'cr_c', '0:3')]     # 4 bit range
 132 ```
 133
 134 This tells us, when used by MultiCompUnit, that:
 135
 136 * CompUnit src reg 0 is from the INT regfile, is linked to CRInputData.a, 64-bit
 137 * CompUnit src reg 1 is from the INT regfile, is linked to CRInputData.b, 64-bit
 138 * CompUnit src reg 2 is from the CR regfile, is CRInputData.full\_cr, and 32-bit
 139 * CompUnit src reg 3 is from the CR regfile, is CRInputData.cr\_a, and 4-bit
 140 * CompUnit src reg 4 is from the CR regfile, is CRInputData.cr\_b, and 4-bit
 141 * CompUnit src reg 5 is from the CR regfile, is CRInputData.cr\_c, and 4-bit
 142
 143 Likewise there is a corresponding regspec for CROutputData.  The two are combined
 144 and associated with the Pipeline:
 145
 146 ```
 147 class CRPipeSpec(CommonPipeSpec):
 148     regspec = (CRInputData.regspec, CROutputData.regspec)
 149     opsubsetkls = CompCROpSubset
 150 ```
 151
 152 In this way the pipeline can be connected up to a generic, general-purpose class
 153 (MultiCompUnit), which would otherwise know nothing about the details of the ALU
 154 (Pipeline) that it is being connected to.
 155
 156 In addition, on the other side of the MultiCompUnit, the regspecs contain enough
 157 information to be able to wire up batches of MultiCompUnits (now known, because
 158 of their association with an ALU, as FunctionUnits), associating the MultiCompUnits
 159 correctly with their corresponding Register File.
 160
 161 Note: there are two exceptions to the "generic-ness and abstraction"
 162 where MultiCompUnit "knows nothing":
 163
 164 1. When the Operand Subset has a member "zero_a".  this tells MultiCompUnit
 165    to create a multiplexer that, if operand.zero_a is set, will put **ZERO**
 166    into its first src operand (src_i[0]) and it will **NOT** put out a
 167    read request (**NOT** raise rd.req[0]) for that first register.
 168 2. When the Operand Subset has a member "imm_data".  this tells
 169    MultiCompUnit to create a multiplexer that, if operand.imm_data.ok is
 170    set, will copy operand.imm_data into its *second* src operand (src_i[1]).
 171    Further: that it will **NOT** put out a read request (**NOT** raise
 172    rd.req[1]) for that second register.
 173
 174 These should only be activated for INTEGER and Logical pipelines, and
 175 the regspecs for them must note and respect the requirements: input
 176 regspec[0] may *only* be associated with operand.zero_a, and input
 177 regspec[1] may *only* be associated with operand.imm_data.  the POWER9
 178 Decoder and the actual INTEGER and Logical pipelines have these
 179 expectations **specifically** hard-coded into them.