3d_gpu/architecture/tomasulo_transformation.mdwn

   1 # Conversion from Tomasulo to Scoreboards
   2
   3 See [discussion (1)](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-May/006747.html) and
   4 [discussion (2)](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-May/006904.html)
   5
   6 This page aids and assists in understanding the full functional equivalence
   7 of a Scoreboard-based design when compared to a Tomasulo algorithm.  However
   8 it is extremely important to note that the Academic literature, by focussing
   9 exclusively on the published patent covering Q-Tables, is hopelessly inaccurate,
  10 factually incorrect and completely misleading.
  11
  12 By only comparing Q-Tables against the entirety of the Tomasulo algorithm,
  13 this is equivalent to narrowly focussing on the Reorder Buffer of
  14 Tomasulo, excluding all else, and concluding that a design that uses a
  15 ROB and therefore the entire Tomasulo algorithm itself is incapable of useful
  16 high-performance out-of-order execution.
  17
  18 This article helps readers to understand that Q-Tables != Scoreboards,
  19 by describing a series of functionally-equivalent transformations that,
  20 when followed, *turn* the Tomasulo algorithm *into* a Scoreboard-based
  21 design.  It also highlights that, following that transformation, multi-issue
  22 execution is near-trivial to add by comparison. Precise exception
  23 handling is also trivial to add (holding of write commits) and is
  24 described in the [[6600scoreboard]] page under "Shadowing"
  25
  26 On Saturday, May 16, 2020, Yehowshua <yimmanuel3@gatech.edu> wrote:
  27 > This is a very intricate and complicated subject matter for sure.
  28
  29 yes, except it doesn't have to be.  the actual
  30 <https://en.wikipedia.org/wiki/Levenshtein_distance> between Tomasulo and
  31 6600 really is not that great.
  32
  33 i thought it would be fun to use a new unpronounceable word i learned
  34 yesterday :)
  35
  36 > At some point, it be great to really break things down and make them
  37 > more accessible.
  38
  39 yes. it comes down to time.
  40
  41 start with this.
  42
  43 1. Begin from Tomasulo.
  44
  45 2. Make sure to add an Operand Forwarding Bus.  this is critical to
  46    providing the functionality provided by the Tomasulo Common Data Bus.
  47
  48    note (later) that multiple Op Fwd Buses may be conveniently added
  49    as parallel data paths without severe design penalties.
  50
  51 3. Start by only allowing one row per Reservation Station.
  52
  53 4. Expand the number of RSes so that if you were to count the total
  54    number of places operands are stored, they are the same.
  55
  56    (another way to put this is, "flatten all 2D RSes into 1D")
  57
  58 5. where pipelines were formerly connected exclusively to one RS,
  59    *preserve* those connections even though the rows are now 1D flattened.
  60
  61    (another way to put this is: we have a global 1D naming scheme to
  62    reference the *operand latches* rather than a 2D scheme involving RS
  63    number in 1 dimension and the row number in the 2nd)
  64
  65 6. give this 1D flattening an UNARY numbering scheme.
  66
  67 7. make the size of the Reorder Buffer EXACTLY equal to the number of
  68    1D flattened RSes.
  69
  70 8. rename RSes to "Function Units" (actually in Thornton's book the phrase
  71    "Computation Units" is used)
  72
  73    thus, at this point in the transformation, the ROB row number *IS*
  74    the Function Unit Number, the need to actually store the ROB # in the
  75    Reservation Station Row is REMOVED, and consequently the Reservation
  76    Stations are NO LONGER A CAM.
  77
  78 9. give all register file numbers (INT FP) an UNARY numbering.
  79
  80    this means that in the ROB, updating of register numbers in a multi-issue
  81    scenario is a matter of raising one of any number of single bits.
  82    contrast this in the Tomasulo to having to multi-port the SRAM in the
  83    ROB, setting multiple bits *even for single-issue* (5-bits for 32-bit reg
  84    numbering).
  85
  86 10. add "Shadowing" capability to each Function Unit
  87     and create a Shadow Matrix (appx 20 gates per Function Unit)
  88
  89    the "Shadow" capability hooks into the WRITE-COMMIT phase of every
  90    Function Unit, permitting it to EXECUTE but prohibiting it from WRITING
  91    the result of that execution until explicitly permitted to do so.
  92
  93 11. Upgrade the CDB from a multi-fan-in, multi-fan-out, single resource
  94     global choke-point to **separate** (multiple, if desired) read-fanout
  95     broadcast and write-fan-in register data broadcast buses.
  96
  97 # Post-transformation Analysis
  98
  99 with the ROB now having rows of bitvectors, it is now termed a "Matrix".
 100
 101 the left side of the ROB, which used to contain the RS Number in unary,
 102 now contains a *bitvector* Directed Acyclic Graph of the FU to FU
 103 dependencies, and is split out into its own Matrix.
 104
 105 this we call the FU-FU Dependency Matrix.
 106
 107 therefore, where previously, it was the ROB Row binary number that preserved
 108 instruction order (as an inherent DAG through sequential cyclically-incremented
 109 numbering), the 2-D bit-level FU-FU matrix preserves the same DAG by way of
 110 single-bit cells that express FU-to-FU dependencies, creating a hardware-form
 111 of a software "linked list".
 112
 113 the remainder of the "ROB" contains the register numbers in unary Matrix
 114 form, and with each row being directly associated with a Function Unit,
 115 we now have an association between FU and Regs which preserves the
 116 knowledge of what instruction required which registers, *and* who will
 117 produce the result.
 118
 119 this we call the FU-Regs Dependency Matrix.
 120
 121 that *really is it*.
 122
 123 take some time to absorb the transformation which not only preserves
 124 absolutely every functional aspect of the Tomasulo Algorithm, it
 125 drastically simplifies the implementation, reduces gate count, reduces
 126 power consumption *and* provides a strong foundation for doing arbitrary
 127 multi-issue execution with only an O(N) linear increase in gate count
 128 to do so.
 129
 130 further hilariously simple additional transformations occur to replace
 131 former massive resource constrained bottlenecks, due to the binary
 132 numbering on both ROB numbers and Reg numbers, with simple large unary
 133 NOR gates:
 134
 135 * the determination of when hazards are clear, on a per register basis,
 136   is a laughably trivial NOR gate across all columns of the FU-REGs matrix,
 137   producing a row bitvector for each read register and each write register.
 138
 139 * the determination of when a Function Unit may proceed is a laughably
 140   trivial NOR gate across all *rows* of the *FU-FU* Matrix, producing a
 141   row-based vector, determining that it is "readable" if there exists no
 142   write hazard and "writable" if there exists no read hazard.
 143
 144 * the Tomasulo Common Data Bus, formerly being a single chokepoint
 145   binary-addressing global Bus, may now be upgraded to *MULTIPLE* Common
 146   Data Buses that, because the addressing information about registers is now
 147   in unary, is likewise laughably trivial to use cascading Priority Pickers
 148   (a nmigen PriorityEncoder and Decoder, back-to-back) to determine which
 149   Function Unit shall be granted access to which CDB in order to receive
 150   (or send) its operand (or result).
 151
 152 * multi-issue as i mentioned a few times is an equally laughably trivial
 153   matter of transitively cascading the Register Dependency Hazards (both
 154   read and write) across future instructions in the same multi issue
 155   execution window. instr2 has instr1 AND instr2's hazards.  instr3 has
 156   instr1 AND instr2 AND instr3's hazards and so on.  this just leaves
 157   the necessity of increasing register port numbers, number of CDBs,
 158   and LD/ST memory bandwidth to compensate and cope with the additional
 159   resource demands that will now occur.
 160
 161 the latter is particularly why we have a design that, ultimately, we
 162 could take on ARM, Intel, and AMD.
 163
 164 there is no reason technically why we could not do a 4, 6 or 8 multi
 165 issue system, and with enough Function Units and the cyclic buffer system
 166 (so as not to require a full crossbar at the Common Data Buses), and
 167 proper stratification and design of the register files, massive Vector
 168 parallelism at the pipelines would be kept fully occupied without an
 169 overwhelming increase in gates or power consumption that would normally
 170 be expected, and scalar performance would be similarly high as well.
 171
 172 # Terminology notes
 173
 174 These terms help understand that conceptually there is no difference
 175 in the capabilities of Tomasulo and Scoreboards.
 176
 177 | Tomasulo name            | Scoreboard name                              |
 178 | -----                    | ----                                         |
 179 | Precise Exceptions       | Precise-capable ("Shadowed") Scoreboard      |
 180 | ROB index cycling order  | FU-FU DAG that preserves instruction order   |
 181 | Reorder Buffer           | hybrid of Shadow, FU-FU and FU-Regs Matrices |
 182 | Reservation Station CAMs | RS Row = "Computation Unit latches" (no CAM) |
 183 | "register renaming"      | "nameless" registers (Comp Unit latches)     |
 184 | part-ROB, part-RS        | Q-Tables                                     |
 185 | blocking Common Data Bus | fan-out Read Reg, fan-in Write, OpFwd Bus(es)|
 186 | Centralised regfile(s)   | Centralised regfile(s)                       |