3d_gpu/architecture/tomasulo_transformation.mdwn

   1 # Conversion from Tomasulo to Scoreboards
   2
   3 See [discussion](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-May/006747.html)
   4
   5 '''
   6 On Saturday, May 16, 2020, Yehowshua <yimmanuel3@gatech.edu> wrote:
   7 > This is a very intricate and complicated subject matter for sure.
   8
   9 yes, except it doesn't have to be.  the actual
  10 <https://en.wikipedia.org/wiki/Levenshtein_distance> between Tomasulo and
  11 6600 really is not that great.
  12
  13 i thought it would be fun to use a new unpronounceable word i learned
  14 yesterday :)
  15
  16 > At some point, it be great to really break things down and make them
  17 > more accessible.
  18
  19 yes. it comes down to time.
  20
  21 start with this.
  22
  23 1. Begin from Tomasulo.  neither TS nor original 6600 have precise
  24    exceptions so we leave that out for now.
  25
  26 2. Make sure to add an Operand Forwarding Bus.  this is critical to
  27    providing the functionality provided by the Tomasulo Common Data Bus.
  28
  29    note (later) that multiple Op Fwd Buses may be conveniently added
  30    as parallel data paths without severe design penalties.
  31
  32 3. Start by only allowing one row per Reservation Station.
  33
  34 4. Expand the number of RSes so that if you were to count the total
  35    number of places operands are stored, they are the same.
  36
  37 (another way to put this is, "flatten all 2D RSes into 1D")
  38
  39 5. where pipelines were formerly connected exclusively to one RS,
  40    *preserve* those connections even though the rows are now 1D flattened.
  41
  42 (another way to put this is: we have a global 1D naming scheme to
  43 reference the *operand latches* rather than a 2D scheme involving RS
  44 number in 1 dimension and the row number in the 2nd)
  45
  46 6. give this 1D flattening an UNARY numbering scheme.
  47
  48 7. make the size of the Reorder Buffer EXACTLY equal to the number of
  49    1D flattened RSes.
  50
  51 8. rename RSes to "Function Units" (actually in Thornton's book the phrase
  52    "Computation Units" is used)
  53
  54 thus, at this point in the transformation, the ROB row number *IS*
  55 the Function Unit Number, the need to actually store the ROB # in the
  56 Reservation Station Row is REMOVED, and consequently the Reservation
  57 Stations are NO LONGER A CAM.
  58
  59 9. give all register file numbers (INT FP) an UNARY numbering.
  60
  61 this means that in the ROB, updating of register numbers in a multi-issue
  62 scenario is a matter of raising one of any number of single bits.
  63 contrast this in the Tomasulo to having to multi-port the SRAM in the
  64 ROB, setting multiple bits *even for single-issue* (5-bits for 32-bit reg
  65 numbering).
  66
  67 with the ROB now having rows of bitvectors, it is now termed a "Matrix".
  68
  69 the left side of the ROB, which used to contain the RS Number in unary,
  70 now contains a *bitvector* Directed Acyclic Graph of the FU to FU
  71 dependencies, and is split out into its own Matrix.
  72
  73 this we call the FU-FU Dependency Matrix.
  74
  75 the remainder of the "ROB" contains the register numbers in unary Matrix
  76 form, and with each row being directly associated with a Function Unit,
  77 we now have an association between FU and Regs which preserves the
  78 knowledge of what instruction required which registers, *and* who will
  79 produce the result.
  80
  81 this we call the FU-Regs Dependency Matrix.
  82
  83 that *really is it*.
  84
  85 take some time to absorb the transformation which not only preserves
  86 absolutely every functional aspect of the Tomasulo Algorithm, it
  87 drastically simplifies the implementation, reduces gate count, reduces
  88 power consumption *and* provides a strong foundation for doing arbitrary
  89 multi-issue execution with only an O(N) linear increase in gate count
  90 to do so.
  91
  92 further hilariously simple additional transformations occur to replace
  93 former massive resource constrained bottlenecks, due to the binary
  94 numbering on both ROB numbers and Reg numbers, with simple large unary
  95 NOR gates:
  96
  97 * the determination of when hazards are clear, on a per register basis,
  98   is a laughably trivial NOR gate across all columns of the FU-REGs matrix,
  99   producing a row bitvector for each read register and each write register.
 100
 101 * the determination of when a Function Unit may proceed is a laughably
 102   trivial NOR gate across all *rows* of the *FU-FU* Matrix, producing a
 103   row-based vector, determining that it is "readable" if there exists no
 104   write hazard and "writable" if there exists no read hazard.
 105
 106 * the Tomasulo Common Data Bus, formerly being a single chokepoint
 107   binary-addressing global Bus, may now be upgraded to *MULTIPLE* Common
 108   Data Buses that, because the addressing information about registers is now
 109   in unary, is likewise laughably trivial to use cascading Priority Pickers
 110   (a nmigen PriorityEncoder and Decoder, back-to-back) to determine which
 111   Function Unit shall be granted access to which CDB in order to receive
 112   (or send) its operand (or result).
 113
 114 * multi-issue as i mentioned a few times is an equally laughably trivial
 115   matter of transitively cascading the Register Dependency Hazards (both
 116   read and write) across future instructions in the same multi issue
 117   execution window. instr2 has instr1 AND instr2's hazards.  instr3 has
 118   instr1 AND instr2 AND instr3's hazards and so on.  this just leaves
 119   the necessity of increasing register port numbers, number of CDBs,
 120   and LD/ST memory bandwidth to compensate and cope with the additional
 121   resource demands that will now occur.
 122
 123 the latter is particularly why we have a design that, ultimately, we
 124 could take on ARM, Intel, and AMD.
 125
 126 there is no reason technically why we could not do a 4, 6 or 8 multi
 127 issue system, and with enough Function Units and the cyclic buffer system
 128 (so as not to require a full crossbar at the Common Data Buses), and
 129 proper stratification and design of the register files, massive Vector
 130 parallelism at the pipelines would be kept fully occupied without an
 131 overwhelming increase in gates or power consumption that would normally
 132 be expected, and scalar performance would be similarly high as well.
 133 '''