3d_gpu/microarchitecture.mdwn

   1 # High-level architectural Requirements
   2
   3 * SMP Cache coherency (TileLink?)
   4 * Minumum 800mhz
   5 * Minimum 2-core SMP, more likely 4-core uniform design,
   6   each core with full 4-wide SIMD-style predicated ALUs
   7 * 6GFLOPS single-precision FP
   8 * 128 64-bit FP and 128 64-bit INT register files
   9 * RV64GC compliance for running full GNU/Linux-based OS
  10 * SimpleV compliance
  11 * xBitManip (required for VPU and ideal for predication)
  12 * 4-lane 2Rx1W SRAMs for registers numbered 32 and above;
  13   Multi-R x Multi-W for registers 1-31.
  14   TODO: consider 2R for registers to be used as predication targets
  15   if >= 32.
  16 * Idea: generic implementation of ports on register file so as to be able
  17   to experiment with different arrangements.
  18 * Potentially: Lane-swapping / crossing / data-multiplexing
  19   bus on register data (particularly because of SHAPE-REMAP (1D/2D/3D)
  20 * Potentially: Registers subdivided into 16-bit, to match
  21   elwidth down to 16-bit (for FP16).  8-bit elwidth only
  22   goes down as far as twin-SIMD (with predication).  This
  23   requires registers to have extra hidden bits: register
  24   x30 is now "x30:0+x30.1+x30.2+x30.3".  have to discuss.
  25
  26 # Conversation Notes
  27
  28 ----
  29
  30 'm thinking about using tilelink (or something similar) internally as
  31 having a cache-coherent protocol is required for implementing Vulkan
  32 (unless you want to turn off the cache for the GPU memory, which I
  33 don't think is a good idea), axi is not a cache-coherent protocol,
  34 and tilelink already has atomic rmw operations built into the protocol.
  35 We can use an axi to tilelink bridge to interface with the memory.
  36
  37 I'm thinking we will want to have a dual-core GPU since a single
  38 core with 4xSIMD is too slow to achieve 6GFLOPS with a reasonable
  39 clock speed. Additionally, that allows us to use an 800MHz core clock
  40 instead of the 1.6GHz we would otherwise need, allowing us to lower the
  41 core voltage and save power, since the power used is proportional to
  42 F\*V^2. (just guessing on clock speeds.)
  43
  44 ----
  45
  46 I don't know about power, however I have done some research and a 4Kbyte
  47 (or 16, icr) SRAM (what I was thinking of for a tile buffer) takes in the
  48 ballpark of 1000 um^2 in 28nm.
  49 Using a 4xFMA with a banked register file where the bank is selected by the
  50 lower order register number means we could probably get away with 1Rx1W
  51 SRAM as the backing memory for the register file, similarly to Hwacha. I
  52 would suggest 8 banks allowing us to do more in parallel since we could run
  53 other units in parallel with a 4xFMA. 8 banks would also allow us to clock
  54 gate the SRAM banks that are not in use for the current clock cycle
  55 allowing us to save more power. Note that the 4xFMA could be 4 separately
  56 allocated FMA units, it doesn't have to be SIMD style. If we have enough hw
  57 parallelism, we can under-volt and under-clock the GPU cores allowing for a
  58 more efficient GPU. If we are using the GPU cores as CPU cores as well, I
  59 think it would be important to be able to use a faster clock speed when not
  60 using the extended registers (similar to how Intel processors use a lower
  61 clock rate when AVX512 is in use) so that scalar code is not slowed down
  62 too much.
  63
  64 > > Using a 4xFMA with a banked register file where the bank is selected by
  65 > the
  66 > > lower order register number means we could probably get away with 1Rx1W
  67 > > SRAM as the backing memory for the register file, similarly to Hwacha.
  68 >
  69 >  okaaay.... sooo... we make an assumption that the top higher "banks"
  70 > are pretty much always going to be "vectorised", such that, actually,
  71 > they genuinely don't need to be 6R-4W (or whatever).
  72 >
  73 Yeah pretty much, though I had meant the bank number comes from the
  74 least-significant bits of the 7-bit register number.
  75
  76 ----
  77
  78 Assuming 64-bit operands:
  79 If you could organize 2 SRAM macros and use the pair of them to
  80 read/write 4 registers at a time (256-bits). The pipeline will allow you to
  81 dedicate 3 cycles for reading and 1 cycle for writing (4 registers each).
  82
  83 RS1 = Read of operand S1
  84 WRd = Write of result Dst
  85 FMx = Floating Point Multiplier, x = stage.
  86
  87    |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
  88                    |FWD|FM1|FM2|FM3|FM4|
  89                        |FWD|FM1|FM2|FM3|FM4|
  90                            |FWD|FM1|FM2|FM3|FM4|WRd|
  91                    |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
  92                                    |FWD|FM1|FM2|FM3|FM4|
  93                                        |FWD|FM1|FM2|FM3|FM4|
  94                                            |FWD|FM1|FM2|FM3|FM4|WRd|
  95                                    |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
  96                                                    |FWD|FM1|FM2|FM3|FM4|
  97                                                        |FWD|FM1|FM2|FM3|FM4|
  98                                                            |FWD|FM1|FM2|FM3|FM4|WRd|
  99
 100 The only trick is getting the read and write dedicated on different clocks.
 101 When the RS3 operand is not needed (60% of the time) you can use
 102 the time slot for reading or writing on behalf of memory refs; STs read,
 103 LDs write.
 104
 105 You will find doing VRFs a lot more compact this way. In GPU land we
 106 called the flip-flops orchestrating the timing "collectors".
 107
 108 # References
 109
 110 * <https://en.wikipedia.org/wiki/Tomasulo_algorithm>
 111 * <https://en.wikipedia.org/wiki/Reservation_station>
 112 * Register File Bank Cacheing <https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
 113 * Discussion <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-November/000157.html>
 114 * <https://github.com/UCSBarchlab/PyRTL/blob/master/examples/example5-instrospection.py>