sync_up: Discussion page for tomorrow's meeting

[libreriscv.git] / 3d_gpu / requirements_specification.mdwn
diff --git a/3d_gpu/requirements_specification.mdwn b/3d_gpu/requirements_specification.mdwn

index 6c0b3e5c54d5c9fe4110d842fd211de7926e93bb..2956ce5dd587248ca8f3e5d63ed6a2506975b2f7 100644 (file)
--- a/3d_gpu/requirements_specification.mdwn
+++ b/3d_gpu/requirements_specification.mdwn
@@ -1,5 +1,14 @@
  # Requirements Specification
  
+This document contains the Requirements Specification for the Libre RISC-V
+micro-architectural design.  It shall meet the target of 5-6 32-bit GFLOPs,
+150 M-Pixels/sec, 30 Million Triangles/sec, and minimum video decode
+capability of 720p @ 30fps to a 1920x1080 framebuffer, in under 2.5 watts
+at an 800mhz clock rate.  Exceeding this target is acceptable if the
+power budget is not exceeded.  Exceeding this target "just because we can"
+is also acceptable, as long as it does not disrupt meeting the minimum
+performance and power requirements.
+
  # General Architectural Design Principle
  
  The general design base is to utilise an augmented and enhanced variant
@@ -12,13 +21,45 @@ will all be added by overloading write hazards.
  
  An overview of the design is as follows:
  
+* 3D and Video primitives (operations) will only be added as strictly
+  necessary to achieve the minimum power and performance target.
+* Identified so far is a 4xFP32 ARGB Quad to 1xINT32 ARGB pixel
+  conversion opcode (part of the Vulkan API).  It will write directly
+  to a separate "tile buffer" (SRAM), not to the integer register
+  file.  The instruction will be scalar and will inherently and
+  automatically parallelised by SV, just like all other scalar opcodes.
+* xBitManip opcodes will be required to deal with VPU workloads
  * The register files will be stratified into 4-way 2R1W banks,
-  with byte-level write-enable on all banks.
+  with *separate* and distinct byte-level write-enable lines on all four
+  bytes of all four banks.
  * 6600-style scoreboards will be augmented with "shadow" wires
    and write hazard capability on exceptions, branch speculation,
    LD/ST and predication.
+* Each "shadow" capability of each type will be provided by a separate
+  Function Unit.  For example if there is to exist the possibility of rolling
+  ahead through two speculative branches, then two **separate**
+  Branch-speculative Function Units will be required: each will
+  hold their own separate and distinct "shadow" (Go-Die wire) and
+  write-hazard over instructions on which the branch depends.
+* Likewise for predication, which shall place a "hold" on
+  the Function Units that depend on it until the register used
+  as a predicate mask has been read and decoded, there will be
+  separate Function Units waiting for each predication mask register.
+  Bits in the mask that are "zero" will result in "Go-Die" signals being
+  sent to the Function Units previously (speculatively) allocated for that
+  (now cancelled) element operation.  Bits that are "1" will cancel
+  their Write-Hazard and allow the Function Unit to proceed with that
+  element's operation.
+* The 6600 "Q-Table" that records, for each register, the last Function
+  Unit (in instruction issue order) that is to write its result to that
+  register, shall be augmented with "history" capability that aids and
+  assists in "rollback" of "nameless" registers, should an exception
+  or interrupt occur. "History" is simply a (short) queue (stack)
+  that preserves, in instruction-issue order, a record of the previous
+  Function Unit(s) that targetted each register as a destination.
  * Function Units will have both src and destination Reservation
-  Stations (latches) in order to buffer incoming and outgoing data
+  Stations (latches) in order to buffer incoming and outgoing data.
+  This to make best use of (limited) inter-Function-Unit bus bandwidth.
  * Crossbar Routing from the Register File will be on the **source**
    registers **only**: Function Units will route **directly** to
    and be hard-wired associated with one of four register banks.
@@ -30,7 +71,10 @@ An overview of the design is as follows:
    latches associated with the Function Unit, and will put the
    result **back** into the destination latch associated with that
    **same** Function Unit.
-* **Pairs** of 32-bit Function Units will handle 64-bit operations.
+* **Pairs** of 32-bit Function Units will handle 64-bit operations,
+  with the 32-bit src Reservation Stations (latches) "teaming up"
+  to store 64-bit src register values, and likewise the 32-bit
+  destination latches for the same (paired) Function Units.
  * 32-bit Function Units will handle 8 and 16 bit operations in
    cases where batches of operations may be (easily, conveniently)
    allocated to a 32-bit-wide SIMD-style (predicated) ALU.
@@ -44,6 +88,15 @@ An overview of the design is as follows:
    corresponding 8/16-bit Function Unit(s) for that register, and vice-versa.
    8/16-bit operations will however **not** block the remaining
    (unallocated) bytes of the same register from being utilised.
+* Spectre timing attacks will be dealt with by ensuring that there
+  are no side-channels between cores in the usual ways (no shared
+  DIV unit, correct use of L1 cache), however there will be an
+  addition of a "Speculation Fence" instruction (or hint) that will
+  reset the internal state to a known quiescent state.  This involves
+  cancellation of all speculation, cancellation of "nameless" registers,
+  committing outstanding register writes to the register file, and
+  cancelling all Function Units waiting for read hazards.  This to
+  be automatically done on any exceptions or interrupts.
  
  # Register File
  
@@ -65,6 +118,7 @@ cycle, such that the register file may effectively be used as an
  
  # Function Units
  
+## Commit Phase (instruction order preservation)
  
  # 6600 Scoreboards
  
@@ -117,3 +171,85 @@ and may only occur if the Function Unit is entirely free of write hazards.
  
  The Function-Unit to Function-Unit Dependency Matrix expresses the
  read and write hazards - dependencies - between Function Units.
+
+## Branch Speculation
+
+Branch speculation is done by preventing instructions from becoming
+"writeable" until the Branch Unit knows if it has resolved or not.
+This is done with the addition of "Shadow" lines, as shown below:
+
+This image reproduced with kind permission, Copyright (C) Mitch Alsup
+[[!img shadow_issue_flipflops.png]]
+
+Note that there are multiple "Shadow" signals, coming not just from Branch
+Speculation but also from predication and exception shadows.
+
+On a "Failed" signal, the instruction is told to "Go Die".  This is
+passed to the Computation Unit as well.  When all "Success" signals
+are raised the instruction is permitted to enter "Writeable".
+
+## Exceptions
+
+Exceptions shall be handled by each instruction that *may* throw an
+exception having and holding a "Shadow" wire over all dependent
+Function Units, in exactly the same way as Branch Speculation.
+Likewise, dependent instructions are prevented and prohibited from
+entering the "Writeable" state.
+
+Dependent downstream instructions, if the exception is thrown,
+shall have the "Failed" bit ASSERTED (by the Function Unit throwing
+the exception) such that the down-stream dependent instruction is told
+to "Go Die".
+
+If the point is reached at which the instruction knows that the
+Exception cannot possibly occur, the "Success" signal is raised
+instead, thus cancelling the "hold" over dependent downstream
+instructions - again in exactly the same way as Branch Speculation
+"Success".
+
+Exceptions may **only** be actually raised if they are at the front of
+the instruction queue, i.e. if they are free of write hazards.
+See section on "Function Unit Commit" phase, as the Function Units
+have a "link bit" that preserves the instruction issue order, which
+must also be respected.
+
+# Spectre-style timing mitigation
+
+Spectre-style timing attacks are defined by one instruction issue
+affecting the completion time of past **and future** instructions.
+The key insight to mitigation against such attacks is to note that
+arbitrary untrusted instructions must not be permitted to affect
+trusted instructions.  Consequently as long as there is a firebreak
+(a "Fence") between trusted and untrusted, timing attacks can be
+held off.
+
+Two instructions ("hints") shall therefore be added:
+
+* One that stops speculation, multi-issue and any out-of-order
+  resource allocation for a minimum of 16 instructions.
+* Another that **cancels** all speculation and reservations,
+  cancels "nameless" registers, waits for and ensures that all
+  outstanding instructions have completed and committed, before
+  permitting the processor to continue further.
+
+This latter shall occur unconditionally without requiring a special
+instruction to be called, on ECALL as well as all exceptions and
+interrupts.
+
+# ALU design
+
+There is a separate pipelined alu for fdiv/fsqrt/frsqrt/idiv/irem
+that is possibly shared between 2 or 4 cores.
+
+The main ALUs are each a unified ALU for i8-i64/f16-f64 where the
+ALU is split into lanes with separate instructions for each 32-bit half.
+So, the multiplier should be capable of 64-bit fmadd, 2x32-bit fmadd,
+4x16-bit fmadd, 1x32-bit fmadd + 2x16-bit fmadd (in either order), and all
+(8/16/32/64) sizes of integer mul/mulhsu/mulh/mulhu in 2 groups of 32-bits.
+We can implement fmul using fmadd with 0 (make sure that we get the right
+sign bit for 0 for all rounding modes).
+
+# Rowhammer Mitigation
+
+* <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-March/000699.html>
+* <https://arxiv.org/pdf/1903.00446.pdf>