# Requirements Specification
+This document contains the Requirements Specification for the Libre RISC-V
+micro-architectural design. It shall meet the target of 5-6 32-bit GFLOPs,
+150 M-Pixels/sec, 30 Million Triangles/sec, and minimum video decode
+capability of 720p @ 30fps to a 1920x1080 framebuffer, in under 2.5 watts
+at an 800mhz clock rate. Exceeding this target is acceptable if the
+power budget is not exceeded. Exceeding this target "just because we can"
+is also acceptable, as long as it does not disrupt meeting the minimum
+performance and power requirements.
+
# General Architectural Design Principle
The general design base is to utilise an augmented and enhanced variant
An overview of the design is as follows:
+* 3D and Video primitives (operations) will only be added as strictly
+ necessary to achieve the minimum power and performance target.
+* Identified so far is a 4xFP32 ARGB Quad to 1xINT32 ARGB pixel
+ conversion opcode (part of the Vulkan API). It will write directly
+ to a separate "tile buffer" (SRAM), not to the integer register
+ file. The instruction will be scalar and will inherently and
+ automatically parallelised by SV, just like all other scalar opcodes.
+* xBitManip opcodes will be required to deal with VPU workloads
* The register files will be stratified into 4-way 2R1W banks,
- with byte-level write-enable on all banks.
+ with *separate* and distinct byte-level write-enable lines on all four
+ bytes of all four banks.
* 6600-style scoreboards will be augmented with "shadow" wires
and write hazard capability on exceptions, branch speculation,
LD/ST and predication.
+* Each "shadow" capability of each type will be provided by a separate
+ Function Unit. For example if there is to exist the possibility of rolling
+ ahead through two speculative branches, then two **separate**
+ Branch-speculative Function Units will be required: each will
+ hold their own separate and distinct "shadow" (Go-Die wire) and
+ write-hazard over instructions on which the branch depends.
+* Likewise for predication, which shall place a "hold" on
+ the Function Units that depend on it until the register used
+ as a predicate mask has been read and decoded, there will be
+ separate Function Units waiting for each predication mask register.
+ Bits in the mask that are "zero" will result in "Go-Die" signals being
+ sent to the Function Units previously (speculatively) allocated for that
+ (now cancelled) element operation. Bits that are "1" will cancel
+ their Write-Hazard and allow the Function Unit to proceed with that
+ element's operation.
+* The 6600 "Q-Table" that records, for each register, the last Function
+ Unit (in instruction issue order) that is to write its result to that
+ register, shall be augmented with "history" capability that aids and
+ assists in "rollback" of "nameless" registers, should an exception
+ or interrupt occur. "History" is simply a (short) queue (stack)
+ that preserves, in instruction-issue order, a record of the previous
+ Function Unit(s) that targetted each register as a destination.
* Function Units will have both src and destination Reservation
- Stations (latches) in order to buffer incoming and outgoing data
+ Stations (latches) in order to buffer incoming and outgoing data.
+ This to make best use of (limited) inter-Function-Unit bus bandwidth.
* Crossbar Routing from the Register File will be on the **source**
registers **only**: Function Units will route **directly** to
and be hard-wired associated with one of four register banks.
latches associated with the Function Unit, and will put the
result **back** into the destination latch associated with that
**same** Function Unit.
-* **Pairs** of 32-bit Function Units will handle 64-bit operations.
+* **Pairs** of 32-bit Function Units will handle 64-bit operations,
+ with the 32-bit src Reservation Stations (latches) "teaming up"
+ to store 64-bit src register values, and likewise the 32-bit
+ destination latches for the same (paired) Function Units.
* 32-bit Function Units will handle 8 and 16 bit operations in
cases where batches of operations may be (easily, conveniently)
allocated to a 32-bit-wide SIMD-style (predicated) ALU.
corresponding 8/16-bit Function Unit(s) for that register, and vice-versa.
8/16-bit operations will however **not** block the remaining
(unallocated) bytes of the same register from being utilised.
+* Spectre timing attacks will be dealt with by ensuring that there
+ are no side-channels between cores in the usual ways (no shared
+ DIV unit, correct use of L1 cache), however there will be an
+ addition of a "Speculation Fence" instruction (or hint) that will
+ reset the internal state to a known quiescent state. This involves
+ cancellation of all speculation, cancellation of "nameless" registers,
+ committing outstanding register writes to the register file, and
+ cancelling all Function Units waiting for read hazards. This to
+ be automatically done on any exceptions or interrupts.
# Register File
# Function Units
+## Commit Phase (instruction order preservation)
# 6600 Scoreboards
The Function-Unit to Function-Unit Dependency Matrix expresses the
read and write hazards - dependencies - between Function Units.
+
+## Branch Speculation
+
+Branch speculation is done by preventing instructions from becoming
+"writeable" until the Branch Unit knows if it has resolved or not.
+This is done with the addition of "Shadow" lines, as shown below:
+
+This image reproduced with kind permission, Copyright (C) Mitch Alsup
+[[!img shadow_issue_flipflops.png]]
+
+Note that there are multiple "Shadow" signals, coming not just from Branch
+Speculation but also from predication and exception shadows.
+
+On a "Failed" signal, the instruction is told to "Go Die". This is
+passed to the Computation Unit as well. When all "Success" signals
+are raised the instruction is permitted to enter "Writeable".
+
+## Exceptions
+
+Exceptions shall be handled by each instruction that *may* throw an
+exception having and holding a "Shadow" wire over all dependent
+Function Units, in exactly the same way as Branch Speculation.
+Likewise, dependent instructions are prevented and prohibited from
+entering the "Writeable" state.
+
+Dependent downstream instructions, if the exception is thrown,
+shall have the "Failed" bit ASSERTED (by the Function Unit throwing
+the exception) such that the down-stream dependent instruction is told
+to "Go Die".
+
+If the point is reached at which the instruction knows that the
+Exception cannot possibly occur, the "Success" signal is raised
+instead, thus cancelling the "hold" over dependent downstream
+instructions - again in exactly the same way as Branch Speculation
+"Success".
+
+Exceptions may **only** be actually raised if they are at the front of
+the instruction queue, i.e. if they are free of write hazards.
+See section on "Function Unit Commit" phase, as the Function Units
+have a "link bit" that preserves the instruction issue order, which
+must also be respected.
+
+# Spectre-style timing mitigation
+
+Spectre-style timing attacks are defined by one instruction issue
+affecting the completion time of past **and future** instructions.
+The key insight to mitigation against such attacks is to note that
+arbitrary untrusted instructions must not be permitted to affect
+trusted instructions. Consequently as long as there is a firebreak
+(a "Fence") between trusted and untrusted, timing attacks can be
+held off.
+
+Two instructions ("hints") shall therefore be added:
+
+* One that stops speculation, multi-issue and any out-of-order
+ resource allocation for a minimum of 16 instructions.
+* Another that **cancels** all speculation and reservations,
+ cancels "nameless" registers, waits for and ensures that all
+ outstanding instructions have completed and committed, before
+ permitting the processor to continue further.
+
+This latter shall occur unconditionally without requiring a special
+instruction to be called, on ECALL as well as all exceptions and
+interrupts.
+
+# ALU design
+
+There is a separate pipelined alu for fdiv/fsqrt/frsqrt/idiv/irem
+that is possibly shared between 2 or 4 cores.
+
+The main ALUs are each a unified ALU for i8-i64/f16-f64 where the
+ALU is split into lanes with separate instructions for each 32-bit half.
+So, the multiplier should be capable of 64-bit fmadd, 2x32-bit fmadd,
+4x16-bit fmadd, 1x32-bit fmadd + 2x16-bit fmadd (in either order), and all
+(8/16/32/64) sizes of integer mul/mulhsu/mulh/mulhu in 2 groups of 32-bits.
+We can implement fmul using fmadd with 0 (make sure that we get the right
+sign bit for 0 for all rounding modes).
+
+# Rowhammer Mitigation
+
+* <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-March/000699.html>
+* <https://arxiv.org/pdf/1903.00446.pdf>