From: Luke Kenneth Casson Leighton Date: Wed, 23 May 2018 11:15:57 +0000 (+0100) Subject: add discussion todo items X-Git-Tag: convert-csv-opcode-to-binary~5337 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=fe12552bc84357e40ea53419e68428773e0b1e5b;p=libreriscv.git add discussion todo items --- diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn index 51fe43d14..4af851578 100644 --- a/simple_v_extension.mdwn +++ b/simple_v_extension.mdwn @@ -1,5 +1,9 @@ # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal +* TODO 23may2018: CSR-CAM-ify regfile tables +* TODO 23may2018: zero-mark predication CSR +* TODO 23may2018: impl. detail on scalar-only ops (see appendix) + Key insight: Simple-V is intended as an abstraction layer to provide a consistent "API" to parallelisation of existing *and future* operations. *Actual* internal hardware-level parallelism is *not* required, such @@ -1804,6 +1808,34 @@ discussion then led to the question of OoO architectures > there may be a way to implement DTM as well. +------ + +* implementation detail for scalar-only op detection + +>> For scalar ops an implementation may choose to compare 2-3 bits through an +>> AND gate: are src & dest scalar? Yep, ok send straight to ALU  (or instr +>> FIFO). + +> Those bits cannot be known until after the registers are decoded from the +> instruction and a lookup in the "vector length table" has completed. +> Considering that one of the reasons RISC-V keeps registers in invariant +> positions across all instructions is to simplify register decoding, I expect +> that inserting an SRAM read would lengthen the critical path in most +> implementations. + +reply: + +> briefly: the trick i mentioned about ANDing bits together to check if +> an op was fully-scalar or not was to be read out of a single 32-bit +> 3R1W SRAM (64-bit if FPU exists). the 32/64-bit SRAM contains 1 bit per +> register indicating "is register vectorised yes no". 3R because you need +> to check src1, src2 and dest simultaneously. the entries are *generated* +> from the CSRs and are an optimisation that on slower embedded systems +> would likely not be needed. + +> is there anything unreasonable that anyone can foresee about that? +> what are the down-sides? + ## Implementation Paradigms @@ -1893,3 +1925,4 @@ TBD: floating-point compare and other exception handling * Wavefront skipping using BRAMS * Streaming Pipelines * Barcelona SIMD Presentation +*