From 217c0699887ecbd40fc3fe22c4365a0c0be3e6ec Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Mon, 1 Aug 2022 20:05:03 +0100 Subject: [PATCH] cleanup, remove old pages --- bitmap_parallelism_extension.mdwn | 120 ----- harmonised_rvv_rvp.mdwn | 92 ---- harmonised_rvv_rvp/comparative_analysis.mdwn | 228 -------- harmonised_rvv_rvp/discussion.mdwn | 44 -- instruction_virtual_addressing.mdwn | 230 --------- interrupts.mdwn | 19 - interrupts/interrupt_handling.mdwn | 38 -- overloadable_opcodes.mdwn | 486 ------------------ pluggable_extensions.mdwn | 319 ------------ rv_major_opcode_1010011.mdwn | 476 ----------------- systemes_libre.mdwn | 3 - ...es_Amazon_Alexa_IOT_Pitch_10-JUN-2020.mdwn | 84 --- systemes_libre/index.mdwn | 2 - 13 files changed, 2141 deletions(-) delete mode 100644 bitmap_parallelism_extension.mdwn delete mode 100644 harmonised_rvv_rvp.mdwn delete mode 100644 harmonised_rvv_rvp/comparative_analysis.mdwn delete mode 100644 harmonised_rvv_rvp/discussion.mdwn delete mode 100644 instruction_virtual_addressing.mdwn delete mode 100644 interrupts.mdwn delete mode 100644 interrupts/interrupt_handling.mdwn delete mode 100644 overloadable_opcodes.mdwn delete mode 100644 pluggable_extensions.mdwn delete mode 100644 rv_major_opcode_1010011.mdwn delete mode 100644 systemes_libre.mdwn delete mode 100644 systemes_libre/Systemes_Libres_Amazon_Alexa_IOT_Pitch_10-JUN-2020.mdwn delete mode 100644 systemes_libre/index.mdwn diff --git a/bitmap_parallelism_extension.mdwn b/bitmap_parallelism_extension.mdwn deleted file mode 100644 index 8593a82b0..000000000 --- a/bitmap_parallelism_extension.mdwn +++ /dev/null @@ -1,120 +0,0 @@ -# Parallelism using Bitmaps - -If you think about it this way you can combine setvl, and predication, -and indeed vector length, by always working with bitmaps. - -So: you have 32 WARL CSRs , called X0, ... X31 (or perhaps 2 banks of -32 CSR's and have a set of additional CSR's FX0,... FX31) - -Each contains a bitmap of length 32 (assuming we only have the standard -registers) - -By default, X0 contains 1<<0, X1 contains 1<<1, X2 contains 1 << 2, ... - -now an instruction like - - add x1 x2 x3 - -is reinterpreted as referring to the CSR's rather than individual -registers. i.e. under simple V it means - - add X1, X2, X3 - -and it has the following semantics: - - let rds = registers in bitmap X1 - let rs1s = registers in bitmap X2 repeated periodically in order of register number to the length of X1 - let rs2s = registers in bitmap X3 repeated periodically in order of register number to the length of X1 - - - parallelfor (rd, rs1, rs2) in (rds[i],rs1s[i], rs2s[i]) where i = 0 to length(rds) - 1 - add rd rs1 rs2 - - -example: - - X1 <- 0b011111 - X2 <- 0b1011 - X3 <- 0b00010 - - > Anyways my point was, for me it would have been more intuitive - > and easier to grasp if it showed: - > X1 -> b011111 (meaning x4,x3,x2,x1,x0) - > X2 -> b001011 (meaning x3,x1,x0) - > X3 -> b000010 (meaning x1) - -then - - rd1s = [x1, x2, x3, x4, x5] - rs1s = [x0, x2, x3, x0, x2] - rs2s = [x3, x3, x3, x3, x3] - -and - - add X1, X2, X3 - -is interpreted as - - parallel{ - add x1, x0, x3 - add x2, x2, x3 - add x3, x3, x3 - add x4, x0, x3 # x2 and x3 have their original values! - add x5, x2, x3 # x2 and x3 have their original values! - } - -This means that the analogue of setvl is simply the "write any" of -setting the bitmap, and the analogue of the return value of setvl, -is the "read legal" of the CSR. Moreover popc would tell you how many -operations are scheduled in parallel so you know how often you have to -repeat a sequential loop. - -Notes: - -> > Thinking about it more, a bitset for X0 seems a bad idea, or equivalently X0 -> > should be -> > the immutable  bitset {x0}. That suggests FX0, ... FX31 _is_ a good idea. - ->  what would it mean, to do ops with x0?  it would mean "always add 0" -> and so on.  it sounds kinda useful.  like MV being add r1, r2, x0.  -> it would completely pointless to *have* anything other than "all 1s" -> in it though i think :) - -# pseudocode for decoding ops - - uint32 XB[32]; // global, assume RV32 for now: CSRs for bitmapping - uint32 regs[32]; // global, actual (integer) register file - - // gets current ACTUAL register to be used - // XB had better not be empty... - int regdecode(int rn, int *offs) - { - int bmap = XB[rn]; - int _offs = *offs; - while (1) - { - int _newoffs = (_offs + 1) & 0x1f; // 32 regs, modulo - if (bmap & (1<<_offs)) - { - *offs = _newoffs; - return _offs; - } - _offs = _newoffs; - } - } - -example usage (pseudo-implementation of add): - - op_add(int rd, int rs1, int rs2) - { - int id=0, irs1=0, irs2=0; - int VL = pcnt(XB[rd]; - for (int i = 0; i < VL; i++) - { - int actualrd = regdecode(rd , &id); - int actualrs1 = regdecode(rs1, &irs1); - int actualrs2 = regdecode(rs2, &irs2); - regs[actualrd] = regs[actualrs1] + regs[actualrs2]; - } - } - diff --git a/harmonised_rvv_rvp.mdwn b/harmonised_rvv_rvp.mdwn deleted file mode 100644 index 971bf03b4..000000000 --- a/harmonised_rvv_rvp.mdwn +++ /dev/null @@ -1,92 +0,0 @@ -# Proposal to harmonise RV Vector spec with Andes Packed SIMD ("Harmonised" RVP) - -[[Comparative analysis|harmonised_rvv_rvp/comparative_analysis]] of -Harmonised RVP vs Andes Packed SIMD ISA proposal - -**MVL, setvl instruction & VL CSR work as per RV Vector spec.** - -**VLD and VST are supported** - -RVP implementations may choose to load/store to/from Integer register file -(rather than from a dedicated Vector register file). - -* Thus, RVP implementations have a choice of providing a dedicated - Vector register file, or sharing the integer register file, but not - both simultaneously. (Supporting both would need a CSR mode switch bit). -* Mapping of v0-31 <-> r0-31 **is fixed** at 1:1. (An exception may be - made to map v1 to r5, as otherwise may clash with procedure linkage). -* VLD and VST in this case will have similar behaviour to LW/LD and SW/SD - respectively, but only operate on up to VL elements (see point #4 below). -* If integer register file is used for vector operations, any callee saved - registers (r2-4, 8-9, 18-27) must be saved with RVI SW or SD instructions, - before being used as vector registers (this register saving behaviour is - harmless but redundant when RVP code is run on a machine with a dedicated - vector reg file). - -**VLDX, VSTX, VLDS, VSTS are not supported in hardware** -To keep RVP implementations simple, these instructions will trap, and -may be implemented as software emulation - -**Default register "banks" and types** - -In the absence of an explicit VCFG setup, the vector registers (when -shared with Integer register file) are to default into two “banks” -as follows: - -* v0-v15: vectors with INT8 elements, split into signed (v0-v7) - & unsigned (v8-v15) -* v16-v29: vectors with INT16 elements, split into signed (v16-v23) - & unsigned (v24-v29) - -Having the above default vector type configuration harmonises most of -the Andes SIMD instruction set (which explicitly encodes INT8 vs INT16 -vector types as separate instructions). The main change from the Andes -SIMD proposal is that instructions are restricted to 14 registers of -each vector element type (with element size explicitly encoded in the -most significant bit of the 5 bit register specifier fields). - -Notes: - -* To preserve forward RVV compatibility, programmers should still - explicitly setup VDCFG to the above default vector types -* Essentially the same register allocation algorithm used for RVV can be - used for RVP, except the algorithm should preferentially use temporary - registers first, before using saved registers -* v30-v31 are reserved for 32 bit operations (see Section 2.3 of this - document), and hence not part of the register bank of INT16 vectors. -* v0 is mapped to r1 (hardwired to zero), and v1 is used for predicate - masks. However, both can be considered INT8 vectors. - -**Default MVL** - -The default RVV MVL value (in absence of explicit VCFG setup) is to -be MVL = 2 on RV32I machines and MVL = 4 on RV64I machines. However, -note RV32I registers can fit 4x INT8 elements. To preserve Andes SIMD -behaviour, all VOP instructions should still operate on all “unused” -elements in the register, regardless of MVL. (This is still compliant -with the RVV spec, provided elements from VL..MVL-1 are set to zero). -VMEM instructions however will only operate on VL elements, and so -where full Andes SIMD compliance is required (without RVV forward -compatibility), LW/LD and SW/SD are to be used instead of VLD and VST. - -**Alternative register "banks" and alternative MVL** - -A programmer can configure VCFG with any mix of these alternative -configurations: - -* v0-v31 are all INT 16, and MVL is same as for Default MVL above -* v0-v31 are all INT 8 and MVL is 4 on RV32I and 8 on RV64I -* A lesser number of registers (less than v31) could be supported, - eg. default is only v0-v29 defined. (Accessing registers beyond - maximum defined by VDCFG is to be legal, with a type of INT32 assumed. - However, this is not to affect the MVL, which is to be calculated based - on INT8/INT16 vectors only) -* With the above alternative configs, there can be any split between -signed & unsigned. - -The above are pure subsets of valid RVV VCFG configurations (and hence -forward compatible between RVP and RVV, whilst also keeping RVP simple). -Other useful element types are fixed point fraction types and small -integer(4 bit to 7 bit) elements. However these are omitted for now -as they aren’t currently part of RVV spec, and the intention of this -proposal is to harmonise the Andes SIMD instructions into a subset of RVV. diff --git a/harmonised_rvv_rvp/comparative_analysis.mdwn b/harmonised_rvv_rvp/comparative_analysis.mdwn deleted file mode 100644 index 9b694a2f1..000000000 --- a/harmonised_rvv_rvp/comparative_analysis.mdwn +++ /dev/null @@ -1,228 +0,0 @@ -# Comparative analysis with Andes Packed ISA proposal - -Harmonised RVP is a proposal to provide SIMD functionality comparable to the Andes Packed SIMD ISA, but in a manner that is forwards compatible ("harmonised") with the RV Vector specification. - -An example use case is a string copy operation ­ using Harmonised RVP, code can use integer register SIMD instructions to copy a string. This code can then also execute (unchanged) on a full RV Vector processor and use the dedicated vector unit to copy the string. Harmonised RVP also upwards compatibility between RV32 and RV64 SIMD using this same approach. - -## Register file - -The Andes Packed SIMD ISA permits any GPR to be used for either INT8 or INT16 vector operations. In contrast, the default Harmonised RVP GPR register file is divided into a lower bank of Vector[INT8] and an upper banxk of Vector[INT16]. (Effectively, the vector element size is encoded by the most significant bit of the 5 bit register specifiers. However programmers can reconfigure the register file data types, if the default configuration is unsuitable.) - -(GPR = General Purpose Integer Register) - -| Register | Andes ISA | Harmonised RVP ISA | -| ------------------ | ------------------------- | ------------------- | -| v0 | Hardwired zero | Hardwired zero | -| v1 | 32bit GPR or Vector[4xINT8 or 2xINT16] | Predicate mask | -| | | | -| v2 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xSINT8] | -| v3 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xSINT8] | -| v4 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xSINT8] | -| v5 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xSINT8] | -| v6 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xSINT8] | -| v7 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xSINT8] | -| v8 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xUINT8] | -| v9 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xUINT8] | -| v10 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xUINT8] | -| v11 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xUINT8] | -| v12 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xUINT8] | -| v13 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xUINT8] | -| v14 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xUINT8] | -| v15 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[4xUINT8] | -| | | | -| v16 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xSINT16] | -| v17 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xSINT16] | -| v18 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xSINT16] | -| v19 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xSINT16] | -| v20 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xSINT16] | -| v21 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xSINT16] | -| v22 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xSINT16] | -| v23 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xSINT16] | -| v24 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xUINT16] | -| v25 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xUINT16] | -| v26 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xUINT16] | -| v27 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xUINT16] | -| v28 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xUINT16] | -| v29 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[2xUINT16] | -| | | | -| v30 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[1xSINT32] | -| v31 | 32bit GPR or Vector[4xINT8 or 2xINT16] | 32bit GPR or Vector[1xSINT32] | - -Both Andes Packed SIMD and Harmonised RVP are intended to be "low end" SIMD implementations (for processors without dedicated vector registers). Instead, the integer register file is used for SIMD operations. To maintain forwards compatibility with "high end" RV Vector implementations, programmer should use VLD and VST to load/store vectors. The implementation will then load/store a vector to/from the register file supported by the implementation. - -To keep implementations simple and focused on SIMD within­-register only, there is a strict 1:1 mapping between vectors (v0­-v31) and integer registers (r0­-r31). Standard calling conventions apply and so callee saved integer registers should be saved before being used as vector registers. Strided (VLDS/VSTS) and indexed (VLDX/VSTX) load/stores are complex, and simple implementations will trap on these instructions, permitting emulation in software. - -## Proposed Harmonised RVP vector op instruction encoding - -Harmonised RVP re-uses the same RV Vector opcodes to encode RVP SIMD instructions on *integer* registers. -This is a deliberate design, to provide a means for binary code to be forwards compatible between RVP and RV Vector. -Such "forwards compatible" code will need to take care to respect normal calling conventions (ie: save callee saved GPR registers before loading vectors into register - this is harmless but redundant behaviour on RV Vector implementations with dedicated vector registers). - -Register x 2 ­--> register operations: - -| 31 30 29 28 27 26 | 25 | 24 23 22 21 20 | 19 18 17 16 15 | 14 | 13 12 | 11 10 9 8 7 | 6 5 4 3 2 1 0 | -| ----------------- | -- | -------------- | -------------- | -- | ----- | ----------- | ------------- | -| func6 | 0 | rs2 | rs1 | 0 | mm | rd1 | VOP opcode | - -Immediate + register ­--> register operations: - -| 31 30 29 28 27 26 | 25 | 24 23 22 21 20 | 19 18 17 16 15 | 14 | 13 12 | 11 10 9 8 7 | 6 5 4 3 2 1 0 | -| ----------------- | -- | -------------- | -------------- | -- | ----- | ----------- | ------------- | -| func3, imm[7:5] | 1 | imm[4:0] | rs1 | 0 | mm | rd1 | VOP opcode | - -Register x 3 --­> register operations: - -| 31 30 29 28 27 | 26 25 | 24 23 22 21 20 | 19 18 17 16 15 | 14 | 13 12 | 11 10 9 8 7 | 6 5 4 3 2 1 0 | -| -------------- | ----- | -------------- | -------------- | -- | ----- | ----------- | ------------- | -| rs3 | func2 | rs2 | rs1 | 1 | mm | rd1 | VOP opcode | - -Values for mm field (bits 12:13 above): - -* mm = 00 ­> no predicate mask, and use current global saturation / rounding settings -* mm = 01 ­> no predicate mask, and force saturation or rounding for this instruction only -* mm = 10 ­> use v1 as predicate mask, and use global saturation / rounding settings -* mm = 11 ­> use ~v1 as predicate mask, and use global saturation / rounding settings - -## 16-bit Arithmetic - -| Andes Mnemonic | 16-bit Instruction | Harmonised RVP Equivalent | -| ------------------ | ------------------------- | ------------------- | -| ADD16 rt, ra, rb | Add | VADD (v16 <= rt,ra,rb <= v29), mm=00| -| RADD16 rt, ra, rb | Signed Halving add | RADD (v16 <= rt,ra,rb <= v23), mm=00| -| URADD16 rt, ra, rb | Unsigned Halving add | RADD (v24 <= rt,ra,rb <= v29), mm=00| -| KADD16 rt, ra, rb | Signed Saturating add | VADD (v16 <= rt,ra,rb <= v23), mm=01| -| UKADD16 rt, ra, rb | Unsigned Saturating add | VADD (v24 <= rt,ra,rb <= v29), mm=01| -| SUB16 rt, ra, rb | Subtract | VSUB (v16 <= rt,ra,rb <= v29), mm=00| -| RSUB16 rt, ra, rb | Signed Halving sub | RSUB (v16 <= rt,ra,rb <= v23), mm=00| -| URSUB16 rt, ra, rb | Unsigned Halving sub | RSUB (v24 <= rt,ra,rb <= v29), mm=00| -| KSUB16 rt, ra, rb | Signed Saturating sub | VSUB (v16 <= rt,ra,rb <= v23), mm=01| -| UKSUB16 rt, ra, rb | Unsigned Saturating sub | VSUB (v24 <= rt,ra,rb <= v29), mm=01| -| CRAS16 rt, ra, rb | Cross Add & Sub | | -| RCRAS16 rt, ra, rb | Signed Halving Cross Add & Sub | | -| URCRAS16 rt, ra, rb| Unsigned Halving Cross Add & Sub | | -| KCRAS16 rt, ra, rb | Signed Saturating Cross Add & Sub | | -| UKCRAS16 rt, ra, rb| Unsigned Saturating Cross Add & Sub | | -| CRSA16 rt, ra, rb | Cross Sub & Add | | -| RCRSA16 rt, ra, rb | Signed Halving Cross Sub & Add | | -| URCRSA16 rt, ra, rb| Unsigned Halving Cross Sub & Add | | -| KCRSA16 rt, ra, rb | Signed Saturating Cross Sub & Add | | -| UKCRSA16 rt, ra, rb| Unsigned Saturating Cross Sub & Add | | - -## 8-bit Arithmetic - -| Andes Mnemonic | 8-bit Instruction | Harmonised RVP Equivalent | -| ------------------ | ------------------------- | ------------------- | -| ADD8 rt, ra, rb | Add | VADD (v2 <= rt,ra,rb <= v15), mm=00 | -| RADD8 rt, ra, rb | Signed Halving add | RADD (v2 <= rt,ra,rb <= v7), mm=00 | -| URADD8 rt, ra, rb | Unsigned Halving add | RADD (v8 <= rt,ra,rb <= v15), mm=00 | -| KADD8 rt, ra, rb | Signed Saturating add | VADD (v2 <= rt,ra,rb <= v7), mm=01 | -| UKADD8 rt, ra, rb | Unsigned Saturating add | VADD (v8 <= rt,ra,rb <= v15), mm=01 | -| SUB8 rt, ra, rb | Subtract | VSUB (v2 <= rt,ra,rb <= v15), mm=00 | -| RSUB8 rt, ra, rb | Signed Halving sub | RSUB (v2 <= rt,ra,rb <= v7), mm=00 | -| URSUB8 rt, ra, rb | Unsigned Halving sub | RSUB (v8 <= rt,ra,rb <= v15), mm=00 | -| KSUB8 rt, ra, rb | Signed Saturating sub | VSUB (v2 <= rt,ra,rb <= v7), mm=01 | -| UKSUB8 rt, ra, rb | Unsigned Saturating sub | VSUB (v8 <= rt,ra,rb <= v15), mm=01 | - -## 16-bit Shifts - -SRA[I]16/SRL[I]16/SLL[I]16 to be mapped to VOP shift instructions in same manner as ADD16/SUB16 - -The “K” (Saturation) and “u” (Rounding) variants could be encoded using VOP’s mm field (mm=01 is saturated or rounded shift, mm=00 is standard VOP shift) - -| Andes Mnemonic | 16-bit Instruction | Harmonised RVP Equivalent | -| ------------------ | ------------------------- | ------------------- | -| SRA16 rt, ra, rb | Shift right arithmetic | VSRA (v16 <= rt,ra,rb <= v29), mm=00| -| SRAI16 rt, ra, im | Shift right arithmetic imm | VSRAI (v16 <= rt,ra <= v29), mm=00| -| SRA16.u rt, ra, rb | Rounding Shift right arithmetic | VSRA (v16 <= rt,ra,rb <= v29), mm=01| -| SRAI16.u rt, ra, im | Rounding Shift right arithmetic imm | VSRAI (v16 <= rt,ra <= v29), mm=01| -| SRL16 rt, ra, rb | Shift right logical | VSRL (v16 <= rt,ra,rb <= v29), mm=00| -| SRLI16 rt, ra, im | Shift right logical imm | VSRLI (v16 <= rt,ra <= v29), mm=00| -| SRL16.u rt, ra, rb | Rounding Shift right logical | VSRL (v16 <= rt,ra,rb <= v29), mm=01| -| SRLI16.u rt, ra, im | Rounding Shift right logical imm | VSLRI (v16 <= rt,ra <= v29), mm=01| -| SLL16 rt, ra, rb | Shift left logical | VSLL (v16 <= rt,ra,rb <= v29), mm=00| -| SLLI16 rt, ra, im | Shift left logical imm | VSLLI (v16 <= rt,ra <= v29), mm=00| -| KSLL16 rt, ra, rb | Saturating Shift left logical | VSLL (v16 <= rt,ra,rb <= v29), mm=01| -| KSLLI16 rt, ra, im | Saturating Shift left logical imm | VSLLI (v16 <= rt,ra <= v29), mm=01| -| KSLRA16 rt, ra, rb | Saturating Shift left logical or Shift right arithmetic || -| KSLRA16.u rt, ra, rb | Saturating Shift left logical or Rounding Shift right arithmetic || - - -## 8-bit Shifts - -Andes SIMD Packed ISA omits 8 bit shifts, but these can be encoded in Harmonised RVP as follows: - -| Andes Mnemonic | 8-bit Instruction | Harmonised RVP Equivalent | -| ------------------ | ------------------------- | ------------------- | -| n/a | Shift right arithmetic | VSRA (v2 <= rt,ra,rb <= v15), mm=00| -| n/a | Shift right arithmetic imm | VSRAI (v2 <= rt,ra <= v15), mm=00| -| n/a | Rounding Shift right arithmetic | VSRA (v2 <= rt,ra,rb <= v15), mm=01| -| n/a | Rounding Shift right arithmetic imm | VSRAI (v2 <= rt,ra <= v15), mm=01| -| n/a | Shift right logical | VSRL (v2 <= rt,ra,rb <= v15), mm=00| -| n/a | Shift right logical imm | VSRLI (v2 <= rt,ra <= v15), mm=00| -| n/a | Rounding Shift right logical | VSRL (v2 <= rt,ra,rb <= v15), mm=01| -| n/a | Rounding Shift right logical imm | VSLRI (v2 <= rt,ra <= v15), mm=01| -| n/a | Shift left logical | VSLL (v2 <= rt,ra,rb <= v15), mm=00| -| n/a | Shift left logical imm | VSLLI (v2 <= rt,ra <= v15), mm=00| -| n/a | Saturating Shift left logical | VSLL (v2 <= rt,ra,rb <= v15), mm=01| -| n/a | Saturating Shift left logical imm | VSLLI (v2 <= rt,ra <= v15), mm=01| - -## 16-bit Comparison instructions - -| Andes Mnemonic | 16-bit Instruction | Harmonised RVP Equivalent | -| ------------------ | ------------------------- | ------------------- | -| CMPEQ16 rt, ra, rb | Compare equal | VSEQ (v16 <= rt,ra,rb <= v29), mm=00| -| SCMPLT16 rt, ra, rb | Signed Compare less than | !VSGT (v16 <= rt,ra,rb <= v23), mm=00| -| SCMPLE16 rt, ra, rb | Signed Compare less or equal | VSLE (v16 <= rt,ra,rb <= v23), mm=00| -| UCMPLT16 rt, ra, rb | Unsigned Compare less than | !VSGT (v24 <= rt,ra,rb <= v29), mm=00| -| UCMPLE16 rt, ra, rb | Unsigned Compare less or equal | VSLE (v24 <= rt,ra,rb <= v29), mm=00| - -## 8-bit Comparison instructions - -| Andes Mnemonic | 8-bit Instruction | Harmonised RVP Equivalent | -| ------------------ | ------------------------- | ------------------- | -| CMPEQ8 rt, ra, rb | Compare equal | VSEQ (v2 <= rt,ra,rb <= v7), mm=00| -| SCMPLT8 rt, ra, rb | Signed Compare less than | !VSGT (v2 <= rt,ra,rb <= v7), mm=00| -| SCMPLE8 rt, ra, rb | Signed Compare less or equal | VSLE (v2 <= rt,ra,rb <= v7), mm=00| -| UCMPLT8 rt, ra, rb | Unsigned Compare less than | !VSGT (v8 <= rt,ra,rb <= v15), mm=00| -| UCMPLE8 rt, ra, rb | Unsigned Compare less or equal | VSLE (v8 <= rt,ra,rb <= v15), mm=00| - -## 16-bit Miscellaneous instructions - -| Andes Mnemonic | 16-bit Instruction | Harmonised RVP Equivalent | -| ------------------ | ------------------------ | ------------------- | -| SMIN16 rt, ra, rb | Signed minimum | VMIN (v16 <= rt,ra,rb <= v23), mm=00| -| UMIN16 rt, ra, rb | Unsigned minimum | VMIN (v24 <= rt,ra,rb <= v29), mm=00| -| SMAX16 rt, ra, rb | Signed maximum | VMAX (v16 <= rt,ra,rb <= v23), mm=00| -| UMAX16 rt, ra, rb | Unsigned maximum | VMAX (v24 <= rt,ra,rb <= v29), mm=00| -| SCLIP16 rt, ra, im | Signed clip | ?VCLIP (v16 <= rt,ra,rb <= v23), mm=01| -| UCLIP16 rt, ra, im | Unsigned clip | ?VCLIP (v24 <= rt,ra,rb <= v29), mm=01| -| KMUL16 rt, ra, rb | Signed multiply 16x16->16 | VMUL (v16 <= rt,ra,rb <= v23), mm=01| -| KMULX16 rt, ra, rb | Signed crossed multiply 16x16->16 | | -| SMUL16 rt, ra, rb | Signed multiply 16x16->32 | VMUL (v30 <= rt <= v31, v16 <= ra,rb <= v23), mm=00| -| SMULX16 rt, ra, rb | Signed crossed multiply 16x16->32 | | -| UMUL16 rt, ra, rb | Signed multiply 16x16->32 | VMUL (v30 <= rt <= v31, v24 <= ra,rb <= r31), mm=00| -| UMULX16 rt, ra, rb | Signed crossed multiply 16x16->32 | | -| KABS16 rt, ra | Saturated absolute value | VSGNX (v16 <= rt <= v29, v16 <= ra,rb <= v23, mm=01) | - -## 8-bit Miscellaneous instructions - -| Andes Mnemonic | 8-bit Instruction | Harmonised RVP Equivalent | -| ------------------ | ------------------------- | ------------------- | -| SMIN8 rt, ra, rb | Signed minimum | VMIN (v2 <= rt,ra,rb <= v7), mm=00| -| UMIN8 rt, ra, rb | Unsigned minimum | VMIN (v8 <= rt,ra,rb <= v15), mm=00| -| SMAX8 rt, ra, rb | Signed maximum | VMAX (v2 <= rt,ra,rb <= v7), mm=00| -| UMAX8 rt, ra, rb | Unsigned maximum | VMAX (v8 <= rt,ra,rb <= v15), mm=00| -| KABS8 rt, ra | Saturated absolute value | VSGNX (v2 <= rt <= v15, v2 <= ra,rb <= v8, mm=01) | - -## 8-bit Unpacking instructions - -| Andes Mnemonic | 8-bit Instruction | Harmonised RVP Equivalent | -| ------------------ | ------------------------- | ------------------- | -| SUNPKD810 rt, ra | Signed unpack bytes 1 & 0 | VMV (v16<= rt <= 23, v2 <= ra <= v7), mm=00| -| SUNPKD820 rt, ra | Signed unpack bytes 2 & 0 | | -| SUNPKD830 rt, ra | Signed unpack bytes 3 & 0 | | -| SUNPKD831 rt, ra | Signed unpack bytes 3 & 1 | | -| ZUNPKD810 rt, ra | Unsigned unpack bytes 1 & 0 | VMV (v24<= rt <= 31, v8 <= ra <= v15), mm=00| -| ZUNPKD820 rt, ra | Unsigned unpack bytes 2 & 0 | | -| ZUNPKD830 rt, ra | Unsigned unpack bytes 3 & 0 | | -| ZUNPKD831 rt, ra | Unsigned unpack bytes 3 & 1 | | diff --git a/harmonised_rvv_rvp/discussion.mdwn b/harmonised_rvv_rvp/discussion.mdwn deleted file mode 100644 index 1d86fcc9a..000000000 --- a/harmonised_rvv_rvp/discussion.mdwn +++ /dev/null @@ -1,44 +0,0 @@ -# Comments - -## enabling/disabling individual 8 and 16-bit operations in SIMD blocks - -* At the end of a loop, how are the three end operations of 4-wide 8-bit operations to be disabled (to avoid "SIMD considered harmful"?) -* Likewise at the beginning of a loop, how are (up to) the first three operations to be disabled? -* Likewise the last (and first) of 2-wide 16-bit operations? -* What about predication within a 4-wide 8-bit group? -* Likewise what about predication within a 2-wide 16-bit group? - -## Providing "cross-over" between elements in a group - -what do you think of the "CSR cross[32][6]" idea? sorry below may -not be exactly clear, it's basically a way to generalise all -cross-operations, even the SUNPKD810 rt, ra and ZUNPKD810 rt, ra would -reduce down to one instruction as opposed to 8 right now. - - def butterfly_remap(remap_me): - # hmmm a little hazy on the details here.... - # help, help! logic-dyslexia kicking in! - # erm do some crossover using the 6 bits from - # the CSR cross map. first 2 bits swap - # elements in index positions 0,1 and 2,3 - # second 2 bits swap elements in positions 0,2 and 1,3 - # then swap 0,1 and 2,3 a second time. - # gives full set of all permutations. - return something, something - - def crossover(elidx, destreg): - base = elidx & ~0x7 - return butterfly_remap(CSR_cross[destreg][elidx & 0x7]) - - def op(v1, v2, v3): - for l in vlen: - remap_src1, remap_src2 = crossover(i, v1) - # remap_srcN references byte offsets? erm.... :) - GPR[v1] = scalar_op(GPR[v2][remap_src1], - GPR[v3][remap_src2]) - -Otherwise, VSHUFFLE and so on (and possibly xBitManip) would -need to be used. xBitManip would not be a bad idea, except -consideration of VLIW-like DSP (TI C67*) architectures needs -to be given, which do not do register-renaming and have fixed -pipeline phases with no stalling on register-dependencies. diff --git a/instruction_virtual_addressing.mdwn b/instruction_virtual_addressing.mdwn deleted file mode 100644 index 7d3a6c757..000000000 --- a/instruction_virtual_addressing.mdwn +++ /dev/null @@ -1,230 +0,0 @@ -# Beyond 39-bit instruction virtual address extension - -Peter says: - -I'd like to propose a spec change and don't know who to contact. My -suggestion is that the instruction virtual address remain at 39-bits -(or lower) while moving the data virtual address to 48-bits. These 2 -spaces do not need to be the same size, and the instruction space will -naturally be a very small subset. The reason we expand is to access -more data, but the HW cost is primarily from the instruction virtual -address. I don't believe there are any applications that require nearly -this much instruction space, so it's possible compilers already abide by -this restriction. However we would need to formalize it to take advantage -in HW. - -I've participated in many feasibilities to expand the virtual address -through the years, and the costs (frequency, area, and power) are -prohibitive and get worse with each process. The main reason it is so -expensive is that the virtual address is used within the core to track -each instruction, so it exists in almost every functional block. We try -to implement address compression where possible, but it is still perhaps -the costliest group of signals we have. This false dependency between -instruction and data address space is the reason x86 processors have -been stuck at 48 bits for more than a decade despite a strong demand -for expansion from server customers. - -This seems like the type of HW/SW collaboration that RISC-V was meant -to address. Any suggestions how to proceed? - -# Discussion with Peter and lkcl - ->> i *believe* that would have implications that only a 32/36/39 bit ->> *total* application execution space could be fitted into the TLB at ->> any one time, i.e. that if there were two applications approaching ->> those limits, that the TLBs would need to be entirely swapped out to ->> make room for one (and only one) of those insanely-large programs to ->> execute at any one time. ->> -> Yes, one solution would be to restrict the instruction TLB to one (or a few) -> segments. Our interface to SW is on page misses and when reading from -> registers (e.g. indirect branches), so we can translate to the different -> address size at these points. It would be preferable if the corner cases -> were disallowed by SW. - - ok so just to be clear: - -* application instruction space addressing is restricted to -32/36/39-bit (whatever) -* virtual address space for applications is restricted to 48-bit (on -rv64: rv128 has higher?) -* TLBs for application instruction space can then be restricted to -32+N/36+N/39+N where 0 <= N <= a small number. -* the smaller application space results in less virtual instruction -address routing hardware (the primary goal) -* an indirect branch, which will always be to an address within the -32/36/39-bit range, will result in a virtual TLB table miss -* the miss will be in: - -> the 32+N/36+N/39+N space that will be - -> redirected to a virtual 48-bit address that will be - -> redirected to real RAM through the TLB. - -assuming i have that right, in this way: - -* you still have up to 48-bit *actual* virtual addressing (and -potentially even higher, even on RV64) -* but any one application is limited in instruction addressing range -to 32/36/39-bit -* *BUT* you *CAN* actually have multiple such applications running -simultaneously (depending on whether N is greater than zero or not). - -is that about right? - -if so, what are the disadvantages? what is lost (vs what is gained)? - --------- - -reply: - - ok so just to be clear: - - * application instruction space addressing is restricted to -32/36/39-bit (whatever) - -The address space of a process would ideally be restricted to a range -such as this. If not, SW would preferably help with corner cases -(e.g. instruction overlaps segment boundary). - - * virtual address space for applications is restricted to 48-bit (on -rv64: rv128 has higher?) - -Anything 64-bits or less would be fine (more of an ISA issue). - - * TLBs for application instruction space can then be restricted to -32+N/36+N/39+N where 0 <= N <= a small number. - -Yes - - * the smaller application space results in less virtual instruction -address routing hardware (the primary goal) - -The primary goal is frequency, but routing in key areas is a major -component of this (and is increasingly important on each new silicon -process). Area and power are secondary goals. - - * an indirect branch, which will always be to an address within the -32/36/39-bit range, will result in a virtual TLB table miss - -Indirect branches would ideally always map to the range, but HW would -always check. - - * the miss will be in: - -> the 32+N/36+N/39+N space that will be - -> redirected to a virtual 48-bit address that will be - -> redirected to real RAM through the TLB. - -Actually a page walk through the page miss handler, but the concept -is correct. - -> if so, what are the disadvantages? what is lost (vs what is gained)? - -I think the disadvantages are mainly SW implementation costs. The -advantages are frequency, power, and area. Also a mechanism for expanded -addressability and security. - -[hypothetically, the same scheme could equally be applied to 48-bit -executables (so 32/36/39/48).)] - -# Jacob and Albert discussion - -Albert Cahalan wrote: - -> The solution is likely to limit the addresses that can be living in the -> pipeline at any one moment. If that would be exceeded, you wait. -> -> For example, split a 64-bit address into a 40-bit upper part and a -> 24-bit lower part. Assign 3-bit codes in place of the 40-bit portion, -> first-come-first-served. Track just 27 bits (3+24) through the -> processor. You can do a reference count on the 3-bit codes or just wait -> for the whole pipeline to clear and then recycle all of the 3-bit codes. - -> Adjust all those numbers as determined by benchmarking. - -> I must say, this bears a strong resemblance to the TLB. Maybe you could -> use a TLB entry index for the tracking. - -I had thought of a similar solution. - -The key is that the pipeline can only care about some subset of the -virtual address space at any one time. All that is needed is some way -to distinguish the instructions that are currently in the pipeline, -rather than every instruction in the process, as virtual addresses do. - -I suggest using cache or TLB coordinates as instruction tags. This would -require that the L1 I-cache or ITLB "pin" each cacheline or slot that -holds a currently-pending instruction until that instruction is retired. -The L1 I-cache is probably an ideal reference, since the cache tag -array has the current base virtual address for each cacheline and the -rest of the pipeline would only need {cacheline number, offset} tuples. -Evicting the cacheline containing the most-recently-fetched instruction -would be insane in general, so this should have minimal impact on L1 -I-cache management. If the virtual address of the instruction is needed -for any reason, it can be read from the I-cache tag array. - -This approach can be trivially extended to multi-ASID or even multi-VMID -systems by simply adding VMID and ASID fields to the tag tuples. - -The L1 I-cache provides an easy solution for assigning "short codes" -to replace the upper portion of an instruction's virtual address. -As an example, consider an 8KiB L1 I-cache with 128-byte cachelines. -Such a cache has 64 cachelines (6 bits) and each cacheline has 64 or -32 possible instructions (depending on implementation of RVC or other -odd-alignment ISA extensions). For an RVC-capable system (the worst -case), each 128-byte cacheline has 64 possible instruction locations, for -another 6 bits. So now the rest of the pipeline need only track 12-bit -tags that reference the L1 I-cache. A similar approach could also use -the ITLB, but the ITLB variant results in larger tags, due both to the -need to track page offsets (11 bits) and the larger number of slots the -ITLB is likely to have. - -Conceivably, even the program counter could be internally implemented -in this way. - ------ - -Jacob replies - -The idea is that the internal encoding for (example) sepc could be the cache coordinates, and reading the CSR uses the actual value stored as an address to perform a read from the L1 I-cache tag array. In other words, cache coordinates do not need to be resolved back to virtual addresses until software does something that requires the virtual address. - -Branch target addresses get "interesting" since the implementation must either be able to carry a virtual address for a branch target into the pipeline (JALR needs the ability to transfer to a virtual address anyway) or prefetch all branch targets so the branch address can be written as a cache coordinate. An implementation could also simply have both "branch to VA" and "branch to CC" macro-ops and probe the cache when a branch is decoded: if the branch target is already in the cache, decode as "branch to CC", otherwise decode as "branch to VA". This requires tracking both forms of the program counter, however, and adds a performance-optimization rule: branch targets should be in the same or next cacheline when feasible. (I expect most implementations that implement I-cache prefetch at all to automatically prefetch the next cacheline of the instruction stream. That is very cheap to implement and the prefetch will hit whenever execution proceeds sequentially, which should be fairly common.) - -Limiting which instructions can take traps helps with this model, and interrupts (which can otherwise introduce interrupt traps anywhere) would need to be handled by inserting a "take interrupt trap" macro-op into the decoded instruction stream. - -Also, this approach can use coordinates into either the L1 I-cache or the ITLB. I have been describing the cache version because I find it more interesting and it can use smaller tags than the TLB version. You mention evaluating TLB pointers and finding them insufficient; do cache pointers reduce or solve those issues? What were the problems with using TLB coordinates instead of virtual addresses? - -More directly addressing lkcl's question, I expect that use of cache coordinates to be completely transparent to software, requiring no change to the ISA spec. As a purely microarchitectural solution, it also meets Dr. Waterman's goal. - -# Microarchitecture design preference - -andrew expressed a preference that the spec not require changes, instead that implementors design microarchitectures that solve the problem transparently. - -> so jacob (and peter, and albert, and others), what are your thoughts -> on whether these proposals would require a specification change. are -> they entirely transparent or are they guaranteed to have ramifications -> that propagate through the hardware and on to the toolchains and OSes, -> requiring formal platform-level specification and ratification? - -I had hoped for software proposals, but these HW proposals would not require a specification change. I found that TLB ptrs didn't address our primary design issues (about 10 years ago), but it does simplify areas of the design. At least a partial TLB would be needed at other points in the pipeline when reading the VA from registers or checking branch addresses. - -I still think the spec should recognize that the instruction space has very different requirements and costs. - ----- - -" sepc could be the cache coordinates [set,way?], and reading the CSR uses the actual value stored as an address to perform a read from the L1 I-cache tag array" -This makes no sense to me. First, reading the CSR move the CSR into a GPR, it doesn't look up anything in the cache. - -In an implementation using cache coordinates for *epc, reading *epc _does_ perform a cache tag lookup. - -In case you instead meant that it is then used to index into the cache, then either: - - Reading the CSR into a GPR resolves to a VA, or - -This is correct. - -[...] -Neither of those explanations makes sense- could you explain better? - -In this case, where sepc stores a (cache row, offset) tuple, reading sepc requires resolving that tuple into a virtual address, which is done by reading the high bits from the cache tag array and carrying over the offset within the cacheline. CPU-internal "magic cookie" cache coordinates are not software-visible. In this specific case, at entry to the trap handler, the relevant cacheline must be present -- it holds the most-recently executed instruction before the trap. - -In general, the cacheline can be guaranteed to remain present using interlock logic that prevents its eviction unless no part of the processor is "looking at" it. Reference counting is a solved problem and should be sufficient for this. This gets a bit more complex with speculative execution and multiple privilege levels, although a cache-per-privilege-level model (needed to avoid side channels) also solves the problem of the cacheline being evicted -- the user cache is frozen while the supervisor runs and vice versa. I have an outline for a solution to this problem involving shadow cachelines (enabling speculative prefetch/eviction in a VIPT cache) and a "trace scoreboard" (multi-column reference counter array -- each column tracks references from pending execution traces: issuing an instruction increments a cell, retiring an instruction decrements a cell, dropping a speculative trace (resolving predicate as false) zeros an entire column, and a cacheline may be selected for eviction iff its entire row is zero). - -CSR reads are allowed to have software-visible side effects in RISC-V, although none of the current standard CSRs have side-effects on read. Looking at it this way, resolving cache coordinates to a virtual address upon reading sepc is simply a side effect that is not visible to software. diff --git a/interrupts.mdwn b/interrupts.mdwn deleted file mode 100644 index 300ed7b18..000000000 --- a/interrupts.mdwn +++ /dev/null @@ -1,19 +0,0 @@ -# Interrupt Handling for RISC-V - -This page is a non-authoritative resource for information and documentation -about interrupt handling on RISC-V. An interim page for the discussion -of interrupt handling is here: [[interrupt_handling]]. - -# Open PLIC Implementations - -* - written in verilog, has an - AHB3-Lite / AMBA interface. Documentation is here: - - It has been taped out, it supports virtually unlimited (limited by - timing only) IRQ lines. All registers are dynamically generated. - Currently it only features an AHB3 slave interface, but the BIU is - separate. So other interfaces can be easily added. -* Shakti Peripherals, there is a tested (taped-out) version here - in src/peripherals/plic - and another version with up to 1024 IRQ lines and a 2-cycle - response time here diff --git a/interrupts/interrupt_handling.mdwn b/interrupts/interrupt_handling.mdwn deleted file mode 100644 index 1af87d41c..000000000 --- a/interrupts/interrupt_handling.mdwn +++ /dev/null @@ -1,38 +0,0 @@ -# Interrupt Handling in RISC-V - -This is a non-authoritative document for informally capturing the -requirements for interrupt handling across the spectrum of the entire -RISC-V ecosystem, with a view to finding common ground. Following on -from that will be seeing where collaboration is (and is not) feasible, -and, crucially, if the existing structures (such as the various PLIC -implementations that already exist) cover peoples' needs (or not). - -# Requirements Discussion - -This section is intended for capturing requirements from different sources -so that they can be viewed and compared in one place. If you are not -familiar with markdown or editing of wikis please contact -luke.leighton@gmail.com, sending the appropriate text, for inclusion here. - -* **Libre-RISCV Shakti M-Class**: a 300-400 pin SoC with almost a hundred - separate and distinct "slow" (below 160mhz) peripherals that need - nothing particularly special in the way of fast latency IRQs, just lots - of them. Five UARTs, each requiring one IRQ line; Four I2C peripherals, - each requiring two IRQ lines, Multiple Quad SPI interfaces requring - **six** IRQ lines (each!), and 32 "EINT" lines (general-purpose - external interrupt) which are intended for mundane purposes such as - "lid opened", or "volume key pressed" and "headphone jack inserted", - the number of IRQ lines required to cover such a significant number - of peripherals begins to add up quite rapidly. However despite this, - the PLIC as it stands (privspec-v-1.10 chapter 7) actually covers the - requirements quite nicely, as long as it can cope with large numbers - *of* IRQ lines (which it can). Thus the Shakti PLIC Peripheral code - has been modified from its original (which could handle up to XLEN - separate lines) to a hierarchical arrangement that can handle up to - 1024 separate and distinct IRQs - . A code-generator tool - will take care of the task - of auto-generating the #defines for the linux kernel, and presently - already takes care of the task of generating the PLIC fabric interconnect. - - diff --git a/overloadable_opcodes.mdwn b/overloadable_opcodes.mdwn deleted file mode 100644 index ed05873e2..000000000 --- a/overloadable_opcodes.mdwn +++ /dev/null @@ -1,486 +0,0 @@ -# Overloadable opcodes. - -The overloadable opcode (or xext) proposal allows a non standard extension to use a documented 20 + 3 bit (or 52 + 3 bit on RV64) UUID identifier for an instruction for _software_ to use. At runtime, a cpu translates the UUID to a small implementation defined 12 + 3 bit bit identifier for _hardware_ to use. It also defines a fallback mechanism for the UUID's of instructions the cpu does not recognise. - -Tl;DR see below for a C description of how this is supposed to work. - -It defines a small number N standardised R-type instructions -xcmd0, xcmd1, ...xcmd[N-1], preferably in the brownfield opcode space. We usually assume N = 8 (aka log2(8) = 3 in the + 3 above). -Each xcmd takes (in rs1) a 12 bit "logical unit" (lun) identifying a (sub)device on the cpu -that implements some "extension interface" (xintf) together with some additional data. -Extension devices may be implemented in any convenient form, e.g. non standard extensions -of the CPU iteself, IP tiles, or closely coupled external devices. - -An xintf is a set of up to N commands with 2 input and 1 output port (i.e. like an -R-type instruction), together with a description of the semantics of the commands. Calling -e.g. xcmd3 routes its two inputs and one output ports to command 3 on the device determined -by the lun bits in rs1. Thus, the N standard xcmd instructions are standard-designated -overloadable opcodes, with the non standard semantics of the opcode determined by the lun. - -Portable software, does not use luns directly. Instead, it goes through a level of -indirection using a further instruction xext. The xext instruction translates a 20 bit globally -unique identifier UUID of an xintf, to the lun of a device on the cpu that implements that xintf. -The cpu can do this, because it knows (at manufacturing or boot time) which devices it has, and -which xintfs they provide. This includes devices that would be described as non standard extension -of the cpu if the designers had used custom opcodes instead of xintf as an interface. If the -UUID of the xintf is not recognised at the current privilege level, the xext instruction returns -the special lun = 0, causing any xcmd to trap. Minor variations of this scheme (requiring two -more instructions xext0 and xextm1) cause xcmd instructions to fallback to always return 0 -or -1 instead of trapping. - -Remark1: the main difference with a previous "ioctl like proposal" is that UUID translation -is stateless and does not use resources. The xext instruction _neither_ initialises a -device _nor_ builds global state identified by a cookie. If a device needs initialisation -it can do this using xcmds as init and deinit instructions. Likewise, it can hand out -cookies (which can include the lun) as a return value . - -Remark2: Implementing devices can respond to an (essentially) arbitrary number of xintfs. -Hence, while an xintf is restricted to N commands, an implementing device can have an -arbitrary number of commands. Organising related commands in xintfs, helps avoid UUID space -pollution, and allows to amortise the (small) cost of UUID to lun translation if related -commands are used in combination. - - -== Description of the instructions == - - xcmd0 rd, rs1, rs2 - xcmd1 rd, rs1, rs2 - .... - xcmdN rd, rs1, rs2 - -* rs1 contains a 12 bit "logical unit" (lun) together with xlen - 12 bits of additional data. -* rs2 is arbitrary - -For e.g xmd3, route the inputs rs1, rs2 and output port rd to command 3 of the (sub)device on the cpu identified by the lun bits of rs1. - -after execution: -* rd contains the value that of the output port of the implementing device - --------- - xext rd, rs1, rs2 - xext0 rd, rs1, rs2 - xextm1 rd, rs1, rs2 - - -* rs1 contains ---a UUID of at least 20 bit in bit 12 .. XLEN of rs1 identifying an xintf. ---the sequence number of a device at the current privilege level on the cpu implementing the xintf in bit 0..11 . - In particular, if bit 0..11 is zero, the default implemententation is requested. -* rs2 is arbitrary (but bit XLEN-12 to XLEN -1 is discarded) - -after execution, - if the cpu recognises the UUID and device at the current privilege level, rd contains the lun of a device -implementing the xintf in bit 0..11, followed by bit 0.. XLEN - 13 of rs2. -if the cpu does not recognise the UUID and device it returns the numbers 0 (for xext), 1 (for xext0) or 2 (for xextm1), in particular bit 12.. XLEN are 0. - ---- -The net effect is that, when the CPU implements an xintf with UUID 0xABCDE a sequence like - - //fake UUID of an xintf - lui rd 0xABCDE - xext rd rd rs1 - xcmd0 rd rd rs2 - -acts like a single namespaced instruction cmd0_ABCDE rd rs1 rs2 (with the annoying caveat that the last 12 of rs1 are discarded) The sequence not indivisible but the crucial semantics that you might want to be indivisible is in xcmd0. - -Delegation and UUID is expected to come at a small performance price compared to a "native" instruction. This should, however, be an acceptable tradeoff in many cases. Moreover implementations may opcode-fuse the whole instruction sequence or the first or last two instructions. -If several instructions of the same interface are used, one can also use instruction sequences like - - lui t1 0xABCDE //org_tinker_tinker__RocknRoll_uuid - xext t1 t1 zero - xcmd0 a5, t1, a0 // org_tinker_tinker__RocknRoll__rock(a5, t1, a0) - xcmd1 t2, t1, a1 // org_tinker_tinker__RocknRoll__roll(t2, t1, a5) - xcmd0 a0, t1, t2 // org_tinker_tinker__RocknRoll__rock(a0, t1, t2) - -If 0xABCDE is an unknown UUID at the current privilege level, the sequence results in a trap just like cmd0_ABCDE rd rs1 rs2 would. The sequence - - //fake UUID of an xintf - lui rd 0xABCDE - xext0 rd rd rs1 - xcmd0 rd rd rs2 - -acts exactly like the sequence with xext, except that 0 is returned by xcmd0 if the UUID is unknown at the current privilege level. Likewise usage of xextm1 results in -1 being returned. This requires lun = 0 , 1 and 2 to be routed to three mandatory fallback -interfaces defined below. - -On the software level, the xintf is just a set of glorified assembler macros - - org.tinker.tinker:RocknRoll{ - uuid : 0xABCDE - rock rd rs1 rs2 : xcmd0 rd rs1 rs2 - roll rd rs1 rs2 : xcmd1 rd rs1 rs2 - } - -so that the above sequence can be more clearly written as - - import(org.tinker.tinker:RocknRoll) - - lui rd org.tinker.tinker:RocknRoll:uuid - xext rd rd rs1 - org.tinker.tinker:RocknRoll:rock rd rd rs2 - - ------- -The following standard xintfs shall be implemented by the CPU. - -For lun == 0: - -At privilege level user mode, supervisor mode and hypervisor mode - - org.RiscV:Fallback:Trap{ - uuid: 0 - trap0 rd rs1 rs2: xcmd0 rd rs1 rs2 - ... - trap[N-1] rd rs1 rs2: xcmd[N-1] rd rs1 rs2 - } - -each of the xcmd instructions shall trap to one level higher. - -At privilege level machine mode each trap command has unspecified behaviour, but in debug mode -should cause an exception to a debug environment. - -For lun == 1, at all privilege levels - - org.RiscV:Fallback:ReturnZero{ - uuid: 1 - return_zero0 rd rs1 rs2: xcmd0 rd rs1 rs2 - ... - return_zero[N-1] rd rs1 rs2: xcmd[N-1] rd rs1 rs2 - } - -each return_zero command shall return 0 in rd. - -For lun == 2, at all privilege levels - - org.RiscV:Fallback:ReturnMinusOne{ - uuid: 2 - return_minusone0 rd rs1 rs2: xcmd0 rd rs1 rs2 - ... - return_minusone[N-1] rd rs1 rs2: xcmd[N-1] rd rs1 rs2 - } - -each return_minusone shall return -1. - ---- - -Remark: -Quite possibly even glorified standard assembler macros are overkill and it is -easier to just use defines or ordinary macro's with long names. E.g. writing - - #define org_tinker_tinker__RocknRoll__uuid 0xABCDE - #define org_tinker_tinker__RocknRoll__rock(rd, rs1, rs2) xcmd0 rd, rs1, rs2 - #define org_tinker_tinker__RocknRoll__roll(rd, rs1, rs2) xcmd1 rd, rs1, rs2 - -allows the same sequence to be written as - - lui rd org_tinker_tinker__RocknRoll__uuid - xext rd rs1 - org_tinker_tinker__RocknRoll__rock(rd, rd, rs2) - -Readability of assembler is no big deal for a compiler, but people are supposed -to _document_ the semantics of the interface. In particular specifying the semantics -of the xintf in same way as the semantics of the cpu should allow formal verification. - -==Implications for the RiscV ecosystem == - -The proposal allows independent groups to define one or more extension -interfaces of (slightly crippled) R-type instructions implemented by an -extension device. Such an extension device would be an native but non standard -extension of the CPU, an IP tile or a closely coupled external chip and would -be configured at manufacturing time or bootup of the CPU. - -The 20 bit provided by the UUID of an xintf is much more room than provided by -the 2 custom 32 bit, or even 4 custom 64/48 bit opcode spaces. Thus the overloadable -opcodes proposal avoids most of the need to put a claim on opcode space, -and the associated collisions when combining independent extensions. -In this respect it is similar to POSIX ioctls, which (almost) obviate the need for -defining new syscalls to control new or nonstandard hardware. - -The expanded flexibility comes at the cost: the standard can specify the -semantics of the delegation mechanism and the interfacing with the rest -of the cpu, but the actual semantics of the overloaded instructions can -only be defined by the designer of the interface. Likewise, a device -can be conforming as far as delegation and interaction with the CPU -is concerned, but whether the hardware is conforming to the semantics -of the interface is outside the scope of spec. Being able to specify -that semantics using the methods used for RV itself is clearly very -valuable. One impetus for doing that is using it for purposes of its own, -effectively freeing opcode space for other purposes. Also, some interfaces -may become de facto or de jure standards themselves, necessitating -hardware to implement competing interfaces. I.e., facilitating a free -for all, may lead to standards proliferation. C'est la vie. - -The only "ISA-collisions" that can still occur are in the 20 bit (~10^6) -interface identifier space, with 12 more bits to identify a device on -a hart that implements the interface. One suggestion is setting aside -2^19 id's that are handed out for a small fee by a central (automated) -registration (making sure the space is not just claimed), while the -remaining 2^19 are used as a good hash on a long, plausibly globally -unique human readable interface name. This gives implementors the choice -between a guaranteed private identifier paying a fee, or relying on low -probabilities. On RV64 the UUID can also be extended to 52 bits (> 10^15). - - -==== Description of the extension as C functions.== - - /* register format of rs1 for xext instructions */ - typedef struct uuid_device{ - long dev:12; - long uuid: 8*sizeof(long) - 12; - } uuid_device_t - - /* register format for rd of xext and rs1 of xcmd instructions, packs lun and data */ - typedef struct lun_data{ - long lun:12; - long data: 8*sizeof(long) - 12; - } lun_data_t - - /* proposed R-type instructions - xext rd rs1 rs2 - xcmd0 rd rs1 rs2 - xcmd1 rd rs1 rs2 - ... - xcmd7 rd rs1 rs2 - */ - - lun_data_t xext(uuid_dev_t rs1, long rs2); - long xcmd0(lun_data_t rs1, long rs2); - long xcmd1(lun_data_t rs1, long rs2); - ... - long xcmd(lun_data_t rs1, long rs2); - - /* hardware interface presented by an implementing device. */ - typedef - long device_fn(unsigned short subdevice_xcmd, lun_data_t rs1, long rs2); - - /* cpu internal datatypes */ - - enum privilege = {user = 0b0001, super = 0b0010, hyper = 0b0100, mach = 0b1000}; - - /* cpu internal, does what is on the label */ - static - enum privilege cpu__current_privilege_level() - - typedef - struct lun{ - unsigned short id:12 - } lun_t; - - struct uuid_device_priv2lun{ - struct{ - uuid_dev_t uuid_dev; - enum privilege reqpriv; - }; - lun_t lun; - }; - - struct device_subdevice{ - device_fn* device_addr; - unsigned short subdeviceId:12; - }; - - struct lun_priv2device_subdevice{ - struct{ - lun_t lun; - enum privilege reqpriv - } - struct device_subdevice devAddr_subdevId; - } - - static - struct uuid_device_priv2lun cpu__lun_map[]; - - /* - map (UUID, device, privilege) to a 12 bit lun, - return -1 on unknown (at acces level) - - does associative memory lookup and tests privilege. - */ - static - short cpu__lookup_lun(const struct uuid_device_priv2lun* lun_map, uuid_dev_t uuid_dev, enum privilege priv, lun_t on_notfound); - - - #define org_RiscV__Trap__lun ((lun_t)0) - #define org_RiscV__Fallback__ReturnZero__lun ((lun_t)1) - #define org_RiscV__Fallback__ReturnMinusOne__lun ((lun_t)2) - - lun_data_t xext(uuid_dev_t rs1, long rs2) - { - short lun = cpu__lookup_lun(lun_map, rs1, current_privilege_level(), org_RiscV__Fallback__Trap__lun); - if(lun < 0) - return (lun_data_t){.lun = org_RiscV__Fallback__Trap__lun, .data = 0}; - - return (lun_data_t){.lun = lun, .data = rs2 % (1<< (8*sizeof(long) - 12))} - } - - lun_data_t xext0(uuid_dev_t rs1, long rs2) - { - short lun = cpu__lookup_lun(lun_map, rs1, current_privilege_level(), org_RiscV__Fallback__Trap__lun); - if(lun < 0) - return (lun_data_t){.lun = org_RiscV__Fallback__ReturnZero__lun, .data = 0}; - - return (lun_data_t){.lun = lun, .data = rs2 % (1<< (8*sizeof(long) - 12))} - } - - lun_data_t xextm1(uuid_dev_t rs1, long rs2) - { - short lun = cpu__lookup_lun(lun_map, rs1, current_privilege_level(), org_RiscV__Fallback__Trap__lun); - if(lun < 0) - return (lun_data_t){.lun = org_RiscV__Fallback__ReturnMinusOne__lun, .data = 0}; - - return (lun_data_t){.lun = lun, .data = rs2 % (1<< (8*sizeof(long) - 12))} - } - - - struct lun_priv2device_subdevice cpu__device_subdevice_map[]; - - /* map (lun, priv) to struct device_subdevice pair. - For lun = 0, or unknown (lun, priv) pair, returns (struct device_subdevice){NULL,0} - */ - static - device_subdevice_t cpu__lookup_device_subdevice(const struct lun_priv2device_subdevice_map* dev_subdev_map, - lun_t lun, enum privileges priv); - - - - /* functional description of the delegating xcmd0 .. xcmd7 instructions */ - template //pretend this is C - long xcmd(lun_data_t rs1, long rs2) - { - struct device_subdevice dev_subdev = cpu__lookup_device_subdevice(device_subdevice_map, rs1.lun, current_privilege()); - - if(dev_subdev.devAddr == NULL) - cpu__trap_to(next_privilege); - - return dev_subdev.devAddr(dev_subdev.subdevId | k >> 12 , rs1, rs2); - } - - /*Fallback interfaces*/ - #define org_RiscV__Fallback__ReturnZero__uuid 1 - #define org_RiscV__Fallback__ReturnMinusOne__uuid 2 - - /* fallback device */ - static - long cpu__falback(short subdevice_xcmd, lun_data_t rs1, long rs2) - { - switch(subdevice_xcmd % (1 << 12) ){ - case 0 /* org.RiscV:ReturnZero */: return 0; - case 1 /* org.RiscV:ReturnMinus1 */: return -1 - default: trap("hardware configuration error"); - } - -Example: - - // Fake UUID's - #define com_bigbucks__Frobate__uuid 0xABCDE - #define org_tinker_tinker__RocknRoll__uuid 0x12345 - #define org_tinker_tinker__Jazz__uuid 0xBEB0B - /* - com.bigbucks:Frobate{ - uuid: com_bigbucks__Frobate__uuid - frobate rd rs1 rs2 : cmd0 rd rs1 rs2 - foo rd rs1 rs2 : cmd1 rd rs1 rs2 - bar rd rs1 rs2 : cmd1 rd rs1 rs2 - } - */ - org.tinker.tinker:RocknRoll{ - uuid: org_tinker_tinker__RocknRoll__uuid - rock rd rs1 rs2: cmd0 rd rs1 rs2 - roll rd rs1 rs2: cmd1 rd rs1 rs2 - } - - /* - Device 1 implements com.bigbucks::Frobate and org.tinker.tinker interfaces, uses - a special command for the machine level implementation. - */ - - long com_bigbucks__device1(short subdevice_xcmd, lun_data_t rs1, long rs2) - { - switch(subdevice_xcmd) { - case 0 | 0 << 12 /* com.bigbucks:Frobate:frobate */ : return device1_frobate(rs1, rs2); - case 0 | 7 << 12 /* com.bigbucks:Frobate:frobate */ : return device1_frobate_machine_level(rs1, rs2); - case 0 | 1 << 12 /* com.bigbucks:Frobate:foo */ : return device1_foo(rs1, rs2); - case 0 | 2 << 12 /* com.bigbucks:Frobate:bar */ : return device1_bar(rs1, rs2); - case 1 | 0 << 12 /* org.tinker.tinker:RocknRoll:rock */ : return device1_rock(rs1, rs2); - case 1 | 1 << 12 /* org.tinker.tinker:RocknRoll:roll */ : return device1_roll(rs1, rs2); - default: trap(“hardware configuration error”); - } - } - - /* - org.tinker.tinker:Jazz{ - uuid: org_tinker_tinker__Jazz__uuid - boogy rd rs1 rs2: cmd0 rd rs1 rs2 - } - */ - - /* Device 2 implements Frobate and Jazz interfaces */ - long org_tinker_tinker__device2(short subdevice_xcmd, lun_data_t rs1, long rs2) - { - switch(dev_cmd.interfId){ - case 0 | 0 << 12 /* com.bigbucks:Frobate:frobate */: return device2_frobate(rs1, rs2); - case 0 | 1 << 12 /* com.bigbucks:Frobate:foo */ : return device2_foo(rs1, rs2); - case 0 | 2 << 12 /* com.bigbucks:Frobate:bar */ : return device2_foo(rs1, rs2); - case 1 | 0 << 12 /* org_tinker_tinker:Jazz:boogy */: return device2_boogy(rs1, rs2); - default: trap(“hardware configuration error”); - } - } - - /* cpu assigns luns to the interfaces at different privilege levels on device1 and 2 to luns at manufacturing or boot up time */ - #define cpu__Device1__Frobate__lun ((lun_t)32) - #define cpu__Device1__RocknRoll__lun ((lun_t)33) - #define cpu__Device2__Frobate__lun ((lun_t)34) - #define cpu__Device2__Jazz__lun ((lun_t)35) - - /* struct uuid_dev2lun_map[] */ - lun_map = { - {{.uuid_devId = {org_RiscV__Fallback__ReturnZero__uuid , 0}, .priv = user}, .lun = org_RiscV__Fallback__ReturnZero__lun}, - {{.uuid_devId = {org_RiscV__Fallback__ReturnZero__uuid , 0}, .priv = super}, .lun = org_RiscV__Fallback__ReturnZero__lun}, - {{.uuid_devId = {org_RiscV__Fallback__ReturnZero__uuid , 0}, .priv = hyper}, .lun = org_RiscV__Fallback__ReturnZero__lun}, - {{.uuid_devId = {org_RiscV__Fallback__ReturnZero__uuid , 0}, .priv = mach} .lun = org_RiscV__Fallback__ReturnZero__lun}, - {{.uuid_devId = {org_RiscV__Fallback__ReturnMinusOne__uuid, 0}, .priv = user}, .lun = org_RiscV__Fallback__ReturnMinusOne__lun}, - {{.uuid_devId = {org_RiscV__Fallback__ReturnMinusOne__uuid, 0}, .priv = super}, .lun = org_RiscV__Fallback__ReturnMinusOne__lun}, - {{.uuid_devId = {org_RiscV__Fallback__ReturnMinusOne__uuid, 0}, .priv = hyper}, .lun = org_RiscV__Fallback__ReturnMinusOne__lun}, - {{.uuid_devId = {org_RiscV__Fallback__ReturnMinusOne__uuid, 0}, .priv = mach}, .lun = org_RiscV__Fallback__ReturnMinusOne__lun}, - {{.uuid_devId = {com_bigbucks__Frobate__uuid, 0}, .priv = user} .lun = cpu__Device1__Frobate__lun}, - {{.uuid_devId = {com_bigbucks__Frobate__uuid, 1}, .priv = super} .lun = cpu__Device1__Frobate__lun}, - {{.uuid_devId = {com_bigbucks__Frobate__uuid, 1}, .priv = hyper} .lun = cpu__Device1__Frobate__lun}, - {{.uuid_devId = {com_bigbucks__Frobate__uuid, 1}, .priv = mach} .lun = cpu__Device1__Frobate__lun}, - {{.uuid_devId = {com_bigbucks__Frobate__uuid, 0}, .priv = super} .lun = cpu__Device2__Frobate__lun}, - {{.uuid_devId = {com_bigbucks__Frobate__uuid, 0}, .priv = hyper} .lun = cpu__Device2__Frobate__lun}, - {{.uuid_devId = {com_bigbucks__Frobate__uuid, 0}, .priv = mach} .lun = cpu__Device2__Frobate__lun}, - {{.uuid_devId = {org_tinker_tinker__RocknRoll__uuid, 0}, .priv = user} .lun = cpu__Device1__RocknRoll__lun}, - {{.uuid_devId = {org_tinker_tinker__RocknRoll__uuid, 0}, .priv = super} .lun = cpu__Device1__RocknRoll__lun}, - {{.uuid_devId = {org_tinker_tinker__RocknRoll__uuid, 0}, .priv = hyper} .lun = cpu__Device1__RocknRoll__lun}, - {{.uuid_devId = {org_tinker_tinker__RocknRoll__uuid, 0}, .priv = super}, .lun = cpu__Device2__Jazz__lun}, - {{.uuid_devId = {org_tinker_tinker__RocknRoll__uuid, 0}, .priv = hyper}, .lun = cpu__Device2__Jazz__lun}, - } - - /* cpu maps luns + privilege level to busaddress of device and particular subdevice according to spec of the device.*/ - /* struct lun2dev_subdevice_map[] */ - dev_subdevice_map = { - // .lun = 0, will trap - {{.lun = org_RiscV__Fallback__ReturnZero__lun, .priv = user}, .devAddr_interfId = {fallback, 1 /* ReturnZero */}}, - {{.lun = org_RiscV__Fallback__ReturnZero__lun, .priv = super}, .devAddr_interfId = {fallback, 1 /* ReturnZero */}}, - {{.lun = org_RiscV__Fallback__ReturnZero__lun, .priv = hyper}, .devAddr_interfId = {fallback, 1 /* ReturnZero */}}, - {{.lun = org_RiscV__Fallback__ReturnZero__lun, .priv = mach}, .devAddr_interfId = {fallback, 1 /* ReturnZero */}}, - {{.lun = org_RiscV__Fallback__ReturnMinusOne__lun, .priv = user}, .devAddr_interfId = {fallback, 2 /* ReturnMinusOne*/}}, - {{.lun = org_RiscV__Fallback__ReturnMinusOne__lun, .priv = super}, .devAddr_interfId = {fallback, 2 /* ReturnMinusOne*/}}, - {{.lun = org_RiscV__Fallback__ReturnMinusOne__lun, .priv = hyper}, .devAddr_interfId = {fallback, 2 /* ReturnMinusOne*/}}, - {{.lun = org_RiscV__Fallback__ReturnMinusOne__lun, .priv = mach}, .devAddr_interfId = {fallback, 2 /* ReturnMinusOne*/}}, - // .lun = 3 .. 7 reserved for other fallback RV interfaces - // .lun = 8 .. 30 reserved as error numbers, c.li t1 31; bltu rd t1 L_fail tests errors - // .lun = 31 reserved out of caution - {{.lun = cpu__Device1__Frobate__lun, .priv = user}, .devAddr_interfId = {device1, 0 /* Frobate interface */}}, - {{.lun = cpu__Device1__Frobate__lun, .priv = super}, .devAddr_interfId = {device1, 0 /* Frobate interface */}}, - {{.lun = cpu__Device1__Frobate__lun, .priv = hyper}, .devAddr_interfId = {device1, 0 /* Frobate interface */}}, - {{.lun = cpu__Device1__Frobate__lun, .priv = mach}, .devAddr_interfId = {device1,64 /* Frobate machine level */}}, - {{.lun = cpu__Device1__RocknRoll__lun, .priv = user}, .devAddr_InterfId = {device1, 1 /* RocknRoll interface */}}, - {{.lun = cpu__Device1__RocknRoll__lun, .priv = super}, .devAddr_InterfId = {device1, 1 /* RocknRoll interface */}}, - {{.lun = cpu__Device1__RocknRoll__lun, .priv = hyper}, .devAddr_InterfId = {device1, 1 /* RocknRoll interface */}}, - {{.lun = cpu__Device1__RocknRoll__lun, .priv = super}, .devAddr_interfId = {device2, 1 /* Frobate interface */}}, - {{.lun = cpu__Device2__Frobate__lun, .priv = super}, .devAddr_interfId = {device2, 0 /* Frobate interface */}}, - {{.lun = cpu__Device2__Frobate__lun, .priv = hyper}, .devAddr_interfId = {device2, 0 /* Frobate interface */}}, - {{.lun = cpu__Device2__Frobate__lun, .priv = mach}, .devAddr_interfId = {device2, 0 /* Frobate interface */}}, - {{.lun = cpu__Device2__Jazz__lun, .priv = super}, .devAddr_interfId = {device2, 1 /* Jazz interface */}}, - {{.lun = cpu__Device2__Jazz__lun, .priv = hyper}, .devAddr_interfId = {device2, 1 /* Jazz interface */}}, - } diff --git a/pluggable_extensions.mdwn b/pluggable_extensions.mdwn deleted file mode 100644 index ed58c4486..000000000 --- a/pluggable_extensions.mdwn +++ /dev/null @@ -1,319 +0,0 @@ -# pluggable extensions - -This proposal adds a standardised extension instructions to the RV -instruction set by introducing a fixed small number N (e.g. N = 8) of -R-type opcodes xcmd0 rd, rs1, rs2, .. , xcmd rd, rs1, rs2, that are intended to be used as "overloadable" (slightly crippled) R-type instructions for independently developed extensions in the form of non standard CPU extensions, IP tiles, or closely coupled external devices. - -Tl;DR see below for a C description of how this is supposed to work. - -The input value of an xcmd instruction in rs2 is arbitrary. The content of the first input rs1, however, is divided in a 12bit "logical unit" (lun) together with xlen - 12 bits of additional data. -The lun bits in rs1, determines a specific (sub)device, and the CPU routes the command to this device with rs1 and rs2 as input, and rd as output. Effectively, the xcmd0, ... xcmd7 instructions are "virtual method" opcodes, overloaded for different extension (sub)devices. - -The specific value of the lun is supposed to be convenient for the cpu and is thus unstandardised. Portable software therefore constructs the lun, with a further R-type instruction xext. It takes a 20 bit universally unique identifier (UUID) that identifies an interface with upto N R-type instructions with the signature of xcmd. An optional sequence number identifies a specific enumerated device on the cpu that implements the interface as a subdevice. For convenience, xext also or's bits rs2[0..XLEN-12]. If the UUID is not recognised 0 is returned. , but implemented by the extension (sub)device. Note that this scheme gives an easy work around the restriction on N (e.g. 8 ) commands: an implementing device can simply implement several interfaces as routable subdevices, indeed is expected to do so. - -The net effect is that a sequence like - - //fake UUID - lui rd 0xEDCBA - xext rd rd rs1 - xcmd0 rd rd rs2 - -acts like a single namespaced instruction cmd0_EDCBA rd rs1 rs2 with the annoying caveat that rs1 can only use bits 0..XLEN-12 (the sequence is also not indivisible but the crucial semantics that you might want to be indivisible is in xcmd0). Delegation is expected to come at a small -additional performance price compared to a "native" instruction. This should, however, be an acceptable tradeoff in many cases. - - -Programatically the instructions in the interface are just a set of glorified assembler macros - - org.tinker.tinker:RocknRoll{ - uuid : 0xABCDE - rock rd rs1 rs2 : xcmd0 rd rs1 rs2 - roll rd rs1 rs2 : xcmd1 rd rs1 rs2 - } - -so that the above sequence is more clearly written as - - import(org.tinker.tinker:RocknRoll) - - lui rd org.tinker.tinker:RocknRoll:uuid - xext rd rd rs1 - org.tinker.tinker:RocknRoll:rock rd rd rs2 - -(Quite possibly even glorified standard assembler macros are overkill and it is easier to just use defines or ordinary macro's with long names. E.g. writing - - #define org_tinker_tinker__RocknRoll__uuid 0xABCDE - #define org_tinker_tinker__RocknRoll__rock(rd, rs1, rs2) xcmd0 rd, rs1, rs2 - #define org_tinker_tinker__RocknRoll__roll(rd, rs1, rs2) xcmd1 rd, rs1, rs2 - -allows the same sequence to be written as - - lui rd org_tinker_tinker__RocknRoll__uuid - xext rd rs1 - org_tinker_tinker__RocknRoll__rock(rd, rd, rs2) - -Readability of assembler is no big deal for a compiler, but people are supposed to _document_ the interface and its semantics. In particular a semantics specified like the semantics of the cpu would be most welcome.) - - -If several instructions of the same interface are used, one can also use instruction sequences like - - lui t1 org_tinker_tinker__RocknRoll_uuid - xext t1 zero - xcmd0 a5, t1, a0 // org_tinker_tinker__RocknRoll__rock(a5, t1, a0) - xcmd1 t2, t1, a1 // org_tinker_tinker__RocknRoll__roll(t2, t1, a5) - xcmd0 a0, t1, t2 // org_tinker_tinker__RocknRoll__rock(a0, t1, t2) - -This amortises the cost of the xext instruction. - -==Implications for the RiscV ecosystem == - - -The proposal allows independent groups to define one or more extension -interfaces of (slightly crippled) R-type instructions implemented by an -extension device. Such an extension device would be an native but non standard -extension of the CPU, an IP tile or a closely coupled external chip and would -be configured at manufacturing time or bootup of the CPU. - -Having a standardised overloadable interface simply avoids much of the -need for isa extensions for hardware with non standard interfaces and -semantics. This is analogous to the way that the standardised overloadable -ioctl interface of the kernel almost completely avoids the need for -extending the kernel with syscalls for the myriad of hardware devices -with their specific interfaces and semantics. - -The expanded flexibility comes at the cost: the standard can specify the -semantics of the delegation mechanism and the interfacing with the rest -of the cpu, but the actual semantics of the overloaded instructions can -only be defined by the designer of the interface. Likewise, a device -can be conforming as far as delegation and interaction with the CPU -is concerned, but whether the hardware is conforming to the semantics -of the interface is outside the scope of spec. Being able to specify -that semantics using the methods used for RV itself is clearly very -valuable. One impetus for doing that is using it for purposes of its own, -effectively freeing opcode space for other purposes. Also, some interfaces -may become de facto or de jure standards themselves, necessitating -hardware to implement competing interfaces. I.e., facilitating a free -for all, may lead to standards proliferation. C'est la vie. - -The only "ISA-collisions" that can still occur are in the 20 bit (~10^6) -interface identifier space, with 12 more bits to identify a device on -a hart that implements the interface. One suggestion is setting aside -2^19 id's that are handed out for a small fee by a central (automated) -registration (making sure the space is not just claimed), while the -remaining 2^19 are used as a good hash on a long, plausibly globally -unique human readable interface name. This gives implementors the choice -between a guaranteed private identifier paying a fee, or relying on low -probabilities. On RV64 the UUID can also be extended to 52 bits (> 10^15). - - -==== Description of the extension as C functions.== - - /* register format of rs1 for xext instructions */ - typedef struct uuid_device{ - long dev:12; - long uuid: 8*sizeof(long) - 12; - } uuid_device_t - - /* register format for rd of xext and rs1 for xcmd instructions, packs lun and data */ - typedef struct lun_data{ - long lun:12; - long data: 8*sizeof(long) - 12; - } lun_data_t - - /* proposed R-type instructions - xext rd rs1 rs2 - xcmd0 rd rs1 rs2 - xcmd1 rd rs1 rs2 - ... - xcmd7 rd rs1 rs2 - */ - - lun_data_t xext(uuid_dev_t rs1, long rs2); - long xcmd0(lun_data_t rs1, long rs2); - long xcmd1(lun_data_t rs1, long rs2); - ... - long xcmd(lun_data_t rs1, long rs2); - - /* hardware interface presented by an implementing device. */ - typedef - long device_fn(unsigned short subdevice_xcmd, lun_data_t rs1, long rs2); - - /* cpu internal datatypes */ - - enum privilege = {user = 0b0001, super = 0b0010, hyper = 0b0100, mach = 0b1000}; - - /* cpu internal, does what is on the label */ - static - enum privilege cpu__current_privilege_level() - - typedef - struct lun{ - unsigned short id:12 - } lun_t; - - struct uuid_device_priv2lun{ - struct{ - uuid_dev_t uuid_dev; - enum privilege reqpriv; - }; - lun_t lun; - }; - - struct device_subdevice{ - device_fn* device_addr; - unsigned short subdeviceId:12; - }; - - struct lun_priv2device_subdevice{ - struct{ - lun_t lun; - enum privilege reqpriv - } - struct device_subdevice devAddr_subdevId; - } - - static - struct uuid_device_priv2lun cpu__lun_map[]; - - /* - map (UUID, device, privilege) to a 12 bit lun, - return (lun_t){0} on unknown (at acces level) - - does associative memory lookup and tests privilege. - */ - static - lun_t cpu__lookup_lun(const struct uuid_device_priv2lun* lun_map, uuid_dev_t uuid_dev, enum privilege priv); - - - - lun_data_t xext(uuid_dev_t rs1, long rs2) - { - lun_t lun = cpu__lookup_lun(lun_map, rs1, current_privilege_level()); - - return (lun_data_t){.lun = lun.id, .data = rs2 % (1<< (8*sizeof(long) - 12))} - } - - - - - struct lun_priv2device_subdevice cpu__device_subdevice_map[]; - - /* map (lun, priv) to struct device_subdevice pair. - For lun = 0, or unknown (lun, priv) pair, returns (struct device_subdevice){NULL,0} - */ - static - device_subdevice_t cpu__lookup_device_subdevice(const struct lun_priv2device_subdevice_map* dev_subdev_map, - lun_t lun, enum privileges priv); - - /* functional description of the delegating xcmd0 .. xcmd7 instructions */ - template //pretend this is C - long xcmd(lun_data_t rs1, long rs2) - { - struct device_subdevice dev_subdev = cpu__lookup_device_subdevice(device_subdevice_map, rs1.lun, current_privilege()); - if(dev_subdev.devAddr == NULL) - trap(“Illegal instruction”); - - return dev_subdev.devAddr(dev_subdev.subdevId | k << 12, rs1, rs2); - } - - - -Example: - - #define com_bigbucks__Frobate__uuid 0xABCDE - #define org_tinker_tinker__RocknRoll__uuid 0x12345 - #define org_tinker_tinker__Jazz__uuid 0xD0B0D - /* - com.bigbucks:Frobate{ - uuid: com_bigbucks__Frobate__uuid - frobate rd rs1 rs2 : cmd0 rd rs1 rs2 - foo rd rs1 rs2 : cmd1 rd rs1 rs2 - bar rd rs1 rs2 : cmd1 rd rs1 rs2 - } - */ - org.tinker.tinker:RocknRoll{ - uuid: org_tinker_tinker__RocknRoll__uuid - rock rd rs1 rs2: cmd0 rd rs1 rs2 - roll rd rs1 rs2: cmd1 rd rs1 rs2 - } - - long com_bigbucks__device1(short subdevice_xcmd, lun_data_t rs1, long rs2) - { - switch(subdevice_xcmd) { - case 0 | 0 << 12 /* com.bigbucks:Frobate:frobate */ : return device1_frobate(rs1, rs2); - case 42| 0 << 12 /* com.bigbucks:FrobateMach:frobate : return device1_frobate_machine_level(rs1, rs2); - case 0 | 1 << 12 /* com.bigbucks:Frobate:foo */ : return device1_foo(rs1, rs2); - case 0 | 2 << 12 /* com.bigbucks:Frobate:bar */ : return device1_bar(rs1, rs2); - case 1 | 0 << 12 /* org.tinker.tinker:RocknRoll:rock */ : return device1_rock(rs1, rs2); - case 1 | 1 << 12 /* org.tinker.tinker:RocknRoll:roll */ : return device1_roll(rs1, rs2); - default: trap(“hardware configuration error”); - } - } - - /* - org.tinker.tinker:Jazz{ - uuid: org_tinker_tinker__Jazz__uuid - boogy rd rs1 rs2: cmd0 rd rs1 rs2 - } - */ - - long org_tinker_tinker__device2(short subdevice_xcmd, lun_data_t rs1, long rs2) - { - switch(dev_cmd.interfId){ - case 0 | 0 << 12 /* com.bigbucks:Frobate:frobate */: return device2_frobate(rs1, rs2); - case 0 | 1 << 12 /* com.bigbucks:Frobate:foo */ : return device2_foo(rs1, rs2); - case 0 | 2 << 12 /* com.bigbucks:Frobate:bar */ : return device2_foo(rs1, rs2); - case 1 | 0 << 12 /* org_tinker_tinker:Jazz:boogy */: return device2_boogy(rs1, rs2); - default: trap(“hardware configuration error”); - } - } - - /* struct uuid_dev2lun_map[] */ - lun_map = { - {{.uuid_devId = {org_RiscV__Fallback__ReturnZero__uuid , 0}, .priv = user}, .lun = 1}, - {{.uuid_devId = {org_RiscV__Fallback__ReturnZero__uuid , 0}, .priv = super}, .lun = 1}, - {{.uuid_devId = {org_RiscV__Fallback__ReturnZero__uuid , 0}, .priv = hyper}, .lun = 1}, - {{.uuid_devId = {org_RiscV__Fallback__ReturnZero__uuid , 0}, .priv = mach} .lun = 1}, - {{.uuid_devId = {org_RiscV__Fallback__ReturnMinusOne__uuid, 0}, .priv = user}, .lun = 2}, - {{.uuid_devId = {org_RiscV__Fallback__ReturnMinusOne__uuid, 0}, .priv = super}, .lun = 2}, - {{.uuid_devId = {org_RiscV__Fallback__ReturnMinusOne__uuid, 0}, .priv = hyper}, .lun = 2}, - {{.uuid_devId = {org_RiscV__Fallback__ReturnMinusOne__uuid, 0}, .priv = mach}, .lun = 2}, - {{.uuid_devId = {com_bigbucks__Frobate__uuid, 0}, .priv = user} .lun = 32}, //32 sic! - {{.uuid_devId = {com_bigbucks__Frobate__uuid, 1}, .priv = super} .lun = 32}, - {{.uuid_devId = {com_bigbucks__Frobate__uuid, 1}, .priv = hyper} .lun = 32}, - {{.uuid_devId = {com_bigbucks__Frobate__uuid, 1}, .priv = mach} .lun = 32}, - {{.uuid_devId = {com_bigbucks__Frobate__uuid, 0}, .priv = super} .lun = 34}, //34 sic! - {{.uuid_devId = {com_bigbucks__Frobate__uuid, 0}, .priv = hyper} .lun = 34}, - {{.uuid_devId = {com_bigbucks__Frobate__uuid, 0}, .priv = mach} .lun = 34}, - {{.uuid_devId = {org_tinker_tinker__RocknRoll__uuid, 0}, .priv = user} .lun = 33}, //33 sic! - {{.uuid_devId = {org_tinker_tinker__RocknRoll__uuid, 0}, .priv = super} .lun = 33}, - {{.uuid_devId = {org_tinker_tinker__RocknRoll__uuid, 0}, .priv = hyper} .lun = 33}, - {{.uuid_devId = {org_tinker_tinker__RocknRoll__uuid, 0}, .priv = super}, .lun = 35}, - {{.uuid_devId = {org_tinker_tinker__RocknRoll__uuid, 0}, .priv = hyper}, .lun = 35}, - } - - /* struct lun2dev_subdevice_map[] */ - dev_subdevice_map = { - // {.lun = 0, error and falls back to trapping xcmd - {{.lun = 1, .priv = user}, .devAddr_interfId = {fallback, 0 /* ReturnZero */}}, - {{.lun = 1, .priv = super}, .devAddr_interfId = {fallback, 0 /* ReturnZero */}}, - {{.lun = 1, .priv = hyper}, .devAddr_interfId = {fallback, 0 /* ReturnZero */}}, - {{.lun = 1, .priv = mach}, .devAddr_interfId = {fallback, 0 /* ReturnZero */}}, - {{.lun = 2, .priv = user}, .devAddr_interfId = {fallback, 1 /* ReturnMinusOne*/}}, - {{.lun = 2, .priv = super}, .devAddr_interfId = {fallback, 1 /* ReturnMinusOne*/}}, - {{.lun = 2, .priv = hyper}, .devAddr_interfId = {fallback, 1 /* ReturnMinusOne*/}}, - {{.lun = 2, .priv = mach}, .devAddr_interfId = {fallback, 1 /* ReturnMinusOne*/}}, - // .lun = 3 .. 7 reserved for other fallback RV interfaces - // .lun = 8 .. 30 reserved as error numbers, c.li t1 31; bltu rd t1 L_fail tests errors - // .lun = 31 reserved out of caution - {{.lun = 32, .priv = user}, .devAddr_interfId = {device1, 0 /* Frobate interface */}}, - {{.lun = 32, .priv = super}, .devAddr_interfId = {device1, 0 /* Frobate interface */}}, - {{.lun = 32, .priv = hyper}, .devAddr_interfId = {device1, 0 /* Frobate interface */}}, - {{.lun = 32, .priv = mach}, .devAddr_interfId = {device1,64 /* Frobate machine level interface */}}, - {{.lun = 33, .priv = user}, .devAddr_InterfId = {device1, 1 /* RocknRoll interface */}}, - {{.lun = 33, .priv = super}, .devAddr_InterfId = {device1, 1 /* RocknRoll interface */}}, - {{.lun = 33, .priv = hyper}, .devAddr_InterfId = {device1, 1 /* RocknRoll interface */}}, - {{.lun = 34, .priv = super}, .devAddr_interfId = {device2, 0 /* Frobate interface */}}, - {{.lun = 34, .priv = hyper}, .devAddr_interfId = {device2, 0 /* Frobate interface */}}, - {{.lun = 34, .priv = mach}, .devAddr_interfId = {device2, 0 /* Frobate interface */}}, - {{.lun = 35, .priv = super}, .devAddr_interfId = {device2, 1 /* Jazz interface */}}, - {{.lun = 35, .priv = hyper}, .devAddr_interfId = {device2, 1 /* Jazz interface */}}, - } diff --git a/rv_major_opcode_1010011.mdwn b/rv_major_opcode_1010011.mdwn deleted file mode 100644 index 6383cf0c4..000000000 --- a/rv_major_opcode_1010011.mdwn +++ /dev/null @@ -1,476 +0,0 @@ -**OBSOLETE**, superceded by [[openpower/transcendentals]] - -# Summary FP Opcodes - -This page aids and assists in the development of FP proposals, -by identifying and listing in full both publicly-known proposals -and the full brownfield encoding space available in the 0b010011 -major opcode. - -A primary critical use-case for extending FP is for 3D and supercomputing. - -Publicly-known FP proposals: - -* Zfrsqrt - Reciprocal SQRT -* Zftrans - see [[ztrans_proposal]]: Transcendentals - (FPOW, FEXP, FLOG, FCBRT) - -* Ztrig\* - see [[ztrans_proposal]]: Trigonometriics - (FSIN, FCOS, FTAN, arc-variants, hypotenuse-variants) -* Extension of formats to cover FP16 (RISC-V ISA Manual Table 11.3 "fmt field") - -* HI-half FP MV - - -* (Add new entries here: Zextname - Description and URL) - -[[!toc levels=2]] - -# Main FP opcode 1010011 table - -Notes: - -* Proposed new encodings in **bold**. -* *Use funct5 sparingly!* - 2-operand functions only. -* Single-argument FP operations should go under one of the funct5 tables -* Both dual and single argument FP operations that do not require - "rounding mode" should go in one of the funct5 tables that already use - "funct3". -* The rs2 field can be best used to sub-select a considerable number - of 1-op operations, with "rounding" in funct3 -* The funct3 field can be best used to sub-select a considerable number - of 2-op operations -* 1-op operations that do not need "rounding" have the best brownfield - availability: 8 bit sub-selection (rs2=5 + funct3=3). This however is - rare as most FP operations need "rounding" selection. -* Be careful not to use encoding space for which FP16 has already been - reserved (mostly FP conversion opcodes) - -[[!table data=""" -31..27 | 26..25 | 24..20 |19..15| 14...12| 11..7 | 6....0 | function | -funct5 | SDHQ | rs2 | rs1 | funct3 | rd | opcode | name | - 5 | 2 | 5 | 5 | 3 | 5 | 7 | | -00000 | xx | rs2 | rs1 | rm | rd | 1010011 | FADD.xx | -00001 | xx | rs2 | rs1 | rm | rd | 1010011 | FSUB.xx | -00010 | xx | rs2 | rs1 | rm | rd | 1010011 | FMUL.xx | -00011 | xx | rs2 | rs1 | rm | rd | 1010011 | FDIV.xx | -00100 | xx | rs2 | rs1 | yyy | rd | 1010011 | ? | -00101 | xx | rs2 | rs1 | yyy | rd | 1010011 | tb=00101 | -00110 | xx | rs2 | rs1 | yyy | rd | 1010011 | ? | -00111 | xx | rs2 | rs1 | yyy | rd | 1010011 | ? | -01000 | xx | rs2 | rs1 | yyy | rd | 1010011 | tb=01000 | -01001 | xx | rs2 | rs1 | yyy | rd | 1010011 | ? | -01010 | xx | rs2 | rs1 | yyy | rd | 1010011 | ? | -01011 | xx | xxxxx | rs1 | yyy | rd | 1010011 | tb=01011 | -01100 | xx | rs2 | rs1 | yyy | rd | 1010011 | **FHYPOT.xx** | -01101 | xx | rs2 | rs1 | rm | rd | 1010011 | **FATAN2.xx** | -01110 | xx | rs2 | rs1 | rm | rd | 1010011 | **FATAN2PI.xx**| -01111 | xx | rs2 | rs1 | rm | rd | 1010011 | **FPOW.xx** | -10000 | xx | rs2 | rs1 | yyy | rd | 1010011 | **FROOTN.xx** | -10001 | xx | rs2 | rs1 | yyy | rd | 1010011 | **FPOWN.xx** | -10010 | xx | rs2 | rs1 | yyy | rd | 1010011 | **FPOWR.xx** | -10011 | xx | rs2 | rs1 | yyy | rd | 1010011 | ? | -10100 | xx | rs2 | rs1 | yyy | rd | 1010011 | tb=10100 | -10101 | xx | rs2 | rs1 | yyy | rd | 1010011 | ? | -10110 | xx | rs2 | rs1 | yyy | rd | 1010011 | ? | -10111 | xx | rs2 | rs1 | yyy | rd | 1010011 | ? | -11000 | xx | xxxxx | rs1 | yyy | rd | 1010011 | tb=11000 | -11001 | xx | rs2 | rs1 | yyy | rd | 1010011 | ? | -11010 | xx | xxxxx | rs1 | yyy | rd | 1010011 | tb=11010 | -11100 | xx | xxxxx | rs1 | yyy | rd | 1010011 | tb=11100 | -11101 | xx | rs2 | rs1 | yyy | rd | 1010011 | ? | -11110 | xx | xxxxx | rs1 | yyy | rd | 1010011 | tb=11110 | -11111 | xx | rs2 | rs1 | yyy | rd | 1010011 | ? | -"""]] - -Code: - -* xx: Opcode format field "fmt" - Table 11.3 -* xxxxx: 5-bit selection field (usually 1-op selection) -* yyy: funct3 selection field (usually 2-op selection) -* rm: "rounding mode" - -## funct5 = 00000 - FADD - -No brownfield encodings available. - -## funct5 = 00001 - FSUB - -No brownfield encodings available. - -## funct5 = 00010 - FMUL - -No brownfield encodings available. - -## funct5 = 00011 - FDIV - -No brownfield encodings available. - -## funct5 = 00100 - unused - -Brownfield encodings available. - -## funct5 = 00100 - FSGN - -This table uses funct3 for encoding 2-operand FP operations - -[[!table data=""" -31..27 | 26..25 | 24..20 |19..15| 14...12| 11..7 | 6....0 | function | -funct5 | SDHQ | rs2 | rs1 | funct3 | rd | opcode | name | - 5 | 2 | 5 | 5 | 3 | 5 | 7 | | -00100 | xx | rs2 | rs1 | 000 | rd | 1010011 | FSGNJ.xx | -00100 | xx | rs2 | rs1 | 001 | rd | 1010011 | FSGNJN.xx | -00100 | xx | rs2 | rs1 | 010 | rd | 1010011 | FSGNJX.xx | -00100 | xx | rs2 | rs1 | 011 | rd | 1010011 | ?f3=011 | -00100 | xx | rs2 | rs1 | 100 | rd | 1010011 | ?f3=100 | -00100 | xx | rs2 | rs1 | 101 | rd | 1010011 | ?f3=101 | -00100 | xx | rs2 | rs1 | 110 | rd | 1010011 | ?f3=110 | -00100 | xx | rs2 | rs1 | 111 | rd | 1010011 | ?f3=111 | -"""]] - -## funct5 = 00101 - FMIN/MAX - -This table uses funct3 for encoding 2-operand FP operations where the result -register is a **floating-point** value. - -[[!table data=""" -31..27 | 26..25 | 24..20 |19..15| 14...12| 11..7 | 6....0 | function | -funct5 | SDHQ | rs2 | rs1 | funct3 | rd | opcode | name | - 5 | 2 | 5 | 5 | 3 | 5 | 7 | | -00101 | xx | rs2 | rs1 | 000 | rd | 1010011 | FMIN.S | -00101 | xx | rs2 | rs1 | 001 | rd | 1010011 | FMAX.S | -00101 | xx | rs2 | rs1 | 010 | rd | 1010011 | ?f3=010 | -00101 | xx | rs2 | rs1 | 011 | rd | 1010011 | ?f3=011 | -00101 | xx | rs2 | rs1 | 100 | rd | 1010011 | ?f3=100 | -00101 | xx | rs2 | rs1 | 101 | rd | 1010011 | ?f3=101 | -00101 | xx | rs2 | rs1 | 110 | rd | 1010011 | ?f3=110 | -00101 | xx | rs2 | rs1 | 111 | rd | 1010011 | ?f3=111 | -"""]] - -## funct5 = 00110 - unused - -Brownfield encodings available. - -## funct5 = 00111 - unused - -Brownfield encodings available. - -## funct5 = 01000 - FCVT - -This table uses rs2 for encoding 1-operand FP operations, using -funct3 to specify the "rounding" mode - -Notes: - -* FP16 logically deduced from fmt field encoding (bits 25-26) - -[[!table data=""" -31..27 | 26..25 | 24..20 |19..15| 14...12| 11..7 | 6....0 | function | -funct5 | SDHQ | rs2 | rs1 | funct3 | rd | opcode | name | - 5 | 2 | 5 | 5 | 3 | 5 | 7 | | -01000 | 00 | 00000 | rs1 | rm | rd | 1010011 | ???????? | -01000 | 00 | 00001 | rs1 | rm | rd | 1010011 | FCVT.S.D | -01000 | 00 | 00010 | rs1 | rm | rd | 1010011 | **FCVT.S.H**| -01000 | 00 | 00011 | rs1 | rm | rd | 1010011 | FCVT.S.Q | -01000 | 00 | xxxxx | rs1 | rm | rd | 1010011 | rs2? | ------- | ----- | ----- | -----| ----- | ----- | ------- | -------- | -01000 | 01 | 00000 | rs1 | rm | rd | 1010011 | FCVT.D.S | -01000 | 01 | 00001 | rs1 | rm | rd | 1010011 | ???????? | -01000 | 01 | 00010 | rs1 | rm | rd | 1010011 | **FCVT.D.H**| -01000 | 01 | 00011 | rs1 | rm | rd | 1010011 | FCVT.D.Q | -01000 | 01 | xxxxx | rs1 | rm | rd | 1010011 | rs2? | ------- | ----- | ----- | -----| ----- | ----- | ------- | -------- | -01000 | 10 | 00000 | rs1 | rm | rd | 1010011 | FCVT.H.S | -01000 | 10 | 00001 | rs1 | rm | rd | 1010011 | **FCVT.H.D**| -01000 | 10 | 00010 | rs1 | rm | rd | 1010011 | ???????? | -01000 | 10 | 00011 | rs1 | rm | rd | 1010011 | FCVT.H.Q | -01000 | 10 | xxxxx | rs1 | rm | rd | 1010011 | rs2? | ------- | ----- | ----- | -----| ----- | ----- | ------- | -------- | -01000 | 11 | 00000 | rs1 | rm | rd | 1010011 | FCVT.Q.S | -01000 | 11 | 00001 | rs1 | rm | rd | 1010011 | FCVT.Q.D | -01000 | 11 | 00010 | rs1 | rm | rd | 1010011 | **FCVT.Q.H**| -01000 | 11 | 00011 | rs1 | rm | rd | 1010011 | ???????? | -01000 | 11 | xxxxx | rs1 | rm | rd | 1010011 | rs2? | -"""]] - -## funct5 = 01001 - unused - -Brownfield encodings available. - -## funct5 = 01010 - unused - -Brownfield encodings available. - -## funct5 = 01011 - 1-op Transcendentals - -This table uses rs2 for encoding 1-operand FP operations, using -funct3 to specify the "rounding" mode - -[[!table data=""" -31..27 | 26..25 | 24..20 |19..15| 14...12| 11..7 | 6....0 | function | -funct5 | SDHQ | rs2 | rs1 | funct3 | rd | opcode | name | - 5 | 2 | 5 | 5 | 3 | 5 | 7 | | -01011 | xx | 00000 | rs1 | rm | rd | 1010011 | FSQRT.xx | -01011 | xx | 00001 | rs1 | rm | rd | 1010011 | **FRSQRT.xx** | -01011 | xx | 00010 | rs1 | rm | rd | 1010011 | **FRECIP.xx** | -01011 | xx | 00011 | rs1 | rm | rd | 1010011 | **FCBRT.xx** | -01011 | xx | 00100 | rs1 | rm | rd | 1010011 | **FEXP2.xx** | -01011 | xx | 00101 | rs1 | rm | rd | 1010011 | **FLOG2.xx** | -01011 | xx | 00110 | rs1 | rm | rd | 1010011 | **FEXPM1.xx** | -01011 | xx | 00111 | rs1 | rm | rd | 1010011 | **FLOGP1.xx** | -01011 | xx | 01000 | rs1 | rm | rd | 1010011 | **FEXP.xx** | -01011 | xx | 01001 | rs1 | rm | rd | 1010011 | **FLOG.xx** | -01011 | xx | 01010 | rs1 | rm | rd | 1010011 | **FEXP10.xx** | -01011 | xx | 01011 | rs1 | rm | rd | 1010011 | **FLOG10.xx** | -01011 | xx | 01100 | rs1 | rm | rd | 1010011 | **FASINH.xx** | -01011 | xx | 01101 | rs1 | rm | rd | 1010011 | **FACOSH.xx** | -01011 | xx | 01110 | rs1 | rm | rd | 1010011 | **FATANH.xx** | -01011 | xx | 01111 | rs1 | rm | rd | 1010011 | ? | -01011 | xx | 10000 | rs1 | rm | rd | 1010011 | **FSIN.xx** | -01011 | xx | 10001 | rs1 | rm | rd | 1010011 | **FSINPI.xx** | -01011 | xx | 10010 | rs1 | rm | rd | 1010011 | **FASIN.xx** | -01011 | xx | 10011 | rs1 | rm | rd | 1010011 | **FASINPI.xx**| -01011 | xx | 10100 | rs1 | rm | rd | 1010011 | **FCOS.xx** | -01011 | xx | 10101 | rs1 | rm | rd | 1010011 | **FCOSPI.xx** | -01011 | xx | 10110 | rs1 | rm | rd | 1010011 | **FACOS.xx** | -01011 | xx | 10111 | rs1 | rm | rd | 1010011 | **FACOSPI.xx**| -01011 | xx | 11000 | rs1 | rm | rd | 1010011 | **FTAN.xx** | -01011 | xx | 11001 | rs1 | rm | rd | 1010011 | **FTANPI.xx** | -01011 | xx | 11010 | rs1 | rm | rd | 1010011 | **FATAN.xx** | -01011 | xx | 11011 | rs1 | rm | rd | 1010011 | **FATANPI.xx**| -01011 | xx | 11100 | rs1 | rm | rd | 1010011 | **FSINH.xx** | -01011 | xx | 11101 | rs1 | rm | rd | 1010011 | **FCOSH.xx** | -01011 | xx | 11110 | rs1 | rm | rd | 1010011 | **FTANH.xx** | -01011 | xx | 11111 | rs1 | rm | rd | 1010011 | ? | -"""]] - -## funct5 = 01100 - **FHYPOT** - -Proposed for Zftrans - FHYPOT: "sqrt(rs1 * rs1 + rs2 * rs2)" - -## funct5 = 01101 - **FATAN2** - -Proposed for Zftrans - FATAN: "atan(rs1, rs2)" - -## funct5 = 01110 - **FATAN2PI** - -Proposed for ZftransExt - FATAN2PI: "atan2(rs1, rs2) * PI". -Rationale: Gives better accuracy than if using FMUL with the constant, PI. - -## funct5 = 01111 - **FPOW** - -Proposed for ZftransAdv - FPOW: "FP rs1 to the power of rs2" - -## funct5 = 10000 - **FROOTN** - -Proposed for ZftransAdv - FPROOTN: "FP rs1 to the power of (1/rs2)". -rs1 is FP, rs2 is **integer**. - -## funct5 = 10000 - **FPOWN** - -Proposed for ZftransAdv - FPOW: "FP rs1 to the power of rs2" -rs1 is FP, rs2 is **integer**. - -## funct5 = 10001 - **FPOW** - -Proposed for ZftransAdv - FPOWN: "FP rs1 to the power of rs2, rs1 +ve" -rs1 and rs2 are FP, rs1 must be +ve. Equivalent to "exp(rs2 * log(rs1))" - -## funct5 = 10010 - unused - -Brownfield encodings available. - -## funct5 = 10011 - unused - -Brownfield encodings available. - -## funct5 = 10100 - FP comparisons - -This table uses funct3 for encoding 2-operand FP "comparison" operations -where the result register is an **integer** - -Notes: - -* FNE missing? - -[[!table data=""" -31..27 | 26..25 | 24..20 |19..15| 14...12| 11..7 | 6....0 | function | -funct5 | SDHQ | rs2 | rs1 | funct3 | rd | opcode | name | - 5 | 2 | 5 | 5 | 3 | 5 | 7 | | -10100 | xx | rs2 | rs1 | 000 | rd | 1010011 | FLE.xx | -10100 | xx | rs2 | rs1 | 001 | rd | 1010011 | FLT.xx | -10100 | xx | rs2 | rs1 | 010 | rd | 1010011 | FEQ.xx | -10100 | xx | rs2 | rs1 | 011 | rd | 1010011 | ?f3=011 | -10100 | xx | rs2 | rs1 | 100 | rd | 1010011 | ?f3=100 | -10100 | xx | rs2 | rs1 | 101 | rd | 1010011 | ?f3=101 | -10100 | xx | rs2 | rs1 | 110 | rd | 1010011 | ?f3=110 | -10100 | xx | rs2 | rs1 | 111 | rd | 1010011 | ?f3=111 | -"""]] - -## funct5 = 10101 - unused - -Brownfield encodings available. - -## funct5 = 10110 - unused - -Brownfield encodings available. - -## funct5 = 10111 - unused - -Brownfield encodings available. - -## funct5 = 11000 - FCVT - -This table uses rs2 for encoding 1-operand FP operations, using -funct3 to specify the "rounding" mode - -Notes: - -* FP16 logically deduced from fmt field (bits 25-26) - -[[!table data=""" -31..27 | 26..25 | 24..20 |19..15| 14...12| 11..7 | 6....0 | function | -funct5 | SDHQ | rs2 | rs1 | funct3 | rd | opcode | name | - 5 | 2 | 5 | 5 | 3 | 5 | 7 | | -11000 | 00 | 00000 | rs1 | rm | rd | 1010011 | FCVT.W.S | -11000 | 00 | 00001 | rs1 | rm | rd | 1010011 | FCVT.WU.S | -11000 | 00 | 00010 | rs1 | rm | rd | 1010011 | FCVT.L.S | -11000 | 00 | 00011 | rs1 | rm | rd | 1010011 | FCVT.LU.S | -11000 | 00 | xxxxx | rs1 | rm | rd | 1010011 | rs2? | ------- | ----- | ----- | -----| ----- | ----- | ------- | -------- | -11000 | 01 | 00000 | rs1 | rm | rd | 1010011 | FCVT.W.D | -11000 | 01 | 00001 | rs1 | rm | rd | 1010011 | FCVT.WU.D | -11000 | 01 | 00010 | rs1 | rm | rd | 1010011 | FCVT.L.D | -11000 | 01 | 00011 | rs1 | rm | rd | 1010011 | FCVT.LU.D | -11000 | 01 | xxxxx | rs1 | rm | rd | 1010011 | rs2? | ------- | ----- | ----- | -----| ----- | ----- | ------- | -------- | -11000 | 10 | 00000 | rs1 | rm | rd | 1010011 |**FCVT.W.H** | -11000 | 10 | 00001 | rs1 | rm | rd | 1010011 |**FCVT.WU.H**| -11000 | 10 | 00010 | rs1 | rm | rd | 1010011 |**FCVT.L.H** | -11000 | 10 | 00011 | rs1 | rm | rd | 1010011 |**FCVT.LU.H**| -11000 | 10 | xxxxx | rs1 | rm | rd | 1010011 | rs2? | ------- | ----- | ----- | -----| ----- | ----- | ------- | -------- | -11000 | 11 | 00000 | rs1 | rm | rd | 1010011 | FCVT.W.Q | -11000 | 11 | 00001 | rs1 | rm | rd | 1010011 | FCVT.WU.Q | -11000 | 11 | 00010 | rs1 | rm | rd | 1010011 | FCVT.L.Q | -11000 | 11 | 00011 | rs1 | rm | rd | 1010011 | FCVT.LU.Q | -11000 | 11 | xxxxx | rs1 | rm | rd | 1010011 | rs2? | -"""]] - -## funct5 = 11001 - unused - -Brownfield encodings available. - -## funct5 = 11010 - FCVT - -This table uses rs2 for encoding 1-operand FP operations, using -funct3 to specify the "rounding" mode - -* FP16 logically deduced from fmt field (bits 25-26) - -[[!table data=""" -31..27 | 26..25 | 24..20 |19..15| 14...12| 11..7 | 6....0 | function | -funct5 | SDHQ | rs2 | rs1 | funct3 | rd | opcode | name | - 5 | 2 | 5 | 5 | 3 | 5 | 7 | | -11010 | 00 | 00000 | rs1 | rm | rd | 1010011 | FCVT.S.W | -11010 | 00 | 00001 | rs1 | rm | rd | 1010011 | FCVT.S.WU | -11010 | 00 | 00010 | rs1 | rm | rd | 1010011 | FCVT.S.L | -11010 | 00 | 00011 | rs1 | rm | rd | 1010011 | FCVT.S.LU | -11010 | 00 | xxxxx | rs1 | rm | rd | 1010011 | rs2? | ------- | ----- | ----- | -----| ----- | ----- | ------- | -------- | -11010 | 01 | 00000 | rs1 | rm | rd | 1010011 | FCVT.D.W | -11010 | 01 | 00001 | rs1 | rm | rd | 1010011 | FCVT.D.WU | -11010 | 01 | 00010 | rs1 | rm | rd | 1010011 | FCVT.D.L | -11010 | 01 | 00011 | rs1 | rm | rd | 1010011 | FCVT.D.LU | -11010 | 01 | xxxxx | rs1 | rm | rd | 1010011 | rs2? | ------- | ----- | ----- | -----| ----- | ----- | ------- | -------- | -11010 | 10 | 00000 | rs1 | rm | rd | 1010011 |**FCVT.H.W** | -11010 | 10 | 00001 | rs1 | rm | rd | 1010011 |**FCVT.H.WU**| -11010 | 10 | 00010 | rs1 | rm | rd | 1010011 |**FCVT.H.L** | -11010 | 10 | 00011 | rs1 | rm | rd | 1010011 |**FCVT.H.LU**| -11010 | 10 | xxxxx | rs1 | rm | rd | 1010011 | rs2? | ------- | ----- | ----- | -----| ----- | ----- | ------- | -------- | -11010 | 11 | 00000 | rs1 | rm | rd | 1010011 | FCVT.Q.W | -11010 | 11 | 00001 | rs1 | rm | rd | 1010011 | FCVT.Q.WU | -11010 | 11 | 00010 | rs1 | rm | rd | 1010011 | FCVT.Q.L | -11010 | 11 | 00011 | rs1 | rm | rd | 1010011 | FCVT.Q.LU | -11010 | 11 | xxxxx | rs1 | rm | rd | 1010011 | rs2? | -"""]] - -## funct5 = 11100 - FMV, FCLASS - -This table uses *both* rs2 *and* funct3 for encoding 1-operand FP operations. - -Notes: - -* FMV.X.Q is missing (alias of FMVH.X.D if it existed) -* FP16 logically deduced from fmt field (bits 25-26) -* FMVH.X.HW (half-word) missing? - -[[!table data=""" -31..27| 26..25| 24..20 |19..15|14...12| 11..7 | 6....0 | function | -funct5| SDHQ | rs2 | rs1 |funct3 | rd | opcode | name | - 5 | 2 | 5 | 5 | 3 | 5 | 7 | | -11100 | 00 | 00000 | rs1 | 000 | rd | 1010011 | FMV.X.W | -11100 | 00 | 00000 | rs1 | 001 | rd | 1010011 | FCLASS.S | -11100 | 00 | xxxxx | rs1 | yyy | rd | 1010011 | rs2? f3? | -------| ----- | ----- | -----| ----- | ----- | ------- | -------- | -11100 | 01 | 00000 | rs1 | 000 | rd | 1010011 | FMV.X.D **FMVH.X.W** | -11100 | 01 | 00000 | rs1 | 001 | rd | 1010011 | FCLASS.D | -11100 | 01 | xxxxx | rs1 | yyy | rd | 1010011 | rs2? f3? | -------| ----- | ----- | -----| ----- | ----- | ------- | -------- | -11100 | 10 | 00000 | rs1 | 000 | rd | 1010011 |**FMV.X.H** | -11100 | 10 | 00000 | rs1 | 001 | rd | 1010011 |**FCLASS.H** | -11100 | 10 | xxxxx | rs1 | yyy | rd | 1010011 | rs2? f3? | -------| ----- | ----- | -----| ----- | ----- | ------- | -------- | -11100 | 11 | 00000 | rs1 | 000 | rd | 1010011 | **FMVH.X.D** | -11100 | 11 | 00000 | rs1 | 001 | rd | 1010011 | FCLASS.Q | -11100 | xx | xxxxx | rs1 | yyy | rd | 1010011 | rs2? f3? | -"""]] - -## funct5 = 11101 - unused - -Brownfield encodings available. - -## funct5 = 11110 - FMV - -This table uses *both* rs2 *and* funct3 for encoding 1-operand FP operations. - -Notes: - -* FMV.Q.X is missing (as is FMVH.D.X) -* FMVH.W.X is missing (alias of FMV.D.X) -* FP16 logically deduced from fmt field (bits 25-26) -* FMVH.HW.X (half-word) missing? - -[[!table data=""" -31..27 | 26..25 | 24..20 |19..15| 14...12| 11..7 | 6....0 | function | -funct5 | SDHQ | rs2 | rs1 | funct3 | rd | opcode | name | - 5 | 2 | 5 | 5 | 3 | 5 | 7 | | -11110 | 00 | 00000 | rs1 | 000 | rd | 1010011 | FMV.W.X | -11110 | 00 | xxxxx | rs1 | yyy | rd | 1010011 | rs2? f3? | ------- | ----- | ----- | -----| ----- | ----- | ------- | -------- | -11110 | 01 | 00000 | rs1 | 000 | rd | 1010011 | FMV.D.X | -11110 | 01 | xxxxx | rs1 | yyy | rd | 1010011 | rs2? f3? | ------- | ----- | ----- | -----| ----- | ----- | ------- | -------- | -11110 | 10 | 00000 | rs1 | 000 | rd | 1010011 |**FMV.H.X** | -11110 | 10 | xxxxx | rs1 | yyy | rd | 1010011 | rs2? f3? | ------- | ----- | ----- | -----| ----- | ----- | ------- | -------- | -11110 | 11 | 00000 | rs1 | 000 | rd | 1010011 | ? | -11110 | 11 | xxxxx | rs1 | yyy | rd | 1010011 | rs2? f3? | -"""]] - -## funct5 = 11111 - unused - -Brownfield encodings available. - -## funct5 = ????? (table template) - -This table acts as a cut/paste template for creating brownfield encodings - -[[!table data=""" -31..27 | 26..25 | 24..20 |19..15| 14...12| 11..7 | 6....0 | function | -funct5 | SDHQ | rs2 | rs1 | funct3 | rd | opcode | name | - 5 | 2 | 5 | 5 | 3 | 5 | 7 | | ------- | ----- | ----- | -----| ----- | ----- | ------- | -------- | -"""]] - diff --git a/systemes_libre.mdwn b/systemes_libre.mdwn deleted file mode 100644 index 58bf2a6a7..000000000 --- a/systemes_libre.mdwn +++ /dev/null @@ -1,3 +0,0 @@ -# Systèmes Libres - -* [[Systemes Libres Amazon Alexa IOT Pitch 10-JUN-2020]] diff --git a/systemes_libre/Systemes_Libres_Amazon_Alexa_IOT_Pitch_10-JUN-2020.mdwn b/systemes_libre/Systemes_Libres_Amazon_Alexa_IOT_Pitch_10-JUN-2020.mdwn deleted file mode 100644 index 7e72c0c94..000000000 --- a/systemes_libre/Systemes_Libres_Amazon_Alexa_IOT_Pitch_10-JUN-2020.mdwn +++ /dev/null @@ -1,84 +0,0 @@ -# Systèmes Libres Amazon Alexa IOT Pitch 10-JUN-2020 - -This is Cole's first draft of the script for Systèmes Libres' Wednesday, 10 June 2020 pitch meeting with Amazon Alexa's IOT division, which will be presented by Yehowshua. Please edit, rearrange, and add relevant points as you see fit. I will be taking this script and turning it into a pretty 'business-deck' on Monday, 8 June 2020, so please help today (Sunday 7 June 2020) if you can. - -Questions that I think need to be answered by people with the relevant technical expertise are bolded. Thanks everyone for helping if you can! - -# Title Page - -Systèmes Libres - -Delivering robust low power, high-performance hardware for IOT and Edge Compute - - -# Part 1 - Current GPU IOT Device Shortcomings - -1. GPU is in ADDITION to CPU (2 processors, 2 ISAs, 2 compilers) - - a. Hardware backdoors: In the industrial IOT and RTOS market, significant harm can be done by malevolent actors and competitors by hacking hardware and causing it to do damage, or hacking the hardware to steal proprietary company secrets - - b. Power: 2 separate cores (CPU, GPU) leads to much higher - power consumption - - c. Capability: In RTOS devices, can't make effective use of the GPU - - d. the drivers involve an inter-core RPC mechanism: which is unacceptably high latency and complexity - - e. Furthermore, current RTOS microcontrollers have much lower mathematical numerically intensive computational performance at the same power and silicon area compared to our chip. - -2. Time/Ease of Use/Development: Proprietary development tools and documentation result in an often difficult and long development cycle, especially when rebuilding and optimizing arithmetically intensive algorithms for embedded systems. - -3. Amazon's Sagemaker and Intel's ngraph are steps in the right direction, but ultimately will never be able to provide comparable ease-of-use and insight into every level of the product. - -4. Proprietary GPU inner workings are not available for inspection, neither during active development nor during a critical evaluation phase for their suitability. - - -# Part 2 – How is Libre-SOC different? Or What Makes Libre-SOC better? - -* (*Addresses #1*) Systèmes Libres is developing an SOC with a fused CPU-GPU architecture - -* This hybrid CPU-GPU will have a lower power budget (**what and how?**) (*addresses a.*), and higher computational performance than to competing SOCs (*addresses c.*) like the Broadcom BCM2836 (**what are their power, and performance specs? Why are using specifically BCM2836 as a comparison?**) - -* Systèmes Libres is developing RTOS drivers using Systèmes Libres' Simple-V dynamically partitionable vector algorithm (**what is it? what makes it different? what are its relative benefits and short comings?**) that automatically handles algorithmic optimization and reconfiguration. - -* Systèmes Libres is developing graphics and compute drivers in conformance with the open standards Vulkan and OpenCL (**should we remove opencl until we have had a proper discussion of ROCM on bugzilla?**). Using open standards makes rebuilding or using existing algorithms a simple and easy process. - -* Systèmes Libres’ completely libre hardware-software stack enables an unprecedented level of insight into the entire system. - - a. Developers can begin their investigations at the top analyzing high-level software, then down into firmware. - - b. at the lowest level, they can examine detailed schematic diagrams. - - c. The developer can easily see the function of individual components as well as all of the relationships in the system. - -# Part 3 - Security and Privacy in RTOS and Industrial IOT - -* Companies try to security-harden their software by writing it in special languages like Ada, or using c++ with static code analyzers and special 'Safety-critical' c++ coding guidelines. However, all of this time and money is wasted if the hardware running underneath this software is hopelessly insecure (**picture intel meltdown, spectre ahhhh!!**) - -(jacob: note that our processor is most likely still vulnerable to some variants of spectre unless we make a special effort in the instruction scheduling HW, load/store unit, caches, branch predictor, etc. see -additional good example bugs would include intel's ME vulnerabilities) - -* For example, in the self-driving car domain, the concern about GPU-capable RTOS devices being insecure at the hardware level causes significant barriers to self-driving car adoption because the public is scared that someone will 'hack' their car and crash it - -* Almost every credit card and banking transaction is dependent on transaction processing servers, if these are hacked $M to $B of economic damage can be done - -Financial hardware such as cryptocurrency hardware wallets, and traditonal banking hardware from ATMs to stock terminals and transaction processing servers containing can be trusted with a much greater degree of confidence if the hardware crypto chips which they rely on are formally verified, with the entire HDL available for a full independent audit. - - -# Part 4 - Low Power Requirements of RTOS and Industrial IOT Devices - -* At the level of individual sensors, power draw must be minimized as these are often running off very small batteries - -* At the level of millions of sensors, power draw must be minimized to lower the overall power bill - -* Our hybrid CPU-GPU chip uses less energy that existing microcontrollers that use a cpu with a tacked-on gpu (**numbers??**) - - -# Part 5 – Conclusion - -1. Better security -2. Lower cost -3. Lower power -4. Higher computation at same power -5. Ease of use and development -6. Flexibility for customisation diff --git a/systemes_libre/index.mdwn b/systemes_libre/index.mdwn deleted file mode 100644 index 71e8208db..000000000 --- a/systemes_libre/index.mdwn +++ /dev/null @@ -1,2 +0,0 @@ -# Systemes Libre - -- 2.30.2