From: Luke Kenneth Casson Leighton Date: Wed, 16 Jan 2019 09:12:21 +0000 (+0000) Subject: add 011 plan X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=f915f3ef4f0a48e69643ba4c99d63e5066baf874;p=crowdsupply.git add 011 plan --- diff --git a/updates/011_2019jan16_spectre_plan.mdwn b/updates/011_2019jan16_spectre_plan.mdwn new file mode 100644 index 0000000..ff495a7 --- /dev/null +++ b/updates/011_2019jan16_spectre_plan.mdwn @@ -0,0 +1,97 @@ +# Spectre Plan + +So from the previous update, we had a massive spanner in the works, +which is hitting not just this design, it's absolutely every single +out-of-order processor, as the problems associated with timing attacks +that probe resource congestion are related to the out-of-order paradigm, +not just a particular vendor or one particular processor: it's **all** +out-of-order processors, period. + +To illustrate: if a vendor decides to have a single divide ALU shared +across multiple cores, arbitrary untrusted processes can issue divide +operations to find out if **other** cores are trying to use the (shared) +divide ALU resource. + +If there is limited bandwidth on operand forwarding, for example, then +an arbitrary untrusted process may issue a series of instructions that +are specifically designed to be chained together so as to trigger +operand forwarding, use up all the available bandwidth of the Operand +Forwarding Bus, and, if the completion time is not as expected, the +attacker knows that another process tried to use the same Bus. + +We think we have a solution to this: a "Speculation Fence" instruction +(or "hint", as they are known). The idea is, before an arbitrary +untrusted process is permitted to run, to call a special instruction +that *clears the decks*, resetting the Out-of-Order execution engine +back to a known, quiescent state. Thus, there *is* no information +to leak to the attacker. + +We will also need all system calls, traps and interrupts to automatically +be a speculation fence point. We can also look at doing a "graded" +shutdown of speculation and resource allocation, on the basis that +if it is known in advance that a system call is coming up, there is +no point issuing speculative instructions or using out-of-order resources +if they are about to be cancelled within 5-10 instructions! + +The alternatives... well, they don't work. A software-only solution +("fixing" Spectre in the linux kernel) has got so complicated and has +so badly affected performance that Linus Torvalds recently put his foot +down and refused to allow "yet another Spectre patch". A hardware-only +solution *also* isn't good enough, as it basically involves degrading +performance back to that of a **single-issue in-order** machine. + +The "cooperative" approach we feel is a reasonable compromise that is +also simple and straightforward to implement in both hardware and software. +It will be a lot of work, however at least we can put the underpinnings +in place (in the hardware). + +# 48-bit Instruction Extension + +Jacob raised an idea to do +[extension prefixes](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000316.html) on Simple-V. +It's a really good idea, that I was hoping would not be necessary. It +comes down to the fact that it takes a bit more than was anticipated to +do the setup and teardown of the Vectorisation Engine. + +So the plan is to have a couple of prefixes: one 16-bit, one 32-bit, that +"extend" both Compressed (16-bit) and standard (32-bit) instructions, turning +them into "one-off" Vector Instructions. There are two problems: firstly, +that extends the instruction encoding, which in turn complicates the +instruction decode phase. Secondly: we may have to use the 48-bit encoding +space, which in turn takes up a whopping six of the available 16 bits, +which in turn puts a huge amount of pressure on what can actually be +extended. + +For example: if 2 bits are allocated to extend 5-bit register numbers +out to 7 bits, that allows us to access the full 128 integer and FP +range needed for a GPU and VPU. Unfortunately, we need 2 bits for +rs1, 2 bits for rs2, 2 bits for rs3 and 2 bits for rd. That's 8 bits +already, and we haven't gotten to VL (Vector Length), the element +width (setting 8/16/32/64 bit), or predication. + +If doing a 32-bit prefix, that actually needs to either be a 48-bit +encoding or a 64-bit encoding, depending on whether a 16-bit "Compressed" +instruction or a 32-bit standard instruction is to be prefixed. + +There is an alternative: for the 16-bit prefix, there happens to be +a Compressed major opcode that is not being used (bits 13-15 equal to 100, +bits 0-1 equal to 00). This gives 11 bits spare (where a 48-bit encoding +can only squeeze out 8 maybe 9). It also has one significant advantage: +as it is actually a standard "C" opcode, it can be done as macro-op fusion. +That in turn means that modifications to the compiler toolchain are a lot +less significant. + +12 available bits, things start to look a lot better. For 32-bit opcodes, +2 bits can be prepended to a 5 bit destination, 2 more bits for all source +registers. 2 bits for Vector Length (VL=1/2/3/4), and 2 bits for the +element width (8/16/32/64). That leaves 4 spare bits for specifying +predication, *or*, if prefixing 16-bit "Compressed" instructions, it +could be used to extend some of the operations that only have 3-bit +registers, by another 2 bits. + +It's quite complex and is going to need a lot of thought. Some compromises +need to be made, the issue being that we won't know what the best choices +are until we have a better handle on things, through simulations and +comprehensive analysis. + +Designing processors is tricky!