updates/019_2019jul16_purism_donation.mdwn

   1 We are delighted to be able to announce additional sponsorship by
   2 [Purism](http://puri.sm), through [NLNet](http://nlnet.nl).
   3
   4 ### Purism Sponsorship
   5
   6 As a social purpose corporation, Purism is empowered to balance
   7 ethics, social enterprise, and profitable business. I am delighted that
   8 they chose to fund the Libre RISC-V hybrid CPU/GPU through the
   9 [NLNet Foundation](https://nlnet.nl/PET). Their donation provides us some
  10 extra flexibility in how we reach the goal of bringing to market a hybrid
  11 CPU, VPU, and GPU that is libre to the bedrock.
  12
  13 Purism started with a [Crowd Supply
  14 campaign](https://www.crowdsupply.com/purism/librem-15) to deliver a
  15 modern laptop with full software support and a [coreboot
  16 BIOS](https://puri.sm/coreboot/).  I know that, after this initial
  17 success, they worked hard to try to solve the "NSA backdoor
  18 co-processor" issue, known as the ["Management
  19 Engine"](https://libreboot.org/faq.html#intelme).  Ironically,
  20 inspired by Purism, Intel's internal efforts became moot, as a 3rd
  21 party reverse engineered an Intel BIOS and discovered the
  22 [`nsa_me_off_switch`](https://it.slashdot.org/story/17/08/29/2239231/researchers-find-a-way-to-disable-intel-me-component-courtesy-of-the-nsa)
  23 parameter, designed to be used by the NSA when Intel equipment is
  24 deployed within NSA premises.
  25
  26 Purism then moved quickly to provide a BIOS update to disable this
  27 "feature," eliminating the last and most important barrier to being
  28 able to declare a full privacy software stack.
  29
  30 Purism deserves our respect and gratitude for this type of brave and
  31 strategic decision-making to kick the trend towards privacy-invading
  32 hardware "by default."
  33
  34 However, just as NLNet recognises, Purism also appreciates that we
  35 cannot stop at just the software.  Profit-maximising corporations just
  36 do not take the brave decisions that can compromise profits,
  37 particularly when faced with competition: it's too much. This is why
  38 being a Social Purpose Corporation is so critically important.
  39 Socially-responsible decisions do not get undermined by
  40 profit-maximisation.
  41
  42 So, we are extremely grateful for their donation, managed through
  43 NLnet.
  44
  45 ### Progress
  46
  47 So much has happened already, since the last update, it is hard to know
  48 where to begin.
  49
  50 * The IEEE754 FPU has a simulation-proven FADD pipeline, and FMUL,
  51   FDIV, FSQRT and FCVT are on the way.
  52 * A RISC-V Reciprocal Square Root FP Opcode has been proposed, which is
  53   needed for 3D operations, particularly normalisation of vectors.  With
  54   other RISC-V implementors needing this opcode, it makes sense for it
  55   to be a standard extension.
  56 * The SimpleV extension has had a major overhaul, with the addition of a
  57   single-instruction prefix (P32C, P48 and P64), and a "VBLOCK" format that
  58   adds vectorisation context to a batch of instructions.
  59 * Implementation of the precise-augmented 6600 style scoreboard system has
  60   begun, with ALU register hazards and shadowing already completed, and
  61   memory hazards underway.
  62
  63 ### Multi-issue
  64
  65 Multi-issue is absolutely critical for this CPU/VPU/GPU because the
  66 [SimpleV](https://libre-riscv.org/simple_v_extension/specification)
  67 engine critically relies on being able to turn one "vector"
  68 operation into multiple "scalar element" instructions, in every cycle. The
  69 simplest way to do this is to throw equivalent scalar opcodes into a
  70 multi-issue execution engine, and let the engine sort it out.
  71
  72 Regarding the dependency matrices: thanks to Mitch Alsup's absolutely
  73 invaluable input, we now know how to do multi-issue. On top of a precise
  74 6600 style dependency matrix it is almost comically trivial.
  75
  76 The key insight Mitch gave us was that instruction dependencies are
  77 transitive. In other words, if there are four instructions to be
  78 issued, the second instruction may have the dependencies of the first
  79 added to it, the 3rd may accumulate the dependencies of the first and
  80 second, and so on.
  81
  82 Where this trick does not work well (or takes significant hardware to
  83 implement) is when, for example with the Tomasulo Algorithm (or the
  84 original 6600 Q-Table), the register dependency hazards are expressed
  85 in *binary* (r5 = 0b00101, r3=0b00011). If instead the registers are
  86 expressed in *unary* (r5 = 0b00010000, r3= 0b00000100) then it should
  87 be pretty obvious that in a multi-issue design, all that is needed in
  88 each clock cycle is to OR the cumulative register dependencies in a
  89 cascading fashion. Aside from now also needing to increase the number
  90 of register ports and other resources to cope with the increased
  91 workload, amazingly that's all it takes!
  92
  93 To achieve the same trick with a Tomasulo reorder buffer (ROB)
  94 requires the addition of an entire extra CAM per every extra issue to
  95 be added to the architecture: four-way multi-issue would require four
  96 ROB CAMs! The power consumption and gate count would be prohibitively
  97 expensive, and resolving the commits of multiple parallel operations
  98 is also fraught.
  99
 100 ### SimpleV
 101
 102 What began ironically as "simple" still bears some vestige of its
 103 original name, in that the ISA needs no new opcodes: any scalar RISC-V
 104 implementation may be turned parallel through the addition of SV at
 105 the instruction issue phase.
 106
 107 However, one of the major drawbacks of the initial draft spec was that
 108 the use of CSRs took a huge number of instructions just to set up and
 109 then tear down the vectorisation context. This had to be fixed.
 110
 111 The idea which came to mind was to embed RISC-V opcodes within a
 112 longer, variable-length encoding, which we've called the [VBLOCK
 113 format](https://libre-riscv.org/simple_v_extension/vblock_format/).
 114 At the beginning of this new format, the vectorisation and predication
 115 context could be embedded, which "changes" the standard *scalar*
 116 opcodes to become "parallel" (multi-issue) operations.
 117
 118 The advantage of this approach is that, firstly, the context is much
 119 smaller: the actual CSR opcodes are gone, leaving only the "data,"
 120 which is now batched together. Secondly, there is no need to "reset"
 121 (tear down) the vectorisation context, because that automatically goes
 122 when the long-format ends.
 123
 124 The other issue that needed to be fixed is that we really need a
 125 [SETVL](https://libre-riscv.org/simple_v_extension/specification/sv.setvl/)
 126 instruction. This is really unfortunate as it breaks the "no new
 127 opcodes" paradigm.  However, what we are going to do is simply to
 128 reuse the RVV SETVL opcode, now that RVV has reached its last
 129 anticipated draft before ratification.  Secondly: it's not an *actual*
 130 instruction related to elements (it doesn't perform a parallel add,
 131 for example).  It's more an "infrastructure support" instruction.
 132
 133 The reason for needing SETVL is complex. It is down to the fact that,
 134 unlike in RVV, the maximum vector length is **not** an architectural hard
 135 design parameter, it is a runtime dynamic one. Thus, it is absolutely
 136 crucial that not only VL is set on every loop (or SV prefix instruction),
 137 but that MVL is also set.
 138
 139 This means SV has two additional instructions for any algorithm, when
 140 compared to RVV, and this kind of penalty is just not acceptable. The
 141 solution therefore was to create a special SV.SETVL opcode that always
 142 takes the MVL as an *additional* extra parameter over and above those
 143 provided to the RV equivalent opcode. That basically puts SV on par
 144 with RV as far as instruction count is concerned.
 145
 146 ### Fail on First
 147
 148 The other really nice addition, which came with a small reorganisation
 149 of the vector and predicate contexts, is data dependent "[fail on
 150 first](https://libre-riscv.org/simple_v_extension/appendix/#ffirst)."
 151
 152 ARM's SVE, RVV, and the Mill architecture all have an incredibly neat
 153 feature where, if data is being loaded from memory in parallel, and the
 154 LD operations run off the end of a page boundary, this may be detected
 155 and the *legal* parallel operations may complete, all without needing
 156 to drop into "scalar" mode.
 157
 158 In the case of the Mill architecture, this is achieved through the
 159 extremely innovative feature of simply marking the result of the
 160 operation as "invalid," and that "tag" cascades through all subsequent
 161 operations. Thus, any attempts to ADD or STORE the data will result in
 162 the invalid data being simply ignored.
 163
 164 RV instead detects the point at which the LD became invalid, "fails"
 165 at the "first" such illegal memory access, and truncates all subsequent
 166 vector operations to within that limit, by *changing VL*. This is an
 167 extremely effective and very simple idea worth adding to SV.
 168
 169 However, when doing so, the idea sprang to mind: why not extend the
 170 "fail on first" concept to not just cover LD/ST operations, but to cover
 171 actual ALU operations as well? Why not, if any of the the results from
 172 a sequence of parallel operations is zero ("fail"), similarly truncate VL?
 173
 174 This idea was tested out on strncpy (the typical canonical function
 175 used to test out data-dependent ISA concepts), and it worked! So, that
 176 is going into SV as well. It does mean that after every ALU operation,
 177 a comparator against zero will be optionally activated: given that it
 178 is optional and under the control of the first bit, it is not a power
 179 penalty on every single instruction.
 180
 181 ### Summary
 182
 183 There is so much to do, and so much that has already been achieved, it
 184 is almost overwhelming. We still cannot lose sight of the fact there
 185 is an enormous amount that we do not yet know, yet at the same time,
 186 never let that stop us from moving forward. A journey starts with a
 187 first step, and continues with each step.
 188
 189 With help from NLNet and companies like Purism, we can look forward to
 190 actually paying people to contribute to solving what was formerly
 191 considered an impossible task.
 192
 193 It is worthwhile emphasising: any individual or corporation wishing to
 194 see this project succeed (so that you can use it as the basis for one
 195 of your products, for example), donations through NLNet, as a
 196 registered charitable foundation, are tax deductible.
 197
 198 Likewise, for anyone who would like to help with the project's
 199 milestones, payments from NLnet are *donations*, and, depending on
 200 jurisdiction, may also be tax deductible (i.e., not classed as
 201 "earnings").  If you are interested to learn more, do get in touch.