add purism draft update

author Luke Kenneth Casson Leighton <lkcl@lkcl.net>

Tue, 16 Jul 2019 10:22:13 +0000 (11:22 +0100)

committer Luke Kenneth Casson Leighton <lkcl@lkcl.net>

Tue, 16 Jul 2019 10:22:13 +0000 (11:22 +0100)
author Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Tue, 16 Jul 2019 10:22:13 +0000 (11:22 +0100)
committer Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Tue, 16 Jul 2019 10:22:13 +0000 (11:22 +0100)
diff --git a/updates/019_2019jul16_purism_donation.mdwn b/updates/019_2019jul16_purism_donation.mdwn

new file mode 100644 (file)

index 0000000..45bb7d9
--- /dev/null
+++ b/updates/019_2019jul16_purism_donation.mdwn
@@ -0,0 +1,196 @@
+**DRAFT STATUS. last edit 16jul2019**
+
+We are delighted to be able to announce additional sponsorship by
+[Purism](http://puri.sm), through [NLNet](http://nlnet.nl).
+
+# Purism Sponsorship
+
+As a Benefit Corporation, Purism is empowered to balance ethics, social
+enterprise and profitable business. I am delighted that they chose to
+fund the Libre RISC-V hybrid CPU/GPU through the NLNet Foundation. Their
+donation provides us some extra flexibility in how we reach the goal of
+bringing to market a hybrid CPU, VPU and GPU that is libre to the bedrock.
+
+Purism started with a crowdsupply campaign to deliver a modern laptop
+with full software support and a coreboot BIOS.  I know that, after
+this initial success, they worked hard to try to solve the "NSA backdoor
+coprocessor" issue, known as the "Management Engine". Ironically, inspired
+by Purism, Intel's internal efforts became moot, as a 3rd party reverse
+engineered an Intel BIOS and discovered the "nsa\_me\_off\_switch" parameter,
+designed to be used by the NSA when Intel equipment is deployed within
+NSA premises.
+
+Purism then moved quickly to provide a BIOS update to disable this
+"feature", eliminating the last and most important barrier to being able
+to declare a full privacy software stack.
+
+It is these kinds of brave strategic decisions to kick the trend towards
+privacy invading hardware "by default" for which Purism deserves our
+respect and gratitude.
+
+However, just as NLNet recognise, Purism also appreciate that we cannot
+stop at just the software. Profit maximising Corporations just do not
+take the brave decisions that can compromise profits, particularly when
+faced with competition: it's too much.
+
+So we are extremely grateful for their donation, managed through NLnet,
+the Charitable Foundation.
+
+# Progress
+
+So much has happened already, since the last update, it is hard to know
+where to begin.
+
+* The IEEE754 FPU has a simulation-proven FADD pipeline, and FMUL,
+  FDIV, FSQRT and FCVT are on the way.
+* A RISC-V Reciprocal Square Root FP Opcode has been proposed, which is
+  needed for 3D operations, particularly normalisation of vectors.  With
+  other RISC-V implementors needing this opcode it makes sense for it
+  to be a Standard Extension.
+* The SimpleV extension has had a major overhaul, with the addition of a
+  single-instruction prefix (P32C, P48 and P64), and a "VBLOCK" format that
+  adds Vectorisation Context to a batch of instructions.
+* Implementation of the precise-augmented 6600 style scoreboard system has
+  begun, with ALU register hazards and shadowing already completed, and 
+  memory hazards underway.
+
+# Multi Issue
+
+Multi Issue is absolutely critical for this CPU/VPU/GPU because the
+[SimpleV](https://libre-riscv.org/simple_v_extension/specification)
+engine critically relies on being able to turn one "vector"
+operation into multiple "scalar element" instructions, in every cycle. The
+simplest way to do this is to throw equivalent scalar opcodes into a
+multi issue execution engine, and let the engine sort it out.
+
+So, regarding the Dependency Matrices: thanks to Mitch Alsup's absolutely
+invaluable input we now know how to do multi-issue. On top of a precise
+6600 style Dependency Matrix it is almost comically trivial.
+
+The key insight that Mitch gave us was that instruction dependencies are
+transitive. In other words: if there are 4 instructions to be issued,
+the second instruction may have the dependencies of the first added to it;
+the 3rd may accumulate the dependencies of the first and second and so on.
+
+Where this trick does not work well (or takes significant hardware to
+implement) is when, for example with the Tomasulo Algorithm (or the
+original 6600 Q-Table), the Register Dependency Hazards are expressed
+in *binary* (r5 = 0b00101, r3=0b00011). If instead the registers are
+expressed in *unary* (r5 = 0b00010000, r3= 0b00000100) then it should
+be pretty obvious that in a multi issue design, all that is needed in
+each clock cycle is to OR the cumulative register dependencies in a
+cascading fashion. Aside from now also needing to increase the number of
+register ports and other resources to cope with the increased workload,
+amazingly that's all it takes!
+
+To achieve the same trick with a Tomasulo Reorder Buffer (ROB) requires
+the addition of an entire extra CAM per every extra issue to be added to
+the architecture: four way multi issue would require four ROB CAMs! The
+power consumption and gate count would be prohibitively expensive,
+and resolving the commits of multiple parallel operations is also fraught.
+
+# SimpleV
+
+What began ironically as "simple" still bears some vestige of its
+original name, in that the ISA needs no new opcodes: any scalar RISC-V
+implementation may be turned parallel through the addition of SV at the
+instruction issue phase.
+
+However, one of the major drawbacks of the initial draft spec was that
+the use of CSRs took a huge number of instructions just to set up and
+then tear down the vectorisation context.
+
+This had to be solved.
+
+The idea which came to mind was to embed RISC-V opcodes within
+a longer, variable-length encoding, which we've called the
+[VBLOCK Format](https://libre-riscv.org/simple_v_extension/vblock_format/).
+At the beginning of this new format, the vectorisation and predication
+context could be embedded, which "changes" the standard *scalar* opcodes
+to become "parallel" (multi-issue) operations.
+
+The advantage of this approach is that, firstly, the context is much
+smaller: the actual CSR opcodes are gone, leaving only the "data",
+which is now batched together. Secondly, there is no need to "reset"
+(tear down) the vectorisation context, because that automatically goes
+when the long-format ends.
+
+The other issue that needed to be fixed is that we really need a
+[SETVL](https://libre-riscv.org/simple_v_extension/specification/sv.setvl/)
+instruction. This is really unfortunate as it breaks the "no new opcodes"
+paradigm.  However, what we are going to do is simply to reuse the RVV
+SETVL opcode, now that RVV has reached its last anticipated draft before
+ratification.  Secondly: it's not an *actual* instruction related to
+elements (it doesn't perform a parallel add, for example).  It's more an
+"infrastructure support" instruction.
+
+The reason for needing SETVL is complex. It is down to the fact that,
+unlike in RVV, the Maximum Vector Length is **not** an architectural hard
+design parameter, it is a runtime dynamic one. Thus, it is absolutely
+crucial that not only VL is set on every loop (or SV Prefix instruction),
+but that MVL is also set.
+
+This means that SV has two additional instructions for any algorithm,
+when compared to RVV, and this kind of penalty is just not acceptable. The
+solution therefore was to create a special SV.SETVL opcode that always
+takes the MVL as an *additional* extra parameter over and above those
+provided to the RV equivalent opcode. That basically puts SV on par with
+RV as far as instruction count is concerned.
+
+# Fail on First
+
+The other really nice addition, which came with a small reorganisation
+of the Vector and Predicate Contexts, is data dependent
+["fail on first"](https://libre-riscv.org/simple_v_extension/appendix/#ffirst).
+
+ARM's SVE, RVV, and the Mill Architecture all have an incredibly neat
+feature where if data is being loaded from memory in parallel, and the
+LD operations run off the end of a page boundary, this may be detected
+and the *legal* parallel operations may complete, all without needing
+to drop into "scalar" mode.
+
+In the case of the Mill Architecture, this is achieved through the
+extremely innovative feature of simply marking the result of the
+operation as "invalid", and that "tag" cascades through all subsequent
+operations. Thus, any attempts to ADD or STORE the data will result in
+the invalid data being simply ignored.
+
+RV instead detects the point at which the LD became invalid, "fails"
+at the "first" such illegal memory access, and truncates all subsequent
+vector operations to within that limit, by *changing VL*. This is an
+extremely effective and very simple idea, it was worth adding to SV.
+
+However, when doing so, the idea sprang to mind: why not extend the
+"fail on first" concept to not just cover LD/ST operations, but to cover
+actual ALU operations as well? Why not, if any of the the results from
+a sequence of parallel operations is zero ("fail"), similarly truncate VL?
+
+This idea was tested out on strncpy (the typical canonical function
+used to test out data-dependent ISA concepts), and it worked! So, that
+is going into SV as well. It does mean that after every ALU operation,
+a comparator against zero will be optionally activated: given that it
+is optional and under the control of the ffirst bit, it is not a power
+penalty on every single instruction.
+
+# Summary
+
+There is so much to do, and so much that has already been achieved,
+it is almost overwhelming. We still cannot lose sight of the fact that
+there is an enormous amount that we do not yet know, yet at the same
+time, never let that stop us from moving forward. A journey starts with
+a first step, and continues with each step.
+
+With help from NLNet and companies like Purism we can look forward
+to actually paying people to contribute to solving what was formerly
+considered an impossible task.
+
+It is worthwhile emphasising: any individual or Corporation wishing to
+see this project succeed (so that you can use it as the basis for one
+of your products, for example), donations through NLNet, as a Registered
+Charitable Foundation, are tax deductible.
+
+Likewise, for anyone who would like to help with the project's Milestones,
+payments from NLnet are *donations*, and, depending on jurisdiction,
+may also be tax deductible.  If you are interested to learn more, do
+get in touch.
+
author	Luke Kenneth Casson Leighton <lkcl@lkcl.net>
	Tue, 16 Jul 2019 10:22:13 +0000 (11:22 +0100)
committer	Luke Kenneth Casson Leighton <lkcl@lkcl.net>
	Tue, 16 Jul 2019 10:22:13 +0000 (11:22 +0100)