* R0 contains C0 plus the LO half of A0 times B0
* R1 contains C1 plus the LO half of A1 times B0
plus the HI half of A0 times B0.
+* R2 contains C2 plus the LO half of A2 times B0
+ plus the HI half of A1 times B0.
This would on the face of it be a 4-in operation:
the upper half of a previous multiply, two new operands
to multiply, and an additional accumulator (C). However if
C is left out (and added afterwards with a Vector-Add)
things become more manageable.
+
+We therefore propose an operation that is 3-in, 2-out,
+that, noting that the connection between successive
+mul-adds has the UPPER half of the previous operation
+as its input, writes the UPPER half of the current
+product into a second output register for exactly that
+purpose.
+
+ product = RA*RB+RC
+ RT = lowerhalf(product)
+ RC = upperhalf(product)
+
+Successive iterations effectively use RC as a 64-bit carry.