bringing parallelised opcodes down to 32-bit and having the benefit of
being explicit.*
+## Branch Instruction:
+
+[[!table data="""
+31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
+imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
+I/F | reserved | src2 | src1 | BPR | predicate rs3 || BRANCH |
+0 | reserved | src2 | src1 | 000 | predicate rs3 || BEQ |
+0 | reserved | src2 | src1 | 001 | predicate rs3 || BNE |
+0 | reserved | src2 | src1 | 010 | predicate rs3 || rsvd |
+0 | reserved | src2 | src1 | 011 | predicate rs3 || rsvd |
+0 | reserved | src2 | src1 | 100 | predicate rs3 || BLE |
+0 | reserved | src2 | src1 | 101 | predicate rs3 || BGE |
+0 | reserved | src2 | src1 | 110 | predicate rs3 || BLTU |
+0 | reserved | src2 | src1 | 111 | predicate rs3 || BGEU |
+1 | reserved | src2 | src1 | 000 | predicate rs3 || FEQ |
+1 | reserved | src2 | src1 | 001 | predicate rs3 || FNE |
+1 | reserved | src2 | src1 | 010 | predicate rs3 || rsvd |
+1 | reserved | src2 | src1 | 011 | predicate rs3 || rsvd |
+1 | reserved | src2 | src1 | 100 | predicate rs3 || FLT |
+1 | reserved | src2 | src1 | 101 | predicate rs3 || FLE |
+1 | reserved | src2 | src1 | 110 | predicate rs3 || rsvd |
+1 | reserved | src2 | src1 | 111 | predicate rs3 || rsvd |
+"""]]
+
+In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
+for predicated compare operations of function "cmp":
+
+ for (int i=0; i<vl; ++i)
+ if ([!]preg[p][i])
+ preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
+ s2 ? vreg[rs2][i] : sreg[rs2]);
+
+With associated predication, vector-length adjustments and so on,
+and temporarily ignoring bitwidth (which makes the comparisons more
+complex), this becomes:
+
+ if I/F == INT: # integer type cmp
+ pred_enabled = int_pred_enabled # TODO: exception if not set!
+ preg = int_pred_reg[rd]
+ else:
+ pred_enabled = fp_pred_enabled # TODO: exception if not set!
+ preg = fp_pred_reg[rd]
+
+ s1 = CSRvectorlen[src1] > 1;
+ s2 = CSRvectorlen[src2] > 1;
+ for (int i=0; i<vl; ++i)
+ preg[rs3][i] = cmp(s1 ? reg[src1+i] : reg[src1],
+ s2 ? reg[src2+i] : reg[src2]);
+
+Notes:
+
+* Predicated SIMD comparisons would break src1 and src2 further down
+ into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
+ Reordering") setting Vector-Length * (number of SIMD elements) bits
+ in Predicate Register rs3 as opposed to just Vector-Length bits.
+* Predicated Branches do not actually have an adjustment to the Program
+ Counter, so all of bits 25 through 30 in every case are not needed.
+* There are plenty of reserved opcodes for which bits 25 through 30 could
+ be put to good use if there is a suitable use-case.
+* FEQ and FNE (and BEQ and BNE) are included in order to save one
+ instruction having to invert the resultant predicate bitfield.
+ FLT and FLE may be inverted to FGT and FGE if needed by swapping
+ src1 and src2 (likewise the integer counterparts).
+
+## Compressed Branch Instruction:
+
+[[!table data="""
+15..13 | 12...10 | 9..7 | 6..5 | 4..2 | 1..0 | name |
+funct3 | imm | rs10 | imm | | op | |
+3 | 3 | 3 | 2 | 3 | 2 | |
+C.BPR | pred rs3 | src1 | I/F B | src2 | C1 | |
+110 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.EQ |
+111 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.NE |
+110 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LT |
+111 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LE |
+"""]]
+
+
+Notes:
+
+* Bits 5 13 14 and 15 make up the comparator type
+* In both floating-point and integer cases there are four predication
+ comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
+ src1 and src2).
+
# Note on implementation of parallelism
One extremely important aspect of this proposal is to respect and support