type of special virtual register port or datapath that masks out the
required predicate bits closer to the regfile.
-## One scalar int per predicate element.
-
-Similar to RVV and similar to the one-CR-per-element concept above, the idea here is to use the LSB of any given element in a vector of predicates. This idea has quite a lot of merit to it.
-
-Implementation-wise just like in the CR-based case a special regfile port could be added that gets the LSB of each scalar integer register and routes them through to the broadcast bus.
-
-The disadvantages appear on closer analysis:
-
-* Unlike the "full" CR port (which reads 8x CRs CR0-7 in one hit) trying the same trick on the scalar integer regfile, to obtain 8 predicate bits, would require a whopping 8x64bit set of reads to the INT regfile instead of a scant 1x32bit read. Resource-wise, then, this idea is expensive.
-* With predicate bits being distributed out amongst 64 bit scalar registers, scalar bitmanipulation operations that can be performed after transferring Vectors of CMP operations from CRs to INTs (vectorised-mfcr) are more challenging and costly. Rather than use vectorised mfcr, complex transfers of the LSBs into a single scalar int are required.
-
-On balance this is a less favourable option than vectorising CRs
-
### Predicated SIMD HI32-LO32 FUs
an analysis of changing the element widths (for SIMD) gives the following
(bit 0 thru bit 3 of each CR) closer to the regfile i.e. before hitting
the broadcast bus.
+## One scalar int per predicate element.
+
+Similar to RVV and similar to the one-CR-per-element concept above, the idea here is to use the LSB of any given element in a vector of predicates. This idea has quite a lot of merit to it.
+
+Implementation-wise just like in the CR-based case a special regfile port could be added that gets the LSB of each scalar integer register and routes them through to the broadcast bus.
+
+The disadvantages appear on closer analysis:
+
+* Unlike the "full" CR port (which reads 8x CRs CR0-7 in one hit) trying the same trick on the scalar integer regfile, to obtain 8 predicate bits, would require a whopping 8x64bit set of reads to the INT regfile instead of a scant 1x32bit read. Resource-wise, then, this idea is expensive.
+* With predicate bits being distributed out amongst 64 bit scalar registers, scalar bitmanipulation operations that can be performed after transferring Vectors of CMP operations from CRs to INTs (vectorised-mfcr) are more challenging and costly. Rather than use vectorised mfcr, complex transfers of the LSBs into a single scalar int are required.
+
+On balance this is a less favourable option than vectorising CRs
+
## Scalar (single) integer as predicate, with one DM row
This idea has several disadvantages.