just stall straight away.
Less extreme examples include instructions that take only a few cycles
-to complete, but if used in tight loops with Conditional Branches, an
+to complete, but if commonly used in tight loops with Conditional Branches, an
Out-of-Order system with Speculative capability may need significantly
more Reservation Stations to hold in-flight data for instructions which
-take longer than those which do not.
+take longer than those which do not, so even a single clock cycle reduction
+could become important.
+
+A rule of thumb is that in Hardware, at 4.8 ghz the budget for what is called
+"gate propagation delay" is only around 16 to 19 gates chained one after
+the other. Anything beyond that budget will need to be stored in DFFs
+(Flip-flops) and another set of 16-19 gates continues on the next clock
+cycle. Thus for example with `grevlut` above it is almost certainly the
+case that high-performance high-clock-rate systems would need at least
+two clock cycles (two pipeline stages) to produce a valid result.
+This in turn brings us to the next question as it is common to consider
+subdividing complex instructions into smaller parts.
**Can one instruction do the job of many?**
anything this strategy makes the ISA Working Group's evaluation task
easier, as well as reducing the work of writing a Compliance Test Suite.
+In the case of the MIPS 3D ASE Extension, a Reciprocal-Square-Root
+instruction was proposed that was split into two halves: 12-14 bit
+accuracy completing in 7 cycles and "Carry On And Get Better Accuracy"
+for the second instruction! With 3D only needing reduced accuracy
+the saving in power consumption and time was definitely worthwhile,
+and it neatly illustrates a counter-example to trying to make one
+instruction do too much.
+
**Summary**
There are many tradeoffs here, it is a huge list of considerations: any