insights into instruction design

author Luke Kenneth Casson Leighton <lkcl@lkcl.net>

Thu, 13 Apr 2023 09:37:52 +0000 (10:37 +0100)

committer Luke Kenneth Casson Leighton <lkcl@lkcl.net>

Thu, 13 Apr 2023 09:37:52 +0000 (10:37 +0100)
author Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Thu, 13 Apr 2023 09:37:52 +0000 (10:37 +0100)
committer Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Thu, 13 Apr 2023 09:37:52 +0000 (10:37 +0100)
diff --git a/openpower/sv/rfc/ls012.mdwn b/openpower/sv/rfc/ls012.mdwn

index e2e19ae9f8310e3855af3fff5e1004c7056bc2f1..3c76bfc1fb6e04e7997fb4b1583c62c8d9b8ee82 100644 (file)
--- a/openpower/sv/rfc/ls012.mdwn
+++ b/openpower/sv/rfc/ls012.mdwn
@@ -573,10 +573,21 @@ will back up and eventually stall, where in-order systems pretty much
  just stall straight away.
  
  Less extreme examples include instructions that take only a few cycles
-to complete, but if used in tight loops with Conditional Branches, an
+to complete, but if commonly used in tight loops with Conditional Branches, an
  Out-of-Order system with Speculative capability may need significantly
  more Reservation Stations to hold in-flight data for instructions which
-take longer than those which do not.
+take longer than those which do not, so even a single clock cycle reduction
+could become important.
+
+A rule of thumb is that in Hardware, at 4.8 ghz the budget for what is called
+"gate propagation delay" is only around 16 to 19 gates chained one after
+the other.  Anything beyond that budget will need to be stored in DFFs
+(Flip-flops) and another set of 16-19 gates continues on the next clock
+cycle. Thus for example with `grevlut` above it is almost certainly the
+case that high-performance high-clock-rate systems would need at least
+two clock cycles (two pipeline stages) to produce a valid result.
+This in turn brings us to the next question as it is common to consider
+subdividing complex instructions into smaller parts.
  
  **Can one instruction do the job of many?**
  
@@ -596,6 +607,14 @@ as a set of four, instead of over 30 separate instructions.  Aside from
  anything this strategy makes the ISA Working Group's evaluation task
  easier, as well as reducing the work of writing a Compliance Test Suite.
  
+In the case of the MIPS 3D ASE Extension, a Reciprocal-Square-Root
+instruction was proposed that was split into two halves: 12-14 bit
+accuracy completing in 7 cycles and "Carry On And Get Better Accuracy"
+for the second instruction! With 3D only needing reduced accuracy
+the saving in power consumption and time was definitely worthwhile,
+and it neatly illustrates a counter-example to trying to make one
+instruction do too much.
+
  **Summary**
  
  There are many tradeoffs here, it is a huge list of considerations: any
author	Luke Kenneth Casson Leighton <lkcl@lkcl.net>
	Thu, 13 Apr 2023 09:37:52 +0000 (10:37 +0100)
committer	Luke Kenneth Casson Leighton <lkcl@lkcl.net>
	Thu, 13 Apr 2023 09:37:52 +0000 (10:37 +0100)