(no commit message)

[libreriscv.git] / openpower / sv / overview.mdwn
diff --git a/openpower/sv/overview.mdwn b/openpower/sv/overview.mdwn

index 5e479be31f50e57fd88a0c1556e26ac9ad8fecb2..9f37c0198a2187f20dfd4eccf9afc034714c135b 100644 (file)
--- a/openpower/sv/overview.mdwn
+++ b/openpower/sv/overview.mdwn
@@ -11,13 +11,19 @@ Links:
  
  * This page: [http://libre-soc.org/openpower/sv/overview](http://libre-soc.org/openpower/sv/overview)
  * [FOSDEM2021 SimpleV for OpenPOWER](https://fosdem.org/2021/schedule/event/the_libresoc_project_simple_v_vectorisation/)
+* FOSDEM2021 presentation <https://www.youtube.com/watch?v=FS6tbfyb2VA>
  * [[discussion]] and
    [bugreport](https://bugs.libre-soc.org/show_bug.cgi?id=556)
    feel free to add comments, questions.
  * [[SV|sv]]
  * [[sv/svp64]]
+* [x86 REP instruction](https://c9x.me/x86/html/file_module_x86_id_279.html):
+  a useful way to quickly understand that the core of the SV concept
+  is not new.
+* [Article about register tagging](http://science.lpnu.ua/sites/default/files/journal-paper/2019/jul/17084/volum3number1text-9-16_1.pdf) showing
+  that tagging is not a new idea either. Register tags
+  are also used in the Mill Architecture.
  
-Contents:
  
  [[!toc]]
  
@@ -63,13 +69,14 @@ Unlike in SIMD, powers of two limitations are not involved in the ISA
  or in the assembly code.
  
  SimpleV takes the Cray style Vector principle and applies it in the
-abstract to a Scalar ISA, in the process allowing register file size
-increases using "tagging" (similar to how x86 originally extended
+abstract to a Scalar ISA in the same way that x86 used to do its "REP" instruction.  In the process, "context" is applied, allowing amongst other things
+a register file size
+increase using "tagging" (similar to how x86 originally extended
  registers from 32 to 64 bit).
  
  ## SV
  
-The fundamentals are:
+The fundamentals are (just like x86 "REP"):
  
  * The Program Counter (PC) gains a "Sub Counter" context (Sub-PC)
  * Vectorisation pauses the PC and runs a Sub-PC loop from 0 to VL-1
@@ -466,7 +473,7 @@ element width.  Our first simple loop thus becomes:
         src1 = get_polymorphed_reg(RA, srcwid, i)
         src2 = get_polymorphed_reg(RB, srcwid, i)
         result = src1 + src2 # actual add here
-       set_polymorphed_reg(rd, destwid, i, result)
+       set_polymorphed_reg(RT, destwid, i, result)
  
  With this loop, if elwidth=16 and VL=3 the first 48 bits of the target
  register will contain three 16 bit addition results, and the upper 16
@@ -607,7 +614,7 @@ truncated.  Only then can the arithmetic saturation condition be detected:
         # unsigned add
         result = op_add(src1, src2, opwidth) # at max width
         # now saturate (unsigned)
-       sat = max(result, (1<<destwid)-1)
+       sat = min(result, (1<<destwid)-1)
         set_polymorphed_reg(rd, destwid, i, sat)
         # set sat overflow
         if Rc=1:
@@ -640,8 +647,8 @@ truncating down to 8 bit for example.
         # logical op, signed has no meaning
         result = op_xor(src1, src2, opwidth)
         # now saturate (signed)
-       sat = max(result, (1<<destwid-1)-1)
-       sat = min(result, -(1<<destwid-1))
+       sat = min(result, (1<<destwid-1)-1)
+       sat = max(result, -(1<<destwid-1))
         set_polymorphed_reg(rd, destwid, i, sat)
  
  Overall here the rule is: apply common sense then document the behaviour
@@ -902,6 +909,42 @@ implementations may cause pipeline stalls.  This was one of the reasons
  why CR-based pred-result analysis was added, because that at least is
  entirely paralleliseable.
  
+# Vertical-First Mode
+
+This is a relatively new addition to SVP64 under development as of
+July 2021.  Where Horizontal-First is the standard Cray-style for-loop,
+Vertical-First typically executes just the **one** scalar element
+in each Vectorised operation. That element is selected by srcstep
+and dststep *neither of which are changed as a side-effect of execution*.
+Illustrating this in pseodocode, with a branch/loop.
+To create loops, a new instruction `svstep` must be called,
+explicitly, with Rc=1:
+
+```
+loop:
+  sv.addi r0.v, r8.v, 5 # GPR(0+dststep) = GPR(8+srcstep) + 5
+  sv.addi r0.v, r8, 5   # GPR(0+dststep) = GPR(8        ) + 5
+  sv.addi r0, r8.v, 5   # GPR(0        ) = GPR(8+srcstep) + 5
+  svstep.               # srcstep++, dststep++, CR0.eq = srcstep==VL
+  beq loop
+```
+
+Three examples are illustrated of different types of Scalar-Vector
+operations. Note that in its simplest form  **only one** element is
+executed per instruction **not** multiple elements per instruction.
+(The more advanced version of Vertical-First mode may execute multiple
+elements per instruction, however the number executed **must** remain
+a fixed quantity.)
+
+Now that such explicit loops can increment inexorably towards VL,
+of course we now need a way to test if srcstep or dststep have reached
+VL. This is achieved in one of two ways: [[sv/svstep]] has an Rc=1 mode
+where CR0 will be updated if VL is reached. A standard v3.0B Branch
+Conditional may rely on that.  Alternatively, the number of elements
+may be transferred into CTR, as is standard practice in Power ISA.
+Here, SVP64 [[sv/branches]] have a mode which allows CTR to be decremented
+by the number of vertical elements executed.
+
  # Instruction format
  
  Whilst this overview shows the internals, it does not go into detail