(no commit message)
[libreriscv.git] / openpower / sv / svp64_quirks.mdwn
1 # The Rules
2
3 [[!toc]]
4
5 SVP64 is designed around these fundamental and inviolate principles:
6
7 1. There are no actual Vector instructions: Scalar instructions
8 are the sole exclusive bedrock.
9 2. No scalar instruction ever deviates in its encoding or meaning
10 just because it is prefixed (caveats below)
11 3. A hardware-level for-loop makes vector elements 100% synonymous
12 with scalar instructions (the suffix)
13
14 That said, there are a few exceptional places where these rules get
15 bent, and others where the rules take some explaining,
16 and this page tracks them.
17
18 The modification caveat obviously exempts element width overrides,
19 which still do not actually modify the meaning of the instruction:
20 an add remains an add, even if it is only an 8-bit add rather than
21 a 64-bit add. elwidth overrides *definitely* do not alter the 3.0 encoding.
22 Other "modifications" such as saturation or Data-dependent Fail-First
23 likewise are post-augmentation or post-analysis, and do not actually
24 fundamentally change an add operation into a subtract for example.
25
26 *(An experiment was attempted to modify LD-immediate instructions
27 to include a
28 third RC register i.e. reinterpret the normal
29 v3.0 32-bit instruction as a
30 different encoding if SVP64-prefixed: it did not go well.
31 The complexity that resulted
32 in the decode phase was too great)*
33
34 # Instruction Groups
35
36 The basic principle of SVP64 is the prefix, which contains mode
37 as well as register augmentation and predicates. When thinking of
38 instructions and Vectorising them, it is natural for arithmetic
39 operations (ADD, OR) to be the first to spring to mind.
40 Arithmetic instructions have registers, therefore augmentation
41 applies, end of story, right?
42
43 Except, Load and Store deals also with Memory, not just registers.
44 Power ISA has Condition Register Fields: how can element widths
45 apply there? And branches: how can you have Saturation on something
46 that does not return an arithmetic result? In short: there are actually
47 four different categories (five including those for which Vectorisation
48 makes no sense at all, such as `sc` or `mtmsr`).
49
50 # CR weird instructions
51
52 [[sv/int_cr_predication]] is by far the biggest violator of the SVP64
53 rules, for good reasons. Transfers between Vectors of CR Fields and Integers
54 for use as predicates is very awkward without them.
55
56 Normally, element width overrides allow the element width to be specified
57 as 8, 16, 32 or default (64) bit. With CR weird instructions producing or
58 consuming either 1 bit or 4 bit elements (in effect) some adaptation was
59 required. When this perspective is taken (that results or sources are
60 1 or 4 bits) the weirdness starts to make sense, because the "elements",
61 such as they are, are still packed sequentially.
62
63 From a hardware implementation perspective however they will need special
64 handling as far as Hazard Dependencies are concerned, due to nonconformance
65 (bit-level management)
66
67 # mv.x
68
69 [[sv/mv.x]] aka `GPR(RT) = GPR(GPR(RA))` is so horrendous in
70 terms of Register Hazard Management that its addition to any Scalar
71 ISA is anathematic. In a Traditional Vector ISA however, where the
72 indices are isolated behind a single Vector Hazard, there is no
73 problem at all. `sv.mv.x` is also fraught, precisely because it
74 sits on top of a Standard Scalar register paradigm, not a Vector
75 ISA, with separate and distinct Vector registers.
76
77 To help partly solve this, `sv.mv.x` has to be made relative:
78
79 ```
80 for i in range(VL):
81 GPR(RT+i) = GPR(RT+MIN(GPR(RA+i), VL))
82 ```
83
84 The reason for doing so is that MAXVL or VL may be used to limit
85 the number of Register Hazards that need to be raised to a fixed
86 quantity, at Issue time.
87
88 `mv.x` itself will still have to be added as a Scalar instruction,
89 but the behaviour of `sv.mv.x` will have to be different from that
90 Scalar version.
91
92 Normally, Scalar Instructions have a good justification for being
93 added as Scalar instructions on their own merit. `mv.x` is the
94 polar opposite, and as such qualifies for a special mention in
95 this section.
96
97 # Branch-Conditional
98
99 [[sv/branches]] are a very special exception to the rule that there
100 shall be no deviation from the corresponding
101 Scalar instruction. This because of the tight
102 integration with looping and the application of Boolean Logic
103 manipulation needed for Parallel operations (predicate mask usage).
104 This results in an extremely important observation that `scalar identity
105 behaviour` is violated: the SV Prefixed variant of branch is **not** the same
106 operation as the unprefixed 32-bit scalar version.
107
108 One key difference is that LR is only updated if certain additional
109 conditions are met, whereas Scalar `bclrl` for example unconditionally
110 overwrites LR.
111
112 Well over 500 Vectorised branch instructions exist in SVP64 due to the
113 number of options available: close integration and interaction with
114 the base Scalar Branch was unavoidable in order to create Conditional
115 Branching suitable for parallel 3D / CUDA GPU workloads.
116
117 # Saturation
118
119 The application of Saturation as a retro-fit to a Scalar ISA is challenging.
120 It does help that within the SFFS Compliancy subset there are no Saturated
121 operations at all: they are only added in VSX.
122
123 Saturation does not inherently change the instruction itself: it does however
124 come with some fundamental implications, when applied. For example:
125 a Floating-Point operation that would normally raise an exception will
126 no longer do so, instead setting the CR1.SO Flag. Another quirky
127 example: signed operations which produce a negative result will be
128 truncated to zero if Unsigned Saturation is requested.
129
130 One very important aspect for implementors is that the operation in
131 effect has to be considered to be performed at infinite precision,
132 followed by saturation detection. In practice this does not actually
133 require infinite precision hardware! Two 8-bit integers being
134 added can only ever overflow into a 9-bit result.
135
136 Overall some care and consideration needs to be applied.
137
138 # Fail-First
139
140 Fail-First (both the Load/Store and Data-Dependent variants)
141 is worthy of a special mention in its own right. Where VL is
142 normally forward-looking and may be part of a pre-decode phase
143 in a (simplified) pipelined architecture with no Read-after-Write Hazards,
144 Fail-First changes that because at any point during the execution
145 of the element-level instructions, one of those elements may not only
146 terminate further continuation of the hardware-for-looping but also
147 effect a change of VL:
148
149 ```
150 for i in range(VL):
151 result = element_operation(GPR(RA+i), GPR(RB+i))
152 if test(result):
153 VL = i
154 break
155 ```
156
157 This is not exactly a violation of SVP64 Rules, more of a breakage
158 of user expectations, particularly for LD/ST where exceptions
159 would normally be expected to be raised, Fail-First provides for
160 avoidance of those exceptions.