From 3703ea746b59e0a300f98f8774494395517986d7 Mon Sep 17 00:00:00 2001 From: Jacob Lifshay Date: Mon, 14 Dec 2020 16:52:37 -0800 Subject: [PATCH] working on svp64 --- openpower/sv/svp_rewrite/svp64.mdwn | 68 +++++++++++++++---- .../sv/svp_rewrite/svp64/discussion.mdwn | 34 ++++++++-- 2 files changed, 84 insertions(+), 18 deletions(-) diff --git a/openpower/sv/svp_rewrite/svp64.mdwn b/openpower/sv/svp_rewrite/svp64.mdwn index 5d7059d4f..30e56b8b3 100644 --- a/openpower/sv/svp_rewrite/svp64.mdwn +++ b/openpower/sv/svp_rewrite/svp64.mdwn @@ -39,18 +39,13 @@ Integer based predication. Twin predication uses the same encoding thus allowin | Value | Mnemonic | Description | |-------|-------------------|--------------------------------------------------------| | 0000 | - | Reserved (causes an illegal instruction trap) | -| 0001 | ALWAYS (implicit) | Operation is not masked see [[discussion]] | +| 0001 | ALWAYS (implicit) | Operation is not masked see [[discussion]] | | 0010 | R3 | Element `i` is enabled if `R3 & (1 << i)` is non-zero | | 0011 | ~R3 | Element `i` is enabled if `R3 & (1 << i)` is zero | | 0100 | R10 | Element `i` is enabled if `R10 & (1 << i)` is non-zero | | 0101 | ~R10 | Element `i` is enabled if `R10 & (1 << i)` is zero | | 0110 | R30 | Element `i` is enabled if `R30 & (1 << i)` is non-zero | | 0111 | ~R30 | Element `i` is enabled if `R30 & (1 << i)` is zero | - -CR based predication. TODO: select alternate CR for twin predication? see [[discussion]] Overlap of the two CR based predicates must be taken into account, so the starting point for one of them must be suitably high, or accept that for twin predication VL must not exceed the range where overlap will occur, *or* that they use the same starting point but select different *bits* of the same CRs - -| Value | Mnemonic | Description | -|-------|-------------------|--------------------------------------------------------| | 1000 | lt | Element `i` is enabled if `CR[6+i].LT` is set | | 1001 | nl/ge | Element `i` is enabled if `CR[6+i].LT` is clear | | 1010 | gt | Element `i` is enabled if `CR[6+i].GT` is set | @@ -60,6 +55,9 @@ CR based predication. TODO: select alternate CR for twin predication? see [[dis | 1110 | so/un | Element `i` is enabled if `CR[6+i].FU` is set | | 1111 | ns/nu | Element `i` is enabled if `CR[6+i].FU` is clear | +CR based predication. TODO: select alternate CR for twin predication? see [[discussion]] Overlap of the two CR based predicates must be taken into account, so the starting point for one of them must be suitably high, or accept that for twin predication VL must not exceed the range where overlap will occur, *or* that they use the same starting point but select different *bits* of the same CRs + + ## Prefix Opcode Map (64-bit instruction encoding) (prefix bits 6:11) (shows both PowerISA v3.1 instructions as well as new SVP instructions; empty spaces are yet-to-be-allocated Illegal Instructions) @@ -92,22 +90,25 @@ SV Registers are numbered using the notation `SV[F]R_` where `` is a de ## Integer Registers +``` setvli ..., VL=7 add r20, r25, r30, elwidth=64, subvl=1 +``` -where r20, r25, and r30 are standard OpenPower register names. -Those names correspond to SVR20_00, SVR25_00, and SVR30_00. +where `r20`, `r25`, and `r30` are standard OpenPower register names. +Those names correspond to `SVR20_00`, `SVR25_00`, and `SVR30_00`. pseudocode: -const STD_TO_SV_SHIFT = 2; // gets bigger as reg files expand to 256, 512, -... registers +```C++ +const size_t STD_TO_SV_SHIFT = 2; // gets bigger as reg files expand to 256, 512, ... registers -VL=7 // setvli (omitting maxvl here) +VL = 7; // setvli (omitting maxvl here) -for(i=0;iRegister | SPR
Field | SV CR
Register | CR
Register | SPR
Field | SV CR
Register | +|-----------------|----------------|--------------------|-----------------|----------------|--------------------| +| CR[0] | CR[32:35] | SVCR0_000 | CR[4] | CR[48:51] | SVCR4_000 | +| | CR_EXT1[32:35] | SVCR0_001 | | CR_EXT1[48:51] | SVCR4_001 | +| | CR_EXT2[32:35] | SVCR0_010 | | CR_EXT2[48:51] | SVCR4_010 | +| | CR_EXT3[32:35] | SVCR0_011 | | CR_EXT3[48:51] | SVCR4_011 | +| *CR[-8]* | CR[0:3] | SVCR0_100 | *CR[-4]* | CR[16:19] | SVCR4_100 | +| | CR_EXT1[0:3] | SVCR0_101 | | CR_EXT1[16:19] | SVCR4_101 | +| | CR_EXT2[0:3] | SVCR0_110 | | CR_EXT2[16:19] | SVCR4_110 | +| | CR_EXT3[0:3] | SVCR0_111 | | CR_EXT3[16:19] | SVCR4_111 | +| CR[1] | CR[36:39] | SVCR1_000 | CR[5] | CR[52:55] | SVCR5_000 | +| | CR_EXT1[36:39] | SVCR1_001 | | CR_EXT1[52:55] | SVCR5_001 | +| | CR_EXT2[36:39] | SVCR1_010 | | CR_EXT2[52:55] | SVCR5_010 | +| | CR_EXT3[36:39] | SVCR1_011 | | CR_EXT3[52:55] | SVCR5_011 | +| *CR[-7]* | CR[4:7] | SVCR1_100 | *CR[-3]* | CR[20:23] | SVCR5_100 | +| | CR_EXT1[4:7] | SVCR1_101 | | CR_EXT1[20:23] | SVCR5_101 | +| | CR_EXT2[4:7] | SVCR1_110 | | CR_EXT2[20:23] | SVCR5_110 | +| | CR_EXT3[4:7] | SVCR1_111 | | CR_EXT3[20:23] | SVCR5_111 | +| CR[2] | CR[40:43] | SVCR2_000 | CR[6] | CR[56:59] | SVCR6_000 | +| | CR_EXT1[40:43] | SVCR2_001 | | CR_EXT1[56:59] | SVCR6_001 | +| | CR_EXT2[40:43] | SVCR2_010 | | CR_EXT2[56:59] | SVCR6_010 | +| | CR_EXT3[40:43] | SVCR2_011 | | CR_EXT3[56:59] | SVCR6_011 | +| *CR[-6]* | CR[8:11] | SVCR2_100 | *CR[-2]* | CR[24:27] | SVCR6_100 | +| | CR_EXT1[8:11] | SVCR2_101 | | CR_EXT1[24:27] | SVCR6_101 | +| | CR_EXT2[8:11] | SVCR2_110 | | CR_EXT2[24:27] | SVCR6_110 | +| | CR_EXT3[8:11] | SVCR2_111 | | CR_EXT3[24:27] | SVCR6_111 | +| CR[3] | CR[44:47] | SVCR3_000 | CR[7] | CR[60:63] | SVCR7_000 | +| | CR_EXT1[44:47] | SVCR3_001 | | CR_EXT1[60:63] | SVCR7_001 | +| | CR_EXT2[44:47] | SVCR3_010 | | CR_EXT2[60:63] | SVCR7_010 | +| | CR_EXT3[44:47] | SVCR3_011 | | CR_EXT3[60:63] | SVCR7_011 | +| *CR[-5]* | CR[12:15] | SVCR3_100 | *CR[-1]* | CR[28:31] | SVCR7_100 | +| | CR_EXT1[12:15] | SVCR3_101 | | CR_EXT1[28:31] | SVCR7_101 | +| | CR_EXT2[12:15] | SVCR3_110 | | CR_EXT2[28:31] | SVCR7_110 | +| | CR_EXT3[12:15] | SVCR3_111 | | CR_EXT3[28:31] | SVCR7_111 | + +Note: CR[-8] through CR[-1] are not part of OpenPower v3.1, they are the MSB half of the 64-bit CR SPR. + # Register Profiles Instructions are broken down by Register Profiles as listed in the following auto-generated page: diff --git a/openpower/sv/svp_rewrite/svp64/discussion.mdwn b/openpower/sv/svp_rewrite/svp64/discussion.mdwn index 8824846f7..768b4d570 100644 --- a/openpower/sv/svp_rewrite/svp64/discussion.mdwn +++ b/openpower/sv/svp_rewrite/svp64/discussion.mdwn @@ -25,7 +25,7 @@ twin predication and twin elwidth overrides is extremely important to have to be something like: | 0 1 | 2 3 | 4 5 | 6 | 7 9 | 10 12 | 13 18 | 19 23 | -| ----- | --- | --- | ---- | ---- | ----- | ----- | ----- | +|-------|-----|-----|------|------|-------|-------|-------| | subvl | sew | dew | ptyp | psrc | pdst | vspec | mode | * subvl - 1 to 4 scalar / vec2 / vec3 / vec4 @@ -51,7 +51,7 @@ With different bits being selectable (CR[0..3]) starting from the same CR makes these are of the form res = op(src1, src2, ...) | 0 1 | 2 3 | 4 5 | 6 | 7 9 | 10 18 | 19 23 | -| ----- | --- | --- | ---- | ---- | ----- | ------ | +|-------|-----|-----|------|------|-------|--------| | subvl | sew | dew | ptyp | pred | vspec | mode | * subvl - 1 to 4 scalar / vec2 / vec3 / vec4 @@ -129,7 +129,13 @@ therefore the strategy proposed is: with 2x12 this would mean no need to have complex encoding of swizzle. -if we really do need 2 bits spare then the complex encoder of swizzle could be deployed. (*an analysis shows this to be very unlikely. 7^4 is around 2400 which still requires 12 bits to encode*) +if we really do need 2 bits spare then the complex encoder of swizzle could be deployed. (*an analysis shows this to be very unlikely. 7^4 is around 2400 which still requires 12 bits to encode* (that's miscalculated, see Single Swizzle section below.)) + +## Single Swizzle + +I expect swizzle to not be common enough to warrant 2 swizzles in a single instruction, therefor the above swizzle strategy is probably unnecessary. + +Also, if a swizzle supports up to subvl=4, then 11 bits is sufficient since each swizzle element needs to be able to select 1 of 6 different values: 0, 1, x, y, z, w. 6^4 = 1296 which easily fits in 11 bits. # note about INT predicate @@ -139,11 +145,26 @@ this means by default that 001 will always be in nonpredicated ops, which seems 000 would indicate "the predicate is an immediate of all 1s" i.e. "no operation is masked out" +programmerjake: +I picked 0001 to indicate ALWAYS since that matches with the other semantics: the LSB bit is invert-the-mask, and you can think about the table as-if it is really: + +| Value | Mnemonic | +|-------|-------------| +| 0000 | R0 (zero) | +| 0001 | ~R0 (~zero) | +| 0010 | R3 | +| 0011 | ~R3 | +| 0100 | R10 | +| 0101 | ~R10 | +| 0110 | R30 | +| 0111 | ~R30 | + + # CR Vectorisation -Some thoughts on this: the sensible (sane) number of CRs to have is 64. A case could be made for having 128 but it is an awful lot. 64 CRs also has the advantage that it is only 4x 64 bit registers on a context-switch. +Some thoughts on this: the sensible (sane) number of CRs to have is 64. A case could be made for having 128 but it is an awful lot. 64 CRs also has the advantage that it is only 4x 64 bit registers on a context-switch (programmerjake: yeah, but we already have 256 64-bit registers, a few more won't change much). -A practical issue stems from the fact that accessing the CR regfile on a non-aligned 8-CR boundary during Vector operations would significantly increase internal routing. By aligning Vector Reads/Writes to 8 CRs this requires only 32 bit aligned read/writes. +A practical issue stems from the fact that accessing the CR regfile on a non-aligned 8-CR boundary during Vector operations would significantly increase internal routing. By aligning Vector Reads/Writes to 8 CRs this requires only 32 bit aligned read/writes. (programmerjake: simple solution -- rename them internally such that CR6 is the first one) How to number them as vectors gets particularly interesting. A case could be made for treating the 64 CRs as a square, and using CR numbering (CR0-7) to begin VL for-loop incrementing first by row and when rolling over to increment the column. CR6 CR14 ... CR62 then CR7 CR15 ... @@ -151,4 +172,5 @@ When the SV prefix marks them with 2 bits, one of those could be used to indicat When there are 3 bits it would be possible to indicate whether to begin from a position offset by 4 (middle of matrix, edge of matrix). -Note: considerable care needs to be taken when putting these horiz/verticsl CRsvthrough the Dependency Matrices +Note: considerable care needs to be taken when putting these horiz/vertical CRs through the Dependency Matrices + -- 2.30.2