From 6ddc02ad774afcfbe9c028c46a31810790d1a8c6 Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Fri, 6 Apr 2018 19:30:56 +0100 Subject: [PATCH] partial update --- simple_v_extension.mdwn | 46 +++++++++++++++++++++++++++++++++++++++-- 1 file changed, 44 insertions(+), 2 deletions(-) diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn index 652212ffe..f8a1d1b7b 100644 --- a/simple_v_extension.mdwn +++ b/simple_v_extension.mdwn @@ -27,8 +27,12 @@ Therefore it makes a huge amount of sense to have a means and method of introducing instruction parallelism in a flexible way that provides implementors with the option to choose exactly where they wish to offer performance improvements and where they wish to optimise for power -and/or area. If that can be offered even on a per-operation basis that -would provide even more flexibility. +and/or area (and if that can be offered even on a per-operation basis that +would provide even more flexibility). + +Additionally it makes sense to *split out* the parallelism inherent within +each of P and V, and to see if each of P and V then, in *combination* with +a "best-of-both" parallelism extension, would work well. **TODO**: reword this to better suit this document: @@ -180,6 +184,42 @@ way to span 8-bit up to 64-bit groups of data, where BGS as it stands and described by Clifford does **bits** of up to 16 width. Lots to look at and investigate!* +# Note on implementation of parallelism + +One extremely important aspect of this proposal is to respect and support +implementors desire to focus on power, area or performance. In that regard, +it is proposed that implementors be free to choose whether to implement +the Vector (or variable-width SIMD) parallelism as sequential operations +with a single ALU, fully parallel (if practical) with multiple ALUs, or +a hybrid combination of both. + +In Broadcom's Videocore-IV, they chose hybrid, and called it "Virtual +Parallelism". They achieve a 16-way SIMD at an **instruction** level +by providing a combination of a 4-way parallel ALU *and* an externally +transparent loop that feeds 4 sequential sets of data into each of the +4 ALUs. + +Also in the same core, it is worth noting that particularly uncommon +but essential operations (Reciprocal-Square-Root for example) are +*not* part of the 4-way parallel ALU but instead have a *single* ALU. +Under the proposed Vector (varible-width SIMD) implementors would +be free to do precisely that: i.e. free to choose *on a per operation +basis* whether and how much "Virtual Parallelism" to deploy. + +It is absolutely critical to note that it is proposed that such choices MUST +be **entirely transparent** to the end-user and the compiler. Whilst +a Vector (varible-width SIM) may not precisely match the width of the +parallelism within the implementation, the end-user **should not care** +and in this way the performance benefits are gained but the ISA remains +simple. All that happens at the end of an instruction run is: some +parallel units (if there are any) would remain offline, completely +transparently to the ISA, the program, and the compiler. + +The "SIMD considered harmful" trap of having huge complexity and extra +instructions to deal with corner-cases is thus avoided, and implementors +get to choose precisely where to focus and target the benefits of their +implementationefforts.. + # References * SIMD considered harmful @@ -189,3 +229,5 @@ look at and investigate!* * Re-continuing P-Extension proposal * First Draft P-SIMD (DSP) proposal * B-Extension discussion +* Broadcom VideoCore-IV + Figure 2 P17 and Section 3 on P16. -- 2.30.2