Fixed push-push to push-pull

[libreriscv.git] / openpower / sv.mdwn
diff --git a/openpower/sv.mdwn b/openpower/sv.mdwn

index 60c4a05dc2f3de48874eec07014c2df1abfbf5c3..89fbcfe1acdfe7ca66a49ae67c24449846866016 100644 (file)
--- a/openpower/sv.mdwn
+++ b/openpower/sv.mdwn
@@ -1,27 +1,168 @@
+[[!tag standards]]
+
  # Simple-V Vectorisation for the OpenPOWER ISA
  
+**SV is in DRAFT STATUS**. SV has not yet been submitted to the OpenPOWER Foundation ISA WG for review.
+
+<https://bugs.libre-soc.org/show_bug.cgi?id=213>
+
  Fundamental design principles:
  
  * Simplicity of introduction and implementation on the existing OpenPOWER ISA
  * Effectively a hardware for-loop, pausing PC, issuing multiple scalar operations
-* Preserving the underlying scalar execution dependencies as if the for-loop had been expanded as actual scalar instructions.
+* Preserving the underlying scalar execution dependencies as if the for-loop had been expanded as actual scalar instructions
+  (termed "preserving Program Order")
  * Augments ("tags") existing instructions, providing Vectorisation "context" rather than adding new ones.
-* Does not modify or deviate from the underly scalar OpenPOWER ISA unless it provides significant performance or other advantage to do so in the Vector space (dropping XER.SO and OE=1 for example)
+* Does not modify or deviate from the underlying scalar OpenPOWER ISA unless it provides significant performance or other advantage to do so in the Vector space (dropping XER.SO and OE=1 for example)
+* Designed for Supercomputing: avoids creating significant sequential
+dependency hazards, allowing high performance superscalar microarchitectures to be deployed.
  
  Advantages of these design principles:
  
  * It is therefore easy to create a first (and sometimes only) implementation as literally a for-loop in hardware, simulators, and compilers.
  * More complex HDL can be done by repeating existing scalar ALUs and pipelines as blocks.
  * As (mostly) a high-level "context" that does not (significantly) deviate from scalar OpenPOWER ISA and, in its purest form being "a for loop around scalar instructions", it is minimally-disruptive and consequently stands a reasonable chance of broad community adoption and acceptance
+* Completely wipes not just SIMD opcode proliferation off the
+  map (SIMD is O(N^6) opcode proliferation)
+  but off of Vectorisation ISAs as well.  No more separate Vector
+  instructions.
  
  Pages being developed and examples
  
-* [[sv/predication]]
+* [[sv/overview]] explaining the basics.
+* [[sv/implementation]] implementation planning and coordination
+* [[sv/svp64]] contains the packet-format *only*
+* [[sv/setvl]] the Cray-style "Vector Length" instruction
+* [[sv/predication]] discussion on predication concepts
+* [[sv/cr_int_predication]] instructions needed for effective predication
  * [[sv/masked_vector_chaining]]
  * [[sv/discussion]]
  * [[sv/example_dep_matrices]]
-* [[sv/prefix]]
  * [[sv/major_opcode_allocation]]
  * [[opcode_regs_deduped]]
  * [[sv/vector_swizzle]]
+* [[sv/register_type_tags]]
+* [[sv/mv.swizzle]]
+* [[sv/mv.x]]
+* [[sv/branches]] - SVP64 Conditional Branch behaviour: All/Some Vector CRs
+* [[sv/cr_ops]] - SVP64 Condition Register ops: Guidelines
+ on Vectorisation of any v3.0B base operations which return
+ or modify a Condition Register bit or field.
+* [[sv/fcvt]] FP Conversion (due to OpenPOWER Scalar FP32)
+* [[sv/fclass]] detect class of FP numbers
+* [[sv/int_fp_mv]] Move and convert GPR <-> FPR, needed for !VSX
+* [[sv/mv.vec]] move to and from vec2/3/4
+* [[sv/16_bit_compressed]] experimental
+* [[sv/toc_data_pointer]] experimental
+* [[sv/ldst]] Load and Store
+* [[sv/sprs]] SPRs
+* [[sv/bitmanip]]
+* [[sv/remap]] "Remapping" for Matrix Multiply and RGB "Structure Packing"
+* [[sv/propagation]] Context propagation including svp64, swizzle and remap
+* [[sv/vector_ops]] Vector ops needed to make a "complete" Vector ISA
+* [[sv/av_opcodes]] scalar opcodes for Audio/Video
+* [[sv/byteswap]]
+* TODO: OpenPOWER [[openpower/transcendentals]]
+
+Additional links:
+
+* <https://www.sigarch.org/simd-instructions-considered-harmful/>
+* [[simple_v_extension]] old (deprecated) version
+* [[openpower/sv/llvm]]
+
+===
+
+Required Background Reading:
+============================
+
+These are all, deep breath, basically... required reading, *as well as and in addition* to a full and comprehensive deep technical understanding of the Power ISA, in order to understand the depth and background on SVP64 as a 3D GPU and VPU Extension.
+
+I am keenly aware that each of them is 300 to 1,000 pages (just like the Power ISA itself).
+
+This is just how it is.
+
+Given the sheer overwhelming size and scope of SVP64 we have gone to CONSIDERABLE LENGTHS to provide justification and rationalisation for adding the various sub-extensions to the Base Scalar Power ISA.
+
+* Scalar bitmanipulation is justifiable for the exact same reasons the extensions are justifiable for other ISAs.  The additional justification for their inclusion where some instructions are already (sort-of) present in VSX is that VSX is not mandatory, and the complexity of implementation of VSX is too high a price to pay at the Embedded SFFS Compliancy Level.
+
+* Scalar FP-to-INT conversions, likewise.  ARM has a javascript conversion instruction, Power ISA does not (and it costs a ridiculous 45 instructions to implement, including 6 branches!)
+
+* Scalar Transcendentals (SIN, COS, ATAN2, LOG) are easily justifiable for High-Performance Compute workloads.
+
+It also has to be pointed out that normally this work would be covered by multiple separate full-time Workgroups with multiple Members contributing their time and resources!
+
+Overall the contributions that we are developing take the Power ISA out of the specialist highly-focussed market it is presently best known for, and expands it into areas with much wider general adoption and broader uses.
+
+
+---
+
+OpenCL specifications are linked here, these are relevant when we get to a 3D GPU / High Performance Compute ISA WG RFC:
+[[openpower/transcendentals]]
+
+(Failure to add Transcendentals to a 3D GPU is directly equivalent to *willfully* designing a product that is 100% destined for commercial failure.)
+
+I mention these because they will be encountered in every single commercial GPU ISA, but they're not part of the "Base" (core design) of a Vector Processor. Transcendentals can be added as a sub-RFC.
+
+---
+
+Actual 3D GPU Architectures and ISAs:
+-------------------------------------
+
+* Broadcom Videocore
+  <https://github.com/hermanhermitage/videocoreiv>
+
+* Etnaviv
+  <https://github.com/etnaviv/etna_viv/tree/master/doc>
+
+* Nyuzi
+  <http://www.cs.binghamton.edu/~millerti/nyuziraster.pdf>
+
+* MALI
+  <https://github.com/cwabbott0/mali-isa-docs>
+
+* AMD
+  <https://developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf>  
+  <https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf>
+
+* MIAOW which is *NOT* a 3D GPU, it is a processor which happens to implement a subset of the AMDGPU ISA (Southern Islands), aka a "GPGPU"
+  <https://miaowgpu.org/>
+
+
+Actual Vector Processor Architectures and ISAs:
+-----------------------------------------------
+
+* NEC SX Aurora
+  <https://www.hpc.nec/documents/guide/pdfs/Aurora_ISA_guide.pdf>
+
+* Cray ISA
+  <http://www.bitsavers.org/pdf/cray/CRAY_Y-MP/HR-04001-0C_Cray_Y-MP_Computer_Systems_Functional_Description_Jun90.pdf>
+
+* RISC-V RVV
+  <https://github.com/riscv/riscv-v-spec>
+
+* MRISC32 ISA Manual (under active development)
+  <https://github.com/mrisc32/mrisc32/tree/master/isa-manual>
+
+* Mitch Alsup's MyISA 66000 Vector Processor ISA Manual is available from Mitch on direct contact with him.  It is a different approach from the others, which may be termed "Cray-Style Horizontal-First" Vectorisation.  66000 is a *Vertical-First* Vector ISA.
+
+The term Horizontal or Vertical alludes to the Matrix "Row-First" or "Column-First" technique, where:
+
+* Horizontal-First processes all elements in a Vector before moving on to the next instruction
+* Vertical-First processes *ONE* element per instruction, and requires loop constructs to explicitly step to the next element.
+
+Vector-type Support by Architecture
+[[!table  data="""
+Architecture | Horizontal | Vertical
+MyISA        |            | X
+Cray         | X          |
+SX Aurora    | X          |
+RVV          | X          |
+SVP64        | X          | X
+"""]]
+
+===
+
+Obligatory Dilbert:
+
+<img src="https://assets.amuniversal.com/7fada35026ca01393d3d005056a9545d" width="600" />