From ff9d3c148e80b4a038a356caf5f6e90d0fb516a3 Mon Sep 17 00:00:00 2001
From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Tue, 4 Dec 2018 00:07:05 +0000
Subject: [PATCH] add discussion

---
 3d_gpu/microarchitecture.mdwn | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/3d_gpu/microarchitecture.mdwn b/3d_gpu/microarchitecture.mdwn
index bd5a0ac87..e244c3ab7 100644
--- a/3d_gpu/microarchitecture.mdwn
+++ b/3d_gpu/microarchitecture.mdwn
@@ -107,6 +107,29 @@ LDs write.
 You will find doing VRFs a lot more compact this way. In GPU land we
 called the flip-flops orchestrating the timing "collectors".
 
+----
+
+For GPU workloads FP64 is not common so I think having 1 FP64 alu would
+be sufficient. Since indexed loads and stores are not supported, it will
+be important to support 4x64 integer operations to generate addresses
+for loads/stores.
+
+I was thinking we would use scoreboarding to keep track of operations
+and dependencies since it doesn't need a cam per alu. We should be able
+to design it to forward past the register file to allow for 0-latency
+forwarding. If we combined that with register renaming it should prevent
+most war and waw data hazards.
+
+I think branch prediction will be essential if only to fetch and decode
+operations since it will reduce the branch penalty substantially.
+
+Note that even if we have a zero-overhead loop extension, branch
+prediction will still be useful as we will want to be able to run code
+like compilers and standard RV code with decent performance. Additionally,
+quite a few shaders have branching in their internal loops so
+zero-overhead loops won't be able to fix all the branching problems.
+
+
 # References
 
 * <https://en.wikipedia.org/wiki/Tomasulo_algorithm>
-- 
2.30.2