From 2ef04934b28cfeb61a532cd3e2976088ade05f22 Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Wed, 12 Dec 2018 08:33:27 +0000 Subject: [PATCH] add conversation notes --- 3d_gpu/microarchitecture.mdwn | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/3d_gpu/microarchitecture.mdwn b/3d_gpu/microarchitecture.mdwn index ac3ba3590..72025497c 100644 --- a/3d_gpu/microarchitecture.mdwn +++ b/3d_gpu/microarchitecture.mdwn @@ -433,6 +433,12 @@ ok,so continuing some thoughts-in-order notes:   - FUs therefore only really express the register, memory, and execution     dependencies: they don't actually do the execution. +## Recommendations + +* Include a merged address-generator in the INT ALU +* Have simple ALU units duplicated and allow more than one FU to + receive (and process) the src operands. + ## Register file workloads Note: Vectorisation also includes predication, which is one extra integer read @@ -459,6 +465,34 @@ FP workloads: * 17% Addition * 5% branch +---- + +> in particular i found it fascinating that analysis of INT +> instructions found a 50% LD, 25% ST and 25% branch, and that +> 70% were 2-src ops. therefore you made sure that the number +> of read and write ports matched these, to ensure no bottlenecks, +> bearing in mind that ST requires reading an address *and* +> a data register. + +I never had a problem in "reading the write slot" in any of my pipelines. +That is, take a pipeline where LD (cache hit) has a latency of 3 cycles +(AGEN, Cache, Align). Align would be in the cycle where the data was being +forwarded, and the subsequent cycle, data could be written into the RF. + +|dec|AGN|$$$|ALN|LDW| + +For stores I would read the LDs write slot Align the store data and merge +into the cache as:: + +|dec|AGEN|tag|---|STR|ALN|$$$| + +You know 4 cycles in advance that a store is coming, 2 cycles after hit +so there is easy logic to decide to read the write slot (or not), and it +costs 2 address comparators to disambiguate this short shadow in the pipeline. + +This is a lower expense than building another read port into the RF, in +both area and power, and uses the pipeline efficiently. + # References -- 2.30.2