dcache: Trim one cycle from the load hit path
Currently we don't get the result from a load that hits in the dcache
until the fourth cycle after the instruction was presented to
loadstore1. This trims this back to 3 cycles by taking the low order
bits of the address generated in loadstore1 into dcache directly (not
via the output register of loadstore1) and using them to address the
read port of the dcache data RAM. We use the lower 12 address bits
here in the expectation that any reasonable data cache design will
have a set size of 4kB or less in order to avoid the aliasing problems
that can arise with a virtually-indexed physically-tagged cache if
the set size is greater than the smallest page size provided by the
MMU.
With this we can get rid of r2 and drive the signals going to
writeback from r1, since the load hit data is now available one
cycle earlier. We need a multiplexer on the read address of the
data cache RAM in order to handle the second doubleword of an
unaligned access.
One small complication is that we now need an extra cycle in the case
of an unaligned load which misses in the data cache and which reads
the 2nd-last and last doublewords of a cache line. This is the reason
for the PRE_NEXT_DWORD state; if we just go straight to NEXT_DWORD
then we end up having the write of the last doubleword of the cache
line and the read of that same doubleword occurring in the same
cycle, which means we read stale data rather than the just-fetched
data.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>