dcache: Fix bug with forwarding of stores
We have two stages of forwarding to cover the two cycles of latency
between when something is written to BRAM and when that new data can
be read from BRAM. When the writes to BRAM result from store
instructions, the write may write only some bytes of a row (8 bytes)
and not others, so we have a mask to enable only the written bytes to
be forwarded. However, we only forward written data from either the
first stage of forwarding or the second, not both. So if we have
two stores in succession that write different bytes of the same row,
and then a load from the row, we will only forward the data from the
second store, and miss the data from the first store; thus the load
will get the wrong value.
To fix this, we make the decision on which forward stage to use for
each byte individually. This results in a 4-input multiplexer feeding
r1.data_out, with its inputs being the BRAM, the wishbone, the current
write data, and the 2nd-stage forwarding register. Each byte of the
multiplexer is separately controlled. The code for this multiplexer
is moved to the dcache_fast_hit process since it is used for cache
hits as well as cache misses.
This also simplifies the BRAM code by ensuring that we can use the
same source for the BRAM address and way selection for writes, whether
we are writing store data or cache line refill data from memory.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>