dcache: Reduce back-to-back store latency from 3 cycles to 2
This uses the machinery we already had for comparing the real address
of a new request with the tag of a previous request (r1.reload_tag)
to get better timing on comparing the address of a second store with
the one in progress. The comparison is now on the set size rather
than the page size, but since set size can't be larger than the page
size (and usually will equal the page size), that is OK.
The same comparison can also be used to tell when we can satisfy
a load miss during a cache line refill.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>