Make the core go faster
Several major improvements in here:
- Simple branch predictor
- Reduced latency for mispredicted branches and interrupts by removing fetch2 stage
- Cache improvements
o Request critical dword first on refill
o Handle hits while refilling, including on line being refilled
o Sizes doubled for both D and I
- Loadstore improvements: can now do one load or store every two cycles in most cases
- Optimized 2-cycle multiplier for Xilinx 7-series parts using DSP slices
- Timing improvements, including:
o Stash buffer in decode1
o Reduced width of execute1 result mux
o Improved SPR decode in decode1
o Some non-critical operation take a cycle longer so we can break some long combinatorial chains
- Core logging: logs 256 bits of info every cycle into a ring buffer, to help with debugging and performance analysis
This increases the LUT usage for the "synth" + A35 target from 9182 to 10297 = 12%.