# Single-Issue, In-Order Processor Core

* First steps for a newbie developer [[docs/firststeps]]
* bugreport <http://bugs.libre-riscv.org/show_bug.cgi?id=1039>

The Libre-SOC TestIssuer core
utilises a Finite-State Machine (FSM) to control the fetch/dec/issue/exec
pipelines, with only one pipeline being active at any given time. This is good
for debugging the HDL, but severly restricts performance as a single
instruction will take tens of clock cycles to complete.  In-development
(Andrey to research and link to the relevant bugreport) is an in-order
core and following on from that will be an out-of-order core.

A Single-Issue In-Order control unit will allow every pipepline to be active,
and raises the ideal maximum throughput to 1 instruction per clock cycle,
bearing any register hazards.

This control unit has not been written in HDL yet (incorrect: the first version was written 18 months ago, and is in soc/ and there are options in the Makefile to enable it), however there's currently a
task to develop the model for the simulator first. The model will be used to
determine performance.

Diagram that Luke drew comparing pipelines and fsms:

[[!img /3d_gpu/pipeline_vs_fsms.jpg size="600x"]]

# The Model
## Brief

* [Bug description](https://bugs.libre-soc.org/show_bug.cgi?id=1039)

The model for the Single-Issue In-Order core needs to be added to the in-house
Python simulator (`ISACaller`, called by `pypowersim`), which will allow basic
*performance estimates*.

For now, this model resides outside the simulator, and
is *completely standalone*.

Eventually, Cavatools code will be studied to extract and re-implement in
Python power consumption estimation.

## Task given

* [Bug comment #1](https://bugs.libre-soc.org/show_bug.cgi?id=1039#c1)
* [IRC log](https://libre-soc.org/irclog/%23libre-soc.2023-05-02.log.html#t2023-05-02T10:51:45)

An offline instruction ordering analyser need to be written that models a
(simple, initially V3.0-only) **in-order core** and gives an estimate of
instructions per clock (IPC).

Hazard Protection should be straightforward, simple bit vector:

- Take the write result register number: set bit
- For all read registers, check corresponding bit. If bit is set, STALL (fake/
model-stall)

A stall is defined as a delay in execution of an instruction in order to
resolve a hazard (i.e. trying to read a register while it is being written to).
See the [wikipedia article on Pipeline Stall](https://en.wikipedia.org/wiki/Pipeline_stall)

Input should be:

- Instruction with its operands (as assembler listing)
- plus an optional memory-address and whether it is read or written.

The input will come as a trace output from the ISACaller simulator,
[see bug comments #7-#16](https://bugs.libre-soc.org/show_bug.cgi?id=1039#c7)

Some classes needed which "model" pipeline stages: fetch, decode, issue,
execute.

One global "STALL" flag will cause all buses to stop:

- Tells fetch to stop fetching
- Decode stops (either because empty, or has instrution whose read reg's and
being written to).
- Issue stops.
- Execute (pipelines) run as an empty slot (except for the initial instruction
 causing the stall)

Example (PC chosen arbitrarily):

    addi 3, 4, 5    #PC=8
    cmpi 1, 0, 3, 4 #PC=12
    ld   1, 2(3)    #PC=16 EA=0x12345678

The third operand of `cmpi` is the register which to use in comparison, so
register 3 needs to be read. However, `addi` will be writing to this register,
and thus a STALL will occur when `cmpi` is in the decode phase.

The output diagram will look like this:

| clk # |    fetch     |    decode    |   issue      |   execute    |
|:-----:|:------------:|:------------:|:------------:|:------------:|
|   1   | addi 3,4,5   |              |              |              |
|   2   | cmpi 1,0,3,4 | addi 3,4,5   |              |              |
|   3   | STALL        | cmpi 1,0,3,4 | addi 3,4,5   |              |
|   4   | STALL        | cmpi 1,0,3,4 |              | addi 3,4,5   |
|   5   | ld 1,2(3)    |              | cmpi 1,0,3,4 |              |
|   6   |              | ld 1,2(3)    |              | cmpi 1,0,3,4 |
|   7   |              |              | ld 1,2(3)    |              |
|   8   |              |              |              | ld 1,2(3)    |

Explanation:

    1: Fetched addi.
    2: Decoded addi, fetched cmpi.
    3: Issued addi, decoded cmpi, must stall decode phase, stop fetching.
    4: Executed addi, everything else stalled.
    5: Issued cmpi, fetched ld.
    6: Executed cmpi, decoded ld.
    7: Issued ld.
    8: Executed ld.

For this initial model, it is assumed that all instructions take one cycle to
execute (not the case for mul/div etc., but will be dealt with later.

**In-progress TODO**

# Code Explanation

Source code: <https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/cyclemodel>