# Single-Issue, In-Order Processor Core

note: as of the time of writing, this task is 95-98% completed and requires
approximately 10-15 lines of python code to get it actually running a first unit test.

* First steps for a newbie developer [[docs/firststeps]]
* bugreport <http://bugs.libre-riscv.org/show_bug.cgi?id=1039>

The Libre-SOC TestIssuer core
utilises a Finite-State Machine (FSM) to control the fetch/dec/issue/exec
Computational Units, with only one such CompUnit (a FSM or a pipeline) being active at any given time. This is good
for debugging the HDL, but severly restricts performance as a single
instruction will take tens of clock cycles to complete.  In-development
(Andrey to research and link to the relevant bugreport) is an in-order
core and following on from that will be an out-of-order core.

A Single-Issue In-Order control unit (written 12+ months ago) will allow every pipepline to be active,
and raises the ideal maximum throughput to 1 instruction per clock cycle,
bearing any register hazards.

This control unit has not been written in HDL yet (incorrect: the first version was written 12+ months ago, and is in soc/ and there are options in the Makefile to enable it), however there's currently a
task to develop the model for the simulator first. The model will be used to
determine performance.

Diagram that Luke drew comparing pipelines and fsms which allows for a transition from FSM to in-order to out-of-order and also allows "Micro-Coding".

[[!img /3d_gpu/pipeline_vs_fsms.jpg size="600x"]]

# The Model
## Brief

* [Bug description](https://bugs.libre-soc.org/show_bug.cgi?id=1039)

The model for the Single-Issue In-Order core needs to be added to the in-house
Python simulator (`ISACaller`, called by `pypowersim`), which will allow basic
*performance estimates*.  INCORRECT - pypowersim *outputs an execution trace log*
which **after the fact** may be passed to **any** model of which the in-order
model is **just the very first**.

For now, this model resides outside the simulator, and
is *completely standalone* **and will ALWAYS remain standalone**

A subtask to be carried out **as incremental development**
is that avatools source code will need to be studied to extract
power consumption estimation and add that into the inorder model


## Task given

* [Bug comment #1](https://bugs.libre-soc.org/show_bug.cgi?id=1039#c1)
* [IRC log](https://libre-soc.org/irclog/%23libre-soc.2023-05-02.log.html#t2023-05-02T10:51:45)

The offline instruction ordering analyser need to be **COMPLETED**
(it is currently 98% complete) that models a
(simple, initially V3.0-only) **in-order core** and gives an estimate of
instructions per clock (IPC).

Hazard Protection **WHICH IS ALREADY COMPLETED** is a straightforward, simple bit vector
(WRONG it is a "length of pipeline countdown until result is ready" which models the
clock cycles needed in the ACTUAL pipeline(s)? the "bit" you refer to is
"is there an entry in the python set() for this register yes-or-no")

- Take the write result register number: set bit WRONG "add num-cycles-until-ready to the set()"
- For all read registers, check corresponding bit WRONG call the function that checks if there is an entry in the "python set() of expected outstanding results to be written" . If bit is set, STALL (fake/
model-stall)

A stall is defined as a delay in execution of an instruction in order to
resolve a hazard (i.e. trying to read a register while it is being written to).
See the [wikipedia article on Pipeline Stall](https://en.wikipedia.org/wiki/Pipeline_stall)

Input **IS** (98% completed, remember?):

- Instruction with its operands (as assembler listing)
- plus an optional memory-address and whether it is read or written.

The input will come as a trace output from the ISACaller simulator,
[see bug comments #7-#16](https://bugs.libre-soc.org/show_bug.cgi?id=1039#c7)

Some classes needed (WRONG: ALREADY WRITTEN) which "model" pipeline stages: fetch, decode, issue,
execute.

One global "STALL" flag will cause all buses to stop:

- Tells fetch to stop fetching
- Decode stops (either because empty, or has instrution whose read reg's and
being written to).
- Issue stops.
- Execute (pipelines) run as an empty slot (except for the initial instruction
 causing the stall)

Example (PC chosen arbitrarily):

    addi 3, 4, 5    #PC=8
    cmpi 1, 0, 3, 4 #PC=12
    ld   1, 2(3)    #PC=16 EA=0x12345678

The third operand of `cmpi` is the register which to use in comparison, so
register 3 needs to be read. However, `addi` will be writing to this register,
and thus a STALL will occur when `cmpi` is in the decode phase.

The output diagram will look like this:

TODO, move this to a separate file then *include it twice*, once with triple-quotes
and once without.  grep "inline raw=yes" for examples on how to include in mdwn

```
| clk # |    fetch     |    decode    |   issue      |   execute    |
|:-----:|:------------:|:------------:|:------------:|:------------:|
|   1   | addi 3,4,5   |              |              |              |
|   2   | cmpi 1,0,3,4 | addi 3,4,5   |              |              |
|   3   | STALL        | cmpi 1,0,3,4 | addi 3,4,5   |              |
|   4   | STALL        | cmpi 1,0,3,4 |              | addi 3,4,5   |
|   5   | ld 1,2(3)    |              | cmpi 1,0,3,4 |              |
|   6   |              | ld 1,2(3)    |              | cmpi 1,0,3,4 |
|   7   |              |              | ld 1,2(3)    |              |
|   8   |              |              |              | ld 1,2(3)    |
```

Explanation:

    1: Fetched addi.
    2: Decoded addi, fetched cmpi.
    3: Issued addi, decoded cmpi, must stall decode phase, stop fetching.
    4: Executed addi, everything else stalled.
    5: Issued cmpi, fetched ld.
    6: Executed cmpi, decoded ld.
    7: Issued ld.
    8: Executed ld.

For this initial model, it is assumed that all instructions take one cycle to
execute (not the case for mul/div etc., but will be dealt with later.

**In-progress TODO**

# Code Explanation - *IN PROGRESS*

*(Not all of the code has been explained, just the general classes.)*

Source code: <https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/cyclemodel>

## `Hazard` namedtuple data structure

A `namedtuple` object stores the attributes of the register access. The
python `namedtuple` is immutable (like a normal tuple), while also allowing to
access elements by predefined names. Immutability is great because the register
access attributes won't change from fetch to execution stages, which is why a
normal `list` or `dict` wouldn't be appropriate.

Unlike a normal dictionary, a `namedtuple` is also ordered (so the initially
defined order is preserved). See the
[python wiki on `namedtuple`](https://docs.python.org/3.7/library/collections.html#collections.namedtuple),
[online namedtuple tutorial](https://realpython.com/python-namedtuple/),
[sta].

`namedtuple` instances can also be stored in sets, which is exactly how it is
used with the `RegisterWrite` class. One instruction trace may contain zero or
more `Hazard` register access objects (depending on whether registers are
needed for the instruction).

## `HazardProfiles`

A dictionary of currently supported register file types. Each entry (register
file type) defines the number of read and write ports, written as a tuple, with
the first entry being the number of read ports, and second entry being the
number of write ports.

Having multiple read and/or write ports means that multiple **different**
entries in the same register file can be read from and/or written to in the
same clock cycle.
This doesn't prevent a stall if the same register entry is used
by a consecutive instruction, even if a spare port is available
(Read-after-Write hazard).

## Parsing trace file dump using `read_file` function

The `CPU` model class takes as input, a single instruction trace `list` object.

This trace `list` object, is produced by the function
`read_file` which itself reads an instruction trace file from modified
`ISACaller` ([link to code needed](LINK)).
From now on, the trace `list` object will simply be referred to as `trace`.

Each line of the trace dump is of the form
`[{rw}:FILE:regnum:offset:width]* # insn` where:

- `rw` is the register to be used for reading (operands), or writing
(to store result, condition codes, etc.).
- `FILE` is the register file type (GPR/integer, FPR/floating-point, etc. see
Additional Information section at the end of this page).
*(TODO: use section reference link instead)*.
- `regnum` is the register number
- `offset` *TODO: Perhaps the offset of data in bytes??? no idea (right now not
important, as examples all show 0 offset)*
- `width` is the length of the data in bits to be accessed from the register.
- `insn` is the full instruction written in PowerISA assembler.

The block `[{rw}:FILE:regnum:offset:width]` is used zero or more times,
based on the total number of read and write registers used for the instruction.

Example trace file with three instructions:

    r:GPR:0:0:64 w:GPR:1:0:64              # addi 1, 0, 0x0010
    r:GPR:0:0:64 w:GPR:2:0:64              # addi 2, 0, 0x1234
    r:GPR:1:0:64 r:GPR:2:0:64              # stw 2, 0(1)

The instruction trace file is processed line by line, where each line split into
the register access atributes (from which a new namedtuple is created using
`_make()` and the `Hazard` definition; see
[python wiki on _make() method](https://docs.python.org/3.7/library/collections.html#collections.somenamedtuple._make)).

Each line is converted to a `trace` object of the form:
`[insn, Hazard(...), Hazard(...), ...]`. An example trace looks like this:

    ['addi 1, 0, 0x0010',
     Hazard(action='r', target='GPR', ident='0', offs='0',elwid='64'),
     Hazard(action='w', target='GPR', ident='1', offs='0', elwid='64')]

The function `read_file` yields (see [python wiki on yield]()) a single `trace`
for each line of the trace file. To produces a full list of
traces all the user needs to do is to call `read_file` with the filename of the
`ISACaller` instruction trace dump, and assign to a new variable (which will
end up being a list of `trace` objects, ready to be iterated over for the CPU
model).

## RegisterWrite

A class which is based on a Python set, and is used to keep track of current
registers used for writing (for detecting Read-after-Write Hazards).

A [python wiki on sets](https://docs.python.org/3.7/tutorial/datastructures.html#sets)
is an unordered collection with **no duplicate elements**.

By checking if next instruction's read registers match any of the write
registers in the RegWrite set, the model can raise a STALL.

Anything in the set **MUST STALL** at the Decode phase because the
currently issued/executed instruction's result has not been written to the
register/s needed for the consecutive instruction.

### Methods

    def __init__(self):
        self.storage = set()

Initialise `RegisterWrite` set.

    def expect_write(self, regs):
        return self.storage.update(regs)

If there are new registers to be written to, add them to the current
`RegisterWrite` set.

    def write_expected(self, regs):
        return (len(self.storage.intersection(regs)) != 0)

Boolean flag which is true if no read registers need to be written to (by
previous instruction).

    def retire_write(self, regs):
        return self.storage.difference_update(regs)

Remove write registers from `RegisterWrite` set if they match the given read
registers.

## `get_input_regs` and `get_output_regs` functions


## CPU class

The `CPU` class models the in-order, single-issue core. Contains the
`RegisterWrite` set for tracking Read-after-Write Hazards, fetch, decode, issue,
and execute stages, as well as a `stall` flag for indicating if the CPU is
currently stalled.

The input to the model is a trace `list` object.

The main methods used during the running of the model is
`process_instructions()`, which is called every time an instruction trace
`list` object is read from a trace file.

### Methods

    def __init__(self):
        self.regs = RegisterWrite()
        self.fetch = Fetch(self)
        self.decode = Decode(self)
        self.issue = Issue(self)
        self.exe = Execute(self)
        self.stall = False

    def reads_possible(self, regs):
        # TODO: subdivide this down by GPR FPR CR-field.
        # currently assumes total of 3 regs are readable at one time
        possible = set()
        r = regs.copy()
        while len(possible) < 3 and len(r) > 0:
            possible.add(r.pop())
        return possible

    def writes_possible(self, regs):
        # TODO: subdivide this down by GPR FPR CR-field.
        # currently assumes total of 1 reg is possible regardless of what it is
        possible = set()
        r = regs.copy()
        while len(possible) < 1 and len(r) > 0:
            possible.add(r.pop())
        return possible

    def process_instructions(self):
        stall = self.stall
        stall = self.fetch.process_instructions(stall)
        stall = self.decode.process_instructions(stall)
        stall = self.issue.process_instructions(stall)
        stall = self.exe.process_instructions(stall)
        self.stall = stall
        if not stall:
            self.fetch.tick()
            self.decode.tick()
            self.issue.tick()
            self.exe.tick()

## Execute class

The `Execute` class models the execute phase of the processor.
Contains a list 

### Methods

    def __init__(self, cpu):
        self.stages = []
        self.cpu = cpu

    def add_stage(self, cycles_away, stage):
        while cycles_away > len(self.stages):
            self.stages.append([])
        self.stages[cycles_away].append(stage)

    def add_instruction(self, insn, writeregs):
        self.add_stage(2, {'insn': insn, 'writes': writeregs})

    def tick(self):
        self.stages.pop(0) # tick drops anything at time "zero"

    def process_instructions(self, stall):
        instructions = self.stages[0] # get list of instructions
        to_write = set()              # need to know total writes
        for instruction in instructions:
            to_write.update(instruction['writes'])
        # see if all writes can be done, otherwise stall
        writes_possible = self.cpu.writes_possible(to_write)
        if writes_possible != to_write:
            stall = True
        # retire the writes that are possible in this cycle (regfile writes)
        self.cpu.regs.retire_write(writes_possible)
        # and now go through the instructions, removing those regs written
        for instruction in instructions:
            instruction['writes'].difference_update(writes_possible)
        return stall

# Additional Information

## On register file types

Currently (20th Aug 2023), the following register files are included in the CPU
model:

- General Purpose Registers (GPR) - stores integers (0-31 in default PowerISA,
0-127 for Libre-SOC with SVP64)
- Floating Point Registers (FPR) - stores floating-point numbers
- Condition Register (CR) - broken up into 4-bit fields
- Condition Register Fields (CRf) - stores arithmetic condition of an operation
(less than, greater than, equal to zero, overflow)
- Fixed-Point Exception Register (XER)
- Machine State Register (MSR)
- Floating-Point Status and Control Register (FPSCR)
- Program Counter (PC); PowerISA spec primarilly calls this *Current
Instruction Address (CIA)*. See PowerISA v3.1, section 1.3.4 Description of
Instruction Operation
- Slow Special Purpose Registers (SPRs)
- Fast SPR (SPRf)

*TODO: Special Purpose Registers and fields need better explation. The initial
writer of this page (Andrey) has very little understanding of whether SPR is
actually a register, or if it's just a category of registers (XER, etc.)*

See the [PowerISA 3.1 spec](LINK) for detailed information on register files
(Book I, Chapters 1.3.4, 2.3, 3.2, 4.2, 5.2, 5.3).