# Single-Issue, In-Order Processor Core * First steps for a newbie developer [[docs/firststeps]] * bugreport The Libre-SOC TestIssuer core utilises a Finite-State Machine (FSM) to control the fetch/dec/issue/exec pipelines, with only one pipeline being active at any given time. This is good for debugging the HDL, but severly restricts performance as a single instruction will take tens of clock cycles to complete. In-development (Andrey to research and link to the relevant bugreport) is an in-order core and following on from that will be an out-of-order core. A Single-Issue In-Order control unit will allow every pipepline to be active, and raises the ideal maximum throughput to 1 instruction per clock cycle, bearing any register hazards. This control unit has not been written in HDL yet (incorrect: the first version was written 18 months ago, and is in soc/ and there are options in the Makefile to enable it), however there's currently a task to develop the model for the simulator first. The model will be used to determine performance. Diagram that Luke drew comparing pipelines and fsms: [[!img /3d_gpu/pipeline_vs_fsms.jpg size="600x"]] # The Model ## Brief * [Bug description](https://bugs.libre-soc.org/show_bug.cgi?id=1039) The model for the Single-Issue In-Order core needs to be added to the in-house Python simulator (`ISACaller`, called by `pypowersim`), which will allow basic *performance estimates*. For now, this model resides outside the simulator, and is *completely standalone*. Eventually, Cavatools code will be studied to extract and re-implement in Python power consumption estimation. ## Task given * [Bug comment #1](https://bugs.libre-soc.org/show_bug.cgi?id=1039#c1) * [IRC log](https://libre-soc.org/irclog/%23libre-soc.2023-05-02.log.html#t2023-05-02T10:51:45) An offline instruction ordering analyser need to be written that models a (simple, initially V3.0-only) **in-order core** and gives an estimate of instructions per clock (IPC). Hazard Protection should be straightforward, simple bit vector: - Take the write result register number: set bit - For all read registers, check corresponding bit. If bit is set, STALL (fake/ model-stall) A stall is defined as a delay in execution of an instruction in order to resolve a hazard (i.e. trying to read a register while it is being written to). See the [wikipedia article on Pipeline Stall](https://en.wikipedia.org/wiki/Pipeline_stall) Input should be: - Instruction with its operands (as assembler listing) - plus an optional memory-address and whether it is read or written. The input will come as a trace output from the ISACaller simulator, [see bug comments #7-#16](https://bugs.libre-soc.org/show_bug.cgi?id=1039#c7) Some classes needed which "model" pipeline stages: fetch, decode, issue, execute. One global "STALL" flag will cause all buses to stop: - Tells fetch to stop fetching - Decode stops (either because empty, or has instrution whose read reg's and being written to). - Issue stops. - Execute (pipelines) run as an empty slot (except for the initial instruction causing the stall) Example (PC chosen arbitrarily): addi 3, 4, 5 #PC=8 cmpi 1, 0, 3, 4 #PC=12 ld 1, 2(3) #PC=16 EA=0x12345678 The third operand of `cmpi` is the register which to use in comparison, so register 3 needs to be read. However, `addi` will be writing to this register, and thus a STALL will occur when `cmpi` is in the decode phase. The output diagram will look like this: | clk # | fetch | decode | issue | execute | |:-----:|:------------:|:------------:|:------------:|:------------:| | 1 | addi 3,4,5 | | | | | 2 | cmpi 1,0,3,4 | addi 3,4,5 | | | | 3 | STALL | cmpi 1,0,3,4 | addi 3,4,5 | | | 4 | STALL | cmpi 1,0,3,4 | | addi 3,4,5 | | 5 | ld 1,2(3) | | cmpi 1,0,3,4 | | | 6 | | ld 1,2(3) | | cmpi 1,0,3,4 | | 7 | | | ld 1,2(3) | | | 8 | | | | ld 1,2(3) | Explanation: 1: Fetched addi. 2: Decoded addi, fetched cmpi. 3: Issued addi, decoded cmpi, must stall decode phase, stop fetching. 4: Executed addi, everything else stalled. 5: Issued cmpi, fetched ld. 6: Executed cmpi, decoded ld. 7: Issued ld. 8: Executed ld. For this initial model, it is assumed that all instructions take one cycle to execute (not the case for mul/div etc., but will be dealt with later. **In-progress TODO** # Code Explanation Source code: