src/doc/inside-minor.doxygen

   1 # Copyright (c) 2014 ARM Limited
   2 # All rights reserved
   3 #
   4 # The license below extends only to copyright in the software and shall
   5 # not be construed as granting a license to any other intellectual
   6 # property including but not limited to intellectual property relating
   7 # to a hardware implementation of the functionality of the software
   8 # licensed hereunder.  You may use the software subject to the license
   9 # terms below provided that you ensure that this notice is replicated
  10 # unmodified and in its entirety in all distributions of the software,
  11 # modified or unmodified, in source code or in binary form.
  12 #
  13 # Redistribution and use in source and binary forms, with or without
  14 # modification, are permitted provided that the following conditions are
  15 # met: redistributions of source code must retain the above copyright
  16 # notice, this list of conditions and the following disclaimer;
  17 # redistributions in binary form must reproduce the above copyright
  18 # notice, this list of conditions and the following disclaimer in the
  19 # documentation and/or other materials provided with the distribution;
  20 # neither the name of the copyright holders nor the names of its
  21 # contributors may be used to endorse or promote products derived from
  22 # this software without specific prior written permission.
  23 #
  24 # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  25 # "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  26 # LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  27 # A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  28 # OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  29 # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
  30 # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  31 # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  32 # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  33 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  34 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  35 #
  36 # Authors: Andrew Bardsley
  37
  38 namespace Minor
  39 {
  40
  41 /*!
  42
  43 \page minor Inside the Minor CPU model
  44
  45 \tableofcontents
  46
  47 This document contains a description of the structure and function of the
  48 Minor gem5 in-order processor model.  It is recommended reading for anyone who
  49 wants to understand Minor's internal organisation, design decisions, C++
  50 implementation and Python configuration.  A familiarity with gem5 and some of
  51 its internal structures is assumed.  This document is meant to be read
  52 alongside the Minor source code and to explain its general structure without
  53 being too slavish about naming every function and data type.
  54
  55 \section whatis What is Minor?
  56
  57 Minor is an in-order processor model with a fixed pipeline but configurable
  58 data structures and execute behaviour.  It is intended to be used to model
  59 processors with strict in-order execution behaviour and allows visualisation
  60 of an instruction's position in the pipeline through the
  61 MinorTrace/minorview.py format/tool.  The intention is to provide a framework
  62 for micro-architecturally correlating the model with a particular, chosen
  63 processor with similar capabilities.
  64
  65 \section philo Design philosophy
  66
  67 \subsection mt Multithreading
  68
  69 The model isn't currently capable of multithreading but there are THREAD
  70 comments in key places where stage data needs to be arrayed to support
  71 multithreading.
  72
  73 \subsection structs Data structures
  74
  75 Decorating data structures with large amounts of life-cycle information is
  76 avoided.  Only instructions (MinorDynInst) contain a significant proportion of
  77 their data content whose values are not set at construction.
  78
  79 All internal structures have fixed sizes on construction.  Data held in queues
  80 and FIFOs (MinorBuffer, FUPipeline) should have a BubbleIF interface to
  81 allow a distinct 'bubble'/no data value option for each type.
  82
  83 Inter-stage 'struct' data is packaged in structures which are passed by value.
  84 Only MinorDynInst, the line data in ForwardLineData and the memory-interfacing
  85 objects Fetch1::FetchRequest and LSQ::LSQRequest are '::new' allocated while
  86 running the model.
  87
  88 \section model Model structure
  89
  90 Objects of class MinorCPU are provided by the model to gem5.  MinorCPU
  91 implements the interfaces of (cpu.hh) and can provide data and
  92 instruction interfaces for connection to a cache system.  The model is
  93 configured in a similar way to other gem5 models through Python.  That
  94 configuration is passed on to MinorCPU::pipeline (of class Pipeline) which
  95 actually implements the processor pipeline.
  96
  97 The hierarchy of major unit ownership from MinorCPU down looks like this:
  98
  99 <ul>
 100 <li>MinorCPU</li>
 101 <ul>
 102     <li>Pipeline - container for the pipeline, owns the cyclic 'tick'
 103     event mechanism and the idling (cycle skipping) mechanism.</li>
 104     <ul>
 105         <li>Fetch1 - instruction fetch unit responsible for fetching cache
 106             lines (or parts of lines from the I-cache interface)</li>
 107         <ul>
 108             <li>Fetch1::IcachePort - interface to the I-cache from
 109                 Fetch1</li>
 110             </ul>
 111             <li>Fetch2 - line to instruction decomposition</li>
 112             <li>Decode - instruction to micro-op decomposition</li>
 113             <li>Execute - instruction execution and data memory
 114                 interface</li>
 115             <ul>
 116                 <li>LSQ - load store queue for memory ref. instructions</li>
 117                 <li>LSQ::DcachePort - interface to the D-cache from
 118                     Execute</li>
 119             </ul>
 120         </ul>
 121     </ul>
 122 </ul>
 123
 124 \section keystruct Key data structures
 125
 126 \subsection ids Instruction and line identity: InstId (dyn_inst.hh)
 127
 128 An InstId contains the sequence numbers and thread numbers that describe the
 129 life cycle and instruction stream affiliations of individual fetched cache
 130 lines and instructions.
 131
 132 An InstId is printed in one of the following forms:
 133
 134     - T/S.P/L - for fetched cache lines
 135     - T/S.P/L/F - for instructions before Decode
 136     - T/S.P/L/F.E - for instructions from Decode onwards
 137
 138 for example:
 139
 140     - 0/10.12/5/6.7
 141
 142 InstId's fields are:
 143
 144 <table>
 145 <tr>
 146     <td><b>Field</b></td>
 147     <td><b>Symbol</b></td>
 148     <td><b>Generated by</b></td>
 149     <td><b>Checked by</b></td>
 150     <td><b>Function</b></td>
 151 </tr>
 152
 153 <tr>
 154     <td>InstId::threadId</td>
 155     <td>T</td>
 156     <td>Fetch1</td>
 157     <td>Everywhere the thread number is needed</td>
 158     <td>Thread number (currently always 0).</td>
 159 </tr>
 160
 161 <tr>
 162     <td>InstId::streamSeqNum</td>
 163     <td>S</td>
 164     <td>Execute</td>
 165     <td>Fetch1, Fetch2, Execute (to discard lines/insts)</td>
 166     <td>Stream sequence number as chosen by Execute.  Stream
 167         sequence numbers change after changes of PC (branches, exceptions) in
 168         Execute and are used to separate pre and post branch instruction
 169         streams.</td>
 170 </tr>
 171
 172 <tr>
 173     <td>InstId::predictionSeqNum</td>
 174     <td>P</td>
 175     <td>Fetch2</td>
 176     <td>Fetch2 (while discarding lines after prediction)</td>
 177     <td>Prediction sequence numbers represent branch prediction decisions.
 178     This is used by Fetch2 to mark lines/instructions according to the last
 179     followed branch prediction made by Fetch2.  Fetch2 can signal to Fetch1
 180     that it should change its fetch address and mark lines with a new
 181     prediction sequence number (which it will only do if the stream sequence
 182     number Fetch1 expects matches that of the request).  </td> </tr>
 183
 184 <tr>
 185 <td>InstId::lineSeqNum</td>
 186 <td>L</td>
 187 <td>Fetch1</td>
 188 <td>(Just for debugging)</td>
 189 <td>Line fetch sequence number of this cache line or the line
 190     this instruction was extracted from.
 191     </td>
 192 </tr>
 193
 194 <tr>
 195 <td>InstId::fetchSeqNum</td>
 196 <td>F</td>
 197 <td>Fetch2</td>
 198 <td>Fetch2 (as the inst. sequence number for branches)</td>
 199 <td>Instruction fetch order assigned by Fetch2 when lines
 200     are decomposed into instructions.
 201     </td>
 202 </tr>
 203
 204 <tr>
 205 <td>InstId::execSeqNum</td>
 206 <td>E</td>
 207 <td>Decode</td>
 208 <td>Execute (to check instruction identity in queues/FUs/LSQ)</td>
 209 <td>Instruction order after micro-op decomposition.</td>
 210 </tr>
 211
 212 </table>
 213
 214 The sequence number fields are all independent of each other and although, for
 215 instance, InstId::execSeqNum for an instruction will always be >=
 216 InstId::fetchSeqNum, the comparison is not useful.
 217
 218 The originating stage of each sequence number field keeps a counter for that
 219 field which can be incremented in order to generate new, unique numbers.
 220
 221 \subsection insts Instructions: MinorDynInst (dyn_inst.hh)
 222
 223 MinorDynInst represents an instruction's progression through the pipeline.  An
 224 instruction can be three things:
 225
 226 <table>
 227 <tr>
 228     <td><b>Thing</b></td>
 229     <td><b>Predicate</b></td>
 230     <td><b>Explanation</b></td>
 231 </tr>
 232 <tr>
 233     <td>A bubble</td>
 234     <td>MinorDynInst::isBubble()</td>
 235     <td>no instruction at all, just a space-filler</td>
 236 </tr>
 237 <tr>
 238     <td>A fault</td>
 239     <td>MinorDynInst::isFault()</td>
 240     <td>a fault to pass down the pipeline in an instruction's clothing</td>
 241 </tr>
 242 <tr>
 243     <td>A decoded instruction</td>
 244     <td>MinorDynInst::isInst()</td>
 245     <td>instructions are actually passed to the gem5 decoder in Fetch2 and so
 246     are created fully decoded.  MinorDynInst::staticInst is the decoded
 247     instruction form.</td>
 248 </tr>
 249 </table>
 250
 251 Instructions are reference counted using the gem5 RefCountingPtr
 252 (base/refcnt.hh) wrapper.  They therefore usually appear as MinorDynInstPtr in
 253 code.  Note that as RefCountingPtr initialises as nullptr rather than an
 254 object that supports BubbleIF::isBubble, passing raw MinorDynInstPtrs to
 255 Queue%s and other similar structures from stage.hh without boxing is
 256 dangerous.
 257
 258 \subsection fld ForwardLineData (pipe_data.hh)
 259
 260 ForwardLineData is used to pass cache lines from Fetch1 to Fetch2.  Like
 261 MinorDynInst%s, they can be bubbles (ForwardLineData::isBubble()),
 262 fault-carrying or can contain a line (partial line) fetched by Fetch1.  The
 263 data carried by ForwardLineData is owned by a Packet object returned from
 264 memory and is explicitly memory managed and do must be deleted once processed
 265 (by Fetch2 deleting the Packet).
 266
 267 \subsection fid ForwardInstData (pipe_data.hh)
 268
 269 ForwardInstData can contain up to ForwardInstData::width() instructions in its
 270 ForwardInstData::insts vector.  This structure is used to carry instructions
 271 between Fetch2, Decode and Execute and to store input buffer vectors in Decode
 272 and Execute.
 273
 274 \subsection fr Fetch1::FetchRequest (fetch1.hh)
 275
 276 FetchRequests represent I-cache line fetch requests.  The are used in the
 277 memory queues of Fetch1 and are pushed into/popped from Packet::senderState
 278 while traversing the memory system.
 279
 280 FetchRequests contain a memory system Request (mem/request.hh) for that fetch
 281 access, a packet (Packet, mem/packet.hh), if the request gets to memory, and a
 282 fault field that can be populated with a TLB-sourced prefetch fault (if any).
 283
 284 \subsection lsqr LSQ::LSQRequest (execute.hh)
 285
 286 LSQRequests are similar to FetchRequests but for D-cache accesses.  They carry
 287 the instruction associated with a memory access.
 288
 289 \section pipeline The pipeline
 290
 291 \verbatim
 292 ------------------------------------------------------------------------------
 293     Key:
 294
 295     [] : inter-stage BufferBuffer
 296     ,--.
 297     |  | : pipeline stage
 298     `--'
 299     ---> : forward communication
 300     <--- : backward communication
 301
 302     rv : reservation information for input buffers
 303
 304                 ,------.     ,------.     ,------.     ,-------.
 305  (from  --[]-v->|Fetch1|-[]->|Fetch2|-[]->|Decode|-[]->|Execute|--> (to Fetch1
 306  Execute)    |  |      |<-[]-|      |<-rv-|      |<-rv-|       |     & Fetch2)
 307              |  `------'<-rv-|      |     |      |     |       |
 308              `-------------->|      |     |      |     |       |
 309                              `------'     `------'     `-------'
 310 ------------------------------------------------------------------------------
 311 \endverbatim
 312
 313 The four pipeline stages are connected together by MinorBuffer FIFO
 314 (stage.hh, derived ultimately from TimeBuffer) structures which allow
 315 inter-stage delays to be modelled.  There is a MinorBuffer%s between adjacent
 316 stages in the forward direction (for example: passing lines from Fetch1 to
 317 Fetch2) and, between Fetch2 and Fetch1, a buffer in the backwards direction
 318 carrying branch predictions.
 319
 320 Stages Fetch2, Decode and Execute have input buffers which, each cycle, can
 321 accept input data from the previous stage and can hold that data if the stage
 322 is not ready to process it.  Input buffers store data in the same form as it
 323 is received and so Decode and Execute's input buffers contain the output
 324 instruction vector (ForwardInstData (pipe_data.hh)) from their previous stages
 325 with the instructions and bubbles in the same positions as a single buffer
 326 entry.
 327
 328 Stage input buffers provide a Reservable (stage.hh) interface to their
 329 previous stages, to allow slots to be reserved in their input buffers, and
 330 communicate their input buffer occupancy backwards to allow the previous stage
 331 to plan whether it should make an output in a given cycle.
 332
 333 \subsection events Event handling: MinorActivityRecorder (activity.hh,
 334 pipeline.hh)
 335
 336 Minor is essentially a cycle-callable model with some ability to skip cycles
 337 based on pipeline activity.  External events are mostly received by callbacks
 338 (e.g. Fetch1::IcachePort::recvTimingResp) and cause the pipeline to be woken
 339 up to service advancing request queues.
 340
 341 Ticked (sim/ticked.hh) is a base class bringing together an evaluate
 342 member function and a provided SimObject.  It provides a Ticked::start/stop
 343 interface to start and pause clock events from being periodically issued.
 344 Pipeline is a derived class of Ticked.
 345
 346 During evaluate calls, stages can signal that they still have work to do in
 347 the next cycle by calling either MinorCPU::activityRecorder->activity() (for
 348 non-callable related activity) or MinorCPU::wakeupOnEvent(<stageId>) (for
 349 stage callback-related 'wakeup' activity).
 350
 351 Pipeline::evaluate contains calls to evaluate for each unit and a test for
 352 pipeline idling which can turns off the clock tick if no unit has signalled
 353 that it may become active next cycle.
 354
 355 Within Pipeline (pipeline.hh), the stages are evaluated in reverse order (and
 356 so will ::evaluate in reverse order) and their backwards data can be
 357 read immediately after being written in each cycle allowing output decisions
 358 to be 'perfect' (allowing synchronous stalling of the whole pipeline).  Branch
 359 predictions from Fetch2 to Fetch1 can also be transported in 0 cycles making
 360 fetch1ToFetch2BackwardDelay the only configurable delay which can be set as
 361 low as 0 cycles.
 362
 363 The MinorCPU::activateContext and MinorCPU::suspendContext interface can be
 364 called to start and pause threads (threads in the MT sense) and to start and
 365 pause the pipeline.  Executing instructions can call this interface
 366 (indirectly through the ThreadContext) to idle the CPU/their threads.
 367
 368 \subsection stages Each pipeline stage
 369
 370 In general, the behaviour of a stage (each cycle) is:
 371
 372 \verbatim
 373     evaluate:
 374         push input to inputBuffer
 375         setup references to input/output data slots
 376
 377         do 'every cycle' 'step' tasks
 378
 379         if there is input and there is space in the next stage:
 380             process and generate a new output
 381             maybe re-activate the stage
 382
 383         send backwards data
 384
 385         if the stage generated output to the following FIFO:
 386             signal pipe activity
 387
 388         if the stage has more processable input and space in the next stage:
 389             re-activate the stage for the next cycle
 390
 391         commit the push to the inputBuffer if that data hasn't all been used
 392 \endverbatim
 393
 394 The Execute stage differs from this model as its forward output (branch) data
 395 is unconditionally sent to Fetch1 and Fetch2.  To allow this behaviour, Fetch1
 396 and Fetch2 must be unconditionally receptive to that data.
 397
 398 \subsection fetch1 Fetch1 stage
 399
 400 Fetch1 is responsible for fetching cache lines or partial cache lines from the
 401 I-cache and passing them on to Fetch2 to be decomposed into instructions.  It
 402 can receive 'change of stream' indications from both Execute and Fetch2 to
 403 signal that it should change its internal fetch address and tag newly fetched
 404 lines with new stream or prediction sequence numbers.  When both Execute and
 405 Fetch2 signal changes of stream at the same time, Fetch1 takes Execute's
 406 change.
 407
 408 Every line issued by Fetch1 will bear a unique line sequence number which can
 409 be used for debugging stream changes.
 410
 411 When fetching from the I-cache, Fetch1 will ask for data from the current
 412 fetch address (Fetch1::pc) up to the end of the 'data snap' size set in the
 413 parameter fetch1LineSnapWidth.  Subsequent autonomous line fetches will fetch
 414 whole lines at a snap boundary and of size fetch1LineWidth.
 415
 416 Fetch1 will only initiate a memory fetch if it can reserve space in Fetch2
 417 input buffer.  That input buffer serves an the fetch queue/LFL for the system.
 418
 419 Fetch1 contains two queues: requests and transfers to handle the stages of
 420 translating the address of a line fetch (via the TLB) and accommodating the
 421 request/response of fetches to/from memory.
 422
 423 Fetch requests from Fetch1 are pushed into the requests queue as newly
 424 allocated FetchRequest objects once they have been sent to the ITLB with a
 425 call to itb->translateTiming.
 426
 427 A response from the TLB moves the request from the requests queue to the
 428 transfers queue.  If there is more than one entry in each queue, it is
 429 possible to get a TLB response for request which is not at the head of the
 430 requests queue.  In that case, the TLB response is marked up as a state change
 431 to Translated in the request object, and advancing the request to transfers
 432 (and the memory system) is left to calls to Fetch1::stepQueues which is called
 433 in the cycle following any event is received.
 434
 435 Fetch1::tryToSendToTransfers is responsible for moving requests between the
 436 two queues and issuing requests to memory.  Failed TLB lookups (prefetch
 437 aborts) continue to occupy space in the queues until they are recovered at the
 438 head of transfers.
 439
 440 Responses from memory change the request object state to Complete and
 441 Fetch1::evaluate can pick up response data, package it in the ForwardLineData
 442 object, and forward it to Fetch2%'s input buffer.
 443
 444 As space is always reserved in Fetch2::inputBuffer, setting the input buffer's
 445 size to 1 results in non-prefetching behaviour.
 446
 447 When a change of stream occurs, translated requests queue members and
 448 completed transfers queue members can be unconditionally discarded to make way
 449 for new transfers.
 450
 451 \subsection fetch2 Fetch2 stage
 452
 453 Fetch2 receives a line from Fetch1 into its input buffer.  The data in the
 454 head line in that buffer is iterated over and separated into individual
 455 instructions which are packed into a vector of instructions which can be
 456 passed to Decode.  Packing instructions can be aborted early if a fault is
 457 found in either the input line as a whole or a decomposed instruction.
 458
 459 \subsubsection bp Branch prediction
 460
 461 Fetch2 contains the branch prediction mechanism.  This is a wrapper around the
 462 branch predictor interface provided by gem5 (cpu/pred/...).
 463
 464 Branches are predicted for any control instructions found.  If prediction is
 465 attempted for an instruction, the MinorDynInst::triedToPredict flag is set on
 466 that instruction.
 467
 468 When a branch is predicted to take, the MinorDynInst::predictedTaken flag is
 469 set and MinorDynInst::predictedTarget is set to the predicted target PC value.
 470 The predicted branch instruction is then packed into Fetch2%'s output vector,
 471 the prediction sequence number is incremented, and the branch is communicated
 472 to Fetch1.
 473
 474 After signalling a prediction, Fetch2 will discard its input buffer contents
 475 and will reject any new lines which have the same stream sequence number as
 476 that branch but have a different prediction sequence number.  This allows
 477 following sequentially fetched lines to be rejected without ignoring new lines
 478 generated by a change of stream indicated from a 'real' branch from Execute
 479 (which will have a new stream sequence number).
 480
 481 The program counter value provided to Fetch2 by Fetch1 packets is only updated
 482 when there is a change of stream.  Fetch2::havePC indicates whether the PC
 483 will be picked up from the next processed input line.  Fetch2::havePC is
 484 necessary to allow line-wrapping instructions to be tracked through decode.
 485
 486 Branches (and instructions predicted to branch) which are processed by Execute
 487 will generate BranchData (pipe_data.hh) data explaining the outcome of the
 488 branch which is sent forwards to Fetch1 and Fetch2.  Fetch1 uses this data to
 489 change stream (and update its stream sequence number and address for new
 490 lines).  Fetch2 uses it to update the branch predictor.   Minor does not
 491 communicate branch data to the branch predictor for instructions which are
 492 discarded on the way to commit.
 493
 494 BranchData::BranchReason (pipe_data.hh) encodes the possible branch scenarios:
 495
 496 <table>
 497 <tr>
 498     <td>Branch enum val.</td>
 499     <td>In Execute</td>
 500     <td>Fetch1 reaction</td>
 501     <td>Fetch2 reaction</td>
 502 </tr>
 503 <tr>
 504     <td>NoBranch</td>
 505     <td>(output bubble data)</td>
 506     <td>-</td>
 507     <td>-</td>
 508 </tr>
 509 <tr>
 510     <td>CorrectlyPredictedBranch</td>
 511     <td>Predicted, taken</td>
 512     <td>-</td>
 513     <td>Update BP as taken branch</td>
 514 </tr>
 515 <tr>
 516     <td>UnpredictedBranch</td>
 517     <td>Not predicted, taken and was taken</td>
 518     <td>New stream</td>
 519     <td>Update BP as taken branch</td>
 520 </tr>
 521 <tr>
 522     <td>BadlyPredictedBranch</td>
 523     <td>Predicted, not taken</td>
 524     <td>New stream to restore to old inst. source</td>
 525     <td>Update BP as not taken branch</td>
 526 </tr>
 527 <tr>
 528     <td>BadlyPredictedBranchTarget</td>
 529     <td>Predicted, taken, but to a different target than predicted one</td>
 530     <td>New stream</td>
 531     <td>Update BTB to new target</td>
 532 </tr>
 533 <tr>
 534     <td>SuspendThread</td>
 535     <td>Hint to suspend fetching</td>
 536     <td>Suspend fetch for this thread (branch to next inst. as wakeup
 537         fetch addr)</td>
 538     <td>-</td>
 539 </tr>
 540 <tr>
 541     <td>Interrupt</td>
 542     <td>Interrupt detected</td>
 543     <td>New stream</td>
 544     <td>-</td>
 545 </tr>
 546 </table>
 547
 548 The parameter decodeInputWidth sets the number of instructions which can be
 549 packed into the output per cycle.  If the parameter fetch2CycleInput is true,
 550 Decode can try to take instructions from more than one entry in its input
 551 buffer per cycle.
 552
 553 \subsection decode Decode stage
 554
 555 Decode takes a vector of instructions from Fetch2 (via its input buffer) and
 556 decomposes those instructions into micro-ops (if necessary) and packs them
 557 into its output instruction vector.
 558
 559 The parameter executeInputWidth sets the number of instructions which can be
 560 packed into the output per cycle.  If the parameter decodeCycleInput is true,
 561 Decode can try to take instructions from more than one entry in its input
 562 buffer per cycle.
 563
 564 \subsection execute Execute stage
 565
 566 Execute provides all the instruction execution and memory access mechanisms.
 567 An instructions passage through Execute can take multiple cycles with its
 568 precise timing modelled by a functional unit pipeline FIFO.
 569
 570 A vector of instructions (possibly including fault 'instructions') is provided
 571 to Execute by Decode and can be queued in the Execute input buffer before
 572 being issued.  Setting the parameter executeCycleInput allows execute to
 573 examine more than one input buffer entry (more than one instruction vector).
 574 The number of instructions in the input vector can be set with
 575 executeInputWidth and the depth of the input buffer can be set with parameter
 576 executeInputBufferSize.
 577
 578 \subsubsection fus Functional units
 579
 580 The Execute stage contains pipelines for each functional unit comprising the
 581 computational core of the CPU.  Functional units are configured via the
 582 executeFuncUnits parameter.  Each functional unit has a number of instruction
 583 classes it supports, a stated delay between instruction issues, and a delay
 584 from instruction issue to (possible) commit and an optional timing annotation
 585 capable of more complicated timing.
 586
 587 Each active cycle, Execute::evaluate performs this action:
 588
 589 \verbatim
 590     Execute::evaluate:
 591         push input to inputBuffer
 592         setup references to input/output data slots and branch output slot
 593
 594         step D-cache interface queues (similar to Fetch1)
 595
 596         if interrupt posted:
 597             take interrupt (signalling branch to Fetch1/Fetch2)
 598         else
 599             commit instructions
 600             issue new instructions
 601
 602         advance functional unit pipelines
 603
 604         reactivate Execute if the unit is still active
 605
 606         commit the push to the inputBuffer if that data hasn't all been used
 607 \endverbatim
 608
 609 \subsubsection fifos Functional unit FIFOs
 610
 611 Functional units are implemented as SelfStallingPipelines (stage.hh).  These
 612 are TimeBuffer FIFOs with two distinct 'push' and 'pop' wires.  They respond
 613 to SelfStallingPipeline::advance in the same way as TimeBuffers <b>unless</b>
 614 there is data at the far, 'pop', end of the FIFO.  A 'stalled' flag is
 615 provided for signalling stalling and to allow a stall to be cleared.  The
 616 intention is to provide a pipeline for each functional unit which will never
 617 advance an instruction out of that pipeline until it has been processed and
 618 the pipeline is explicitly unstalled.
 619
 620 The actions 'issue', 'commit', and 'advance' act on the functional units.
 621
 622 \subsubsection issue Issue
 623
 624 Issuing instructions involves iterating over both the input buffer
 625 instructions and the heads of the functional units to try and issue
 626 instructions in order.  The number of instructions which can be issued each
 627 cycle is limited by the parameter executeIssueLimit, how executeCycleInput is
 628 set, the availability of pipeline space and the policy used to choose a
 629 pipeline in which the instruction can be issued.
 630
 631 At present, the only issue policy is strict round-robin visiting of each
 632 pipeline with the given instructions in sequence.  For greater flexibility,
 633 better (and more specific policies) will need to be possible.
 634
 635 Memory operation instructions traverse their functional units to perform their
 636 EA calculations.  On 'commit', the ExecContext::initiateAcc execution phase is
 637 performed and any memory access is issued (via. ExecContext::{read,write}Mem
 638 calling LSQ::pushRequest) to the LSQ.
 639
 640 Note that faults are issued as if they are instructions and can (currently) be
 641 issued to *any* functional unit.
 642
 643 Every issued instruction is also pushed into the Execute::inFlightInsts queue.
 644 Memory ref. instructions are pushing into Execute::inFUMemInsts queue.
 645
 646 \subsubsection commit Commit
 647
 648 Instructions are committed by examining the head of the Execute::inFlightInsts
 649 queue (which is decorated with the functional unit number to which the
 650 instruction was issued).  Instructions which can then be found in their
 651 functional units are executed and popped from Execute::inFlightInsts.
 652
 653 Memory operation instructions are committed into the memory queues (as
 654 described above) and exit their functional unit pipeline but are not popped
 655 from the Execute::inFlightInsts queue.  The Execute::inFUMemInsts queue
 656 provides ordering to memory operations as they pass through the functional
 657 units (maintaining issue order).  On entering the LSQ, instructions are popped
 658 from Execute::inFUMemInsts.
 659
 660 If the parameter executeAllowEarlyMemoryIssue is set, memory operations can be
 661 sent from their FU to the LSQ before reaching the head of
 662 Execute::inFlightInsts but after their dependencies are met.
 663 MinorDynInst::instToWaitFor is marked up with the latest dependent instruction
 664 execSeqNum required to be committed for a memory operation to progress to the
 665 LSQ.
 666
 667 Once a memory response is available (by testing the head of
 668 Execute::inFlightInsts against LSQ::findResponse), commit will process that
 669 response (ExecContext::completeAcc) and pop the instruction from
 670 Execute::inFlightInsts.
 671
 672 Any branch, fault or interrupt will cause a stream sequence number change and
 673 signal a branch to Fetch1/Fetch2.  Only instructions with the current stream
 674 sequence number will be issued and/or committed.
 675
 676 \subsubsection advance Advance
 677
 678 All non-stalled pipeline are advanced and may, thereafter, become stalled.
 679 Potential activity in the next cycle is signalled if there are any
 680 instructions remaining in any pipeline.
 681
 682 \subsubsection sb Scoreboard
 683
 684 The scoreboard (Scoreboard) is used to control instruction issue.  It contains
 685 a count of the number of in flight instructions which will write each general
 686 purpose CPU integer or float register.  Instructions will only be issued where
 687 the scoreboard contains a count of 0 instructions which will write to one of
 688 the instructions source registers.
 689
 690 Once an instruction is issued, the scoreboard counts for each destination
 691 register for an instruction will be incremented.
 692
 693 The estimated delivery time of the instruction's result is marked up in the
 694 scoreboard by adding the length of the issued-to FU to the current time.  The
 695 timings parameter on each FU provides a list of additional rules for
 696 calculating the delivery time.  These are documented in the parameter comments
 697 in MinorCPU.py.
 698
 699 On commit, (for memory operations, memory response commit) the scoreboard
 700 counters for an instruction's source registers are decremented.  will be
 701 decremented.
 702
 703 \subsubsection ifi Execute::inFlightInsts
 704
 705 The Execute::inFlightInsts queue will always contain all instructions in
 706 flight in Execute in the correct issue order.  Execute::issue is the only
 707 process which will push an instruction into the queue.  Execute::commit is the
 708 only process that can pop an instruction.
 709
 710 \subsubsection lsq LSQ
 711
 712 The LSQ can support multiple outstanding transactions to memory in a number of
 713 conservative cases.
 714
 715 There are three queues to contain requests: requests, transfers and the store
 716 buffer.  The requests and transfers queue operate in a similar manner to the
 717 queues in Fetch1.  The store buffer is used to decouple the delay of
 718 completing store operations from following loads.
 719
 720 Requests are issued to the DTLB as their instructions leave their functional
 721 unit.  At the head of requests, cacheable load requests can be sent to memory
 722 and on to the transfers queue.  Cacheable stores will be passed to transfers
 723 unprocessed and progress that queue maintaining order with other transactions.
 724
 725 The conditions in LSQ::tryToSendToTransfers dictate when requests can
 726 be sent to memory.
 727
 728 All uncacheable transactions, split transactions and locked transactions are
 729 processed in order at the head of requests.  Additionally, store results
 730 residing in the store buffer can have their data forwarded to cacheable loads
 731 (removing the need to perform a read from memory) but no cacheable load can be
 732 issue to the transfers queue until that queue's stores have drained into the
 733 store buffer.
 734
 735 At the end of transfers, requests which are LSQ::LSQRequest::Complete (are
 736 faulting, are cacheable stores, or have been sent to memory and received a
 737 response) can be picked off by Execute and either committed
 738 (ExecContext::completeAcc) and, for stores, be sent to the store buffer.
 739
 740 Barrier instructions do not prevent cacheable loads from progressing to memory
 741 but do cause a stream change which will discard that load.  Stores will not be
 742 committed to the store buffer if they are in the shadow of the barrier but
 743 before the new instruction stream has arrived at Execute.  As all other memory
 744 transactions are delayed at the end of the requests queue until they are at
 745 the head of Execute::inFlightInsts, they will be discarded by any barrier
 746 stream change.
 747
 748 After commit, LSQ::BarrierDataRequest requests are inserted into the
 749 store buffer to track each barrier until all preceding memory transactions
 750 have drained from the store buffer.  No further memory transactions will be
 751 issued from the ends of FUs until after the barrier has drained.
 752
 753 \subsubsection drain Draining
 754
 755 Draining is mostly handled by the Execute stage.  When initiated by calling
 756 MinorCPU::drain, Pipeline::evaluate checks the draining status of each unit
 757 each cycle and keeps the pipeline active until draining is complete.  It is
 758 Pipeline that signals the completion of draining.  Execute is triggered by
 759 MinorCPU::drain and starts stepping through its Execute::DrainState state
 760 machine, starting from state Execute::NotDraining, in this order:
 761
 762 <table>
 763 <tr>
 764     <td><b>State</b></td>
 765     <td><b>Meaning</b></td>
 766 </tr>
 767 <tr>
 768     <td>Execute::NotDraining</td>
 769     <td>Not trying to drain, normal execution</td>
 770 </tr>
 771 <tr>
 772     <td>Execute::DrainCurrentInst</td>
 773     <td>Draining micro-ops to complete inst.</td>
 774 </tr>
 775 <tr>
 776     <td>Execute::DrainHaltFetch</td>
 777     <td>Halt fetching instructions</td>
 778 </tr>
 779 <tr>
 780     <td>Execute::DrainAllInsts</td>
 781     <td>Discarding all instructions presented</td>
 782 </tr>
 783 </table>
 784
 785 When complete, a drained Execute unit will be in the Execute::DrainAllInsts
 786 state where it will continue to discard instructions but has no knowledge of
 787 the drained state of the rest of the model.
 788
 789 \section debug Debug options
 790
 791 The model provides a number of debug flags which can be passed to gem5 with
 792 the --debug-flags option.
 793
 794 The available flags are:
 795
 796 <table>
 797 <tr>
 798     <td><b>Debug flag</b></td>
 799     <td><b>Unit which will generate debugging output</b></td>
 800 </tr>
 801 <tr>
 802     <td>Activity</td>
 803     <td>Debug ActivityMonitor actions</td>
 804 </tr>
 805 <tr>
 806     <td>Branch</td>
 807     <td>Fetch2 and Execute branch prediction decisions</td>
 808 </tr>
 809 <tr>
 810     <td>MinorCPU</td>
 811     <td>CPU global actions such as wakeup/thread suspension</td>
 812 </tr>
 813 <tr>
 814     <td>Decode</td>
 815     <td>Decode</td>
 816 </tr>
 817 <tr>
 818     <td>MinorExec</td>
 819     <td>Execute behaviour</td>
 820 </tr>
 821 <tr>
 822     <td>Fetch</td>
 823     <td>Fetch1 and Fetch2</td>
 824 </tr>
 825 <tr>
 826     <td>MinorInterrupt</td>
 827     <td>Execute interrupt handling</td>
 828 </tr>
 829 <tr>
 830     <td>MinorMem</td>
 831     <td>Execute memory interactions</td>
 832 </tr>
 833 <tr>
 834     <td>MinorScoreboard</td>
 835     <td>Execute scoreboard activity</td>
 836 </tr>
 837 <tr>
 838     <td>MinorTrace</td>
 839     <td>Generate MinorTrace cyclic state trace output (see below)</td>
 840 </tr>
 841 <tr>
 842     <td>MinorTiming</td>
 843     <td>MinorTiming instruction timing modification operations</td>
 844 </tr>
 845 </table>
 846
 847 The group flag Minor enables all of the flags beginning with Minor.
 848
 849 \section trace MinorTrace and minorview.py
 850
 851 The debug flag MinorTrace causes cycle-by-cycle state data to be printed which
 852 can then be processed and viewed by the minorview.py tool.  This output is
 853 very verbose and so it is recommended it only be used for small examples.
 854
 855 \subsection traceformat MinorTrace format
 856
 857 There are three types of line outputted by MinorTrace:
 858
 859 \subsubsection state MinorTrace - Ticked unit cycle state
 860
 861 For example:
 862
 863 \verbatim
 864  110000: system.cpu.dcachePort: MinorTrace: state=MemoryRunning in_tlb_mem=0/0
 865 \endverbatim
 866
 867 For each time step, the MinorTrace flag will cause one MinorTrace line to be
 868 printed for every named element in the model.
 869
 870 \subsubsection traceunit MinorInst - summaries of instructions issued by \
 871     Decode
 872
 873 For example:
 874
 875 \verbatim
 876  140000: system.cpu.execute: MinorInst: id=0/1.1/1/1.1 addr=0x5c \
 877                              inst="  mov r0, #0" class=IntAlu
 878 \endverbatim
 879
 880 MinorInst lines are currently only generated for instructions which are
 881 committed.
 882
 883 \subsubsection tracefetch1 MinorLine - summaries of line fetches issued by \
 884     Fetch1
 885
 886 For example:
 887
 888 \verbatim
 889   92000: system.cpu.icachePort: MinorLine: id=0/1.1/1 size=36 \
 890                                 vaddr=0x5c paddr=0x5c
 891 \endverbatim
 892
 893 \subsection minorview minorview.py
 894
 895 Minorview (util/minorview.py) can be used to visualise the data created by
 896 MinorTrace.
 897
 898 \verbatim
 899 usage: minorview.py [-h] [--picture picture-file] [--prefix name]
 900                    [--start-time time] [--end-time time] [--mini-views]
 901                    event-file
 902
 903 Minor visualiser
 904
 905 positional arguments:
 906   event-file
 907
 908 optional arguments:
 909   -h, --help            show this help message and exit
 910   --picture picture-file
 911                         markup file containing blob information (default:
 912                         <minorview-path>/minor.pic)
 913   --prefix name         name prefix in trace for CPU to be visualised
 914                         (default: system.cpu)
 915   --start-time time     time of first event to load from file
 916   --end-time time       time of last event to load from file
 917   --mini-views          show tiny views of the next 10 time steps
 918 \endverbatim
 919
 920 Raw debugging output can be passed to minorview.py as the event-file. It will
 921 pick out the MinorTrace lines and use other lines where units in the
 922 simulation are named (such as system.cpu.dcachePort in the above example) will
 923 appear as 'comments' when units are clicked on the visualiser.
 924
 925 Clicking on a unit which contains instructions or lines will bring up a speech
 926 bubble giving extra information derived from the MinorInst/MinorLine lines.
 927
 928 --start-time and --end-time allow only sections of debug files to be loaded.
 929
 930 --prefix allows the name prefix of the CPU to be inspected to be supplied.
 931 This defaults to 'system.cpu'.
 932
 933 In the visualiser, The buttons Start, End, Back, Forward, Play and Stop can be
 934 used to control the displayed simulation time.
 935
 936 The diagonally striped coloured blocks are showing the InstId of the
 937 instruction or line they represent.  Note that lines in Fetch1 and f1ToF2.F
 938 only show the id fields of a line and that instructions in Fetch2, f2ToD, and
 939 decode.inputBuffer do not yet have execute sequence numbers.  The T/S.P/L/F.E
 940 buttons can be used to toggle parts of InstId on and off to make it easier to
 941 understand the display.  Useful combinations are:
 942
 943 <table>
 944 <tr>
 945     <td><b>Combination</b></td>
 946     <td><b>Reason</b></td>
 947 </tr>
 948 <tr>
 949     <td>E</td>
 950     <td>just show the final execute sequence number</td>
 951 </tr>
 952 <tr>
 953     <td>F/E</td>
 954     <td>show the instruction-related numbers</td>
 955 </tr>
 956 <tr>
 957     <td>S/P</td>
 958     <td>show just the stream-related numbers (watch the stream sequence
 959         change with branches and not change with predicted branches)</td>
 960 </tr>
 961 <tr>
 962     <td>S/E</td>
 963     <td>show instructions and their stream</td>
 964 </tr>
 965 </table>
 966
 967 The key to the right shows all the displayable colours (some of the colour
 968 choices are quite bad!):
 969
 970 <table>
 971 <tr>
 972     <td><b>Symbol</b></td>
 973     <td><b>Meaning</b></td>
 974 </tr>
 975 <tr>
 976     <td>U</td>
 977     <td>Unknown data</td>
 978 </tr>
 979 <tr>
 980     <td>B</td>
 981     <td>Blocked stage</td>
 982 </tr>
 983 <tr>
 984     <td>-</td>
 985     <td>Bubble</td>
 986 </tr>
 987 <tr>
 988     <td>E</td>
 989     <td>Empty queue slot</td>
 990 </tr>
 991 <tr>
 992     <td>R</td>
 993     <td>Reserved queue slot</td>
 994 </tr>
 995 <tr>
 996     <td>F</td>
 997     <td>Fault</td>
 998 </tr>
 999 <tr>
1000     <td>r</td>
1001     <td>Read (used as the leftmost stripe on data in the dcachePort)</td>
1002 </tr>
1003 <tr>
1004     <td>w</td>
1005     <td>Write " "</td>
1006 </tr>
1007 <tr>
1008     <td>0 to 9</td>
1009     <td>last decimal digit of the corresponding data</td>
1010 </tr>
1011 </table>
1012
1013 \verbatim
1014
1015     ,---------------.         .--------------.  *U
1016     | |=|->|=|->|=| |         ||=|||->||->|| |  *-  <- Fetch queues/LSQ
1017     `---------------'         `--------------'  *R
1018     === ======                                  *w  <- Activity/Stage activity
1019                               ,--------------.  *1
1020     ,--.      ,.      ,.      | ============ |  *3  <- Scoreboard
1021     |  |-\[]-\||-\[]-\||-\[]-\| ============ |  *5  <- Execute::inFlightInsts
1022     |  | :[] :||-/[]-/||-/[]-/| -. --------  |  *7
1023     |  |-/[]-/||  ^   ||      |  | --------- |  *9
1024     |  |      ||  |   ||      |  | ------    |
1025 []->|  |    ->||  |   ||      |  | ----      |
1026     |  |<-[]<-||<-+-<-||<-[]<-|  | ------    |->[] <- Execute to Fetch1,
1027     '--`      `'  ^   `'      | -' ------    |        Fetch2 branch data
1028              ---. |  ---.     `--------------'
1029              ---' |  ---'       ^       ^
1030                   |   ^         |       `------------ Execute
1031   MinorBuffer ----' input       `-------------------- Execute input buffer
1032                     buffer
1033 \endverbatim
1034
1035 Stages show the colours of the instructions currently being
1036 generated/processed.
1037
1038 Forward FIFOs between stages show the data being pushed into them at the
1039 current tick (to the left), the data in transit, and the data available at
1040 their outputs (to the right).
1041
1042 The backwards FIFO between Fetch2 and Fetch1 shows branch prediction data.
1043
1044 In general, all displayed data is correct at the end of a cycle's activity at
1045 the time indicated but before the inter-stage FIFOs are ticked.  Each FIFO
1046 has, therefore an extra slot to show the asserted new input data, and all the
1047 data currently within the FIFO.
1048
1049 Input buffers for each stage are shown below the corresponding stage and show
1050 the contents of those buffers as horizontal strips.  Strips marked as reserved
1051 (cyan by default) are reserved to be filled by the previous stage.  An input
1052 buffer with all reserved or occupied slots will, therefore, block the previous
1053 stage from generating output.
1054
1055 Fetch queues and LSQ show the lines/instructions in the queues of each
1056 interface and show the number of lines/instructions in TLB and memory in the
1057 two striped colours of the top of their frames.
1058
1059 Inside Execute, the horizontal bars represent the individual FU pipelines.
1060 The vertical bar to the left is the input buffer and the bar to the right, the
1061 instructions committed this cycle.  The background of Execute shows
1062 instructions which are being committed this cycle in their original FU
1063 pipeline positions.
1064
1065 The strip at the top of the Execute block shows the current streamSeqNum that
1066 Execute is committing.  A similar stripe at the top of Fetch1 shows that
1067 stage's expected streamSeqNum and the stripe at the top of Fetch2 shows its
1068 issuing predictionSeqNum.
1069
1070 The scoreboard shows the number of instructions in flight which will commit a
1071 result to the register in the position shown.  The scoreboard contains slots
1072 for each integer and floating point register.
1073
1074 The Execute::inFlightInsts queue shows all the instructions in flight in
1075 Execute with the oldest instruction (the next instruction to be committed) to
1076 the right.
1077
1078 'Stage activity' shows the signalled activity (as E/1) for each stage (with
1079 CPU miscellaneous activity to the left)
1080
1081 'Activity' show a count of stage and pipe activity.
1082
1083 \subsection picformat minor.pic format
1084
1085 The minor.pic file (src/minor/minor.pic) describes the layout of the
1086 models blocks on the visualiser.  Its format is described in the supplied
1087 minor.pic file.
1088
1089 */
1090
1091 }