Courses/Computer Science/CPSC 355.W2014/Lecture Notes/Pipeline

From wiki.ucalgary.ca
Jump to: navigation, search

Announcements

  • Piazza posts worth noting
  • Final project posted
  • No office hours Feb 24

Topics

We are not going to dwell too heavily on the SPARC instruction format because (1) you should read about it in the readings for this week and (2) we will examine SPARC in greater detail in week 11.

We will examine the general topic of pipelining and how it relates to the instruction format (in general) and with respect to SPARC.

Terms and Concepts

  • sequential execution
  • pipeline, pipelining
  • instruction level parallelism (ILP)
  • data conflict, register conflicts, resource conflicts
  • branch prediction
  • stall
  • delay slot

Class Notes

Simple, non-pipelined execution of FDX cycle:

                T0   T1   T2   T3    T4    T5    T6    T7   T8   T9  T10
==========================================================================
PUSH EAX        FI   D    FO   X     WB   
MOV EAX, 0x0                               FI   D    FO    X     WB 
XOR EBX, EBX                                                         FI   D     FO    X     WB
INC EBX     
ADD EAX, EBX
NOP         

Note that the 8088 we're looking at is not pipelined; it only performs some instruction prefetching. We use the x86 language simply for illustration.

Ideal pipeline: at each clock cycle, a new instruction proceeds into the pipeline. During T4 -- T6, pipeline is full and graduating one instruction per cycle rather than one instruction every 5 cycles.

The chart below is "idealized" -- it represents a non-existent reality that demonstrates the potential benefits of a pipeline and full ILP.

                T0   T1   T2   T3    T4    T5    T6    T7   T8   T9  T10
==========================================================================
PUSH EAX        FI   D    FO   X     WB   
MOV EAX, 0x0         FI   D    FO    X     WB 
XOR EBX, EBX              FI   D     FO    X     WB
INC EBX                        FI    D     FO    X     WB
ADD EAX, EBX                         FI    D     FO    X    WB
NOP                                        FI    D     FO   X   WB
INC EAX                                          FI    D    FO  X    WB

Dilemna: we want high ILP because (1) it makes 100% use of the available circuitry and (2) it greatly speeds up execution throughput, but we NEED to maintain the semantics of the ISA (i.e., the public, known behavior of the instructions as "atomic" units of execution).

  1. Issue1: Instructions have data dependencies between themselves and subsequent or preceeding instructions.
  2. Issue2: Instructions have functional dependencies between themselves given the limited number of functional units and scratch registers present in the ISA.

These resource and data conflicts interfere with ideal ILP:

1) operand is being written back to while subsequent instruction is fetching it (need to stall)
2) all available functional units are being used by X stage (e.g., no more ADDers available)

RISC ISAs / approaches sometimes make this work easier because the load/store discipline typically means dealing with conflicts between registers, which are easier to keep track of than arbitrary dynamically-computed memory locations.

There are different approaches for an ISA designer to take. The work can be put on the back of:

  • the programmer
  • the compiler
  • the microarchitecture

A solution is to stall the pipeline for one cycle (i.e., introduce a bubble). Compilers that know this can "schedule" or reorder instructions to fill in the delay slots. Assembly programmers can do the same thing, but have to keep track of what the ISA will be doing at execution time. As an alternative, the microarchitecture can attempt to deal with this by either simply stalling, or trying to forward results, or trying to launch another instruction (superscalar), or trying these and falling back to stalling.