31,884 views

Consider an instruction pipeline with five stages without any branch prediction:

Fetch Instruction (FI), Decode Instruction (DI), Fetch Operand (FO), Execute Instruction (EI) and Write Operand (WO). The stage delays for FI, DI, FO, EI and WO are $\text{5 ns, 7 ns, 10 ns, 8 ns and 6 ns},$ respectively. There are intermediate storage buffers after each stage and the delay of each buffer is $1\ \text{ns}.$ A program consisting of $12$ instructions $\text{I1, I2, I3,}\ldots,\text{ I12}$ is executed in this pipelined processor. Instruction $\text{I4}$ is the only branch instruction and its branch target is $\text{I9}.$ If the branch is taken during the execution of this program, the time (in ns) needed to complete the program is

1.  $132$
2.  $165$
3.  $176$
4.  $328$

There are intermediate storage buffers after each stage and the delay of each buffer is 1 ns.

What is the significance of this statement in given problem?

Where are we using this information?

edited by

Hi @Arjun Sir,

First of all thanks for sharing following example -->

JZ LOOP

Here we can know whether we have to do a Jump or not only after the Execute stage of the JZ (Jump on Zero) instruction. Similar is the case for any conditional branch instruction.

Good point but in case of unconditional branch i think till ID phase decision could be taken. Please correct me if i am wrong.

Chhotu yes in case of unconditional branch by decoding itself we can take decision IFF the operand fetch phase is merged with the decode phase then we can directly get the location (target ) by decode phase itself hence once this decode phase is done we can flush the pipleline and restart from I9

I think following is the generalized formula. Anyone confirm.

No of cycles = No. of instructions actually appeared in the datapath + No of stalls + No. of stages - 1

$N = 8 (I_{1,2,3,4,9,10,11,12}) \ +\ 3 \ +\ 5 -1 = 15$

$=> T_n = 15*(10+1) = 165$
@Venkat Sai

if in question it is given that it is conditional branch instruction then no of instructions=9

if in question it is given that it is unconditional branch instruction then no of instruction =11

but since nothing given therefore we assume target address will be available in last stage only thefore no. of instructions=12

if i m wrong then please correct me
Why we assume branch decision on ex stage???

Why Instruction I8 is not considered in Instruction cycle ? The question should not mention so Please clear in topics Very deep.  @Arjun Sir

@ because it is the best case, that we come to know about branch decision in EX stage

edited
Always count for branch instruction separately. first instruction=5 Clock, now remaining (I2 I3 I10 I11 I12 I4)=6 clock and at last branch instruction will take 4 clock.

Total=(5+6+4)*11=165 ns
Because during execution only we came to know where we have to branch. During execution  effective address for the next instruction will be calculated and loaded into PC

After pipelining we have to adjust the stage delays such that no stage will be waiting for another to ensure smooth pipelining (continuous flow). Since we can not easily decrease the stage delay, we can increase all the stage delays to the maximum delay possible. So, here maximum delay is $10$ ns. Buffer delay given is $1$ ns. So, each stage takes $11$ ns in total.

FI of $\text{I9}$ can start only after the EI of $\text{I4}.$ So, the total execution time will be
$$15 \times 11 = 165$$
$$\small \begin{array}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|} \hline &\bf{t_1}&\bf{t_2}&\bf{t_3}&\bf{t_4}&\bf{t_5}&\bf{t_6}&\bf{t_7}&\bf{t_8}&\bf{t_9}&\bf{t_{10}}&\bf{t_{11}}&\bf{t_{12}}&\bf{t_{13}}&\bf{t_{14}}&\bf{t_{15}}\\ \hline \textbf{I1}&\text{FI}&\text{DI}&\text{FO}&\text{EI}&\text{WO}\\ \textbf{I2}&&\text{FI}&\text{DI}&\text{FO}&\text{EI}&\text{WO}\\ \textbf{I3}&&&\text{FI}&\text{DI}&\text{FO} &\text{EI}&\text{WO}\\ \textbf{I4}&&&&\text{FI}&\text{DI}&\text{FO}&\text{EI}&\text{WO}\\ &&&&&\color{red}{\text{stall}}\\ &&&&&&\color{red}{\text{stall}}\\ &&&&&&&\color{red}{\text{stall}}\\ \textbf{I9}&&&&&&&&\text{FI}&\text{DI}&\text{FO}&\text{EI}&\text{WO}\\ \textbf{I10}&&&&&&&&&\text{FI}&\text{DI}&\text{FO}&\text{EI}&\text{WO}\\ \textbf{I11}&&&&&&&&&&\text{FI}&\text{DI}&\text{FO}&\text{EI}&\text{WO}\\ \textbf{I12}&&&&&&&&&&&\text{FI}&\text{DI}&\text{FO}&\text{EI}&\text{WO}\\ \hline\end{array}$$

Correct Answer: $B$
by

@arjun sir how you saying branch address availiable after EI stage..?

@Nils

Fetch will happen in T8. Even if there is a RAW hazard between I4 and I9, I9 will get the updated data at decode phase of the instruction.

Why we are not using the stage delays for FI, DI, FO, EI and WO are 5 ns, 7 ns, 10 ns, 8 ns and 6 ns, respectively and taking only highest one?

At least for first Instruction we suppose to use it if its given.

first instruction will take 5+7+10+8+6=36 and 36+5 (buffer delay)= 41 ns and remaining instruction will take (upto 4 only) 11 ns * 3= 33. thus total time after 4th instruction execution 41+33 =74 ns.

similary, form 9 to 12 th instruction it will take 41 ns (for 9th) + 33 ns (from 10 to 12th)=74ns.

and thus total time required for execution 74 + 74 =148 ns (No option is given)

What I am saying here is that if they are giving time for each stage then why we are not using it and just solve it by assuming only highest delay.
It is because the highest delay in the pipeline is 10ns. What we are doing is we increase the delay of all other stages to 10ns to ensure continuous flow. now, every stage has a delay of 10ns. Refer the best answer above. He has explained it nicely.

Can anyone guide me why can't we do like this ?

edited
Its the clock period which determines when the instruction will move to next hardware in the datapath - Not the execution time of datapath hardwares. I hope you now understand why it is 10+1 = 11 ns.
why FI of I9 is not taken  at  T9?

we are not doing it like this becoz pipeline's time is decided by its slowest unit!!

if it is still not clear to u, then tell me ill give u an example;

HOW TO GET 165  I8 IS NOT COUNTED IN SOLUTION
TOTAL CYCLE ARE 16 ->16 *11=176 ANS PLS ELABORATE 

after completing the branch instruction you have to come back so then why the answer is b
hahah

There will be 4 stalls ,so answer is C

In conditional branch instructions, condition and target address are resolved at EX' stage and bubbles are fed during its execution to the pipeline. (By default)

But we can also resolve it at ID -stage using more complex pipeline architecture( add simple ALU and computation unit at ID stage) but don't use this architecture unless otherwise stated.

without any branch prediction

then we should consider I5 also ????

If we don't use any method to remove control dependency, then I5 will be executed.

in the question, it has been mentioned that " If the branch is taken during the execution of this program" then why are we not fetching the I9 instruction in the execution stage of I4?

why are we waiting till the completion of the execution phase to start fetching?
What would be the solution had the question been asked for with branch prediction?
why we can’t perform fetch operation of I9 along with execution phase of I4?

how we are fetching I9 in the 8th cycle, it is not mentioned. Please explain this point

So, we can definitely move the branch comparison to the OF(ID stage in MIPS) stage and reduce the stall cycles by $1$. But since none of the options matches that, we are not considering this.

I guess for NAT question both answers would be true, because moving the branch comparison to earlier stages of the pipeline require complex circuitry to deal with forwarding and other data hazards.

cycles in pink are stall cycles, at EI-4 it was notified to the system that instruction 9 has to be loaded next.
We have completed execution in a total of 15 cycles where each cycle was (10+1)ns long,

Hence, answer = $15 \times 11 = 165$ns

In the given ans there would be stall for I8 too,plz clarify why only 3 stalls,.Plz explain someone.

Updated.

can somebody tell me pls differnce in flushing pipeline and stalling?here in this example we stall pipeline or flush pipeline .

### Could you explain why structural hazard has not been considered?

Instruction number 8 is omitted,  any particular reason?
Yes. Not able to understand . Could someone explain in detail ?
Usually when we drew timing diagram if fetch stage is 5 ns then we will place decode staage in 6th clock cycle . Here every stage is taking 1 clock cycle only. Why it is so ?
Here we are stalling the pipeline not flushing. Because it is specified in question that no prediction has to be made. So we cant predict that "branch will not be taken and fetch I5". Hence to stall is the only way possible.

Flushing is done once we fetched the instruction and started executing it and later realize that it is not the next instruction to be executed. Howerver if fetched instruction is the next instruction to be executed no flushing is done and performance improves.

To stall means "not performing any operations in those cycles." It will always degrade the performance.
why you dont consider the I8 i am totally confused for this
How do you know that
at EI-4 it was notified to the system that instruction 9 has to be loaded next.
Plz explain.
Clock Time = max stage delay + Buffer Delay

= 10+1= 11ns

I1 - Finish at 5th clock

I2 - Finish at 6th clock

I3 - Finish at 7th clock

I4- Finish at 8th clock

Due to branching at I4 pipelining halts and starts after  EI stage of I4 and performs FI of I9 at 8th clock.

I9 - Finish at 12th clock

I10 - Finish at 13th clock

I11- Finish at 14th clock

I12 - Finish at 15 clock

Total time to complete program = 11*15= 165 ns

Nice logic.No confusion.

Short Trick :

Branching descion will be taken in Execute Instruction (EI) phase (4th phase) so there will be 3 stalls

first I1 will complete in 5 cycles + (I2,I3,I4,3 STALLS,I9,I10,I11,I12) WILL TAKE ONE-ONE CYCLE=15 CYCLE

cycle time=(largest cycle time + buffer delay)=10+1=11

Execution time =11*15=165

clock cycle time= max. stage delay+ buffer delay

= 10+1

= 11 ns

clock cycle =15

execution time= 11*15=165
by

.......

How you identified the branch instruction is  conditional branch (like if else).
if it was a branch instruction like a function call  ?
In question it clearlly mention I4 is a branch instruction.
Does the question need more clarity about the branch instruction ?
Which instruction is not executed out of 12 instructions?

No of instructions = 12

each clock cycle takes = 11ns (max(5,7,10,8,6) + 1ns)

I4 is the branch instruction so till I4 ,

I1-I3 will take total time of 11ns to execute so 3 * 11ns = 33ns

I4

 IF ID OF EX WB

So we know the target address of I9 only after completing the execution phase of I4 , but till that point I5 - I8 will already start

their execution , so there will be stalls or wastage of cycle till that point

therefore time taken for I4-I8 to execute is = 5 * 11ns = 55ns (For all I1-I8 , CPI = 1 )

Then at I9 , CPI will be 4 ( As that will execute after execution phase so CPI = 4 for I9)

So time taken for I9 to execute = 4 * 11 = 44ns (CPI * time taken for each cycle)

Then again from I10 - I12 again they take CPI = 1 , so time taken by them is = 3 * 11ns = 33ns

Total time taken to execute this program is = 33ns + 55ns + 44ns + 33ns = 165ns

After pipelining we have to adjust the stage delays such that no stage will be waiting for another to ensure smooth pipelining (continuous flow). Since we can not easily decrease the stage delay, we can increase all the stage delays to the maximum delay possible. So, here maximum delay is 1010 ns. Buffer delay given is 1ns. So, each stage takes 1111 ns in total.

FI of I9 can start only after the EI of I4. So, the total execution time will be

15×11=165ns

### 1 comment

how you got 15
We can determine  if the branch is taken or not after the execution stage  of the instruction I4.So number of stall cycles will be (4-1)=3.

Before the branch is taken total number of instructions is  4 and after branch is taken total number of instructions is 4 .So total number of instructions to be executed is 8.So we can assume the situation as we have total 8 instructions to be executed in pipelined manner without any stall cycles.

There are  5 stages in the pipeline.So the number of cycles needed to execute first instruction is 5.After that in each clock cycle one instruction will be completed.So total number of clock cycles needed is

5+(8-1)=5+7=12.

Due to branching number of stall cycles overhead is 3.

So total number of clock cycles needed=12+3=15.

time required to complete one clock cycle is =max stage delay+buffer overhead=10+1=11

So total time required will be = 15*11=165.

Given stage 1 = FI, stage 2=DI, stage 3= FO, stage 4=EI and stage 5=WO

Cycle time =  max(stage delays)+buffer delay = max(5,7,10,8,6) + 1 = 11ns

Question also mentions I4 is the only Branch Instruction out of I1,I2..I12.

So Branch % fraction = Number of Branch Inst / Total Inst = 1/12

Question also mentions Branch is taken only after EI stage of I4

Branch Penalty = Stage n – 1 = 4 – 1 = 3 (Here n =4)

#stalls/inst = Branch% fraction * Branch Penalty = 1/12 * 3 = 0.25

Avg_inst_exec_time = (1 + #stalls/inst)*cycle time = (1+ 0.25) * 11 = 13.75 ns

Total_exec_time =  Total  instructions * Avg_inst_exec_time = 12 * 13.75 = 165 ns = Option B

Note: Better to follow this formula approach instead of drawing phase time table in GATE so that answer can be obtained under 2 mins without any mistake. These formulas are easy to remember and comes under branched instruction pipelining concept. But using phase time table you can also solve as