# GATE2006-42

8.3k views

A CPU has a five-stage pipeline and runs at $1$ GHz frequency. Instruction fetch happens in the first stage of the pipeline. A conditional branch instruction computes the target address and evaluates the condition in the third stage of the pipeline. The processor stops fetching new instructions following a conditional branch until the branch outcome is known. A program executes $10^9$ instructions out of which $20\%$ are conditional branches. If each instruction takes one cycle to complete on average, the total execution time of the program is:

1. $\text{1.0 second}$
2. $\text{1.2 seconds}$
3. $\text{1.4 seconds}$
4. $\text{1.6 seconds}$

edited
3
program execute at 10^9

frequency =1/10^9

instruction fetch in first stage and conditional branch at third stage,so 2 stall cycle.

20% conditional branch and remaining 80%.non conditional branch

CPI of conditional branch=3*1/10^9

CPI of non conditional branch=1*1/10^9

conditional branch(10^9*20/100)(3*1/10^9)+non conditional branch(10^9*80/100)(1*1/10^9).

0.6+0.8=1.4ns
0

@Arjun Sir, In GO 2020 edition there is typing mistake , instead of $10^9$ there it is written $109$

0

A conditional branch instruction computes the target address and evaluates the condition in the third stage of the pipeline.

Branch penalty$:3-1=2$

$\# of\ stalls/inst:0.2\times 2=0.4$

average inst. $ET_{pipe}=CPI_{pipe}\times Cycle\ time_{pipe}$

Ideal $CPI_{pipe}$ is always 1. But due to control dependency problem stall cycles are created. So,

avg. inst exec. time$: (1+\#\ of\ stalls/inst)\times cycle\ time=(1+0.4)\times \dfrac{1}{1GHz}=1.4ns$

Total exec. time$: 10^9(instruction)\times 1.4ns=1.4\ sec$

Delay slots in the pipeline caused due to a branch instruction is $2$ as after the $3^{rd}$
stage of current instruction (during $4^{th}$ stage) IF of next begins.
Ideally, this should be during $2\text{nd}$ stage.

So, for total no. of instructions = $10^9$ and $20\%$ branch,
we have $0.2 \times 2 \times 10^9 = 4 \times 10^8$ cycle penalty.

Since clock speed is $1\text{ GHz}$ and each instruction on average takes $1$ cycle,
total execution time in seconds will be

$=\dfrac{10^9}{10^9}+4 \times \dfrac{10^8}{10^9}$

$= 1.4$

Correct Answer: $C$

edited
0
thank you sir :)
5
Sir,

number of stalls per branch instruction=2

we have 20% branch instructions=$2\times 10^{8}$

therefore stall cycles caused by 20% of instructions=$2\times 2\times 10^{8}$

Execution time=$10^{9}/10^{9}+4\times 10^{8}/10^{9}$=1.4

https://gateoverflow.in/1818/gate2006_42
0
Thanks for the correction :)
3

@Arjun sir I have a silly doubt here

If each instruction takes one cycle to complete on average

Does this not imply that we can simply do this ---   No. of instructions * clock cycle/instruction * Tclock    as the number of cycles on an average required for the instruction is given.

In this case the answer could be 1.0 sec.

1
Which formula is being used to calculate the execution time??
0
Execution time in Seconds or Nano-Seconds?

I think options must be in Nano-seconds for value 1.4 to be correct.
0

@Arjun sir

Here are we considering that branch is always NOT TAKEN.. hence we have to execute all $10^{9}$ instructions

Else if branch would have been taken instructions to execute will be less hence time also

correct me if wrong .

0

@jatin khachane 1-Here the concern is not the branch is taken or not.The concern is the stalls caused when the conditional instructions are fetched.

0
Yes but anyway both cases branch taken or not we stalls happen..but if branch taken then we may skip some instructions hence time will be less..

0
In this question we have control hazard.right?
because we are having stalls
5

In the GOPDF for 2020, the value is given as $109$ instead of $10^9$. Please fix when you can. 0
Same error for me
0
@arjun sir here one instruction takes takes one cycle so one stage should take 0.2 cycles so why are you considering stall cycles to be 2 it should be 0.4 stall cycles please help me here
No. of non branch instructions $= 8 \times 10^8$

No. of branch instructions $= 2 \times 10^8$

When the branch condition is evaluated in third stage- first and second stage of pipeline are empty. So,
no. of stalls created by control hazard(branch instruction) $= 2 \times 2 \times 10^8 = 4 \times 10^8$

Then,   total no of cycle for execution $= 8 \times 10^8 + 2 \times 10^8 + 4 \times 10^8 ns \\= 14 \times 10^8ns = 1.4s$
30

(1+stall frequency* stall cycle)* tc

NOW , stall freq is 20% and stall cycle is 2 bcz we have to wait  upto 2 stage and in 3rd stage we are getiing the evaluation.

So taccesstime=(1+0.20*2)*1/10^9)  for 1 instuction then for 10^9 instruction will be (1+0.20*2)*1/10^9* 10^9= 1.4 sec

1
I branch instruction should take 2 cycles why have u taken 3
1
bcz in the question itsself given i.e conditional instructions evaluate in the  3rd stage , so there will be 2 stall cycle.
0
Why you have added 4*10^8 ?
1
total no of cycles for execution=[ [80%(0+1)+20%(2+1)] *10^-9] 10^9

=1.4ns
Total excution time is :{(3*2*10^8)+(8*10^1)}*10^-9=1.4 sec

where first term in addition is no of cycle taken by branch instruction to excute ( 2 stall + 1 for its fetch )

and next every instruction is complete in 1

It it correct way of thinking ?
0
Yes, thats also correct :)
0

0
a conditional branch instruction  compute target address and evaluate in the third stage it means no of stalls =2

and clock per instruction(cpi)=1+no of stalls      =1+2=3

20% are branch instruction and 80% are non branch

There are 2 methods to solve this type of question. So for branch instructions CPI will be 3 , as no of stalls are 2 , and there are 2*10^8 branch instructions , therefore total machine cycles used =2*10^8*3 and for remaining 8*10^8 instructions CPI will be 1 as they are non branch instructions , and total machine cycles taken =8*10^8*1,now total machine cycles used = 8*10^8*1+3*2*10^8 = 14*10^8 , now cpu frequency is 1Ghz Therefore memory cycle time =1/(10^9), finally total execution time = total no of machine cycles used *cycle time=(14*10^8)/10^9= 1.4 sec
1 vote

Following are my two approach -

## First Approach:

$10^{9}$ * 20% = 2 *$10^{8}$ will be executed in non-pipelined mode.

$10^{9}$ * 80% = 8 * $10^{8}$ will be executed in pipelined mode.

so according to formula, Total execution time = pipeline mode + non-pipeline mode

$(2 * 10^{8})$ * 3 * 1ns  +  $[5 + (8 * 10^{8}) - 1]$ * 1ns = 1.4 sec + 4ns

so I can neglect 4ns & hence ans is  1.4 sec

## 2nd Approach:

Total time = New CPI * cycle time * Instruction count

New CPI = ( Old CPI + stall-rate*stall-cycle)

so, total time = (1+ 0.2*2) * 1ns * 10^9 = 1.4 sec

0

@MRINMOY_HALDER

New CPI = ( Old CPI + stall-rate*stall-cycle)

Stall rate means stall frequency for conditional branch instructions only which is 20%=0.2

Is it correct what I thinking

0
Can you please explain the second last line in first approach.
0
Why did you multiplied $2\times 10^8$ with $3$. Why its not 5 which is equal to the number of stages.
0
20% is the branch instruction & these branch instruction will be executed in non-pipeline mode, but for only upto 3 cycles, after 3 cycles we can overlap those branch instructions.

i.e, for branch instruction CPI = 3.

## Related questions

1
4.3k views
Consider two cache organizations. First one is $32$ $kB$ $2$-way set associative with $32$ $byte$ block size, the second is of same size but direct mapped. The size of an address is $32$ $bits$ in both cases . A $2$-to-$1$ multiplexer has latency of $0.6 ns$ while a $k-$bit comparator has ... while that of direct mapped is $h_2$. The value of $h_2$ is: $2.4$ $ns$ $2.3$ $ns$ $1.8$ $ns$ $1.7$ $ns$
A CPU has a $32$ $KB$ direct mapped cache with $128$ byte-block size. Suppose $A$ is two dimensional array of size $512 \times512$ with elements that occupy $8-bytes$ each. Consider the following two $C$ code segments, $P1$ and $P2$. $P1$: for (i=0; i<512; i++) { for (j=0; j<512; j++) ... and that for $P2$ be $M2$. The value of the ratio $\frac{M_{1}}{M_{2}}$: $0$ $\frac{1}{16}$ $\frac{1}{8}$ $16$
A CPU has a $32 KB$ direct mapped cache with $128$ byte-block size. Suppose A is two dimensional array of size $512 \times512$ with elements that occupy $8$-bytes each. Consider the following two C code segments, $P1$ and $P2$. P1: for (i=0; i<512; i++) { for (j=0; j<512; j++ ... misses experienced by $P1$ be $M_{1}$and that for $P2$ be $M_{2}$. The value of $M_{1}$ is: $0$ $2048$ $16384$ $262144$
Consider two cache organizations. First one is $32 \hspace{0.2cm} KB$ $2-way$ set associative with $32 \hspace{0.2cm} byte$ block size, the second is of same size but direct mapped. The size of an address is $32 \hspace{0.2cm} bits$ in both cases . A $2-to-1$ ... $h_1$ is: $2.4 \text{ ns}$ $2.3 \text{ ns}$ $1.8 \text{ ns}$ $1.7 \text{ ns}$