24,399 views

Consider a $4$ stage pipeline processor. The number of cycles needed by the four instructions $I1, I2, I3, I4$ in stages $S1, S2, S3, S4$ is shown below:

$$\begin{array}{|c|c|c|c|c|} \hline \textbf{} & \textbf {S1} &\textbf {S2} & \textbf {S3} & \textbf{S4 } \\\hline \textbf{I1}& 2 & 1 & 1 & 1 \\\hline \textbf{I2} & 1 & 3 & 2 & 2\\\hline \textbf{I3}& 2 & 1 & 1 & 3 \\\hline \textbf{I4} & 1 & 2 & 2 & 2 \\\hline \end{array}$$

What is the number of cycles needed to execute the following loop?

For (i=1 to 2) {I1; I2; I3; I4;}

1. $16$
2. $23$
3. $28$
4. $30$

edited
It will be considered

For people wondering why the debate between 23 cycles and 25 cycles, here is some visual help:

The stages in red are the ones leading to a different answer. Reason:

Take for example the 1st red in I4 at cycle 7. The 1’ could’ve been scheduled in cycle 6 itself, under the assumption as given in the selected answer :

“buffers between the pipeline stages can store multiple results in the form of a queue”

But, some people are pointing out that it shouldn’t work like that, because I3 is already being stalled in stage 1 in cycle 6, and thus I4 can’t be scheduled there.

This is what @bikram points out:

“yes , answer would be 25 if multistage buffer not use and it is nearest to option B , so this option tells we need to use multistage buffer”

Hope this clears everything.

nice explanation @rish-18

Here bound of the loop are constants, therefore compiler will do the loop unrolling(If compiler won't then prefetcher will do) to increase the instruction level parallelism. And after loop unrolling $23$ cycles are required for execution. Therefore, correct answer would be (B).

PS: We assume the buffers between the pipeline stages can store multiple results in the form of a queue.

$$\tiny \begin{array}{|c|c|c|c|c|c|c|c|c|c|c|c|c|} \hline &C_1&C_2&C_3&C_4&C_5&C_6&C_7&C_8&C_9&C_{10}&C_{11}&C_{12}&C_{13}&C_{14}&C_{15}&C_{16}&C_{17}&C_{18}&C_{19}&C_{20} &C_{21}&C_{22}&C_{23}\\\hline \bf{I_1}&S_1&S_1&S_2&S_3&S_4\\\hline \bf{I_2}&&&S_1&S_2&S_2&S_2&S_3& S_3&S_4&S_4\\\hline \bf{I_3}&&&&S_1&S_1&\color{red}{-}&S_2&\color{red}{-}&S_3&\color{red}{-}&S_4& S_4&S_4\\\hline \bf{I_4}&&&&&&S_1&\color{red}{-}&S_2&S_2&S_3&S_3&\color{red}{-}&\color{red}{-}&S_4&S_4\\\hline \bf{I_1}&&&&&&&S_1&S_1&\color{red}{-}&S_2&\color{red}{-}&S_3&\color{red}{-}&\color{red}{-}&\color{red}{-}&S_4\\\hline \bf{I_2}&&&&&&&&&S_1&\color{red}{-}&S_2&S_2&S_2&S_3& S_3&\color{red}{-}&S_4&S_4\\\hline \bf{I_3}&&&&&&&&&&S_1&S_1&\color{red}{-}&\color{red}{-}&S_2&\color{red}{-}&S_3&\color{red}{-}&\color{red}{-}&S_4& S_4&S_4\\\hline \bf{I_4} &&&&&&&&&&&&S_1&\color{red}{-}&\color{red}{-}&S_2&S_2&S_3&S_3&\color{red}{-}&\color{red}{-}&\color{red}{-}&S_4&S_4\\\hline \end{array}$$

by

How to know if we have to use loop parallelism or no???

@Vickyrix

in the second iteration ..in I2  the s2 can get started at c13 itself because the s2 of I1 is completed by then. So this way the answer we are getting is 23 cycles only and not 25.

anything wrong?

The answer in case of single stage buffer will be 25

and in case of multi stage buffer will be 23?

correct me if i am wrong
and by default which type of buffer should we consider
this is the loop level level paralellism question but when we apply the loop level parallelism then we get 25 cycles so dis is not in option so we have to do it without loop level parallelism and the frst time loop output the result at 15 cc so total 30 cc for 2 iterations

### 1 comment

how is this a loop level parallelism question?

When using the following timeline we get 23 cycles:

In the 6th cycle, instruction 4 is entering stage 1 while instruction 3 has not started stage 2. This requires additional output buffer for stage 1 and we can assume this is available. (This additional buffer is not required if all stage delays are the same).

Hence , option (b) is true.

Ok,  in above diag,  in 6th cycle we are doing it introducing i4 while i3 hasn't  gone to next stage

In this ques if we take that approach,  we will get diff ans i think i4 will be introduced in 7th clock.

but why cant we use that stage if its free?cant we just use intermediate buffers to store prev instruction output. We will require those buffers only at end of that stage for new instruction.

Ok so i hv read discussion regarding this doubt in below ans.. So in which cases we can use that particular stage.. As we are using here

Process-2 can't get into slot-9 untill it completes first time!

@Arjun sir,

Why loop enrolling is not used in https://gateoverflow.in/3690/gate2004-it-47

Concept for solving this question:

Stage Si is allocated to instruction Ii iff

1. Eligibility: Ii must have completed Si-1 ... X time

2. Availability: Ii-1 must have released Si ... Y time

Time of allocation =  Max(X,Y)

In first iteration:

I1 completes after 5th cycle

I2 completes after 10th cycle

I3 completes after 13th cycle

I4 completes after 15th cycle

In second iteration

I1 completes after 16th cycle

I2 completes after 18th cycle

I3 completes after 21th cycle

I4 completes after 23th cycle

In second iteration I2 completes in 20th cycle and not 18th cycle. So we have to go with 30 only as 25 is not present in options.
23 is not correct u will get 25 but as this is not in the options loop level parallelism wont work here and u have to go drctly by 2*15=30