# GATE2006-81

5.1k views

A CPU has a $32$ $KB$ direct mapped cache with $128$ byte-block size. Suppose $A$ is two dimensional array of size $512 \times512$ with elements that occupy $8-bytes$ each. Consider the following two $C$ code segments, $P1$ and $P2$.

$P1$:

for (i=0; i<512; i++)
{
for (j=0; j<512; j++)
{
x +=A[i] [j];
}
}

P2:

for (i=0; i<512; i++)
{
for (j=0; j<512; j++)
{
x +=A[j] [i];
}
}

$P1$ and $P2$ are executed independently with the same initial state, namely, the array $A$ is not in the cache and $i$, $j$, $x$ are in registers. Let the number of cache misses experienced by $P1$ be $M1$ and that for $P2$ be $M2$.

The value of the ratio $\frac{M_{1}}{M_{2}}$:

1. $0$
2. $\frac{1}{16}$
3. $\frac{1}{8}$
4. $16$

edited
11

another good source-

1
Nice:)

$\text{Number of Cache Lines}= \dfrac{2^{15}B}{128B}= 256$

$\text{In 1 Cache Line} =\dfrac{128B}{8B} = 16\ elements$

$P_1=\dfrac{\text{total elements in array}}{\text{elements in a cache line}}$

$\quad=\dfrac{512 \times 512}{16}= 2^{14}= 16384.$

$P_2= 512 \times 512=2^{18}$

$\dfrac{P_1}{P_2}=\dfrac{16384}{512 \times 512}$

$\quad = 2^{14-18}= 2^{-4}=\dfrac{1}{16}$

It is so, because for $P_1$ for every line there is a miss, and once a miss is processed we get $16$ elements in memory.
So, another miss happens after $16$ elements.
For $P_2$ for every element there is a miss because storage is row major order(by default) and we are accessing column wise.

Hence, answer is option B.

edited
0
Please explain no. of cache lines=2^15/128B . How 2^15 ?
0
it means whenever storage will be in  row major order(by default) and we will try to  access in  column wise. then for every element there will be miss

Code being C implies array layout is row-major.

http://en.wikipedia.org/wiki/Row-major_order

When A[0][0] is fetched, 128 consecutive bytes are moved to cache. So, for the next 128/8 -1= 15 memory references there won't be a cache miss. For the next iteration of i loop also the same thing happens as there is no temporal locality in the code. So, number of cache misses for P1

$= \frac{512}{16} \times 512$

$= 32 \times 512$

$=2^{14} = 16384$

In the case of P2, the memory references are not consecutive. After A[0][0], the next access is A[1][0] which is after 512 * 8 memory locations. Since our cache block can hold only 128 contiguous memory locations, A[1][0] won't be in cache after a[0][0] is accessed. Now, the next location after A[0][0] is A[0][1] which will be accessed only after 512 iterations of the inner loop- after 512 distinct memory  block accesses. In our cache we have only space for 32 KB/128 B = 256 memory blocks. So, by the time A[0][1] is accessed, its cache block would be replaced. So, each of the memory access in P2 results in a cache miss. Total number of cache miss

$= 512\times 512$

So, $\frac{M_1}{M_2} = \frac{32 \times 512}{512 \times 512} = \frac{1}{16}$

2
Superb Explanation :)
0
I am not able to understand it. Can u plzzz make ot more elaborate?? Why is every memory reference is a cache miss?
2
@sushmita,

for P2 access is via coloumn major order, but C uses row major for storage(by default) so here the blocks will be bring is Row major order only to cache (as it is order of default storage)i.e. storage order = retrieval order.

but it not the the requirement of P2, so every access will be a Miss in according to P2s requirement.. i.e. 2^18 misses.
2
Nice explanation sir ...thnks :)
0
excellent explanation ... :-)
0
<p>They should also mention the block replacement policy for P2, In case of LRU or FIFO, we will get every memory reference as a miss. But in case of random block replacement, we might not get every memory access as miss.</p>
1
@hemant

its a direct mapped cache, it require no page replacement policy because frame no. 'f' will always be replaced by block no. (f mod n), where n is total number of lines(blocks) in cache.
0
Yes, you are right, I missed that.
0

In our cache we have only space for 32 KB/128 B = 256 memory blocks. So, by the time A[0][1] is accessed, its cache block would be replaced.

The quote above is not entirely obvious to me.

Please elaborate about this; how can we be sure that 512 iterations of the inner loop will definitely remove the block containing A[0][0]. Even if there are 512 accesses to distinct bocks for every iteration of the outer loop, maybe not all blocks are replaced and instead some cache lines are replaced multiple times?

For example, it could be the case that in each of the outer loop, every odd cache line is replaced twice and the even lines are left untouched after the initial compulsory miss. This means that when time comes to visit A[0][1], it will in fact be in the first cache block.

0
@Arjun sir, What would be the miss ratio in P2 if we were using Optimal Block replacement policy instead of LRU?

No. of elements/block = 128/8=16

one row contains 512 elements

No. of blocks/row= 512/16 = 32

No. of cache lines = 32KB/128B= 256

FOR M1

Now to access one row 32 miss operations will be occurred

to access whole array it requires = 512 * 32 (miss operations)

= 16384

FOR M2

Now to access one column 512 miss operations will occur(as it is column major)

to access whole array it requires = 512 * 512 (miss operations)

so  M1/M2=1/16

1 vote

## Check previous question..in that question i calculated cache lines and MM block..Hence here Option B is correct.

1 vote

(1) Cache size $=32 KB=2^{15}$

(2) Virtual address size > 15 bits (lets denote it as {TagBits + 15 bits} )

(3) Block size $=128 B=2^7$

(4) $512=2^9$

(5) Array element size $=8B=2^3$

(6) $\frac{(3)}{(5)}=\frac{128B}{8B}=2^4=16$ elements per block

(7) Word offset = 7 bits …from (3)

(8) Bits required to index word in cache =15   …from (1)

(9) Bits to index cache line $=(8)-(7)=15-7=9$

(10) Assume A[0][0] is stored at 0th memory location = {TagBits + 15 zeroes}

15 zeroes mod $2^7$ = most significant 8 zeroes, which are used to index cache lines:

So A[0][0] will go in line 0.

(11) Bytes/row = (4) × (5) = $2^{12}$

(12) Address of last byte in first row, i.e. of A[0][511] = {00..00 + twelve 1’s}

(13) Address of first byte in 2nd row = next address of (12) = {00..00 + 1 + twelve 0’s}

(14) Address of first byte in 3rd row = next address of (12) = {00..00 + 10 + twelve 0’s}

(15) So, CACHE_LINE occupied by row is incremented $32=2^5$ for each row.

There are $2^9=512$ rows  … from (9)

$\frac{2^9}{2^5}=2^4=16$

Every 16th row will occupy same line. Cache line of block containing A[0][0] will be replaced by block of A[16][0], far before accessing A[0][1]. Same for other elements. Thus there will be no hits at all while accessing any array element.

edited
0
I think bits to index cache line is 8 bits. Correct me if I'm wrong.

## Related questions

1
5.8k views
Consider two cache organizations. First one is $32$ $kB$ $2$-way set associative with $32$ $byte$ block size, the second is of same size but direct mapped. The size of an address is $32$ $bits$ in both cases . A $2$-to-$1$ multiplexer has latency of $0.6 ns$ while a $k-$bit comparator has ... while that of direct mapped is $h_2$. The value of $h_2$ is: $2.4$ $ns$ $2.3$ $ns$ $1.8$ $ns$ $1.7$ $ns$
A CPU has a $32 KB$ direct mapped cache with $128$ byte-block size. Suppose A is two dimensional array of size $512 \times512$ with elements that occupy $8$-bytes each. Consider the following two C code segments, $P1$ and $P2$. P1: for (i=0; i<512; i++) { for (j=0; j<512; j++ ... misses experienced by $P1$ be $M_{1}$and that for $P2$ be $M_{2}$. The value of $M_{1}$ is: $0$ $2048$ $16384$ $262144$
Consider two cache organizations. First one is $32 \hspace{0.2cm} KB$ $2-way$ set associative with $32 \hspace{0.2cm} byte$ block size, the second is of same size but direct mapped. The size of an address is $32 \hspace{0.2cm} bits$ in both cases . A $2-to-1$ ... $h_1$ is: $2.4 \text{ ns}$ $2.3 \text{ ns}$ $1.8 \text{ ns}$ $1.7 \text{ ns}$
A CPU has a cache with block size 64 bytes. The main memory has $k$ banks, each bank being $c$ bytes wide. Consecutive $c$ − byte chunks are mapped on consecutive banks with wrap-around. All the $k$ banks can be accessed in parallel, but two accesses to the same bank ... and $k = 24$, the latency of retrieving a cache block starting at address zero from main memory is: 92 ns 104 ns 172 ns 184 ns