337 views

1024x1024 array of 32-bit numbers is to be normalized as follows. For each column the largest element is found and all elements of the column are divided by this maximum value. Assume that each page in the virtual memory consists of  4Kbytes and that 1Mbytes of the main memory are allocated for storing data during this computation. Suppose that it takes 40  ms to load a page from the disk to the main memory when a page fault occurs (assume that when we start, the main memory is empty ).

a.   How many page faults would occur if the elements of the array are stored in column order in the virtual memory?

b.   How many page faults would occur if the elements are stored in row order?

c.   Estimate the total time needed to perform this normalization for both arrangements a & b. Assume that it takes 2 ns to do a comparison, 20 ns to do a divide and 100 ns to do a load/store to memory.

retagged | 337 views

Page fault : We search main memory for some data and that data not present in it so we need to bring that in memory from secondry memory.

1024 x 1024 array of 32-bit numbers  means $1024\times 1024 \times 4$ Bytes $= 2^{20} = 2^{20}$ element each of (32 bits= 32/8= 4 Bytes).

So memory needed $=2^{22}$ Bytes

Pagesize in memory =  4 Kbytes $= 2^{12}$ Bytes

No. of elements stored in a page $= 2^{22}$ Bytes $/2^{12}$ Bytes$= 2^{10} = 1024$ elements

Total main memory allocated  = 1 Mbytes $= 2^{20}$ Bytes

So Number of pages in memory $= 2^{20}$ Bytes $/ 2^{12}$ Bytes $= 2^8$ pages in main memory maximum..

C language and many others force elements are stored in Row major order i.e. 1,2,3,4,5, ... in one row tilll row ends. When Row ends the start with new Row. Example:

A. Here elements are stored in column major order - transporse of the above figure. (Addresses in a program is always virtual on a system with virtual memory). Our program fetches all elements in a column and find the maximum (as in the innermost loop of selection sort). Then it again iterates through all the elements and divide each with this maximum value.

In each column we have 1024 elements, and all of them fit in a page. So, we have one page fault per column and then dividing each element by the maximum does not cause any page fault as all 1024 elements are in the cache. So, in total we have $1024$ page faults. (Assumes LRU page replacement).

B. Here we have row major order memory allocation. So, for first element access we have a page fault. For the next element access we again have a page fault. So, after seeing 1024 pages corresponding to 1024 columns, we have to iterate through all of them once more and each causes a new page fault -(this depends on the page replacement algorithm being used and we assume LRU as this is not given in question). i.e., for each memory access we are having a page fault.

#Page faults = 2 per matrix entry

$= 1024 \times 1024 \times 2 = 2M$

This is extremely high. But we can do a better implementationn by processing only 256 elements of a column at a time (then doing the same for next column and so on) so that while processing the next column we hit in the page table. i.e. we find the maximum of first 256 elements of a column at a time then do the same for next column and so on each causing a page fault. After 256 accesses, we move to next column and their we have first 256 elements already in memory and this continues for the first 256 columns. After this we proceed with the second set of 256 elements from each column, then 3rd and finally 4th- the max element of each quartet is propagated to next (assume temporary memory avalable).  This way (tiling to 4x4x256x256 or 4x1024x256), we can reduce the page fault to just 2 per entry- one for comparison and 1 for division. So, number of page faults will be

$= 2 \times 1024 = 2048.$

C. For each element we have to do a comparison except the first element of each column. So, in total we have $1023\times 1024$ comparisons excluding the bound checking in the loop. So, time for comparison is nearly

$1M \times 2ns = 2ms$.

Each element of the matrix is divided by the max column element. So, time for division

$= 1M \times 20ns = 20ms$

Each element of the matrix must be loaded to memory twice- first for comparison and second for division and then stored back. So, total time for load/store

$= 1M \times (2\times 100ns + 100ns) \\= 300ms$

So, total time excluding page fault time

$= 2 + 20 + 300 = 322 ms$.

Now for arrangement A, total time

= Page fault time for A + 322ms

$= 1024 \times 40ms + 322 ms \\ \approx 41 s$

For arrangement B, total time

= Page fault time for B + 322 ms

$= 2M \times 40ms + 322ms \\ \approx 82,000 s$

For optimized version, we get, total time

$= 2048 \times 40ms + 322 ms \\ \approx 82s$

1000 times better - that's what architecture specific optimization can do :)

selected by
0
I edited the A part- do you agree?
0
Sir after finding the max element in a column when we will divide the elements of column by the max element all the elements will result in hit as they will already be in memory, so why 1024*1024*2 faults. I am not getting it? plzz explain arjun sir.

Also for part B where elements are stored row major order and accessed column wise according to me every element will result in page fault. so total page faults will be 1024*1024. Please clear it.
0

when we will divide the elements of column by the max element all the elements will result in hit as they will already be in memory

Won't they be replaced before next access?

0
plzz explain arjun sir. I am not getting it. totally lost. Answer part A and B completely. dunno why i am totally confused.

In part A, suppose we are accessing first column and when we will access the first element of the column the entire page which contains the first column will be brought from virtual memory to main memory. We will find the max elemnt and we will then divide the elements of the column by that max elemnt without going to second column. Then how will it get replaced?? cant we write such code.
0
you are correct. I switched the row major order and column major order. See now..
+2
wonderful explanation. Everything is clear now. Thanku so much arjun sir. Thanx a tons.

Answer was given wrong everywhere in textbook solutions that caused a lot of trouble.
0
Added an optimized one for B part. LRU replacement is not mentioned- so I guess they assume the best replacement policy.
+1
Understood the optimized version completely. That's why everywhere answer was 2048. I didn't know about such optimization before. Thanx for explaining it beautifully. Learned lot from this question. Can such optimization also be applied between main memory and cache?
0

Sorry sir, But could not understand optimized one of part B completely.

This is extremely high. But we can do a better implementation by processing only 256 elements of a column at a time (then doing the same for next column and so on) so that while processing the next column we hit in the page table. In this way, we can ensure each page table gets hit 255 times- or we reduce the page fault to just 2 per entry- one for comparison and 1 for a division.

sir, here after processing first 256 elements of a column, we have to first complete processing of rest (1024-256= 768 elements ) elements of the first column( because we have to find max among 1024 elements of a column ) then only we will be processing the elements of next column right ??

please clear my doubt sir :p

+1
@sushmita yes, infact this is used mainly for the cache. Technique is called loop tiling and often done by compilers.

@vijaycs yes, we have to. But we just postpones it. See the answer now.
+1

Now fine. Thank you, sir. : )

(assume temporary memory available), which is 4kB ( array of size 1024).

+2
Arjun sir when we process the last 256 elements of each column we will get the max element that time and if we divide the elements of the last set in that pass only can't we reduce further page faults? I mean 2048-256=1782 page faults?
+1
@Susmita, I think yes, you are correct. (y)
0
yes :)

+1 vote