retagged by
2,568 views
4 votes
4 votes

1024x1024 array of 32-bit numbers is to be normalized as follows. For each column the largest element is found and all elements of the column are divided by this maximum value. Assume that each page in the virtual memory consists of  4Kbytes and that 1Mbytes of the main memory are allocated for storing data during this computation. Suppose that it takes 40  ms to load a page from the disk to the main memory when a page fault occurs (assume that when we start, the main memory is empty ).

a.   How many page faults would occur if the elements of the array are stored in column order in the virtual memory?

b.   How many page faults would occur if the elements are stored in row order?

c.   Estimate the total time needed to perform this normalization for both arrangements a & b. Assume that it takes 2 ns to do a comparison, 20 ns to do a divide and 100 ns to do a load/store to memory.

retagged by

1 Answer

Best answer
6 votes
6 votes

We start with what is given in question:

Page fault : We search main memory for some data and that data not present in it so we need to bring that in memory from secondry memory.

1024 x 1024 array of 32-bit numbers  means $1024\times 1024 \times 4$ Bytes $= 2^{20}  = 2^{20}$ element each of (32 bits= 32/8= 4 Bytes).

So memory needed $=2^{22}$ Bytes 

Pagesize in memory =  4 Kbytes $= 2^{12}$ Bytes

No. of elements stored in a page $= 2^{22}$ Bytes $/2^{12}$ Bytes$ = 2^{10} = 1024$ elements

Total main memory allocated  = 1 Mbytes $= 2^{20}$ Bytes

So Number of pages in memory $= 2^{20}$ Bytes $/ 2^{12}$ Bytes $= 2^8$ pages in main memory maximum..

C language and many others force elements are stored in Row major order i.e. 1,2,3,4,5, ... in one row tilll row ends. When Row ends the start with new Row. Example:

A. Here elements are stored in column major order - transporse of the above figure. (Addresses in a program is always virtual on a system with virtual memory). Our program fetches all elements in a column and find the maximum (as in the innermost loop of selection sort). Then it again iterates through all the elements and divide each with this maximum value.

In each column we have 1024 elements, and all of them fit in a page. So, we have one page fault per column and then dividing each element by the maximum does not cause any page fault as all 1024 elements are in the cache. So, in total we have $1024$ page faults. (Assumes LRU page replacement).

B. Here we have row major order memory allocation. So, for first element access we have a page fault. For the next element access we again have a page fault. So, after seeing 1024 pages corresponding to 1024 columns, we have to iterate through all of them once more and each causes a new page fault -(this depends on the page replacement algorithm being used and we assume LRU as this is not given in question). i.e., for each memory access we are having a page fault.

#Page faults = 2 per matrix entry

$= 1024 \times 1024 \times 2 = 2M$

This is extremely high. But we can do a better implementationn by processing only 256 elements of a column at a time (then doing the same for next column and so on) so that while processing the next column we hit in the page table. i.e. we find the maximum of first 256 elements of a column at a time then do the same for next column and so on each causing a page fault. After 256 accesses, we move to next column and their we have first 256 elements already in memory and this continues for the first 256 columns. After this we proceed with the second set of 256 elements from each column, then 3rd and finally 4th- the max element of each quartet is propagated to next (assume temporary memory avalable).  This way (tiling to 4x4x256x256 or 4x1024x256), we can reduce the page fault to just 2 per entry- one for comparison and 1 for division. So, number of page faults will be

$= 2 \times 1024 = 2048.$

C. For each element we have to do a comparison except the first element of each column. So, in total we have $1023\times 1024$ comparisons excluding the bound checking in the loop. So, time for comparison is nearly

$1M \times 2ns = 2ms$.

Each element of the matrix is divided by the max column element. So, time for division

$= 1M \times 20ns = 20ms$

Each element of the matrix must be loaded to memory twice- first for comparison and second for division and then stored back. So, total time for load/store

$= 1M \times (2\times 100ns + 100ns) \\= 300ms$

So, total time excluding page fault time

$= 2 + 20 + 300 = 322 ms$.

Now for arrangement A, total time

= Page fault time for A + 322ms

$= 1024 \times 40ms + 322 ms \\ \approx 41 s$

For arrangement B, total time

= Page fault time for B + 322 ms

$= 2M \times 40ms + 322ms \\ \approx 82,000 s$

For optimized version, we get, total time

$= 2048 \times 40ms + 322 ms \\ \approx 82s$

1000 times better - that's what architecture specific optimization can do :)

selected by

Related questions

1 votes
1 votes
2 answers
1
admin asked Oct 26, 2019
1,629 views
If an instruction takes $1\: nsec$ and a page fault takes an additional $n\: nsec,$ give a formula for the effective instruction time if page faults occur every $k$ instr...
0 votes
0 votes
2 answers
4
prashant9 asked Jun 15, 2016
6,396 views
(A) processes tend to the I/O-bound(B) size of pages is reduced(C) processes tend to be CPU-bound(D) locality of reference is applicable to the process