retagged by
1,055 views
2 votes
2 votes

Here is pseudo code for a multiprocessing purpose:

set_num_threads(NUM_THREADS);
double sum=0.0;
sum_local[NUM_THREADS];
parallel region 
{

 int this_thread_id = get_thread_number(); // returns 0 to (no_of_threads-1)
 sum_local[this_thread_id] = 0.0;

 for (i = this_thread_id; i < N; i += NUM_THREADS)
  sum_local[this_thread_id] += foo(i); // foo(i) is just some other work

} // join

// main thread
for(int i=0;i<NUM_THREADS;i++) 
  sum += sum_local[i];

Instead of using a serial program on a multicore core system if I use this parallel version the running time should reduce compared to serial version. I have dual core system. But it does not reduce time to half even if I use NUM_THREADS = 2

I found this happens because of false sharing.

I googled and found an Intel page on false sharing. They have explained as follows:

There is a potential for false sharing on the array sum_local. This array is dimensioned according to the number of threads and is small enough to fit in a single cache line. When executed in parallel, the threads modify different, but adjacent, elements of sum_local (the source line shown in red), which invalidates the cache line for all processors.

Figure 1. False sharing occurs when threads on different processors modify variables that reside on the same cache line. This invalidates the cache line and forces a memory update to maintain cache coherency.

In Figure 1, threads $0$ and $1$ require variables that are adjacent in memory and reside on the same cache line. The cache line is loaded into the caches of CPU $0$ and CPU $1$ (gray arrows). Even though the threads modify different variables (red and blue arrows), the cache line is invalidated, forcing a memory update to maintain cache coherency.

Later they explained more:

To ensure data consistency across multiple caches, multiprocessor-capable Intel® processors follow the MESI (Modified/Exclusive/Shared/Invalid) protocol. On the first load of a cache line, the processor will mark the cache line as ‘Exclusive’ access. As long as the cache line is marked exclusive, subsequent loads are free to use the existing data in the cache. If the processor sees the same cache line loaded by another processor on the bus, it marks the cache line with ‘Shared’ access. If the processor stores a cache line marked as ‘S’, the cache line is marked as ‘Modified’ and all other processors are sent an ‘Invalid’ cache line message. If the processor sees the same cache line which is now marked ‘M’ being accessed by another processor, the processor stores the cache line back to memory and marks its cache line as ‘Shared’. The other processor that is accessing the same cache line incurs a cache miss.

The frequent coordination required between processors when cache lines are marked ‘Invalid’ requires cache lines to be written to memory and subsequently loaded. False sharing increases this coordination and can significantly degrade application performance.

Please explain this cache line miss due to this MESI protocol. I found this interesting but I could not understand clearly what they have explained. I think frequent DRAM write back causing the problem, but not very clear, though. 

please explain a bit. @Arjun Sir

retagged by

1 Answer

1 votes
1 votes
One thread writes to a "cache line" - usually 64 bytes. So, this 64 bytes get invalidated in the next core also due to cache coherence protocol. This causes all accesses to memory to go to the highest shared cache/memory across the cores. In short, we reduce the instructions being executed to about half in each core, but also decreases the hit rate in L1 cache to almost 0. May be next year GATE question will be to calculate the time difference in this scenario :)

Related questions

0 votes
0 votes
0 answers
2
Anshul kumar singh asked Jul 18, 2022
644 views
Quantify the effect on performance that results from the use of a cache in the case of a program that has a total of 500 instructions, including a 100-instruction loop th...
0 votes
0 votes
1 answer
4
Tulsidhar M asked Nov 30, 2018
805 views
What is the correct formula for tag directory size in associative mapped cache concept.?