244 views

Here is pseudo code for a multiprocessing purpose:

set_num_threads(NUM_THREADS);
double sum=0.0;
parallel region
{

sum_local[this_thread_id] += foo(i); // foo(i) is just some other work

} // join

sum += sum_local[i];


Instead of using a serial program on a multicore core system if I use this parallel version the running time should reduce compared to serial version. I have dual core system. But it does not reduce time to half even if I use NUM_THREADS = 2

I found this happens because of false sharing.

I googled and found an Intel page on false sharing. They have explained as follows:

There is a potential for false sharing on the array sum_local. This array is dimensioned according to the number of threads and is small enough to fit in a single cache line. When executed in parallel, the threads modify different, but adjacent, elements of sum_local (the source line shown in red), which invalidates the cache line for all processors.

Figure 1. False sharing occurs when threads on different processors modify variables that reside on the same cache line. This invalidates the cache line and forces a memory update to maintain cache coherency.

In Figure 1, threads $0$ and $1$ require variables that are adjacent in memory and reside on the same cache line. The cache line is loaded into the caches of CPU $0$ and CPU $1$ (gray arrows). Even though the threads modify different variables (red and blue arrows), the cache line is invalidated, forcing a memory update to maintain cache coherency.

Later they explained more:

To ensure data consistency across multiple caches, multiprocessor-capable Intel® processors follow the MESI (Modified/Exclusive/Shared/Invalid) protocol. On the first load of a cache line, the processor will mark the cache line as ‘Exclusive’ access. As long as the cache line is marked exclusive, subsequent loads are free to use the existing data in the cache. If the processor sees the same cache line loaded by another processor on the bus, it marks the cache line with ‘Shared’ access. If the processor stores a cache line marked as ‘S’, the cache line is marked as ‘Modified’ and all other processors are sent an ‘Invalid’ cache line message. If the processor sees the same cache line which is now marked ‘M’ being accessed by another processor, the processor stores the cache line back to memory and marks its cache line as ‘Shared’. The other processor that is accessing the same cache line incurs a cache miss.

The frequent coordination required between processors when cache lines are marked ‘Invalid’ requires cache lines to be written to memory and subsequently loaded. False sharing increases this coordination and can significantly degrade application performance.

Please explain this cache line miss due to this MESI protocol. I found this interesting but I could not understand clearly what they have explained. I think frequent DRAM write back causing the problem, but not very clear, though.

please explain a bit. @Arjun Sir

retagged | 244 views
0

I removed all openmp constructs to make this pseudo code simpler. here is the link to full intel article.

+1 vote
One thread writes to a "cache line" - usually 64 bytes. So, this 64 bytes get invalidated in the next core also due to cache coherence protocol. This causes all accesses to memory to go to the highest shared cache/memory across the cores. In short, we reduce the instructions being executed to about half in each core, but also decreases the hit rate in L1 cache to almost 0. May be next year GATE question will be to calculate the time difference in this scenario :)
0
This line: So, this 64 bytes get invalidated in the next core also due to cache coherence protocol

So when other thread2 in the second core tries to write to an adjacent location, complete cache line needs to be loaded again from the lower level to the core2 cache. Is it correct sir?
+1
yes, exactly. It depends on how cache is being implemented but something similar must happen in any implementation.