Here is pseudo code for a multiprocessing purpose:
int this_thread_id = get_thread_number(); // returns 0 to (no_of_threads-1)
sum_local[this_thread_id] = 0.0;
for (i = this_thread_id; i < N; i += NUM_THREADS)
sum_local[this_thread_id] += foo(i); // foo(i) is just some other work
} // join
// main thread
sum += sum_local[i];
Instead of using a serial program on a multicore core system if I use this parallel version the running time should reduce compared to serial version. I have dual core system. But it does not reduce time to half even if I use
NUM_THREADS = 2
I found this happens because of false sharing.
I googled and found an Intel page on false sharing. They have explained as follows:
There is a potential for false sharing on the array
sum_local. This array is dimensioned according to the number of threads and is small enough to fit in a single cache line. When executed in parallel, the threads modify different, but adjacent, elements of
sum_local (the source line shown in red), which invalidates the cache line for all processors.
Figure 1. False sharing occurs when threads on different processors modify variables that reside on the same cache line. This invalidates the cache line and forces a memory update to maintain cache coherency.
In Figure 1, threads $0$ and $1$ require variables that are adjacent in memory and reside on the same cache line. The cache line is loaded into the caches of CPU $0$ and CPU $1$ (gray arrows). Even though the threads modify different variables (red and blue arrows), the cache line is invalidated, forcing a memory update to maintain cache coherency.
Later they explained more:
To ensure data consistency across multiple caches, multiprocessor-capable Intel® processors follow the MESI (Modified/Exclusive/Shared/Invalid) protocol. On the first load of a cache line, the processor will mark the cache line as ‘Exclusive’ access. As long as the cache line is marked exclusive, subsequent loads are free to use the existing data in the cache. If the processor sees the same cache line loaded by another processor on the bus, it marks the cache line with ‘Shared’ access. If the processor stores a cache line marked as ‘S’, the cache line is marked as ‘Modified’ and all other processors are sent an ‘Invalid’ cache line message. If the processor sees the same cache line which is now marked ‘M’ being accessed by another processor, the processor stores the cache line back to memory and marks its cache line as ‘Shared’. The other processor that is accessing the same cache line incurs a cache miss.
The frequent coordination required between processors when cache lines are marked ‘Invalid’ requires cache lines to be written to memory and subsequently loaded. False sharing increases this coordination and can significantly degrade application performance.
Please explain this cache line miss due to this MESI protocol. I found this interesting but I could not understand clearly what they have explained. I think frequent DRAM write back causing the problem, but not very clear, though.
please explain a bit. @Arjun Sir