Since, cache size is $8\;KB$ and block size is $64$ bytes, and being direct mapped, number of cache blocks $ = \dfrac{8\;KB}{64 \;B} = 128. $
In the original loop, every adjacent inner loop accesses are to separate cache lines in memory - $10000\times8 = 80K$ bytes apart causing $1000$ cache misses.
Now, we also have to see if any of the cache line accessed in the inner loop can cause a hit in the next iteration of the outer loop. This is not possible because the accesses being $80K$ bytes apart which is $10$ times cache size (EXACT multiples of cache size will map to same cache line). Thus, in the original loop, number of cache misses $=1000\times 1000 = 1M.$
Now, lets see the loop interchanged version. Now, for the inner loop all memory accesses are to adjacent memory locations. Since a cache line can hold $64/8=8$ elements of $a,$ we'll have a cache miss only once in $8$ iterations. So, number of cache misses $=1000 \times \left\lceil \dfrac{1000}{8} \right\rceil = 125K.$
So, reduction in cache misses $ = 1M - 125K.$
Percentage reduction in cache misses $ = \dfrac{1M-125K}{1M} \times 100\%= 87.5\%.$