|
|
valgrind cache profiler
by
Peter Wind
—
last modified
Nov 07, 2011 02:29 PM
Simple tool to quantify cache misses Valgrind is much more than a cache profiler (see also valgrind). Here only the cache profiler is presented. Valgrinds emulates the CPU. (It is therefore slow, and not necessarily exact). Example: valgrind –tool=cachegrind MyProg.x ==24983== D refs: 44,747,225 (33,333,323 rd + 11,413,902 wr) Means data has been accessed 44,747,225 times. 31,041,844 of whom where not in the D1 cache (highest level cache) and 83,793 where not in the Lowest Level cache. Roughly, a cache miss from high level cache increase by a factor 10 the access time, and one low level cache miss increase the access time by a factor 100. In this example most of the data is available in the low level cache, but not in the high level cache.
For parallel runs: mpirun valgrind --tool=cachegrind MyProg.x Analyse the output files with cg_annotate cachegrind.out.2705
Overview over cache sizes You can get an exact description of the cache and processor hierarchy using cpuinfo from intel-mpi: module unload openmpi module load intel-mpi cpuinfo
For example on c0-13 on Stallo: c0-13 ~]$ cpuinfo
Document Actions |
|
