Personal tools
You are here: Home UiT files-uit Performance tools valgrind cache profiler

valgrind cache profiler

by Peter Wind last modified Nov 07, 2011 02:29 PM

Simple tool to quantify cache misses

Valgrind is much more than a cache profiler (see also valgrind). Here only the cache profiler is presented.

Valgrinds emulates the CPU. (It is therefore slow, and not necessarily exact).

Example:

valgrind –tool=cachegrind MyProg.x
==24983== D   refs:          44,747,225  (33,333,323 rd   + 11,413,902 wr)
==24983== D1  misses:    31,041,844  (20,966,798 rd   + 10,075,046 wr)
==24983== LLd misses:           83,793  (    24,382 rd   +     59,411 wr)
==24983== D1  miss rate:         69.3% (      62.9%     +       88.2%  )
==24983== LLd miss rate:           0.1% (       0.0%     +        0.5%  )

Means data has been accessed 44,747,225 times. 31,041,844 of whom where not in the D1 cache (highest level cache) and 83,793 where not in the Lowest Level cache.

Roughly, a cache miss from high level cache increase by a factor 10 the access time, and one low level cache miss increase the access time by a factor 100.

In this example most of the data is available in the low level cache, but not in the high level cache.

 

For parallel runs:

mpirun  valgrind --tool=cachegrind MyProg.x

Analyse the output files with

cg_annotate cachegrind.out.2705

 

 

Overview over cache sizes

You can get an exact description of the cache and processor hierarchy using cpuinfo from intel-mpi:

module unload openmpi
module load intel-mpi
cpuinfo

 

For example on c0-13 on Stallo:

c0-13 ~]$ cpuinfo
Intel(R) Xeon(R)  CPU E5640 
=====  Processor composition  =====
Processors(CPUs)  : 16
Packages(sockets) : 2
Cores per package : 4
Threads per core  : 2
=====  Processor identification  =====
Processor    Thread Id.    Core Id.    Package Id.
0           0           0           0  
1           0           0           1  
2           0           10          0  
3           0           10          1  
4           0           1           0  
5           0           1           1  
6           0           9           0  
7           0           9           1  
8           1           0           0  
9           1           0           1  
10          1           10          0  
11          1           10          1  
12          1           1           0  
13          1           1           1  
14          1           9           0  
15          1           9           1  
=====  Placement on packages  =====
Package Id.    Core Id.    Processors
0           0,10,1,9        (0,8)(2,10)(4,12)(6,14)
1           0,10,1,9        (1,9)(3,11)(5,13)(7,15)
=====  Cache sharing  =====
Cache    Size        Processors
L1    32  KB        (0,8)(1,9)(2,10)(3,11)(4,12)(5,13)(6,14)(7,15)
L2    256 KB        (0,8)(1,9)(2,10)(3,11)(4,12)(5,13)(6,14)(7,15)
L3    12  MB        (0,2,4,6,8,10,12,14)(1,3,5,7,9,11,13,15)

 

 

Document Actions