Document Actions
4. Application optimization
Up to Table of Contents
In general, in order to reach performances close to the theoretical peak, it is necessary to write your algorithms in a form that allows the use of scientific library routines, such as BLACS/LAPACK. See General software and libraries for available and recommended libraries.
Performance monitoring
Some simple performance monitoring tools:
- Find the use of memory, CPU time and other information about all running tasks
- time spent, memory and page faults
- Monitor file access
- trace system calls and signals
- Simple tool to quantify cache misses
- Compiler option for optimization report
Document Actions
Performance tuning by Compiler flags.
Quick n'dirty.
--------------
Use ``ifort/icc -O3``.
We usually recommend that you use the ``ifort/icc`` compilers as they give superior performance on Stallo. Using ``-O3`` is a quick way to get reasonable performance for most applications. Unfortunately, sometimes the compiler break the code with ``-O3`` making it crash or give incorrect results. Try a lower optimization, ``-O2`` or ``-O1``, if this doesn't help, let us know and we will try to solve this or report a compiler bug to INTEL. If you need to use ``-O2`` or ``-O1`` instead of ``-O3`` please remember to add the ``-ftz`` too, this will flush small values to zero. Doing this can have a huge impact on the performance of your application.
Profile based optimization
------------------------------------
The intel compilers can do something called *profile based optimization*. This uses information from the execution of the application to create more effective code. It is important that you run the application with a typical input set or else the compiler will tune the application for another usage profile than you are interested in. With a typical input set one means for instance a full spatial input set, but using just a few iterations for the time stepping.
#. Compile with ``-prof_gen``.
#. Run the app (might take a long time as optimization is turned off in this stage).
#. Recompile with ``-prof_use``.
The simplest case is to compile/run/recompile in the same catalog or else you need to use the ``-prof_dir`` flag, see the manual for details.
To get started with more advanced optimization you can take a look at this pdf file from Intel.
intel-qref-222300_222300.pdf
IPM: MPI performance profiling
IPM is a tool which gives rapidly an overview over the time spent in the different MPI calls. It is very simple to use and can also give a html output with graphical representation of the results.
In your script or on the command line just write
module load ipm
and run you application as usual.
You can stop ipm with ipm_stop, and restart it with ipm_start.
At the end of the run you will get an overview (standard output):
##IPMv0.982#################################################################### # # command : a.out (completed) # host : stallo-1/x86_64_Linux mpi_tasks : 1 on 1 nodes # start : 04/30/10/13:09:19 wallclock : 0.002069 sec # stop : 04/30/10/13:09:19 %comm : 0.02 # gbytes : 8.27026e-02 total gflop/sec : 0.00000e+00 total # ############################################################################## # region : * [ntasks] = 1 # # [total] <avg> min max # entries 1 1 1 1 # wallclock 0.002069 0.002069 0.002069 0.002069 # user 0.011998 0.011998 0.011998 0.011998 # system 0.024996 0.024996 0.024996 0.024996 # mpi 3.96278e-07 3.96278e-07 3.96278e-07 3.96278e-07 # %comm 0.0191531 0.0191531 0.0191531 # gflop/sec 0 0 0 0 # gbytes 0.0827026 0.0827026 0.0827026 0.0827026 # # # [time] [calls] <%mpi> <%wall> # MPI_Comm_size 2.31201e-07 1 58.34 0.01 # MPI_Comm_rank 1.65077e-07 1 41.66 0.01 ###############################################################################
This will also produce a file like.
MyName.1272624855.201741.0
You can then run the command (on the front-end, stallo-1 or stallo-2, not a compute-node):
ipm_parse -html MyName.1272624855.201741.0
Which produce a new directory with html files that you can visualize in your browser:
firefox a.out_1_MyName.1272624855.201741.0_ipm_unknown/index.html
Note that the use of hardware performance counter are not implemeted yet on Stallo. Therefore IPM cannot give information about floating point operations, cache use etc.
For more details refer to:
http://ipm-hpc.sourceforge.net/userguide.html
Vtune
Basic use of vtune
module unload openmpi module unload intel-compiler module load intel-compiler/12.0.4 module load intel-mpi module load intel-tools amplxe-gui
<new project>, <new analysis> (choose Hotspots for example), <get command line> and edit it. For a parallel run you will have something like:
mpirun -np 32 amplxe-cl -collect hotspots -follow-child -mrte-mode=auto -target-duration-type=short -no-allow-multiple-runs -no-analyze-system -data-limit=100 -slow-frames-threshold=40 -fast-frames-threshold=100 -r res -- /My/Path/MyProg.x

