A quick introduction to performance testing
Today I would like to show how one get a quick estimate on the performance impact on a specific code change using the Linux tool perf
(see perf tutorial for an introduction).
The Setup
I am considering the ASPECT pull request #5044 that removes an unnecessary copy of a vector inside the linear solver. I was curious how much of a difference this makes in practice.
First, we need to pick a suitable example prm file to run. The change is inside the geometric multigrid solver, so we need to run a test that uses it. We also want it to be large enough that we can easily time it without too much noise. For this, we are going to pick nsinker benchmark and slightly modify the file (disable graphical output, disable adaptive refinement, choose 6 global refinements). See here for the file.
We can only get a good estimate for the performance difference, if we compare optimized versions. That’s why we need to compile both versions of ASPECT (with and without the change) in optimized mode. We also use native optimizations as recommended for geometric multigrid.
A first test
With perf correctly configured, we can get a first idea about the program by running
perf stat ./aspect test.prm
and get something like the following output
perf stat ../aspect-new test.prm
-----------------------------------------------------------------------------
-- This is ASPECT, the Advanced Solver for Problems in Earth's ConvecTion.
-- . version 2.5.0-pre
-- . using deal.II 9.4.1 (dealii-9.4, 6a1115bbf6)
-- . with 32 bit indices and vectorization level 2 (256 bits)
-- . using Trilinos 13.2.0
-- . using p4est 2.3.2
-- . running in OPTIMIZED mode
-- . running with 1 MPI process
-----------------------------------------------------------------------------
Loading shared library <./libnsinker.so>
Vectorization over 4 doubles = 256 bits (AVX), VECTORIZATION_LEVEL=2
-----------------------------------------------------------------------------
-- For information on how to cite ASPECT, see:
-- https://aspect.geodynamics.org/citing.html?ver=2.5.0-pre&mf=1&sha=&src=code
-----------------------------------------------------------------------------
Number of active cells: 262,144 (on 7 levels)
Number of degrees of freedom: 8,861,381 (6,440,067+274,625+2,146,689)
*** Timestep 0: t=0 seconds, dt=0 seconds
Solving Stokes system...
GMG coarse size A: 81, coarse size S: 8
GMG n_levels: 7
Viscosity range: 0.01 - 100
GMG workload imbalance: 1
Stokes solver: 28+0 iterations.
Schur complement preconditioner: 29+0 iterations.
A block preconditioner: 29+0 iterations.
Relative nonlinear residual (Stokes system) after nonlinear iteration 1: 0.999967
Postprocessing:
System matrix memory consumption: 101.42 MB
Termination requested by criterion: end time
+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start | 117s | |
| | | |
| Section | no. calls | wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system rhs | 1 | 15.9s | 14% |
| Build Stokes preconditioner | 1 | 5.96s | 5.1% |
| Initialization | 1 | 0.106s | 0% |
| Postprocessing | 1 | 0.00504s | 0% |
| Setup dof systems | 1 | 10s | 8.5% |
| Setup initial conditions | 1 | 8.58s | 7.3% |
| Setup matrices | 1 | 3.37s | 2.9% |
| Solve Stokes system | 1 | 68.1s | 58% |
+---------------------------------+-----------+------------+------------+
-- Total wallclock time elapsed including restarts: 117s
-----------------------------------------------------------------------------
-- For information on how to cite ASPECT, see:
-- https://aspect.geodynamics.org/citing.html?ver=2.5.0-pre&mf=1&sha=&src=code
-----------------------------------------------------------------------------
Performance counter stats for '../aspect-new test.prm':
117,495.66 msec task-clock # 0.987 CPUs utilized
1,134 context-switches # 9.651 /sec
44 cpu-migrations # 0.374 /sec
4,838,162 page-faults # 41.177 K/sec
412,324,642,261 cycles # 3.509 GHz
1,021,213,808,438 instructions # 2.48 insn per cycle
122,571,132,407 branches # 1.043 G/sec
338,189,797 branch-misses # 0.28% of all branches
119.093952900 seconds time elapsed
112.405895000 seconds user
5.087723000 seconds sys
As you can see, we are indeed running in optimized mode, with vectorization enabled, and we are solving a 3d problem with 8.8 million degrees of freedom. It takes about 68 seconds to solve the Stokes system with a single MPI rank.
The real setup
For a more realistic test, we will run the same program with 4 MPI ranks (this way at least some small cost for possible changes in communication are accounted for) by using mpirun -n 4 ./aspect
. Finally,
perf
supports running the program several times and averaging the stats. This turns out to be necessary,
as the change is otherwise too small to detect.
Our final command line is thus
perf stat -r 10 mpirun -n 4 ../aspect test.prm
The result
The output without the patch is
Performance counter stats for 'mpirun -n 4 ../aspect-old test.prm' (10 runs):
182,419.23 msec task-clock # 4.010 CPUs utilized ( +- 0.44% )
1,042 context-switches # 5.934 /sec ( +- 7.12% )
137 cpu-migrations # 0.780 /sec ( +- 4.94% )
2,016,941 page-faults # 11.485 K/sec ( +- 0.36% )
536,241,394,539 cycles # 3.054 GHz ( +- 0.25% )
1,180,113,849,900 instructions # 2.25 insn per cycle ( +- 0.12% )
159,889,552,768 branches # 910.491 M/sec ( +- 0.24% )
446,788,836 branch-misses # 0.28% of all branches ( +- 0.30% )
45.494 +- 0.200 seconds time elapsed ( +- 0.44% )
while the new version gives
Performance counter stats for 'mpirun -n 4 ../aspect-new test.prm' (10 runs):
174,309.09 msec task-clock # 3.880 CPUs utilized ( +- 0.21% )
1,350 context-switches # 7.787 /sec ( +- 4.85% )
102 cpu-migrations # 0.588 /sec ( +- 5.06% )
1,993,599 page-faults # 11.499 K/sec ( +- 0.38% )
522,637,629,676 cycles # 3.015 GHz ( +- 0.11% )
1,170,583,946,847 instructions # 2.24 insn per cycle ( +- 0.09% )
157,504,211,601 branches # 908.506 M/sec ( +- 0.17% )
448,626,534 branch-misses # 0.29% of all branches ( +- 0.23% )
44.9306 +- 0.0919 seconds time elapsed ( +- 0.20% )
Conclusion
The new code executes about 1% fewer instructions, the total runtime decreases from 45.5 to 44.9 seconds (with some uncertainty, see above). The Stokes solve takes around 26 seconds (not shown), which means the patch improves the Stokes solve by about 2%.
What is not taking into account here is that the construction and usage of the vector also causes some MPI communication, which is potentially more expensive when running large simulations on more than a single node.
(written by Timo Heister)