Parallel Sculpt

Hi All

Coreform 2021.5 Linux - RHEL8 - Cluster

Just wondering if anyone has any tips for debugging parallel sculpt runs. I have a case which runs slowly, but correctly in serial, but in parallel results in lots of errors, sigsev 11 errors.

Does anyone have rules of thumb regarding memory useage for parallel sculpt problems, I dont think I’m running out of memory, I’ve got access to 196 Gb per node in my cluster. Are there any details to better understand the algorithm in terms of decomposing the problem into MPI tasks? e.g. if I run a functioning serial problem on 56 MPI tasks, will each MPI task occupy 1/56th of the memory space, or is there significant memory overhead, such that 56 MPI tasks will need 56x the memory?

I couldn’t use the MPI that came packaged with Cubit, so I built OpenMPI 3.1.6 for this system, which passes all the tests. It could be that I use that MPI rather than the bundled one.

All suggestions appreciated.

Thanks

So the memory usage in serial for this problem is 28 Gb, which is fine, I’m trying what will be a (hyper?) detailed problem, so going into 100x bigger isn’t really an issue. However, even running this on 4 nodes (with npernode=1) I get issues

  Guaranteed Quality Threshold = 0.200000
Laplacian Iter: 1
Laplacian Iter: 2
Smoothing 2180663 hexes on 4 processors
Jacobi Opt Iter: 1,  Num bad: 220234, Num poor: 269523, Min SJ: -0.999999
WARNING: Unconstrained curve optimization was used for at least one node.
Some nodes may not lie on owning curves.
(Use curve_opt_thresh = -1 to turn off behavior)
Jacobi Opt Iter: 2,  Num bad: 184721, Num poor: 284976, Min SJ: -0.999999
Jacobi Opt Iter: 3,  Num bad: 164867, Num poor: 289469, Min SJ: -0.999999
Jacobi Opt Iter: 4,  Num bad: 149903, Num poor: 296794, Min SJ: -0.999999
Jacobi Opt Iter: 5,  Num bad: 139408, Num poor: 294261, Min SJ: -0.999998
[cpu-p-388:50138:0:50138] Caught signal 11 (Segmentation fault: tkill(2) or tgkill(2) at address 0x51c60000c3da)
[cpu-p-390:61674:0:61674] Caught signal 11 (Segmentation fault: tkill(2) or tgkill(2) at address 0x51c60000f0ea)
[cpu-p-389:189458:0:189458] Caught signal 11 (Segmentation fault: tkill(2) or tgkill(2) at address 0x51c60002e412)
[cpu-p-391:253961:0:253961] Caught signal 11 (Segmentation fault: tkill(2) or tgkill(2) at address 0x51c60003e009)
==== backtrace (tid: 253961) ====
 0 0x000000000004f6b5 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000743feb CubitString::number<int>()  ???:0
 2 0x000000000063f6f6 CubitString::number<int>()  ???:0
 3 0x000000000065707a CubitString::number<int>()  ???:0
 4 0x000000000065c886 CubitString::number<int>()  ???:0
 5 0x000000000061d809 CubitString::number<int>()  ???:0
 6 0x000000000061ec24 CubitString::number<int>()  ???:0
 7 0x000000000048ef7e std::vector<CubitString, std::allocator<CubitString> >::~vector()  ???:0
 8 0x000000000047ddcb std::_Rb_tree<int, std::pair<int const, double>, std::_Select1st<std::pair<int const, double> >, std::less<int>, std::allocator<std::pair<int const, double> > >::_M_erase()  ???:0
 9 0x0000000000477db3 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
10 0x0000000000462061 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
11 0x0000000000469c5d ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
12 0x0000000000472862 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
13 0x000000000043d368 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
14 0x0000000000022555 __libc_start_main()  ???:0
15 0x0000000000443bd7 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
=================================
[cpu-p-391:253961] *** Process received signal ***
[cpu-p-391:253961] Signal: Segmentation fault (11)
[cpu-p-391:253961] Signal code:  (-6)
[cpu-p-391:253961] Failing at address: 0x51c60003e009
[cpu-p-391:253961] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2b02a21c5630]
[cpu-p-391:253961] [ 1] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x743feb]
[cpu-p-391:253961] [ 2] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x63f6f6]
[cpu-p-391:253961] [ 3] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x65707a]
[cpu-p-391:253961] [ 4] ==== backtrace (tid:  50138) ====
 0 0x000000000004f6b5 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000743feb CubitString::number<int>()  ???:0
 2 0x000000000063f6f6 CubitString::number<int>()  ???:0
 3 0x000000000065707a CubitString::number<int>()  ???:0
 4 0x000000000065c886 CubitString::number<int>()  ???:0
 5 0x000000000061d809 CubitString::number<int>()  ???:0
 6 0x000000000061ec24 CubitString::number<int>()  ???:0
 7 0x000000000048ef7e std::vector<CubitString, std::allocator<CubitString> >::~vector()  ???:0
 8 0x000000000047ddcb std::_Rb_tree<int, std::pair<int const, double>, std::_Select1st<std::pair<int const, double> >, std::less<int>, std::allocator<std::pair<int const, double> > >::_M_erase()  ???:0
 9 0x0000000000477db3 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
10 0x0000000000462061 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
11 0x0000000000469c5d ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
12 0x0000000000472862 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
13 0x000000000043d368 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
14 0x0000000000022555 __libc_start_main()  ???:0
15 0x0000000000443bd7 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
=================================
[cpu-p-388:50138] *** Process received signal ***
[cpu-p-388:50138] Signal: Segmentation fault (11)
[cpu-p-388:50138] Signal code:  (-6)
[cpu-p-388:50138] Failing at address: 0x51c60000c3da
==== backtrace (tid:  61674) ====
 0 0x000000000004f6b5 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000743feb CubitString::number<int>()  ???:0
 2 0x000000000063f6f6 CubitString::number<int>()  ???:0
 3 0x000000000065707a CubitString::number<int>()  ???:0
 4 0x000000000065c886 CubitString::number<int>()  ???:0
 5 0x000000000061d809 CubitString::number<int>()  ???:0
 6 0x000000000061ec24 CubitString::number<int>()  ???:0
 7 0x000000000048ef7e std::vector<CubitString, std::allocator<CubitString> >::~vector()  ???:0
 8 0x000000000047ddcb std::_Rb_tree<int, std::pair<int const, double>, std::_Select1st<std::pair<int const, double> >, std::less<int>, std::allocator<std::pair<int const, double> > >::_M_erase()  ???:0
 9 0x0000000000477db3 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
10 0x0000000000462061 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
11 0x0000000000469c5d ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
12 0x0000000000472862 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
13 0x000000000043d368 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
14 0x0000000000022555 __libc_start_main()  ???:0
15 0x0000000000443bd7 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
=================================
[cpu-p-390:61674] *** Process received signal ***
[cpu-p-390:61674] Signal: Segmentation fault (11)
[cpu-p-390:61674] Signal code:  (-6)
[cpu-p-390:61674] Failing at address: 0x51c60000f0ea
==== backtrace (tid: 189458) ====
 0 0x000000000004f6b5 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000743feb CubitString::number<int>()  ???:0
 2 0x000000000063f6f6 CubitString::number<int>()  ???:0
 3 0x000000000065707a CubitString::number<int>()  ???:0
 4 0x000000000065c886 CubitString::number<int>()  ???:0
 5 0x000000000061d809 CubitString::number<int>()  ???:0
 6 0x000000000061ec24 CubitString::number<int>()  ???:0
 7 0x000000000048ef7e std::vector<CubitString, std::allocator<CubitString> >::~vector()  ???:0
 8 0x000000000047ddcb std::_Rb_tree<int, std::pair<int const, double>, std::_Select1st<std::pair<int const, double> >, std::less<int>, std::allocator<std::pair<int const, double> > >::_M_erase()  ???:0
 9 0x0000000000477db3 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
10 0x0000000000462061 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
11 0x0000000000469c5d ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
12 0x0000000000472862 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
13 0x000000000043d368 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
14 0x0000000000022555 __libc_start_main()  ???:0
15 0x0000000000443bd7 ???()  /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt:0
=================================
[cpu-p-389:189458] *** Process received signal ***
/home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x65c886]
[cpu-p-391:253961] [ 5] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x61d809]
[cpu-p-391:253961] [ 6] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x61ec24]
[cpu-p-391:253961] [ 7] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x48ef7e]
[cpu-p-391:253961] [ 8] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x47ddcb]
[cpu-p-391:253961] [ 9] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x477db3]
[cpu-p-391:253961] [10] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x462061]
[cpu-p-391:253961] [11] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x469c5d]
[cpu-p-391:253961] [12] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x472862]
[cpu-p-391:253961] [13] [cpu-p-390:61674] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2b83e9933630]
[cpu-p-390:61674] [ 1] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x743feb]
[cpu-p-390:61674] [ 2] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x63f6f6]
[cpu-p-390:61674] [ 3] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x65707a]
[cpu-p-390:61674] [ 4] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x65c886]
[cpu-p-389:189458] Signal: Segmentation fault (11)
[cpu-p-389:189458] Signal code:  (-6)
[cpu-p-389:189458] Failing at address: 0x51c60002e412
/home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x43d368]
[cpu-p-391:253961] [14] [cpu-p-388:50138] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2ac2043b2630]
[cpu-p-388:50138] [ 1] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x743feb]
[cpu-p-388:50138] [ 2] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x63f6f6]
[cpu-p-388:50138] [ 3] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x65707a]
[cpu-p-388:50138] [ 4] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x65c886]
[cpu-p-388:50138] [ 5] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x61d809]
[cpu-p-388:50138] [ 6] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x61ec24]
[cpu-p-388:50138] [ 7] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x48ef7e]
[cpu-p-388:50138] [ 8] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x47ddcb]
[cpu-p-388:50138] [ 9] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x477db3]
[cpu-p-388:50138] [10] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x462061]
[cpu-p-388:50138] [11] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x469c5d]
[cpu-p-388:50138] [12] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x472862]
[cpu-p-388:50138] [13] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x43d368]
[cpu-p-388:50138] [14] [cpu-p-390:61674] [ 5] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x61d809]
[cpu-p-390:61674] [ 6] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x61ec24]
[cpu-p-390:61674] [ 7] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x48ef7e]
[cpu-p-390:61674] [ 8] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x47ddcb]
[cpu-p-390:61674] [ 9] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x477db3]
[cpu-p-390:61674] [10] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x462061]
[cpu-p-390:61674] [11] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x469c5d]
[cpu-p-390:61674] [12] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x472862]
[cpu-p-390:61674] [13] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x43d368]
[cpu-p-390:61674] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b83eaf19555]
[cpu-p-390:61674] [cpu-p-389:189458] [ 0] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b02a37ab555]
[cpu-p-391:253961] [15] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x443bd7]
[cpu-p-391:253961] *** End of error message ***
[15] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x443bd7]
[cpu-p-390:61674] *** End of error message ***
/lib64/libpthread.so.0(+0xf630)[0x2b4e108a7630]
[cpu-p-389:189458] [ 1] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x743feb]
[cpu-p-389:189458] [ 2] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x63f6f6]
[cpu-p-389:189458] [ 3] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x65707a]
[cpu-p-389:189458] [ 4] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x65c886]
[cpu-p-389:189458] [ 5] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x61d809]
[cpu-p-389:189458] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac205998555]
[cpu-p-388:50138] [15] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x443bd7]
[cpu-p-388:50138] *** End of error message ***
[ 6] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x61ec24]
[cpu-p-389:189458] [ 7] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x48ef7e]
[cpu-p-389:189458] [ 8] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x47ddcb]
[cpu-p-389:189458] [ 9] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x477db3]
[cpu-p-389:189458] [10] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x462061]
[cpu-p-389:189458] [11] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x469c5d]
[cpu-p-389:189458] [12] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x472862]
[cpu-p-389:189458] [13] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x43d368]
[cpu-p-389:189458] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b4e11e8d555]
[cpu-p-389:189458] [15] /home/dc-davi4/Coreform-Cubit-2021.5/bin/psculpt[0x443bd7]
[cpu-p-389:189458] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 3 with PID 253961 on node cpu-p-391 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
3 total processes killed (some possibly by mpiexec during cleanup)

I should note here the problem runs fine (slowly) in serial, but in parallel fails consistently with this message. Valgrind didn’t show much too useful, I’ll try gdb next maybe it will show something helpful.

Can I ping anyone on this @gvernon or @scot any ideas?

Unfortunately I’m not a MPI expert, and there’s just too many variables involved in setting up your cluster for me to give you much in the way of advice. Have you looked at https://www.open-mpi.org/faq/?category=debugging?

I am interested to know if the crash is happening as part of the Jacobi optimization or afterwards. What happens if you reduce the Max Jacobi Iterations? What happens if you turn off Jacobi iteration?

Thanks,
Karl