Fix VirtPidTbl initialization to not rely on getpid.#1200
Conversation
There was a problem hiding this comment.
I'll look at this more deeply soon. But right now, I'm seeing that for the original libconsttructor.c file that I sent before, this works:
bin/dmtcp_launch env LD_PRELOAD=$PWD/libconstructor.so ./test/dmtcp1
But this hits a segfault:
gdb -ex 'b constructor' -ex 'r' -ex 'cont' --args bin/dmtcp_launch env LD_PRELOAD=$PWD/libconstructor.so ./test/dmtcp1
The stack shows:
(gdb) where
#0 0x00007ffff7d8d062 in dmtcp::SharedData::setPidMap (virt=40000, real=208860) at shareddata.cpp:548
#1 0x00007ffff7cd6d7f in pid_event_hook (event=DMTCP_EVENT_INIT, data=0x0) at pid/pid.cpp:315
#2 0x00007ffff7d5fb38 in dmtcp::PluginManager::eventHook (event=DMTCP_EVENT_INIT, data=0x0) at pluginmanager.cpp:107
#3 0x00007ffff7d52134 in dmtcp_initialize_entry_point () at dmtcpworker.cpp:254
#4 0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffddd8,
env=env@entry=0x7fffffffdde8) at ./elf/dl-init.c:70
When I analyze this, I see:
I can track this, by setting a breakpoint:
(gdb) b src/plugin/pid/pid.cpp:315
When I do this, then I reach this breakpoint, and `sharedDataHeader -> numPidMaps` at `shareddata.cpp:548` is 0.
It succeeds.
Then it reaches the breakpoint for `consturctor()` and it succeeds.
Then it again reaches `sharedDataHeader -> numPidMaps` at `shareddata.cpp:548`. And the value is still 0. But somehow the integer index `i`, now has the value 6.
I'm using gcc-11 on WSL Ubuntu. Could it be that some union of types has a bug in it, and we are overwriting the value in the integer `i`, when in GDB?
@karya0 , Could you look at this issue? Thanks.
|
@gc00 : Can you try the latest updates and see if you are still seeing this issue? I tried to reproduce it locally, but couldn't. |
|
@karya0 , (In the previous updated email that I sent to you today, Apr. 8, I had investigated further with the old PR #1200. I recommend that you still briefly read that email also. There was something there I didn't understand: an obviously correct assembly isntruction was doing a segfault. I was guessing that it had to do with us mmap'ing a text segment during dmtcp_launch, but I don't understand how that would happen. Don't spend too much time, but it's worth a quick read, to see if it gives you insights.) @xuyao0127 Please let us know if DMTCP now properly supports MANA without mods to DMTCP. |
|
@karya0 I tested this MANA with Perlmutter and VASP, and it works now. |
gc00
left a comment
There was a problem hiding this comment.
Thanks for this important PR. This will help us in the extreme cases about constructor initialization that we've seen in the past.
I've added some small inline suggested changes to the comments. But otherwise:
LGTM
Using getpid() in VirtPidTbl initialization causes recursion in certain cases where a foreign constructor may be called before DMTCP's initializer kicks in.
This ensures that struct pthread contains a valid virtual tid.
6da87b1 to
9f9aef3
Compare
Using getpid() in VirtPidTbl initialization causes recursion in certain cases where a foreign constructor may be called before DMTCP's initializer kicks in.