Skip to content

Fix VirtPidTbl initialization to not rely on getpid.#1200

Merged
karya0 merged 3 commits intomainfrom
dev/karya0/init_affinity
Apr 11, 2025
Merged

Fix VirtPidTbl initialization to not rely on getpid.#1200
karya0 merged 3 commits intomainfrom
dev/karya0/init_affinity

Conversation

@karya0
Copy link
Copy Markdown
Member

@karya0 karya0 commented Mar 31, 2025

Using getpid() in VirtPidTbl initialization causes recursion in certain cases where a foreign constructor may be called before DMTCP's initializer kicks in.

Copy link
Copy Markdown
Contributor

@gc00 gc00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look at this more deeply soon. But right now, I'm seeing that for the original libconsttructor.c file that I sent before, this works:

bin/dmtcp_launch env LD_PRELOAD=$PWD/libconstructor.so ./test/dmtcp1

But this hits a segfault:

gdb -ex 'b constructor' -ex 'r' -ex 'cont' --args bin/dmtcp_launch env LD_PRELOAD=$PWD/libconstructor.so ./test/dmtcp1

The stack shows:

(gdb) where
#0  0x00007ffff7d8d062 in dmtcp::SharedData::setPidMap (virt=40000, real=208860) at shareddata.cpp:548
#1  0x00007ffff7cd6d7f in pid_event_hook (event=DMTCP_EVENT_INIT, data=0x0) at pid/pid.cpp:315
#2  0x00007ffff7d5fb38 in dmtcp::PluginManager::eventHook (event=DMTCP_EVENT_INIT, data=0x0) at pluginmanager.cpp:107
#3  0x00007ffff7d52134 in dmtcp_initialize_entry_point () at dmtcpworker.cpp:254
#4  0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffddd8,
      env=env@entry=0x7fffffffdde8) at ./elf/dl-init.c:70 

When I analyze this, I see:

I can track this, by setting a breakpoint:
(gdb) b src/plugin/pid/pid.cpp:315

When I do this, then I reach this breakpoint, and `sharedDataHeader -> numPidMaps` at `shareddata.cpp:548` is 0.
It succeeds.
Then it reaches the breakpoint for `consturctor()` and it succeeds.
Then it again reaches `sharedDataHeader -> numPidMaps` at `shareddata.cpp:548`.  And the value is still 0.  But somehow the integer index `i`, now has the value 6.

I'm using gcc-11 on WSL Ubuntu.  Could it be that some union of types has a bug in it, and we are overwriting the value in the integer `i`, when in GDB?

@karya0 , Could you look at this issue?  Thanks.

@karya0
Copy link
Copy Markdown
Member Author

karya0 commented Apr 8, 2025

@gc00 : Can you try the latest updates and see if you are still seeing this issue? I tried to reproduce it locally, but couldn't.

@gc00
Copy link
Copy Markdown
Contributor

gc00 commented Apr 8, 2025

@karya0 ,
I've now tested with the latest update to PR #1200
Good news! It works now under GDB!!!

(In the previous updated email that I sent to you today, Apr. 8, I had investigated further with the old PR #1200. I recommend that you still briefly read that email also. There was something there I didn't understand: an obviously correct assembly isntruction was doing a segfault. I was guessing that it had to do with us mmap'ing a text segment during dmtcp_launch, but I don't understand how that would happen. Don't spend too much time, but it's worth a quick read, to see if it gives you insights.)

@xuyao0127
Please test MANA with Perlmutter and VASP on the latest DMTCP at PR #1200. Probably your own commits to DMTCP main are already there, but if not, also cherry-pick your latest commits to DMTCP.

Please let us know if DMTCP now properly supports MANA without mods to DMTCP.
Thanks.

@xuyao0127
Copy link
Copy Markdown
Collaborator

@karya0 I tested this MANA with Perlmutter and VASP, and it works now.

Copy link
Copy Markdown
Contributor

@gc00 gc00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this important PR. This will help us in the extreme cases about constructor initialization that we've seen in the past.

I've added some small inline suggested changes to the comments. But otherwise:
LGTM

karya0 added 3 commits April 11, 2025 11:56
Using getpid() in VirtPidTbl initialization causes recursion in
certain cases where a foreign constructor may be called before
DMTCP's initializer kicks in.
This ensures that struct pthread contains a valid virtual tid.
@karya0 karya0 force-pushed the dev/karya0/init_affinity branch from 6da87b1 to 9f9aef3 Compare April 11, 2025 18:58
@karya0 karya0 merged commit 9370f9a into main Apr 11, 2025
1 check passed
@karya0 karya0 deleted the dev/karya0/init_affinity branch April 11, 2025 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants