Skip to content

Conversation

@Asher8118
Copy link
Contributor

There are cases in which getEntryIDForSymbol is called, where the given Symbol is in a constant island, and so BOLT can not find its function. This causes BOLT to reach llvm_unreachable("symbol not found") and crash. This patch adds a check that avoids this crash and gives a warning to the user.

@llvmbot
Copy link
Member

llvmbot commented Sep 22, 2025

@llvm/pr-subscribers-bolt

Author: Asher Dobrescu (Asher8118)

Changes

There are cases in which getEntryIDForSymbol is called, where the given Symbol is in a constant island, and so BOLT can not find its function. This causes BOLT to reach llvm_unreachable("symbol not found") and crash. This patch adds a check that avoids this crash and gives a warning to the user.


Full diff: https://github.com/llvm/llvm-project/pull/160143.diff

2 Files Affected:

  • (modified) bolt/lib/Core/BinaryContext.cpp (+9-1)
  • (added) bolt/test/AArch64/check-symbol-area.s (+49)
diff --git a/bolt/lib/Core/BinaryContext.cpp b/bolt/lib/Core/BinaryContext.cpp
index 72c72bbaf4a65..5f3b3c0e57152 100644
--- a/bolt/lib/Core/BinaryContext.cpp
+++ b/bolt/lib/Core/BinaryContext.cpp
@@ -2374,8 +2374,16 @@ BinaryFunction *BinaryContext::getFunctionForSymbol(const MCSymbol *Symbol,
     return nullptr;
 
   BinaryFunction *BF = BFI->second;
-  if (EntryDesc)
+  if (EntryDesc) {
+    const uint64_t Address = BF->getAddress() + Symbol->getOffset();
+    if (BF->isInConstantIsland(Address)) {
+      this->outs() << "BOLT-WARNING: symbol " << Symbol->getName()
+                   << " is in data region of function 0x"
+                   << Twine::utohexstr(Address) << ".\n";
+      return nullptr;
+    }
     *EntryDesc = BF->getEntryIDForSymbol(Symbol);
+  }
 
   return BF;
 }
diff --git a/bolt/test/AArch64/check-symbol-area.s b/bolt/test/AArch64/check-symbol-area.s
new file mode 100644
index 0000000000000..43ce25e16181f
--- /dev/null
+++ b/bolt/test/AArch64/check-symbol-area.s
@@ -0,0 +1,49 @@
+// This test checks that when looking for a function
+// correspnding to a symbol, BOLT is not looking 
+// through a data area (constant island).
+
+# RUN: llvm-mc -filetype=obj -triple aarch64-unknown-unknown %s -o %t.o
+# RUN: %clang %cflags %t.o -o %t.exe -Wl,-q
+# RUN: llvm-bolt %t.exe -o %t.bolt 2>&1 | FileCheck %s
+
+// Before adding a check for constant islands, BOLT would exit with an error
+// of the form: "symbol not found" and throw an LLVM UNREACHABLE error.
+# CHECK-NOT: symbol not found
+# CHECK-NOT: UNREACHABLE
+
+// Now BOLT throws a warning and does not crash.
+# CHECK: BOLT-WARNING: symbol [[SYM:.*]]  is in data region of function 0x{{.*}}.
+
+.text
+.global main
+main:
+        stp     x14, x15, [sp, -8]!
+        mov     x14, sp
+        adrp    x1, .test
+        add     x0, x1, :lo12:.test
+        bl      first_block
+        ret
+
+.global first_block
+$d:
+first_block:
+        stp     x14, x15, [sp, -8]!
+        mov     x14, sp
+        bl      second_block
+        ret
+second_block:
+        stp     x14, x15, [sp, -8]!
+        mov     x14, sp
+        bl      third_block
+        ret
+$x:
+third_block:
+        stp     x14, x15, [sp, -8]!
+        mov     x14, sp
+        adrp    x1, .data
+        add     x0, x1, :lo12:.test
+        ret
+
+.data
+.test:
+        .string "test"

BinaryFunction *BF = BFI->second;
if (EntryDesc)
if (EntryDesc) {
const uint64_t Address = BF->getAddress() + Symbol->getOffset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. I'm not sure Symbol->getOffset() will always produce the expected output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look. Could you please elaborate on what might be the problem here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, sorry. Symbol->getOffset() is not guaranteed to return the offset from the start of the function at arbitrary point of execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, if Symbol->getOffset() does not return the symbol offset, then it returns 0. In which case const uint64_t Address = BF->getAddress() + Symbol->getOffset(); becomes the same as BF->getAddress(). For the purpose of this patch, I think that is fine. Whether we're checking the function address or the symbol address, the point is to check if we're in a data area (constant island) before proceeding. This check is only reached in getFunctionForSymbol as a way to prevent calling getEntryIDForSymbol, and avoid hitting the llvm_unreachable, if we're in a data area.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'd prefer to make getEntryIDForSymbol() return an optional (i.e. remove the unreachable) and not rely on the implicit behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll look into making it an optional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's now returning an optional. I did keep the Symbol->getOffset() check in order to give a warning to the user, but we are no longer relying on it to avoid thellvm_unreachable.

Copy link
Member

@paschalis-mpeis paschalis-mpeis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Ash,

Thanks for the patch! Trying to understand the cases we hit this.

Is this the case where we fail on a symbol in a code-marked region
that follows a data marked region, like L2 below?

.text
.global foo
$d.foo:
foo:
        nop
$x.foo:
L2:
        ret

Maybe the below is a more realistic example than the above?
Note that replacing BL with a B does not trigger it.

.text
.global foo
foo:
        adrp    x0, .Lp
        add     x0, x0, :lo12:.Lp
        ldr     w0, [x0]
        bl L2
        ret
$d.L1:
.Lp:
        .word   0x123FFF
$x.L2:
L2:
        ret

@Asher8118
Copy link
Contributor Author

Thanks for your comments, @paschalis-mpeis. I've tried to address them.

Trying to understand the cases we hit this. Is this the case where we fail on a symbol in a code-marked region

We hit this when we ask BOLT to read code from a data area. So references to a symbol from a data area (marked with $d) that are called from a code area (marked with $x) will hit this. The two examples you offered hit this issue.

Note that replacing BL with a B does not trigger it.

That's because there's no link to a data area without BL, as with BL we save the return address to the link register.

@maksfb
Copy link
Contributor

maksfb commented Oct 7, 2025

.text
.global foo
foo:
adrp x0, .Lp
add x0, x0, :lo12:.Lp
ldr w0, [x0]
bl L2
ret
$d.L1:
.Lp:
.word 0x123FFF
$x.L2:
L2:
ret

I tried this example with the patch, and I'm getting:

BOLT-WARNING: symbol L2/1 is in data region of function foo(0x10284).

which is incorrect since L2 is not in the data area. Without looking further into it, I suspect it's the consequence of using Symbol->getOffset() to establish constant island and symbol association.

@maksfb
Copy link
Contributor

maksfb commented Oct 7, 2025

Note that replacing BL with a B does not trigger it.

That's because there's no link to a data area without BL, as with BL we save the return address to the link register.

With B we will generate different code eliminating the branch and making Symbol->getOffset() return a different value resulting in a different code path.

@maksfb
Copy link
Contributor

maksfb commented Oct 7, 2025

Hi Ash,

Overall, it's a good change. Besides the elimination of the unreachable/failure, we should also check that we generate correct code.

The warning, as it is, will not be helpful to the user without any further context, i.e. why it is a problem that there's a symbol in the data area of a function.

We should probably warn the user whenever the data is being accessed as code, e.g. via branch/call, and this should be done while building CFG (or with post-CFG analysis). It's possible, we will have to ignore/skip functions containing such patterns.

@Asher8118
Copy link
Contributor Author

The warning, as it is, will not be helpful to the user without any further context, i.e. why it is a problem that there's a symbol in the data area of a function.

Hi Maks, thank you for your comments. I see what you mean, the symbol being in a data area is more of a symptom than the actual underlying problem. Instead we should check when control flow goes into data and skip those functions. I'll have to look into that.

@Asher8118
Copy link
Contributor Author

Asher8118 commented Nov 10, 2025

Hello, I looked some more into this issue and wanted to write my findings. I have tried looking into data being accessed as code at the postCFG route, which wasn't successful. I briefly looked into data being accessed as code at the buildCFG phase, which looks more promising, but I am struggling to find any branches or calls that hit this issue from buildCFG.

I have also tried to find a better way to get the symbol address, hoping to make the warning more useful to the user, but this was unsuccessful.

I've spent some more time looking over this issue with @paschalis-mpeis . We tried to understand in what circumstances we hit this issue. Below are some test cases we have looked over:

Example1.s:

.text
.global main
main:
nop
$d.L1:
.Lp:
bl L2
$x.L2:
L2:
ret

Example2.s:

.text
.global main
main: # text
        add     x0, x1, x1
        bl      main_d1
        ret
$d.main1:
main_d1:
        add     x0, x1, x1
        ret

Example3.s:

.text
.global main
main:# text
        add     x0, x1, x1
        bl      foo_d1
        ret
$d.main:
foo_d1:
        bl      foo_x1
        ret
$x.main:
foo_x1:
        add     x0, x1, x1
        ret

Example4.s:

.text
.global main
main:
$d.main:
        bl      bar2_x1
        ret
$x.bar2x1:
bar2_x1:
        add     x0, x1, x1
        ret

Example5.s:

.text
.global main
main:
$d.main:
        b      bar2_x1
        ret
$x.bar2x1:
bar2_x1:
        add     x0, x1, x1
        ret  

Example6.s:

.text
.global main
main:
$d.main:
        .word 0x1241234
        .word 0x1241234
$x.bar3x1:
bar3_x1:
        add     x0, x1, x1
        ret

Example7.s:

.text
.global main
$d.main:
main:
        bl      bar_x1
        ret
$x.barx1:
bar_x1:
        add     x0, x1, x1
        ret

edge_case1: (As per Paschalis' comment above )

.text
.global main
$d.main:
main:
        nop
$x.main:
L2:
        ret

edge_case2:

.text
.global main
$d.main:
main:
$x.main:
L2:
        ret

From this list, we can break the tests into 2 categories: those that hit the issue, and those that do not hit the issue.
Tests that HIT the issue: examples 3,4,5,6,7 and edge_case1.s.
Examples 1 and 2 as well as edge_case2.s do not hit this issue.

Based on the tests that do hit this issue, what they seem to have in common is that they break (b/bl) from a data area into a code area. The only exception seems to be edge_case1.s, as there is nothing connecting the data area to the code area. This case is a bit of a mystery and I have been unable to ascertain what exactly makes it crash. However, I have only looked at that test through objdump and nm. Perhaps if I were to look into bolt it would make more sense. So we have a few test cases that consistently hit the issue, while others do so for reasons that aren't clear. This problem might be more complex than it initially appeared.

@maksfb
Copy link
Contributor

maksfb commented Nov 13, 2025

Thanks for sharing your thoughts and test cases. #165406 is related to this area.

@Asher8118
Copy link
Contributor Author

Asher8118 commented Dec 1, 2025

Hi Maks, I looked at the PR you shared, I apologise for taking so long to do this and thank you for bringing it to my attention.

It looks like validateInternalBranch() is very useful for some of the cases that hit the issue in this PR. Adding this:

@@ -2376,10 +2379,9 @@ BinaryFunction *BinaryContext::getFunctionForSymbol(const MCSymbol *Symbol,
   BinaryFunction *BF = BFI->second;
   if (EntryDesc) {
     const uint64_t Address = BF->getAddress() + Symbol->getOffset();
-    if (BF->isInConstantIsland(Address)) {
-      this->outs() << "BOLT-WARNING: symbol " << Symbol->getName()
-                   << " is in data region of function " << BF->getOneName()
-                   << "(0x" << Twine::utohexstr(BF->getAddress()) << ").\n";
+   if (!BF->validateInternalBranch()){
+      BF->setIgnored();
+      return nullptr;
     }
     *EntryDesc = BF->getEntryIDForSymbol(Symbol).value_or(0);
   }

and removing the llvm_unreachable in MCInst *BinaryFunction::getInstructionAtOffset(uint64_t Offset) fixes the main cases in this patch(eg: the test in this PR, Example3.s). Some of the test cases I mention above still fail with a different unreachable(invalid CFG state to use getInstructionAtOffset()), but that may be a different issue altogether.

@paschalis-mpeis
Copy link
Member

In case we have failures outside the scope of this PR (and #165406), should we open an issue listing all test cases, indicating which PRs solves what, so we can track this wider effort?

@Asher8118
Copy link
Contributor Author

In case we have failures outside the scope of this PR (and #165406), should we open an issue listing all test cases, indicating which PRs solves what, so we can track this wider effort?

That sounds reasonable to me. I would wait until this PR (and #165406) are merged. Last I checked #165406 was close to getting merged. I will make the appropriate changes to this PR once that happens and we can check what cases are left then to open an issue.

Ash Dobrescu added 4 commits December 11, 2025 11:20
There are cases in which `getEntryIDForSymbol` is called,
where the given Symbol is in a constant island, and so BOLT
can not find its function. This causes BOLT to reach
`llvm_unreachable("symbol not found")` and crash. This patch adds
a check that avoid this crash and gives a warning to the user.
@Asher8118
Copy link
Contributor Author

Hi Maks, I looked at the PR you shared, I apologise for taking so long to do this and thank you for bringing it to my attention.

It looks like validateInternalBranch() is very useful for some of the cases that hit the issue in this PR. Adding this:

@@ -2376,10 +2379,9 @@ BinaryFunction *BinaryContext::getFunctionForSymbol(const MCSymbol *Symbol,
   BinaryFunction *BF = BFI->second;
   if (EntryDesc) {
     const uint64_t Address = BF->getAddress() + Symbol->getOffset();
-    if (BF->isInConstantIsland(Address)) {
-      this->outs() << "BOLT-WARNING: symbol " << Symbol->getName()
-                   << " is in data region of function " << BF->getOneName()
-                   << "(0x" << Twine::utohexstr(BF->getAddress()) << ").\n";
+   if (!BF->validateInternalBranch()){
+      BF->setIgnored();
+      return nullptr;
     }
     *EntryDesc = BF->getEntryIDForSymbol(Symbol).value_or(0);
   }

and removing the llvm_unreachable in MCInst *BinaryFunction::getInstructionAtOffset(uint64_t Offset) fixes the main cases in this patch(eg: the test in this PR, Example3.s). Some of the test cases I mention above still fail with a different unreachable(invalid CFG state to use getInstructionAtOffset()), but that may be a different issue altogether.

I've added this change locally and took a look at the tests affected:

Failed Tests (11):
  BOLT :: AArch64/check-symbol-area.s
  BOLT :: AArch64/double_jump.cpp
  BOLT :: AArch64/pacret-cfi-reorder.s
  BOLT :: X86/bolt-address-translation-yaml.test
  BOLT :: X86/pre-aggregated-perf.test
  BOLT :: eh-frame-hdr.test
  BOLT :: eh-frame-overwrite.test
  BOLT :: runtime/AArch64/pacret-eh-function-split.cpp
  BOLT :: runtime/exceptions-instrumentation.test
  BOLT :: runtime/user-func-reorder.c
  BOLT :: safe-icf-relative-vtable.cpp

Unfortunately it seems like this change would ignore too many functions, eg: foo in pacret-cfi-reorder.s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants