Skip to content

Add support for DWARF-5 (resubmit, in attempt to get core dump)#40772

Closed
azat wants to merge 12 commits intoClickHouse:masterfrom
azat:DWARF-5-v2
Closed

Add support for DWARF-5 (resubmit, in attempt to get core dump)#40772
azat wants to merge 12 commits intoClickHouse:masterfrom
azat:DWARF-5-v2

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Aug 29, 2022

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add support for DWARF-5

This is resubmit of #40710, since I cannot reproduce the SIGTRAP issue locally.

azat added 3 commits August 29, 2022 20:30
Signed-off-by: Azat Khuzhin <[email protected]>
(cherry picked from commit 444acb9)
I have to do this, since I have libraries compiled with DWARF5 (i.e.
glibc).

ClickHouse changes:
- use camel_case
- add NOLINT
- avoid using folly:: (use std:: instead)
- avoid using boost:: (use std:: instead)

Refs: facebook/folly@490b287
Signed-off-by: Azat Khuzhin <[email protected]>
(cherry picked from commit ee5696b)
(cherry picked from commit e03870b)
@azat azat changed the title Add support for DWARF-5 Add support for DWARF-5 (resubmit) Aug 29, 2022
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-not-for-changelog This PR should not be mentioned in the changelog label Aug 29, 2022
azat added 3 commits August 30, 2022 12:37
gcore is a gdb command, that internally uses gdb to dump the core.

However with proper configuration of limits (core_dump.size_limit) it
should not be required, althought some issues is possible:
- non standard kernel.core_pattern
- sanitizers

So yes, gcore is more "universal", but it is ad-hoc, let's try to switch
to more native way.

Signed-off-by: Azat Khuzhin <[email protected]>
@azat azat changed the title Add support for DWARF-5 (resubmit) Add support for DWARF-5 (resubmit, in attempt to get core dump) Sep 4, 2022
@azat
Copy link
Copy Markdown
Member Author

azat commented Sep 4, 2022

Actually even gdb does not work reliably with DWARF-5:

https://s3.amazonaws.com/clickhouse-test-reports/40772/29169a4bbe318cd08df370ad2405fd86fd6c2da6/stress_test__tsan_.html

And that's why there are no core files:

2022-09-04 19:14:43,695 Finished 0 from 16 processes
script.gdb:14: Error in sourced command file:
Dwarf Error: DW_FORM_strx1 found in non-DWO CU [in module /usr/bin/clickhouse]
2022-09-04 19:14:48,699 Finished 0 from 16 processes
2022-09-04 19:14:46 [Inferior 1 (process 673) detached]
/var/log/clickhouse-server/clickhouse-server.err.log:2022.08.31 22:55:42.371804 [ 7076 ] {} <Fatal> BaseDaemon: ########################################
/var/log/clickhouse-server/clickhouse-server.err.log:2022.08.31 22:55:42.381703 [ 7076 ] {} <Fatal> BaseDaemon: (version 22.9.1.1, build id: A08A087EAA3D1C9D796DFC0B5DA3CA64F4D74EFC) (from thread 3437) (query_id: 7bd3794b-cd0e-4764-9594-a8c04795f6b1) (query: select * from file('test_02402/overflow.capnp', 'CapnProto', 'val1 char') settings format_schema='nonexist:Message') Received signal Trace/breakpoint trap (5)
/var/log/clickhouse-server/clickhouse-server.err.log:2022.08.31 22:55:42.474923 [ 7076 ] {} <Fatal> BaseDaemon: 
/var/log/clickhouse-server/clickhouse-server.err.log:2022.08.31 22:55:42.476538 [ 7076 ] {} <Fatal> BaseDaemon: Stack trace: 0xd9b6046
/var/log/clickhouse-server/clickhouse-server.err.log:2022.08.31 22:56:03.093649 [ 739 ] {} <Fatal> Application: Child process was terminated by signal 5.
/var/log/clickhouse-server/clickhouse-server.stress.log:2022.08.31 22:55:42.371804 [ 7076 ] {} <Fatal> BaseDaemon: ########################################
/var/log/clickhouse-server/clickhouse-server.stress.log:2022.08.31 22:55:42.381703 [ 7076 ] {} <Fatal> BaseDaemon: (version 22.9.1.1, build id: A08A087EAA3D1C9D796DFC0B5DA3CA64F4D74EFC) (from thread 3437) (query_id: 7bd3794b-cd0e-4764-9594-a8c04795f6b1) (query: select * from file('test_02402/overflow.capnp', 'CapnProto', 'val1 char') settings format_schema='nonexist:Message') Received signal Trace/breakpoint trap (5)
/var/log/clickhouse-server/clickhouse-server.stress.log:2022.08.31 22:55:42.474923 [ 7076 ] {} <Fatal> BaseDaemon: 
/var/log/clickhouse-server/clickhouse-server.stress.log:2022.08.31 22:55:42.476538 [ 7076 ] {} <Fatal> BaseDaemon: Stack trace: 0xd9b6046
/var/log/clickhouse-server/clickhouse-server.stress.log:2022.08.31 22:56:03.093649 [ 739 ] {} <Fatal> Application: Child process was terminated by signal 5.

azat added 4 commits September 4, 2022 14:41
Merge this patch to preserve coredump w/o gdb, since that version of gdb
does not work with DWARF 5

* ci/core-dumps-rework:
  Rework core collecting on CI (eliminate gcore usage)
azat added a commit to azat/ClickHouse that referenced this pull request Sep 4, 2022
gcore is a gdb command, that internally uses gdb to dump the core.

However with proper configuration of limits (core_dump.size_limit) it
should not be required, althought some issues is possible:
- non standard kernel.core_pattern
- sanitizers

So yes, gcore is more "universal" (you don't need to configure any
`kernel_pattern`), but it is ad-hoc, and it has drawbacks -
**it does not work when gdb fails**. For example gdb may fail with
`Dwarf Error: DW_FORM_strx1 found in non-DWO CU` in case of DWARF-5 [1].

  [1]: ClickHouse#40772 (comment).

Let's try to switch to more native way.

Signed-off-by: Azat Khuzhin <[email protected]>
@azat azat mentioned this pull request Sep 5, 2022
@azat
Copy link
Copy Markdown
Member Author

azat commented Sep 6, 2022

Core dumps did not help either, but:

# llvm-dwarfdump-14 --verify clickhouse |& fgrep error: | head
error: DIE has overlapping ranges in DW_AT_ranges attribute: [0x0000000000000000, 0x0000000000000013) and [0x0000000000000000, 0x0000000000000064)
error: DIE has overlapping ranges in DW_AT_ranges attribute: [0x0000000000000000, 0x0000000000000064) and [0x0000000000000000, 0x0000000000000034)
error: DIE has overlapping ranges in DW_AT_ranges attribute: [0x0000000000000000, 0x0000000000000064) and [0x0000000000000000, 0x0000000000000505)
error: DIE has overlapping ranges in DW_AT_ranges attribute: [0x0000000000000000, 0x0000000000000505) and [0x0000000000000000, 0x0000000000000612)
error: DIE address ranges are not contained in its parent's ranges:
error: DIE address ranges are not contained in its parent's ranges:
error: DIE address ranges are not contained in its parent's ranges:
error: DIEs have overlapping address ranges:
error: DIEs have overlapping address ranges:
error: DIE address ranges are not contained in its parent's ranges:

@robot-ch-test-poll robot-ch-test-poll added the submodule changed At least one submodule changed in this PR. label Sep 11, 2022
@azat azat mentioned this pull request Sep 11, 2022
azat added a commit to azat/ClickHouse that referenced this pull request Sep 11, 2022
ClickHouse changes to the folly parser:
- use camel_case
- add NOLINT
- avoid using folly:: (use std:: instead)
- avoid using boost:: (use std:: instead)

But note, now it has not been enabled by default (like it was
initially), because you may need recent debugger to support DWARF-5
correctly, and to make debugging easier, let's do this later.

A good example is gdb 10, even though it looks like it should support
it, it still produce some errors, like here [1]:

    Dwarf Error: DW_FORM_strx1 found in non-DWO CU [in module /usr/bin/clickhouse]

  [1]: ClickHouse#40772 (comment)

And not only it complains, apparently it can "activate" SDT probes
(replace "nop" with "int3"), and I believe this is what happens here
[2].

  [2]: ClickHouse#41063 (comment)

There you got int3 in the case when ClickHouse got SIGTRAP:

<details>

```
    0x7f494705e093 <+1139>: jne    0x7f494705e450            ; <+2096> [inlined] update_tls_slotinfo at dl-open.c:732
    0x7f494705e099 <+1145>: testl  %r13d, %r13d
    0x7f494705e09c <+1148>: je     0x7f494705e09f            ; <+1151> at dl-open.c:744:6
    0x7f494705e09e <+1150>: int3
->  0x7f494705e09f <+1151>: movl   -0x54(%rbp), %eax
    0x7f494705e0a2 <+1154>: testl  %eax, %eax
    0x7f494705e0a4 <+1156>: jne    0x7f494705e410            ; <+2032> at dl-open.c:745:5

But if I repeat the query it does not:

    0x7ffff7fe5093 <+1139>: jne    0x7ffff7fe5450            ; <+2096> [inlined] update_tls_slotinfo at dl-open.c:732
    0x7ffff7fe5099 <+1145>: testl  %r13d, %r13d
    0x7ffff7fe509c <+1148>: je     0x7ffff7fe509f            ; <+1151> at dl-open.c:744:6
    0x7ffff7fe509e <+1150>: nop
->  0x7ffff7fe509f <+1151>: movl   -0x54(%rbp), %eax
    0x7ffff7fe50a2 <+1154>: testl  %eax, %eax
    0x7ffff7fe50a4 <+1156>: jne    0x7ffff7fe5410            ; <+2032> at dl-open.c:745:5
```

</details>

Test command was:

    clickhouse local --stacktrace -q "select * from file('data.capnp', 'CapnProto', 'val1 char') settings format_schema='nonexist:Message'

*P.S. I did this, because I have libraries compiled with DWARF5 (i.e. glibc), and dwarf parser simply fails on my dev env.*

Refs: facebook/folly@490b287
(cherry picked from commit ee5696b)
(cherry picked from commit e03870b)
Signed-off-by: Azat Khuzhin <[email protected]>
@azat
Copy link
Copy Markdown
Member Author

azat commented Sep 11, 2022

Last attempt did not help, anyway I've submitted a patch that only adds support of DWARF-5 for the parser, w/o enabling it - #41193

@azat azat closed this Sep 11, 2022
@azat azat deleted the DWARF-5-v2 branch September 11, 2022 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants