Skip to content

Stricter checks when attaching/detaching threads to groups#75474

Merged
nikitamikhaylov merged 11 commits intomasterfrom
tach
Feb 18, 2025
Merged

Stricter checks when attaching/detaching threads to groups#75474
nikitamikhaylov merged 11 commits intomasterfrom
tach

Conversation

@al13n321
Copy link
Copy Markdown
Member

@al13n321 al13n321 commented Feb 4, 2025

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Closes #59114

In https://s3.amazonaws.com/clickhouse-test-reports/74014/dcdccd54f13e74c5018f6d40e5a4caa77858dc22/stress_test__debug_.html CurrentThread::getGroup() returned nullptr when called from TCPHandler::processOrdinaryQuery. This means that calls to CurrentThread::attachToGroup/detachFromGroupIfNotDetached are not properly nested: someone detached a thread that they didn't attach. (The thread is expected to be attached to a group because the caller of processOrdinaryQuery holds a QueryScope for the duration of the call.)

Almost all calls to CurrentThread::attachToGroupIfDetached/detachFromGroupIfNotDetached follow this pattern that looks incorrect:

    auto thread_func = [thread_group = CurrentThread::getGroup()]
    {
        SCOPE_EXIT_SAFE(
            if (thread_group)
                CurrentThread::detachFromGroupIfNotDetached();
        );

        if (thread_group)
            CurrentThread::attachToGroupIfDetached(thread_group);

        ...
    };

If the thread is already attached to a group, it'll end up detached after this function returns. Normally such thread_func is immediately scheduled on a thread pool; then the thread shouldn't be already attached to a group (... or so I'm assuming; there were no asserts about this, so CI on this PR may prove this analysis wrong). But some code sites sometimes call such function directly, e.g. Aggregator::prepareBlocksAndFillTwoLevelImpl. AFAICT, this would incorrectly detach the thread from the thread group. I'm guessing this is what caused these test failures (but I didn't try to reproduce this and don't have an explanation for why the crash happens so rarely.)

This PR replaces this pattern with a more correct one, with a bunch of asserts, wrapped in a RAII thing.

Most changes are mechanical. Most substantial ones are in ThreadStatus.h, ThreadStatusExt.cpp, and ProfileEventsExt.cpp.

@robot-ch-test-poll robot-ch-test-poll added the pr-not-for-changelog This PR should not be mentioned in the changelog label Feb 4, 2025
@robot-clickhouse
Copy link
Copy Markdown
Member

robot-clickhouse commented Feb 4, 2025

This is an automated comment for commit f0eae2c with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ failure
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests❌ failure
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ failure
Successful checks
Check nameDescriptionStatus
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success

@antonio2368 antonio2368 self-assigned this Feb 4, 2025
@al13n321
Copy link
Copy Markdown
Member Author

al13n321 commented Feb 4, 2025

(The new checks fail in CI, investigating.)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Feb 8, 2025

Workflow [PR], commit [4544010]

Copy link
Copy Markdown
Member

@azat azat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

But, AFAICS it should not affect incorrect tracking of total memory tracker?

@al13n321
Copy link
Copy Markdown
Member Author

But, AFAICS it should not affect incorrect tracking of total memory tracker?

The check that current_thread is not nullptr seems relevant. The only site it caught in CI was PolygonDictionaryUtils.h (it used GlobalThreadPool directly), but at least one leaky instance didn't use polygon dictionary (if I checked correctly), so it can't be the whole story.

@azat
Copy link
Copy Markdown
Member

azat commented Feb 12, 2025

but at least one leaky instance didn't use polygon dictionary (if I checked correctly), so it can't be the whole story.

I mean that even if the current_thread was wrong, it should fallback to total_memory_tracker and account total memory correctly

@al13n321
Copy link
Copy Markdown
Member Author

Ah, you're right, I didn't notice the fallback in getMemoryTracker(), so this is fully unrelated to the memory tracking problem we're seeing.

@al13n321
Copy link
Copy Markdown
Member Author

Flaky tests: #75961 , #74703 , #75520

@nikitamikhaylov nikitamikhaylov added this pull request to the merge queue Feb 18, 2025
Merged via the queue into master with commit 3671cc3 Feb 18, 2025
117 of 118 checks passed
@nikitamikhaylov nikitamikhaylov deleted the tach branch February 18, 2025 14:52
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 18, 2025
baibaichen added a commit to Kyligence/gluten that referenced this pull request Feb 19, 2025
baibaichen added a commit to Kyligence/gluten that referenced this pull request Feb 20, 2025
baibaichen added a commit to apache/gluten that referenced this pull request Feb 20, 2025
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250220)

* Fix build due to ClickHouse/ClickHouse#75474

* Fix build due to ClickHouse/ClickHouse#75452

* Rework "gluten_cache.local.enabled" config due to ClickHouse/ClickHouse#75452

Now, we still use "gluten_cache.local.enabled" in spark, and we will convert it to "enable.gluten_cache.local" when initlize native backend.

* Fix memory access error exposed by ClickHouse/ClickHouse#75452

* Fix  "<Error>... Possibly incorrect column size subtraction" due to  ClickHouse/ClickHouse#75938

* Unified temporary directory

---------

Co-authored-by: kyligence-git <[email protected]>
Co-authored-by: Chang Chen <[email protected]>
kevinw66 pushed a commit to kevinw66/incubator-gluten that referenced this pull request Mar 3, 2025
)

* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250220)

* Fix build due to ClickHouse/ClickHouse#75474

* Fix build due to ClickHouse/ClickHouse#75452

* Rework "gluten_cache.local.enabled" config due to ClickHouse/ClickHouse#75452

Now, we still use "gluten_cache.local.enabled" in spark, and we will convert it to "enable.gluten_cache.local" when initlize native backend.

* Fix memory access error exposed by ClickHouse/ClickHouse#75452

* Fix  "<Error>... Possibly incorrect column size subtraction" due to  ClickHouse/ClickHouse#75938

* Unified temporary directory

---------

Co-authored-by: kyligence-git <[email protected]>
Co-authored-by: Chang Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Segmentation fault in ProfileEvents::getProfileEvents

7 participants