Skip to content

Try fix flaky tests with transactions#37822

Merged
tavplubix merged 14 commits intomasterfrom
fix_flaky_tests_with_transactions
Jun 21, 2022
Merged

Try fix flaky tests with transactions#37822
tavplubix merged 14 commits intomasterfrom
fix_flaky_tests_with_transactions

Conversation

@tavplubix
Copy link
Copy Markdown
Member

@tavplubix tavplubix commented Jun 3, 2022

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Previous PR #36644 fixed tests, but it triggers a bug in bash...
Let's try to simply send SIGTERM without any tricks
Also fixes #37920

@tavplubix tavplubix mentioned this pull request Jun 3, 2022
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-not-for-changelog This PR should not be mentioned in the changelog label Jun 3, 2022
@tavplubix
Copy link
Copy Markdown
Member Author

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Jun 3, 2022

update

✅ Branch has been successfully updated

@tavplubix tavplubix force-pushed the fix_flaky_tests_with_transactions branch from 4868b96 to 1d5f43a Compare June 15, 2022 14:08
@tavplubix tavplubix force-pushed the fix_flaky_tests_with_transactions branch from 1d5f43a to e4d4d0a Compare June 15, 2022 14:21
@tavplubix
Copy link
Copy Markdown
Member Author

Tests with tsan - timeouts - broken after upgrading to llvm-14
Stateless tests flaky check (address, actions) - OK
Stress tests - #35156, suppressed in 299f102

@tavplubix
Copy link
Copy Markdown
Member Author

Integration tests - No space left on device
Stateless tests with tsan are broken
Stateless tests flaky check - OK

But let's run tests again just in case

@tavplubix
Copy link
Copy Markdown
Member Author

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Jun 20, 2022

update

✅ Branch has been successfully updated

@tavplubix
Copy link
Copy Markdown
Member Author

Integration tests (asan, actions) [1/3] - No space left on device
Integration tests (release, actions) [2/2] - no space left on device
Integration tests (thread, actions) [4/4] - no space left on device

Stateless tests (aarch64, actions) - 02232_dist_insert_send_logs_level_hung - #38219
Stateless tests (address, actions) [1/2] - 02232_dist_insert_send_logs_level_hung
Stateless tests (debug, actions) [3/3] - 02232_dist_insert_send_logs_level_hung
Stateless tests (memory, actions) [3/3] - 02232_dist_insert_send_logs_level_hung
Stateless tests (release, DatabaseReplicated, actions) [1/2] - 02232_dist_insert_send_logs_level_hung
Stateless tests (release, actions) - 02232_dist_insert_send_logs_level_hung and 01281_group_by_limit_memory_tracking
Stateless tests (release, s3 storage, actions) - 02232_dist_insert_send_logs_level_hung
Stateless tests (release, wide parts enabled, actions) - 02232_dist_insert_send_logs_level_hung
Stateless tests (thread, actions) [3/3] - 02232_dist_insert_send_logs_level_hung
Stateless tests (ubsan, actions) - 02232_dist_insert_send_logs_level_hung

Stateless tests flaky check (address, actions) - OK

Stress test (debug, actions) and Stress test (thread, actions):

2022-06-20 16:54:46,552 Checking if some queries hung
usage: clickhouse-test [-h] [-q QUERIES] [--tmp TMP] [-b BINARY] [-c CLIENT]
...
clickhouse-test: error: unrecognized arguments: --stress

Some inconsistency in docker images (race condition in CI, cc: @Felixoid)

@tavplubix
Copy link
Copy Markdown
Member Author

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Jun 20, 2022

update

✅ Branch has been successfully updated

@tavplubix tavplubix merged commit 75b8800 into master Jun 21, 2022
@tavplubix tavplubix deleted the fix_flaky_tests_with_transactions branch June 21, 2022 06:38
assert((expected) == uncaught || (expected) + 1 == uncaught); \
if ((expected) < uncaught) \
{ \
tryLogCurrentException("NOEXCEPT_SCOPE"); \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work, you cannot rethrow exception outside the catch block, this will simply terminate (via std::terminate) the program. So this NOEXCEPT_SCOPE macro will simply call std::terminate on exception and will lost original exception.

I was thinking about this a little, and the only thing that pops up in my mind is setting some flag here and check it in Exception ctor and if it set, call abort/std::terminate, that way you will not loose it, however it looks not great.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it does not work as expected and seems like it's there's no way to get exception that is currently unwinding stack (even using __cxa_* functions or something from libunwind...). Not sure what options is better: just remove this like or add thread-local flag and check it when throwing an exception. Maybe it's ok to simply remove it for now and think about logging if we will face some issue that we will not be able to debug without information about unexpected exception...

Copy link
Copy Markdown
Member

@azat azat Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have another idea on how NOEXCEPT_SCOPE can be implemented, by accepting the whole code that should be executed inside, like SCOPE_EXIT, and this will also make it more explicit then simply defining in some scope.

What do you think about this? In the meantime I will prepare a patch and see how it will looks like.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an RFC - #39229

}
}

NOEXCEPT_SCOPE;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain why it is required? And looks like it is better to add a comment for users of NOEXCEPT_SCOPE, thoughts?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment:

/// It can be used in critical places to exit on unexpected exceptions.
/// SIGABRT is usually better that broken in-memory state with unpredictable consequences.
/// It also temporarily disables exception from memory tracker in current thread.
/// Strict version does not take into account nested exception (i.e. it aborts even when we're in catch block).

It's required in MergeTreeData::Transaction::commit because changes to precommitted parts and MergeTreeData state (like set of parts with their states, indexes and column sizes) are not undone on exceptions. Normally this code should not throw any exceptions (unless maybe MEMORY_LIMIT_EXCEEDED, but exceptions from MemoryTracker are blocked), but in case it does (due to some bug or whatever) I would prefer to terminate immediately.

You can think of this macro as kind of paranoid defensive programming :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I read that comment, and also came to conclusion that it is for counters accounting.

However I wasn't came to this PR randomly, I got some exception here, but due to NOEXCEPT_SCOPE the exception was not printed so I cannot know the exact reason, and zero information in logs. I've tried to get something from coredump but failed.

I've added a patch on top my build that will not loose the exception, however no exceptions occurred so far. I have one guess about exception there, that time zookeeper was under pressure, but this has been fixed before deploying the patch, and I'm not talking about some exception for zookeeper commands but instead some issues with parts maybe (some LOGICAL_ERROR).


NOEXCEPT_SCOPE;
LockMemoryExceptionInThread lock_memory_tracker(VariableContext::Global);
NOEXCEPT_SCOPE_STRICT;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the difference between them, but can you explain why strict variant is used here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because initially we had only strict variant and it was the only place where this macro was used. But strict variant is not generic enough to use it in other places. I decided to leave strict variant here because something is probably going wrong if some code tries to initialize transaction log during stack unwinding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Logical error: 'Cannot unlock removal_tid, it's a bug'

3 participants