Skip to content

less sys calls #2: make vdso work again#27492

Merged
alexey-milovidov merged 5 commits intoClickHouse:masterfrom
filimonov:less_clock_gettime2
Aug 13, 2021
Merged

less sys calls #2: make vdso work again#27492
alexey-milovidov merged 5 commits intoClickHouse:masterfrom
filimonov:less_clock_gettime2

Conversation

@filimonov
Copy link
Copy Markdown
Contributor

@filimonov filimonov commented Aug 9, 2021

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Less number of clock_gettime syscalls that may lead to performance improvement for some types of fast queries.

Detailed description / Documentation draft:
cgt_init has used getauxval BEFORE __auxv_init was fired, and because of that vdso was not working for clock_gettime (and maybe some other API).

Again have very significant improvement locally, but let's see what performance tests will say.

laptop-5591 :) SELECT count() FROM zeros(100000000000);

SELECT count()
FROM zeros(100000000000)

Query id: 7d1018a8-50ad-4659-9f97-9fe8248d0248

Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 21.9.1 revision 54449.

┌──────count()─┐
│ 100000000000 │
└──────────────┘

1 rows in set. Elapsed: 4.106 sec. Processed 100.00 billion rows, 100.00 GB (24.35 billion rows/s., 24.35 GB/s.)

laptop-5591 :) SELECT count() FROM zeros(100000000000);

SELECT count()
FROM zeros(100000000000)

Query id: 1f4f29db-f05f-44bc-b0c5-5446e7f12ee0

Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 21.9.1 revision 54449.

┌──────count()─┐
│ 100000000000 │
└──────────────┘

1 rows in set. Elapsed: 2.448 sec. Processed 100.00 billion rows, 100.00 GB (40.86 billion rows/s., 40.86 GB/s.)

@robot-clickhouse robot-clickhouse added the pr-bugfix Pull request with bugfix, not backported by default label Aug 9, 2021
@alexey-milovidov alexey-milovidov marked this pull request as draft August 9, 2021 21:54
@robot-clickhouse robot-clickhouse added pr-improvement Pull request with some product improvements pr-performance Pull request with some performance improvements and removed pr-bugfix Pull request with bugfix, not backported by default pr-improvement Pull request with some product improvements labels Aug 9, 2021
@vitlibar
Copy link
Copy Markdown
Member

2021-08-09 19:38:25 ==24256==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x0000010333c8 bp 0x000000000000 sp 0x7fff01868190 T0)
2021-08-09 19:38:25 ==24256==The signal is caused by a READ memory access.
2021-08-09 19:38:25 ==24256==Hint: address points to the zero page.
2021-08-09 19:38:25     #0 0x10333c8 in __auxv_init obj-x86_64-linux-gnu/../base/glibc-compatibility/musl/getauxval.c:24:5
2021-08-09 19:38:25     #1 0x10333c8 in getauxval obj-x86_64-linux-gnu/../base/glibc-compatibility/musl/getauxval.c:37:9
2021-08-09 19:38:25     #2 0x5654cb in __lsan::InitializePlatformSpecificModules() (/build/obj-x86_64-linux-gnu/contrib/llvm/llvm/bin/llvm-tblgen+0x5654cb)
2021-08-09 19:38:25     #3 0x549f77 in __asan::AsanInitInternal() (/build/obj-x86_64-linux-gnu/contrib/llvm/llvm/bin/llvm-tblgen+0x549f77)
2021-08-09 19:38:25     #4 0x7ff9248a2ce5  (/lib64/ld-linux-x86-64.so.2+0x11ce5)

@vitlibar vitlibar self-assigned this Aug 10, 2021
@alexey-milovidov alexey-milovidov marked this pull request as ready for review August 10, 2021 23:35
@filimonov filimonov force-pushed the less_clock_gettime2 branch 2 times, most recently from 03f18a7 to 5584fe5 Compare August 11, 2021 12:46
@filimonov
Copy link
Copy Markdown
Contributor Author

wondering why performance tests don't show visible diff. It's quite a big improvement in my local env.

@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Aug 11, 2021

Always on profiling in production (aggregated stack traces from 360-node cluster) does not show any significant time spent in clock_gettime. Maybe it affects only one synthetic query.

@alexey-milovidov
Copy link
Copy Markdown
Member

Make vDSO kernel mechanism work again to reduce the number of syscalls and to improve performance.

This is not a good changelog entry for users.

@filimonov
Copy link
Copy Markdown
Contributor Author

Always on profiling in production (aggregated stack traces from 360-node cluster) does not show any significant time spent in clock_gettime.

It also may be something system-specific (kernel version / hw / some settings / etc.)

Maybe it affects only one synthetic query.

Yes, I expect that it should affect queries when nr of blocks is very high, and each block is processed very fast. So mostly every select from zeros/numbers etc. And we have a lot of those in perf tests.

@filimonov
Copy link
Copy Markdown
Contributor Author

Always on profiling in production

BTW: system.trace_log will tend to catch long-running functions. I.e. shooting a very 'fast target' like system call is much harder than shooting the slow one. We have a very fast system call with almost zero cost, but we have a very high number of those calls.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like a good solution, __attribute__((constructor(101))) is definitely going to be initialized before just __attribute__((constructor)), but there are still no guaranties that any of __attribute__((constructor*)) will be called before the first call of getauxval(). So I believe we need lazy initialization here.

@vitlibar
Copy link
Copy Markdown
Member

Lazy initialization in C is often implemented via replacing a pointer to function. I mean we prepare two functions looking like getauxval() - the first with initialization, and the second without. Initially the first function is called, and it replaces the pointer to function to make sure other times the second function will be called. Something like this:

unsigned long __getauxval(unsigned long type)
{
    if (type == AT_SECURE)
        return __auxv_secure;

    if (__auxv)
    {
        size_t index = __find_auxv(type);
        if (index != ((size_t) -1))
            return __auxv[index];
    }

    errno = ENOENT;
    return 0;
}

static void * volatile getauxval_func;

static unsigned long  __auxv_init(unsigned long type)
{
    if (!__environ)
        return 0;

    size_t i;
    for (i = 0; __environ[i]; i++);
    __auxv = (unsigned long *) (__environ + i + 1);

    size_t secure_idx = __find_auxv(AT_SECURE);
    if (secure_idx != ((size_t) -1))
        __auxv_secure = __auxv[secure_idx];

    a_cas_p(&getauxval_func, (void *)__auxv_init, (void *)__getauxval);
    return __getauxval(type);
}

static void * volatile getauxval_func = (void *)__auxv_init;

unsigned long getauxval(unsigned long type)
{
    return ((unsigned long (*)(unsigned long))getauxval_func)(type);
}

@filimonov filimonov force-pushed the less_clock_gettime2 branch from cb63cfe to 242d8e1 Compare August 12, 2021 09:08
@alexey-milovidov
Copy link
Copy Markdown
Member

Someone breaks test_odbc_interaction/test.py::test_sqlite_odbc_hashed_dictionary in master.
Need to figure out why and where it has been broken.

@alexey-milovidov
Copy link
Copy Markdown
Member

@filimonov

BTW: system.trace_log will tend to catch long-running functions.

Every function is traced with the probability proportional to the CPU time or real time.

@alexey-milovidov
Copy link
Copy Markdown
Member

Unfortunately, libunwind does not work correctly with vdso.

@alexey-milovidov
Copy link
Copy Markdown
Member

This PR is Ok but it interferes with other details in code.

@alexey-milovidov
Copy link
Copy Markdown
Member

Finally I made it to work.

@filimonov
Copy link
Copy Markdown
Contributor Author

#28132

@filimonov
Copy link
Copy Markdown
Contributor Author

This PR is Ok but it interferes with other details in code.

That get us back to the question: can we ensure somehow in tests that vdso path is used? (since it looks like performance tests in CI don't catch that difference),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants