journal: refactor process metadata interrogation #2280

vcaputo · 2016-01-06T20:19:16Z

Currently systemd-journald interrogates the log-producing process on a
per-dispatched-message basis, which ultimately results in doing
significant work for every line logged even within a single received
buffer where this information should be identical from the other
messages dispatched from the same buffer.

This commit rearranges things so we cache and reuse the metadata across
multiple messages when possible, refreshing on buffer boundaries, and
using the cached information should refreshing fail (common when
processing a message produced by a now-exited process, like when a
daemon errors and exits).

In addition to enabling caching and reuse of the interrogated state,
this also changes the way the appended entry is being assembled to
require fewer allocations and copies per message.

The common OBJECT_PID == _PID case has also been optimized to simply
reuse the cached metadata retrieved for _PID.

The current implementation of interrogating the process on every message
is arguably less correct than this more efficient version as well. The
messages by the time we interrogate the log-producing process are always
from the past, the best we can do is to interrogate once at the time we
process the buffer, on behalf of the entire buffer, and reuse that
metadata as if it were current for all messages contained therein.
Repeatedly interrogating for every message querying the instantaneous
process state for data associated with a message from the past is less
likely to be accurate.

In order to make this all work correctly journal_file_append_entry()
needed modification to support scatter-gather IO within the individual
entry items, introducing a new JournalEntryItem type for describing the
entry.

Previously journal_file_append_entry() used the struct iovec to represent
each "KEY=VALUE" item of the entry being appended, preventing scattered
buffers from representing the individual items.

The new interface allows performing fewer allocations and copies in
constructing the appended journal entries by using a static array of
the constant "KEY=" strings which a per-item iovec may simply reference,
and simply referencing the discrete "VALUE" right hand side from the
metadata cache in the subsequent iovec for the item. This is also what
enables the OBJECT_PID == _PID metadata reuse optimization mentioned
above; the VALUE side of multiple items may simply reference the same
value within the metadata cache.

Some data:

 $ # journald-native (a-la libsystemd), ephemeral processes
 $ time {
     for ((i=0; i < 1000; i++)); do
       dd if=/dev/urandom bs=8k count=1 | od -x | systemd-cat
     done
   } >& /dev/null

  Before:
   real    0m11.938s
   user    0m1.788s
   sys     0m2.153s

   real    0m11.989s
   user    0m1.846s
   sys     0m2.130s

   real    0m12.012s
   user    0m1.802s
   sys     0m2.176s

  Now:
   real    0m7.937s
   user    0m1.801s
   sys     0m2.062s

   real    0m8.094s
   user    0m1.685s
   sys     0m2.277s

   real    0m8.114s
   user    0m1.751s
   sys     0m2.152s

---

 $ # journald-native (a-la libsystemd), persistent process
 $ dd if=/dev/urandom bs=8k | od -x | systemd-cat

  Before:
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 7.28229 s, 1.1 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 9.78265 s, 837 kB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 8.17331 s, 1.0 MB/s

  Now:
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 1.3507 s, 6.1 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 1.13135 s, 7.2 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 1.17473 s, 7.0 MB/s

---

 $ # journald-stream (a-la systemd services)
 $ cat >> noise.sh <<EOF
 #!/bin/bash
 dd if=/dev/urandom bs=8k count=1000 2>/tmp/foo | od -x
 EOF
 $ chmod +x noise.sh
 $ tail -F /tmp/foo &
 $ systemd-run ./noise.sh

  Before:
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 4.35375 s, 1.9 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 4.56279 s, 1.8 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 9.26128 s, 885 kB/s

  Now:
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 1.38334 s, 5.9 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 1.14587 s, 7.1 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 1.40992 s, 5.8 MB/s

---

 $ # service's farewell messages lossy metadata
 $ gcc -x c -pipe -O2 -s -o sigh - <<EOF
 #include <unistd.h>
 int main(int argc, const char *argv[]) {
         write(1, "sleeping 5s\n", 12);
         sleep(5);
         write(1, "farewell\n", 9);
         return 0;
 }
 EOF
 $ journalctl --follow --all --output=verbose --identifier=sigh &
 $ systemd-run ./sigh
 $ dd if=/dev/urandom bs=8k | od -x | systemd-cat
 $ # sigh starts before dd, dd spams journald before farewell occurs, delaying
 $ # journald's servicing of farewell to race with sigh's exit.
 $ # Now we'll at least have as much metadata after exit as we ever did for any
 $ # previous log message on the stream.

  Before:
   Wed 2016-01-06 16:55:45.649644 UTC [...
       _TRANSPORT=stdout
       PRIORITY=6
       _UID=0
       _GID=0
       _SELINUX_CONTEXT=system_u:system_r:kernel_t:s0
       _BOOT_ID=9fd04fe784a340d5956d8c1db921c31b
       _MACHINE_ID=7e967d2a1b3740f18722f903e0ccecfd
       _HOSTNAME=coreos_developer_qemu-884-0-0-20
       SYSLOG_FACILITY=3
       SYSLOG_IDENTIFIER=sigh
       MESSAGE=sleeping 5s
       _CAP_EFFECTIVE=3fffffffff
       _PID=19979
       _COMM=sigh
       _EXE=/tmp/sigh
       _CMDLINE=/tmp/./sigh
       _SYSTEMD_CGROUP=/system.slice/run-19978.service
       _SYSTEMD_UNIT=run-19978.service
       _SYSTEMD_SLICE=system.slice
   Wed 2016-01-06 16:55:50.649851 UTC [...
       _TRANSPORT=stdout
       PRIORITY=6
       _UID=0
       _GID=0
       _SELINUX_CONTEXT=system_u:system_r:kernel_t:s0
       _BOOT_ID=9fd04fe784a340d5956d8c1db921c31b
       _MACHINE_ID=7e967d2a1b3740f18722f903e0ccecfd
       _HOSTNAME=coreos_developer_qemu-884-0-0-20
       SYSLOG_FACILITY=3
       SYSLOG_IDENTIFIER=sigh
       MESSAGE=farewell
       _CAP_EFFECTIVE=3fffffffff
       _PID=19979
       _COMM=sigh
       _SYSTEMD_UNIT=run-19978.service

  Now:
   Wed 2016-01-06 16:57:19.857615 UTC [...
       _TRANSPORT=stdout
       PRIORITY=6
       _BOOT_ID=9fd04fe784a340d5956d8c1db921c31b
       _MACHINE_ID=7e967d2a1b3740f18722f903e0ccecfd
       _HOSTNAME=coreos_developer_qemu-884-0-0-20
       SYSLOG_FACILITY=3
       _UID=0
       _GID=0
       _SYSTEMD_SLICE=system.slice
       _SELINUX_CONTEXT=system_u:system_r:kernel_t:s0
       _CAP_EFFECTIVE=3fffffffff
       SYSLOG_IDENTIFIER=sigh
       MESSAGE=sleeping 5s
       _PID=19995
       _COMM=sigh
       _EXE=/tmp/sigh
       _CMDLINE=/tmp/./sigh
       _SYSTEMD_CGROUP=/system.slice/run-19994.service
       _SYSTEMD_UNIT=run-19994.service
   Wed 2016-01-06 16:57:24.860611 UTC [...
       _TRANSPORT=stdout
       PRIORITY=6
       _BOOT_ID=9fd04fe784a340d5956d8c1db921c31b
       _MACHINE_ID=7e967d2a1b3740f18722f903e0ccecfd
       _HOSTNAME=coreos_developer_qemu-884-0-0-20
       SYSLOG_FACILITY=3
       _UID=0
       _GID=0
       _SYSTEMD_SLICE=system.slice
       _SELINUX_CONTEXT=system_u:system_r:kernel_t:s0
       _CAP_EFFECTIVE=3fffffffff
       SYSLOG_IDENTIFIER=sigh
       _PID=19995
       _COMM=sigh
       _EXE=/tmp/sigh
       _CMDLINE=/tmp/./sigh
       _SYSTEMD_CGROUP=/system.slice/run-19994.service
       _SYSTEMD_UNIT=run-19994.service
       MESSAGE=farewell

---

 $ # no journald, ideal for reference
 $ dd if=/dev/urandom bs=8k count=1000 | od -x > /dev/null
 1000+0 records in
 1000+0 records out
 8192000 bytes (8.2 MB) copied, 1.2545 s, 6.5 MB/s
 1000+0 records in
 1000+0 records out
 8192000 bytes (8.2 MB) copied, 1.00709 s, 8.1 MB/s
 1000+0 records in
 1000+0 records out
 8192000 bytes (8.2 MB) copied, 1.05836 s, 7.7 MB/s

sztanpet · 2016-01-06T20:35:44Z

what happens with a few thousand services? how does the memory usage change?

vcaputo · 2016-01-06T20:35:56Z

Just putting this up here, there are some remaining things to clean up and presumably JournalMeta, its functions, and the tables should probably go into journal-meta.[ch] (or whatever name you guys prefer if we agree on the fundamentals of the change), and the vectorized helpers are a bit unfortunate but... vectorizing the entry items is an important part of the picture.

I was also using free_and_replace() as I developed this against v225 (coreos) and was a little surprised to find it gone in master, it's just inlined here, but I think my use case of free_and_replace() demonstrates a reasonable application in favor of its existence... that code would be significantly more verbose and repetitive without it.

Currently systemd-journald interrogates the log-producing process on a per-dispatched-message basis, which ultimately results in doing significant work for every line logged even within a single received buffer where this information should be identical from the other messages dispatched from the same buffer. This commit rearranges things so we cache and reuse the metadata across multiple messages when possible, refreshing on buffer boundaries, and using the cached information should refreshing fail (common when processing a message produced by a now-exited process, like when a daemon errors and exits). In addition to enabling caching and reuse of the interrogated state, this also changes the way the appended entry is being assembled to require fewer allocations and copies per message. The common OBJECT_PID == _PID case has also been optimized to simply reuse the cached metadata retrieved for _PID. The current implementation of interrogating the process on every message is arguably less correct than this more efficient version as well. The messages by the time we interrogate the log-producing process are always from the past, the best we can do is to interrogate once at the time we process the buffer, on behalf of the entire buffer, and reuse that metadata as if it were current for all messages contained therein. Repeatedly interrogating for every message querying the instantaneous process state for data associated with a message from the past is less likely to be accurate. In order to make this all work correctly journal_file_append_entry() needed modification to support scatter-gather IO within the individual entry items, introducing a new JournalEntryItem type for describing the entry. Previously journal_file_append_entry() used the struct iovec to represent each "KEY=VALUE" item of the entry being appended, preventing scattered buffers from representing the individual items. The new interface allows performing fewer allocations and copies in constructing the appended journal entries by using a static array of the constant "KEY=" strings which a per-item iovec may simply reference, and simply referencing the discrete "VALUE" right hand side from the metadata cache in the subsequent iovec for the item. This is also what enables the OBJECT_PID == _PID metadata reuse optimization mentioned above; the VALUE side of multiple items may simply reference the same value within the metadata cache. Some data: $ # journald-native (a-la libsystemd), ephemeral processes $ time { for ((i=0; i < 1000; i++)); do dd if=/dev/urandom bs=8k count=1 | od -x | systemd-cat done } >& /dev/null Before: real 0m11.938s user 0m1.788s sys 0m2.153s real 0m11.989s user 0m1.846s sys 0m2.130s real 0m12.012s user 0m1.802s sys 0m2.176s Now: real 0m7.937s user 0m1.801s sys 0m2.062s real 0m8.094s user 0m1.685s sys 0m2.277s real 0m8.114s user 0m1.751s sys 0m2.152s --- $ # journald-native (a-la libsystemd), persistent process $ dd if=/dev/urandom bs=8k | od -x | systemd-cat Before: 1000+0 records in 1000+0 records out 8192000 bytes (8.2 MB) copied, 7.28229 s, 1.1 MB/s 1000+0 records in 1000+0 records out 8192000 bytes (8.2 MB) copied, 9.78265 s, 837 kB/s 1000+0 records in 1000+0 records out 8192000 bytes (8.2 MB) copied, 8.17331 s, 1.0 MB/s Now: 1000+0 records in 1000+0 records out 8192000 bytes (8.2 MB) copied, 1.3507 s, 6.1 MB/s 1000+0 records in 1000+0 records out 8192000 bytes (8.2 MB) copied, 1.13135 s, 7.2 MB/s 1000+0 records in 1000+0 records out 8192000 bytes (8.2 MB) copied, 1.17473 s, 7.0 MB/s --- $ # journald-stream (a-la systemd services) $ cat >> noise.sh <<EOF #!/bin/bash dd if=/dev/urandom bs=8k count=1000 2>/tmp/foo | od -x EOF $ chmod +x noise.sh $ tail -F /tmp/foo & $ systemd-run ./noise.sh Before: 1000+0 records in 1000+0 records out 8192000 bytes (8.2 MB) copied, 4.35375 s, 1.9 MB/s 1000+0 records in 1000+0 records out 8192000 bytes (8.2 MB) copied, 4.56279 s, 1.8 MB/s 1000+0 records in 1000+0 records out 8192000 bytes (8.2 MB) copied, 9.26128 s, 885 kB/s Now: 1000+0 records in 1000+0 records out 8192000 bytes (8.2 MB) copied, 1.38334 s, 5.9 MB/s 1000+0 records in 1000+0 records out 8192000 bytes (8.2 MB) copied, 1.14587 s, 7.1 MB/s 1000+0 records in 1000+0 records out 8192000 bytes (8.2 MB) copied, 1.40992 s, 5.8 MB/s --- $ # service's farewell messages lossy metadata $ gcc -x c -pipe -O2 -s -o sigh - <<EOF #include <unistd.h> int main(int argc, const char *argv[]) { write(1, "sleeping 5s\n", 12); sleep(5); write(1, "farewell\n", 9); return 0; } EOF $ journalctl --follow --all --output=verbose --identifier=sigh & $ systemd-run ./sigh $ dd if=/dev/urandom bs=8k | od -x | systemd-cat $ # sigh starts before dd, dd spams journald before farewell occurs, delaying $ # journald's servicing of farewell to race with sigh's exit. $ # Now we'll at least have as much metadata after exit as we ever did for any $ # previous log message on the stream. Before: Wed 2016-01-06 16:55:45.649644 UTC [... _TRANSPORT=stdout PRIORITY=6 _UID=0 _GID=0 _SELINUX_CONTEXT=system_u:system_r:kernel_t:s0 _BOOT_ID=9fd04fe784a340d5956d8c1db921c31b _MACHINE_ID=7e967d2a1b3740f18722f903e0ccecfd _HOSTNAME=coreos_developer_qemu-884-0-0-20 SYSLOG_FACILITY=3 SYSLOG_IDENTIFIER=sigh MESSAGE=sleeping 5s _CAP_EFFECTIVE=3fffffffff _PID=19979 _COMM=sigh _EXE=/tmp/sigh _CMDLINE=/tmp/./sigh _SYSTEMD_CGROUP=/system.slice/run-19978.service _SYSTEMD_UNIT=run-19978.service _SYSTEMD_SLICE=system.slice Wed 2016-01-06 16:55:50.649851 UTC [... _TRANSPORT=stdout PRIORITY=6 _UID=0 _GID=0 _SELINUX_CONTEXT=system_u:system_r:kernel_t:s0 _BOOT_ID=9fd04fe784a340d5956d8c1db921c31b _MACHINE_ID=7e967d2a1b3740f18722f903e0ccecfd _HOSTNAME=coreos_developer_qemu-884-0-0-20 SYSLOG_FACILITY=3 SYSLOG_IDENTIFIER=sigh MESSAGE=farewell _CAP_EFFECTIVE=3fffffffff _PID=19979 _COMM=sigh _SYSTEMD_UNIT=run-19978.service Now: Wed 2016-01-06 16:57:19.857615 UTC [... _TRANSPORT=stdout PRIORITY=6 _BOOT_ID=9fd04fe784a340d5956d8c1db921c31b _MACHINE_ID=7e967d2a1b3740f18722f903e0ccecfd _HOSTNAME=coreos_developer_qemu-884-0-0-20 SYSLOG_FACILITY=3 _UID=0 _GID=0 _SYSTEMD_SLICE=system.slice _SELINUX_CONTEXT=system_u:system_r:kernel_t:s0 _CAP_EFFECTIVE=3fffffffff SYSLOG_IDENTIFIER=sigh MESSAGE=sleeping 5s _PID=19995 _COMM=sigh _EXE=/tmp/sigh _CMDLINE=/tmp/./sigh _SYSTEMD_CGROUP=/system.slice/run-19994.service _SYSTEMD_UNIT=run-19994.service Wed 2016-01-06 16:57:24.860611 UTC [... _TRANSPORT=stdout PRIORITY=6 _BOOT_ID=9fd04fe784a340d5956d8c1db921c31b _MACHINE_ID=7e967d2a1b3740f18722f903e0ccecfd _HOSTNAME=coreos_developer_qemu-884-0-0-20 SYSLOG_FACILITY=3 _UID=0 _GID=0 _SYSTEMD_SLICE=system.slice _SELINUX_CONTEXT=system_u:system_r:kernel_t:s0 _CAP_EFFECTIVE=3fffffffff SYSLOG_IDENTIFIER=sigh _PID=19995 _COMM=sigh _EXE=/tmp/sigh _CMDLINE=/tmp/./sigh _SYSTEMD_CGROUP=/system.slice/run-19994.service _SYSTEMD_UNIT=run-19994.service MESSAGE=farewell --- $ # no journald, ideal for reference $ dd if=/dev/urandom bs=8k count=1000 | od -x > /dev/null 1000+0 records in 1000+0 records out 8192000 bytes (8.2 MB) copied, 1.2545 s, 6.5 MB/s 1000+0 records in 1000+0 records out 8192000 bytes (8.2 MB) copied, 1.00709 s, 8.1 MB/s 1000+0 records in 1000+0 records out 8192000 bytes (8.2 MB) copied, 1.05836 s, 7.7 MB/s

vcaputo · 2016-01-06T21:08:39Z

@sztanpet The amount of memory needed is negligible, all we're doing is caching the string form of the value side of this metadata. Additionally, I don't think the current approach of not caching this information per-stream is tenable at all, the whole 'lost metadata @ service exit because by the time journald interrogates the process it's gone' is a total disaster.

It's one thing to miss the metadata because the process was short-lived and journald had no opportunity to ever interrogate it even once, but when the process is a service which has been getting logged with full metadata, it's very unexpected and confusing when the last (read: interesting and valuable) messages emitted before the service exited appear to be missing from the journal because the user is keying their search on the metadata those last messages are missing.

This is why I've gone and cached the information for the lifetime of the stream in journal-stream, it's fixing a usability issue (we've had numerous etcd users complain etcd is losing logs before exit which turn out to be journald losing the association of those messages with etcd for those fields the metadata provides).

The performance improvements could be had caching the information in journald-stream on a per-buffer/per-scan basis, in the interests of saving a little memory, but then we'd still have users confused as all hell when the most important log messages appear to be lost from the journal because they're filtering on the lost metadata.

sztanpet · 2016-01-06T21:10:57Z

I agree that it's certainly needed and thanks for the detailed response @vcaputo!

evverx · 2016-01-07T11:37:52Z

Looks like @vcaputo fixed #1456 too:)

poettering · 2016-01-07T12:37:42Z

Hmm, so let me get this right? This implementes a cache for /proc accesses, so that we don't need to read the meta data each time? What's the strategy for dealing with changing credentials?

So far we always assumed that one day somebody would resurrect the SCM_CGROUPS stuff in the kernel, so that much of the credentials stuff would become fully reliable, accuracte and cheap. Even more so if we could get SCM_AUDIT and suchlike into place. Wouldn't it make sense to work on resurrecting this? I think the only missing bit to get SCM_CGROUPS into the kernel really was to ensure this was opt-in via some SO_xyz option on the recv side.

vcaputo · 2016-01-07T12:39:32Z

@evverx There will always be potential for the process to have exited before journald has processed any of the queued log messages. In that case, my changes can't help. The only way to protect from that edge case is to have the kernel bundling all the metadata with all the log messages, but that's not present at the moment.

However, if journald has managed to interrogate the process at least once before it exits, then journald will at least be able to fall-back on the cached metadata from that interrogation, should the process be gone before its later log messages are consumed.

So my change does improve the situation in #1456, but not for processes so ephemeral journald never sampled their metadata.

vcaputo · 2016-01-07T12:42:15Z

@poettering the cache is being updated for every buffer scanned. This change simply eliminates the need to interrogate /proc for every message (newline) within a buffer, which is fairly pointless and arguably less correct for subsequent lines within the same atomic buffer. You don't want to update the metadata to the instantaneous process state that is the future relative to when the message was sent. Just update it once per buffer and apply that to all messages within.

vcaputo · 2016-01-07T13:27:48Z

@poettering The SCM_CGROUPS proposal sounds good, but I don't see how that precludes the caching of all the process metadata we're getting out of /proc for messages within the same buffer.

They're complementary, not mutually exclusive. SCM_CGROUPS would just improve the reliability of the cgroup metadata, great, that's important. There's more than the cgroup being read from /proc however, and it's costly and inappropriate to be interrogating that stuff repeatedly for multiple lines/messages arriving in the same buffer. Interrogate proc once for the buffer, and supply the the metadata to the dispatcher for reuse across all the lines/messages therein.

When s->length is zero this function doesn't do anything. Simply short-circuit the function when this is the case. This is especially helpful now as it suppresses an unnecessary journal_meta_refresh() when the length is zero, a common occurrence in force_flush == true (eof) invocations. It also clarifies the code, since you must study the implementation to verify this fact without the explicit short-circuit being present.

poettering · 2016-01-12T19:02:33Z

OK, if the creds cache is used only within the same buffer we read, and when the data is otherwise gone, then I think I am onboard with the concept.

poettering · 2016-01-12T19:11:57Z

Hmm, any chance you can maybe rework this to to use the sd_bus_creds object instead? It is pretty much exactly the same as the new JournalMeta structure you added, but generic, and exposed as part of sd-bus... It can do a lot of the stuff we need here anyway (such as augmenting ucred data from /proc), and I think it would be really good to just add a few new calls to it to make it support this new usecase, too, rather than adding a fully new object...

For example, we could add sd_bus_creds_refresh() as public API which tries to refresh the data it already knows. And we could add a new (internal?) bus_creds_from_ucred_and_label() or so, which allows allocating an sd_bus_creds() object from the socket metadata we have.

I think this would be much nicer, and would actually result in the patch reducing footprint while adding new features, instead of increasing it!

Hope that makes sense!

vcaputo · 2016-01-14T05:23:18Z

@poettering I'll poke at the sd_bus_creds idea, do you have any thoughts re: JournalEntryItem and the vectorized stuff?

poettering · 2016-01-14T14:44:13Z

What's the precise reason for the vectorization stuff? It's well-known that kernel-based scatter/gather IO for smaller stuff is actually very slow, much slower than copying things together in userspace and submitting things as one. There are good reasons for using kernel-based scatter/gather, but speed isn't one of them. What's your rationale here, precisely?

vcaputo · 2016-01-15T01:25:19Z

@poettering The vectorizing of individual "KEY=VALUE" items is to permit the constant key names to be referenced from a static array of strings and the values to be referenced in-place in the JournalMeta structure (or sd_bus_creds). We currently do all sorts of allocating and freeing to accommodate the combined "KEY=VALUE" form just for the sake of fulfilling the non-vectorized journal_file_append_entry() interface, not to mention we have to copy the strings into these tiny ephemeral buffers.

Look at the current body of dispatch_message_real():
https://github.com/systemd/systemd/blob/master/src/journal/journald-server.c#L586
it's a mess of snprintf, strjoina(), and free() calls just so we can feed journal_file_append_entry() contiguous "KEY=VALUE" strings. I'm not vectorizing under the impression that I'll be making copying faster, I'm doing this to elide a bunch of pointless allocator activity and string concatenation. Defer turning this stuff into anything contiguous until journal-file.c is sticking it into the mmap'd file.

vcaputo · 2016-01-15T22:32:08Z

@poettering at a glance I'm not really liking the sd_bus_creds idea, it includes a bunch of stuff we're not using but would need to be refreshed by a generic implementation and doesn't cache everything in the string form ready for the iovec to reference.

philips · 2016-01-28T17:57:18Z

@vcaputo Can we separate out the vectorization and the caching for the purpose of getting part of this merged? Do they have separate performance impact?

vcaputo · 2016-02-23T02:57:54Z

@poettering What are your current feelings about this PR? I had responded to your questions regarding the vectorized items, and would like us to come to some kind of agreement here before going and dropping the vectorized items in favor of allocating space for and converting types into concatenated key=value string items for every message.

centos-ci · 2016-04-12T07:39:16Z

Can one of the admins verify this patch?

gdamjan · 2016-06-02T12:36:57Z

Why doesn't systemd inform the journal about the service that is going to use the logging socket before even starting the service?
Then the journal would know exactly what to tag the message with and never have to interrogate the system about it.

markstos · 2016-06-23T02:45:38Z

Today I ran into the exactly the problem case described: Just before exiting, I logged to STDOUT the key details about what failed, but those details are confusingly missing when I do journalctl -u my-unit-name.

I then spent hours tonight trying to figure out how to make sure /all/ my logs from a service end up tagged with the service name. I have a write-up about it on the StackExchange site here:

http://unix.stackexchange.com/questions/291483/how-to-fix-unit-attribution-of-last-lines-of-systemd-logging-before-exit

The issue turned out to be easy to reproduce. Start with a simple service file named stdout-before-exit.service

[Unit]
Description=Stdout-before-exit

[Service]
StandardOutput=journal
ExecStart=/home/mark/bin/stdout-before-exit.sh

Then in the referenced bash script, just put:

#!/bin/bash
echo "Logging just before exit!"
exit;

Then run the service with systemctl start.

Finally, check if you can find the log entry by looking up the logs for that service:

journalctl -u stdout-before-exit

markstos · 2016-06-23T14:25:21Z

To amend my earlier comment, I found the situation is much worse for user units, which is how I was testing. I had trouble getting the unit attribution attached to a user-unit even when I did "sleep 10 && echo 'some log' && sleep 10". For a system unit, the attribution gap loss is much smaller, but still present.

vcaputo · 2016-07-14T21:38:13Z

Well, I'm back from a prolonged and unexpected break, and curious what everyone feels about this stuff now that it's been steeping for months.

There was an outstanding question regarding the vectorizing code, which is an important part of the optimizations as it enables arbitrarily scattering the "key=value" strings. By scattering them we're able to cache the values alone and stitch together arbitrary combinations of static keys and potentially cached values. In some cases the values will be the same across multiple fields, and in those cases it's better to simply reference the same shared value. The splitting also elides the need for allocating, constructing, and freeing contiguous regions for the combined "key=value" fields.

This PR probably needs cleanups and at least rebasing, coming to some agreement on what the mergable form should approximately look like would be helpful.

/cc @poettering

phemmer · 2016-08-31T13:42:50Z

I'm also interested in this, but would favor an even more aggressive caching scheme.
Right now, we have boxes where journald is consuming ~50% CPU. These are log heavy machines, such as web servers. Polling the same /proc/$pid data thousands of times a second seems like overkill.
I haven't tested this PR specifically, but I don't think once-per-buffer would be enough. This would mainly help if the rate of incoming events is faster than the rate that journald can process. Such as if journald is CPU bound and the buffer starts accumulating. As long as journald is able to keep up, and can process one event at a time, it's still going to chew up the CPU.

I can certainly understand the benefit of checking the /proc/$pid data per-event, so while I personally would favor more aggressive caching, I don't think it should be mandatory, or default. So perhaps an option we could set. And even better would be a per-unit option so that we can enable it specifically for log heavy apps.

vcaputo · 2016-08-31T21:00:06Z

@phemmer Could you try this as-is with your workload to see how things perform? (assuming I rebased against master again...)

phemmer · 2016-08-31T21:21:10Z

Assuming I can get it working on an EL7 box, yes. If I can just do a quick and dirty swap of the journald binary on the system, then it should be easy to test.

vcaputo · 2016-09-07T20:21:25Z

@phemmer hm, EL7 is relatively old systemd anyways, if you're just going to do a swap of the binary then this PR as-is should be fine for an experiment, just apply it to an older branch like v229.

towolf · 2016-09-26T23:55:31Z

Is this still on the radar of systemd reviewers? I've hit the race problem that this PR promises to ameliorate. Can't with good confidence suggest making use of journal for log heavy stuff when these issues remain.

@poettering are we really gonna stick to waiting for a kernel fix?

martinpitt · 2016-10-12T13:51:15Z

This PR is almost year old and thus stale, and got rejected in its present form. Hence I am closing it now. If this is still relevant, please fix and rebase to current master, and either reopen this PR or open a new one. Thank you for understanding!

@vcaputo

Cache client metadata, in order to be improve runtime behaviour under pressure. This is inspired by @vcaputo's work, specifically: systemd#2280 That code implements related but different semantics. For a longer explanation what this change implements please have a look at the long source comment this patch adds to journald-context.c. After this commit: # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat' 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s real 0m11.283s user 0m0.007s sys 0m6.216s Before this commit: # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat' 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s real 0m52.099s user 0m0.014s sys 0m7.170s As side effect, this corrects the journal's rate limiter feature: we now always use the unit name as key for the ratelimiter.

@vcaputo

Cache client metadata, in order to be improve runtime behaviour under pressure. This is inspired by @vcaputo's work, specifically: systemd#2280 That code implements related but different semantics. For a longer explanation what this change implements please have a look at the long source comment this patch adds to journald-context.c. After this commit: # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat' 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s real 0m11.283s user 0m0.007s sys 0m6.216s Before this commit: # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat' 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s real 0m52.099s user 0m0.014s sys 0m7.170s As side effect, this corrects the journal's rate limiter feature: we now always use the unit name as key for the ratelimiter.

@vcaputo

Cache client metadata, in order to be improve runtime behaviour under pressure. This is inspired by @vcaputo's work, specifically: systemd#2280 That code implements related but different semantics. For a longer explanation what this change implements please have a look at the long source comment this patch adds to journald-context.c. After this commit: # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat' 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s real 0m11.283s user 0m0.007s sys 0m6.216s Before this commit: # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat' 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s real 0m52.099s user 0m0.014s sys 0m7.170s As side effect, this corrects the journal's rate limiter feature: we now always use the unit name as key for the ratelimiter.

@vcaputo

Cache client metadata, in order to be improve runtime behaviour under pressure. This is inspired by @vcaputo's work, specifically: systemd#2280 That code implements related but different semantics. For a longer explanation what this change implements please have a look at the long source comment this patch adds to journald-context.c. After this commit: # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat' 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s real 0m11.283s user 0m0.007s sys 0m6.216s Before this commit: # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat' 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s real 0m52.099s user 0m0.014s sys 0m7.170s As side effect, this corrects the journal's rate limiter feature: we now always use the unit name as key for the ratelimiter.

@vcaputo

Cache client metadata, in order to be improve runtime behaviour under pressure. This is inspired by @vcaputo's work, specifically: systemd#2280 That code implements related but different semantics. For a longer explanation what this change implements please have a look at the long source comment this patch adds to journald-context.c. After this commit: # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat' 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s real 0m11.283s user 0m0.007s sys 0m6.216s Before this commit: # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat' 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s real 0m52.099s user 0m0.014s sys 0m7.170s As side effect, this corrects the journal's rate limiter feature: we now always use the unit name as key for the ratelimiter.

@vcaputo

Cache client metadata, in order to be improve runtime behaviour under pressure. This is inspired by @vcaputo's work, specifically: systemd#2280 That code implements related but different semantics. For a longer explanation what this change implements please have a look at the long source comment this patch adds to journald-context.c. After this commit: # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat' 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s real 0m11.283s user 0m0.007s sys 0m6.216s Before this commit: # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat' 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s real 0m52.099s user 0m0.014s sys 0m7.170s As side effect, this corrects the journal's rate limiter feature: we now always use the unit name as key for the ratelimiter.

@vcaputo

Cache client metadata, in order to be improve runtime behaviour under pressure. This is inspired by @vcaputo's work, specifically: systemd#2280 That code implements related but different semantics. For a longer explanation what this change implements please have a look at the long source comment this patch adds to journald-context.c. After this commit: # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat' 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s real 0m11.283s user 0m0.007s sys 0m6.216s Before this commit: # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat' 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s real 0m52.099s user 0m0.014s sys 0m7.170s As side effect, this corrects the journal's rate limiter feature: we now always use the unit name as key for the ratelimiter.

@vcaputo

Cache client metadata, in order to be improve runtime behaviour under pressure. This is inspired by @vcaputo's work, specifically: systemd#2280 That code implements related but different semantics. For a longer explanation what this change implements please have a look at the long source comment this patch adds to journald-context.c. After this commit: # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat' 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s real 0m11.283s user 0m0.007s sys 0m6.216s Before this commit: # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat' 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s real 0m52.099s user 0m0.014s sys 0m7.170s As side effect, this corrects the journal's rate limiter feature: we now always use the unit name as key for the ratelimiter.

@vcaputo

Cache client metadata, in order to be improve runtime behaviour under pressure. This is inspired by @vcaputo's work, specifically: systemd#2280 That code implements related but different semantics. For a longer explanation what this change implements please have a look at the long source comment this patch adds to journald-context.c. After this commit: # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat' 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s real 0m11.283s user 0m0.007s sys 0m6.216s Before this commit: # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat' 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s real 0m52.099s user 0m0.014s sys 0m7.170s As side effect, this corrects the journal's rate limiter feature: we now always use the unit name as key for the ratelimiter.

This was referenced Jan 6, 2016

Excessive logging causes process to block coreos/bugs#990

Closed

journald miss the last few log lines when process exit coreos/bugs#373

Closed

zonque added the journal label Jan 7, 2016

poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Jan 12, 2016

vcaputo mentioned this pull request Feb 3, 2016

journal: continue execution during journal_file_set_offline()'s big fsync() #2520

Closed

evverx mentioned this pull request Mar 30, 2016

journald is unable to attribute messages incoming from processes that exited to their cgroup, due to /proc vs SCM_CREDS race #2913

Open

1 task

markstos mentioned this pull request Jun 23, 2016

journald entries that should have _SYSTEMD_UNIT are /sometimes/ missing _SYSTEMD_UNIT #1347

Closed

lucab mentioned this pull request Aug 9, 2016

RFE: Extend unit section with custom journal fields #3041

Closed

2 tasks

vcaputo mentioned this pull request Sep 7, 2016

journal: note stdout_stream_scan() non-op #4102

Merged

martinpitt closed this Oct 12, 2016

vcaputo mentioned this pull request Dec 24, 2016

journald: bump log message buffer size to 64K #4875

Closed

phemmer mentioned this pull request Jan 18, 2017

systemd-journal uses too much CPU #5102

Closed

2 tasks

Uh oh!

journal: refactor process metadata interrogation #2280

journal: refactor process metadata interrogation #2280

Uh oh!

Conversation

vcaputo commented Jan 6, 2016

Uh oh!

sztanpet commented Jan 6, 2016

Uh oh!

vcaputo commented Jan 6, 2016

Uh oh!

vcaputo commented Jan 6, 2016

Uh oh!

sztanpet commented Jan 6, 2016

Uh oh!

evverx commented Jan 7, 2016

Uh oh!

poettering commented Jan 7, 2016

Uh oh!

vcaputo commented Jan 7, 2016

Uh oh!

vcaputo commented Jan 7, 2016

Uh oh!

vcaputo commented Jan 7, 2016

Uh oh!

poettering commented Jan 12, 2016

Uh oh!

poettering commented Jan 12, 2016

Uh oh!

vcaputo commented Jan 14, 2016

Uh oh!

poettering commented Jan 14, 2016

Uh oh!

vcaputo commented Jan 15, 2016

Uh oh!

vcaputo commented Jan 15, 2016

Uh oh!

philips commented Jan 28, 2016

Uh oh!

vcaputo commented Feb 23, 2016

Uh oh!

centos-ci commented Apr 12, 2016

Uh oh!

gdamjan commented Jun 2, 2016

Uh oh!

markstos commented Jun 23, 2016

Uh oh!

markstos commented Jun 23, 2016

Uh oh!

vcaputo commented Jul 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phemmer commented Aug 31, 2016

Uh oh!

vcaputo commented Aug 31, 2016

Uh oh!

phemmer commented Aug 31, 2016

Uh oh!

vcaputo commented Sep 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

towolf commented Sep 26, 2016

Uh oh!

martinpitt commented Oct 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

12 participants

vcaputo commented Jul 14, 2016 •

edited

Loading

vcaputo commented Sep 7, 2016 •

edited

Loading