Skip to content

Conversation

@vcaputo
Copy link
Member

@vcaputo vcaputo commented Jan 6, 2016

Currently systemd-journald interrogates the log-producing process on a
per-dispatched-message basis, which ultimately results in doing
significant work for every line logged even within a single received
buffer where this information should be identical from the other
messages dispatched from the same buffer.

This commit rearranges things so we cache and reuse the metadata across
multiple messages when possible, refreshing on buffer boundaries, and
using the cached information should refreshing fail (common when
processing a message produced by a now-exited process, like when a
daemon errors and exits).

In addition to enabling caching and reuse of the interrogated state,
this also changes the way the appended entry is being assembled to
require fewer allocations and copies per message.

The common OBJECT_PID == _PID case has also been optimized to simply
reuse the cached metadata retrieved for _PID.

The current implementation of interrogating the process on every message
is arguably less correct than this more efficient version as well. The
messages by the time we interrogate the log-producing process are always
from the past, the best we can do is to interrogate once at the time we
process the buffer, on behalf of the entire buffer, and reuse that
metadata as if it were current for all messages contained therein.
Repeatedly interrogating for every message querying the instantaneous
process state for data associated with a message from the past is less
likely to be accurate.

In order to make this all work correctly journal_file_append_entry()
needed modification to support scatter-gather IO within the individual
entry items, introducing a new JournalEntryItem type for describing the
entry.

Previously journal_file_append_entry() used the struct iovec to represent
each "KEY=VALUE" item of the entry being appended, preventing scattered
buffers from representing the individual items.

The new interface allows performing fewer allocations and copies in
constructing the appended journal entries by using a static array of
the constant "KEY=" strings which a per-item iovec may simply reference,
and simply referencing the discrete "VALUE" right hand side from the
metadata cache in the subsequent iovec for the item. This is also what
enables the OBJECT_PID == _PID metadata reuse optimization mentioned
above; the VALUE side of multiple items may simply reference the same
value within the metadata cache.

Some data:

 $ # journald-native (a-la libsystemd), ephemeral processes
 $ time {
     for ((i=0; i < 1000; i++)); do
       dd if=/dev/urandom bs=8k count=1 | od -x | systemd-cat
     done
   } >& /dev/null

  Before:
   real    0m11.938s
   user    0m1.788s
   sys     0m2.153s

   real    0m11.989s
   user    0m1.846s
   sys     0m2.130s

   real    0m12.012s
   user    0m1.802s
   sys     0m2.176s

  Now:
   real    0m7.937s
   user    0m1.801s
   sys     0m2.062s

   real    0m8.094s
   user    0m1.685s
   sys     0m2.277s

   real    0m8.114s
   user    0m1.751s
   sys     0m2.152s

---

 $ # journald-native (a-la libsystemd), persistent process
 $ dd if=/dev/urandom bs=8k | od -x | systemd-cat

  Before:
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 7.28229 s, 1.1 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 9.78265 s, 837 kB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 8.17331 s, 1.0 MB/s

  Now:
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 1.3507 s, 6.1 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 1.13135 s, 7.2 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 1.17473 s, 7.0 MB/s

---

 $ # journald-stream (a-la systemd services)
 $ cat >> noise.sh <<EOF
 #!/bin/bash
 dd if=/dev/urandom bs=8k count=1000 2>/tmp/foo | od -x
 EOF
 $ chmod +x noise.sh
 $ tail -F /tmp/foo &
 $ systemd-run ./noise.sh

  Before:
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 4.35375 s, 1.9 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 4.56279 s, 1.8 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 9.26128 s, 885 kB/s

  Now:
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 1.38334 s, 5.9 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 1.14587 s, 7.1 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 1.40992 s, 5.8 MB/s

---

 $ # service's farewell messages lossy metadata
 $ gcc -x c -pipe -O2 -s -o sigh - <<EOF
 #include <unistd.h>
 int main(int argc, const char *argv[]) {
         write(1, "sleeping 5s\n", 12);
         sleep(5);
         write(1, "farewell\n", 9);
         return 0;
 }
 EOF
 $ journalctl --follow --all --output=verbose --identifier=sigh &
 $ systemd-run ./sigh
 $ dd if=/dev/urandom bs=8k | od -x | systemd-cat
 $ # sigh starts before dd, dd spams journald before farewell occurs, delaying
 $ # journald's servicing of farewell to race with sigh's exit.
 $ # Now we'll at least have as much metadata after exit as we ever did for any
 $ # previous log message on the stream.

  Before:
   Wed 2016-01-06 16:55:45.649644 UTC [...
       _TRANSPORT=stdout
       PRIORITY=6
       _UID=0
       _GID=0
       _SELINUX_CONTEXT=system_u:system_r:kernel_t:s0
       _BOOT_ID=9fd04fe784a340d5956d8c1db921c31b
       _MACHINE_ID=7e967d2a1b3740f18722f903e0ccecfd
       _HOSTNAME=coreos_developer_qemu-884-0-0-20
       SYSLOG_FACILITY=3
       SYSLOG_IDENTIFIER=sigh
       MESSAGE=sleeping 5s
       _CAP_EFFECTIVE=3fffffffff
       _PID=19979
       _COMM=sigh
       _EXE=/tmp/sigh
       _CMDLINE=/tmp/./sigh
       _SYSTEMD_CGROUP=/system.slice/run-19978.service
       _SYSTEMD_UNIT=run-19978.service
       _SYSTEMD_SLICE=system.slice
   Wed 2016-01-06 16:55:50.649851 UTC [...
       _TRANSPORT=stdout
       PRIORITY=6
       _UID=0
       _GID=0
       _SELINUX_CONTEXT=system_u:system_r:kernel_t:s0
       _BOOT_ID=9fd04fe784a340d5956d8c1db921c31b
       _MACHINE_ID=7e967d2a1b3740f18722f903e0ccecfd
       _HOSTNAME=coreos_developer_qemu-884-0-0-20
       SYSLOG_FACILITY=3
       SYSLOG_IDENTIFIER=sigh
       MESSAGE=farewell
       _CAP_EFFECTIVE=3fffffffff
       _PID=19979
       _COMM=sigh
       _SYSTEMD_UNIT=run-19978.service

  Now:
   Wed 2016-01-06 16:57:19.857615 UTC [...
       _TRANSPORT=stdout
       PRIORITY=6
       _BOOT_ID=9fd04fe784a340d5956d8c1db921c31b
       _MACHINE_ID=7e967d2a1b3740f18722f903e0ccecfd
       _HOSTNAME=coreos_developer_qemu-884-0-0-20
       SYSLOG_FACILITY=3
       _UID=0
       _GID=0
       _SYSTEMD_SLICE=system.slice
       _SELINUX_CONTEXT=system_u:system_r:kernel_t:s0
       _CAP_EFFECTIVE=3fffffffff
       SYSLOG_IDENTIFIER=sigh
       MESSAGE=sleeping 5s
       _PID=19995
       _COMM=sigh
       _EXE=/tmp/sigh
       _CMDLINE=/tmp/./sigh
       _SYSTEMD_CGROUP=/system.slice/run-19994.service
       _SYSTEMD_UNIT=run-19994.service
   Wed 2016-01-06 16:57:24.860611 UTC [...
       _TRANSPORT=stdout
       PRIORITY=6
       _BOOT_ID=9fd04fe784a340d5956d8c1db921c31b
       _MACHINE_ID=7e967d2a1b3740f18722f903e0ccecfd
       _HOSTNAME=coreos_developer_qemu-884-0-0-20
       SYSLOG_FACILITY=3
       _UID=0
       _GID=0
       _SYSTEMD_SLICE=system.slice
       _SELINUX_CONTEXT=system_u:system_r:kernel_t:s0
       _CAP_EFFECTIVE=3fffffffff
       SYSLOG_IDENTIFIER=sigh
       _PID=19995
       _COMM=sigh
       _EXE=/tmp/sigh
       _CMDLINE=/tmp/./sigh
       _SYSTEMD_CGROUP=/system.slice/run-19994.service
       _SYSTEMD_UNIT=run-19994.service
       MESSAGE=farewell

---

 $ # no journald, ideal for reference
 $ dd if=/dev/urandom bs=8k count=1000 | od -x > /dev/null
 1000+0 records in
 1000+0 records out
 8192000 bytes (8.2 MB) copied, 1.2545 s, 6.5 MB/s
 1000+0 records in
 1000+0 records out
 8192000 bytes (8.2 MB) copied, 1.00709 s, 8.1 MB/s
 1000+0 records in
 1000+0 records out
 8192000 bytes (8.2 MB) copied, 1.05836 s, 7.7 MB/s

@sztanpet
Copy link

sztanpet commented Jan 6, 2016

what happens with a few thousand services? how does the memory usage change?

@vcaputo
Copy link
Member Author

vcaputo commented Jan 6, 2016

Just putting this up here, there are some remaining things to clean up and presumably JournalMeta, its functions, and the tables should probably go into journal-meta.[ch] (or whatever name you guys prefer if we agree on the fundamentals of the change), and the vectorized helpers are a bit unfortunate but... vectorizing the entry items is an important part of the picture.

I was also using free_and_replace() as I developed this against v225 (coreos) and was a little surprised to find it gone in master, it's just inlined here, but I think my use case of free_and_replace() demonstrates a reasonable application in favor of its existence... that code would be significantly more verbose and repetitive without it.

Currently systemd-journald interrogates the log-producing process on a
per-dispatched-message basis, which ultimately results in doing
significant work for every line logged even within a single received
buffer where this information should be identical from the other
messages dispatched from the same buffer.

This commit rearranges things so we cache and reuse the metadata across
multiple messages when possible, refreshing on buffer boundaries, and
using the cached information should refreshing fail (common when
processing a message produced by a now-exited process, like when a
daemon errors and exits).

In addition to enabling caching and reuse of the interrogated state,
this also changes the way the appended entry is being assembled to
require fewer allocations and copies per message.

The common OBJECT_PID == _PID case has also been optimized to simply
reuse the cached metadata retrieved for _PID.

The current implementation of interrogating the process on every message
is arguably less correct than this more efficient version as well.  The
messages by the time we interrogate the log-producing process are always
from the past, the best we can do is to interrogate once at the time we
process the buffer, on behalf of the entire buffer, and reuse that
metadata as if it were current for all messages contained therein.
Repeatedly interrogating for every message querying the instantaneous
process state for data associated with a message from the past is less
likely to be accurate.

In order to make this all work correctly journal_file_append_entry()
needed modification to support scatter-gather IO within the individual
entry items, introducing a new JournalEntryItem type for describing the
entry.

Previously journal_file_append_entry() used the struct iovec to represent
each "KEY=VALUE" item of the entry being appended, preventing scattered
buffers from representing the individual items.

The new interface allows performing fewer allocations and copies in
constructing the appended journal entries by using a static array of
the constant "KEY=" strings which a per-item iovec may simply reference,
and simply referencing the discrete "VALUE" right hand side from the
metadata cache in the subsequent iovec for the item.  This is also what
enables the OBJECT_PID == _PID metadata reuse optimization mentioned
above; the VALUE side of multiple items may simply reference the same
value within the metadata cache.

Some data:

 $ # journald-native (a-la libsystemd), ephemeral processes
 $ time {
     for ((i=0; i < 1000; i++)); do
       dd if=/dev/urandom bs=8k count=1 | od -x | systemd-cat
     done
   } >& /dev/null

  Before:
   real    0m11.938s
   user    0m1.788s
   sys     0m2.153s

   real    0m11.989s
   user    0m1.846s
   sys     0m2.130s

   real    0m12.012s
   user    0m1.802s
   sys     0m2.176s

  Now:
   real    0m7.937s
   user    0m1.801s
   sys     0m2.062s

   real    0m8.094s
   user    0m1.685s
   sys     0m2.277s

   real    0m8.114s
   user    0m1.751s
   sys     0m2.152s

---

 $ # journald-native (a-la libsystemd), persistent process
 $ dd if=/dev/urandom bs=8k | od -x | systemd-cat

  Before:
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 7.28229 s, 1.1 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 9.78265 s, 837 kB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 8.17331 s, 1.0 MB/s

  Now:
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 1.3507 s, 6.1 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 1.13135 s, 7.2 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 1.17473 s, 7.0 MB/s

---

 $ # journald-stream (a-la systemd services)
 $ cat >> noise.sh <<EOF
 #!/bin/bash
 dd if=/dev/urandom bs=8k count=1000 2>/tmp/foo | od -x
 EOF
 $ chmod +x noise.sh
 $ tail -F /tmp/foo &
 $ systemd-run ./noise.sh

  Before:
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 4.35375 s, 1.9 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 4.56279 s, 1.8 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 9.26128 s, 885 kB/s

  Now:
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 1.38334 s, 5.9 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 1.14587 s, 7.1 MB/s
   1000+0 records in
   1000+0 records out
   8192000 bytes (8.2 MB) copied, 1.40992 s, 5.8 MB/s

---

 $ # service's farewell messages lossy metadata
 $ gcc -x c -pipe -O2 -s -o sigh - <<EOF
 #include <unistd.h>
 int main(int argc, const char *argv[]) {
         write(1, "sleeping 5s\n", 12);
         sleep(5);
         write(1, "farewell\n", 9);
         return 0;
 }
 EOF
 $ journalctl --follow --all --output=verbose --identifier=sigh &
 $ systemd-run ./sigh
 $ dd if=/dev/urandom bs=8k | od -x | systemd-cat
 $ # sigh starts before dd, dd spams journald before farewell occurs, delaying
 $ # journald's servicing of farewell to race with sigh's exit.
 $ # Now we'll at least have as much metadata after exit as we ever did for any
 $ # previous log message on the stream.

  Before:
   Wed 2016-01-06 16:55:45.649644 UTC [...
       _TRANSPORT=stdout
       PRIORITY=6
       _UID=0
       _GID=0
       _SELINUX_CONTEXT=system_u:system_r:kernel_t:s0
       _BOOT_ID=9fd04fe784a340d5956d8c1db921c31b
       _MACHINE_ID=7e967d2a1b3740f18722f903e0ccecfd
       _HOSTNAME=coreos_developer_qemu-884-0-0-20
       SYSLOG_FACILITY=3
       SYSLOG_IDENTIFIER=sigh
       MESSAGE=sleeping 5s
       _CAP_EFFECTIVE=3fffffffff
       _PID=19979
       _COMM=sigh
       _EXE=/tmp/sigh
       _CMDLINE=/tmp/./sigh
       _SYSTEMD_CGROUP=/system.slice/run-19978.service
       _SYSTEMD_UNIT=run-19978.service
       _SYSTEMD_SLICE=system.slice
   Wed 2016-01-06 16:55:50.649851 UTC [...
       _TRANSPORT=stdout
       PRIORITY=6
       _UID=0
       _GID=0
       _SELINUX_CONTEXT=system_u:system_r:kernel_t:s0
       _BOOT_ID=9fd04fe784a340d5956d8c1db921c31b
       _MACHINE_ID=7e967d2a1b3740f18722f903e0ccecfd
       _HOSTNAME=coreos_developer_qemu-884-0-0-20
       SYSLOG_FACILITY=3
       SYSLOG_IDENTIFIER=sigh
       MESSAGE=farewell
       _CAP_EFFECTIVE=3fffffffff
       _PID=19979
       _COMM=sigh
       _SYSTEMD_UNIT=run-19978.service

  Now:
   Wed 2016-01-06 16:57:19.857615 UTC [...
       _TRANSPORT=stdout
       PRIORITY=6
       _BOOT_ID=9fd04fe784a340d5956d8c1db921c31b
       _MACHINE_ID=7e967d2a1b3740f18722f903e0ccecfd
       _HOSTNAME=coreos_developer_qemu-884-0-0-20
       SYSLOG_FACILITY=3
       _UID=0
       _GID=0
       _SYSTEMD_SLICE=system.slice
       _SELINUX_CONTEXT=system_u:system_r:kernel_t:s0
       _CAP_EFFECTIVE=3fffffffff
       SYSLOG_IDENTIFIER=sigh
       MESSAGE=sleeping 5s
       _PID=19995
       _COMM=sigh
       _EXE=/tmp/sigh
       _CMDLINE=/tmp/./sigh
       _SYSTEMD_CGROUP=/system.slice/run-19994.service
       _SYSTEMD_UNIT=run-19994.service
   Wed 2016-01-06 16:57:24.860611 UTC [...
       _TRANSPORT=stdout
       PRIORITY=6
       _BOOT_ID=9fd04fe784a340d5956d8c1db921c31b
       _MACHINE_ID=7e967d2a1b3740f18722f903e0ccecfd
       _HOSTNAME=coreos_developer_qemu-884-0-0-20
       SYSLOG_FACILITY=3
       _UID=0
       _GID=0
       _SYSTEMD_SLICE=system.slice
       _SELINUX_CONTEXT=system_u:system_r:kernel_t:s0
       _CAP_EFFECTIVE=3fffffffff
       SYSLOG_IDENTIFIER=sigh
       _PID=19995
       _COMM=sigh
       _EXE=/tmp/sigh
       _CMDLINE=/tmp/./sigh
       _SYSTEMD_CGROUP=/system.slice/run-19994.service
       _SYSTEMD_UNIT=run-19994.service
       MESSAGE=farewell

---

 $ # no journald, ideal for reference
 $ dd if=/dev/urandom bs=8k count=1000 | od -x > /dev/null
 1000+0 records in
 1000+0 records out
 8192000 bytes (8.2 MB) copied, 1.2545 s, 6.5 MB/s
 1000+0 records in
 1000+0 records out
 8192000 bytes (8.2 MB) copied, 1.00709 s, 8.1 MB/s
 1000+0 records in
 1000+0 records out
 8192000 bytes (8.2 MB) copied, 1.05836 s, 7.7 MB/s
@vcaputo
Copy link
Member Author

vcaputo commented Jan 6, 2016

@sztanpet The amount of memory needed is negligible, all we're doing is caching the string form of the value side of this metadata. Additionally, I don't think the current approach of not caching this information per-stream is tenable at all, the whole 'lost metadata @ service exit because by the time journald interrogates the process it's gone' is a total disaster.

It's one thing to miss the metadata because the process was short-lived and journald had no opportunity to ever interrogate it even once, but when the process is a service which has been getting logged with full metadata, it's very unexpected and confusing when the last (read: interesting and valuable) messages emitted before the service exited appear to be missing from the journal because the user is keying their search on the metadata those last messages are missing.

This is why I've gone and cached the information for the lifetime of the stream in journal-stream, it's fixing a usability issue (we've had numerous etcd users complain etcd is losing logs before exit which turn out to be journald losing the association of those messages with etcd for those fields the metadata provides).

The performance improvements could be had caching the information in journald-stream on a per-buffer/per-scan basis, in the interests of saving a little memory, but then we'd still have users confused as all hell when the most important log messages appear to be lost from the journal because they're filtering on the lost metadata.

@sztanpet
Copy link

sztanpet commented Jan 6, 2016

I agree that it's certainly needed and thanks for the detailed response @vcaputo!

@evverx
Copy link
Contributor

evverx commented Jan 7, 2016

Looks like @vcaputo fixed #1456 too:)

@poettering
Copy link
Member

Hmm, so let me get this right? This implementes a cache for /proc accesses, so that we don't need to read the meta data each time? What's the strategy for dealing with changing credentials?

So far we always assumed that one day somebody would resurrect the SCM_CGROUPS stuff in the kernel, so that much of the credentials stuff would become fully reliable, accuracte and cheap. Even more so if we could get SCM_AUDIT and suchlike into place. Wouldn't it make sense to work on resurrecting this? I think the only missing bit to get SCM_CGROUPS into the kernel really was to ensure this was opt-in via some SO_xyz option on the recv side.

@vcaputo
Copy link
Member Author

vcaputo commented Jan 7, 2016

@evverx There will always be potential for the process to have exited before journald has processed any of the queued log messages. In that case, my changes can't help. The only way to protect from that edge case is to have the kernel bundling all the metadata with all the log messages, but that's not present at the moment.

However, if journald has managed to interrogate the process at least once before it exits, then journald will at least be able to fall-back on the cached metadata from that interrogation, should the process be gone before its later log messages are consumed.

So my change does improve the situation in #1456, but not for processes so ephemeral journald never sampled their metadata.

@vcaputo
Copy link
Member Author

vcaputo commented Jan 7, 2016

@poettering the cache is being updated for every buffer scanned. This change simply eliminates the need to interrogate /proc for every message (newline) within a buffer, which is fairly pointless and arguably less correct for subsequent lines within the same atomic buffer. You don't want to update the metadata to the instantaneous process state that is the future relative to when the message was sent. Just update it once per buffer and apply that to all messages within.

@vcaputo
Copy link
Member Author

vcaputo commented Jan 7, 2016

@poettering The SCM_CGROUPS proposal sounds good, but I don't see how that precludes the caching of all the process metadata we're getting out of /proc for messages within the same buffer.

They're complementary, not mutually exclusive. SCM_CGROUPS would just improve the reliability of the cgroup metadata, great, that's important. There's more than the cgroup being read from /proc however, and it's costly and inappropriate to be interrogating that stuff repeatedly for multiple lines/messages arriving in the same buffer. Interrogate proc once for the buffer, and supply the the metadata to the dispatcher for reuse across all the lines/messages therein.

When s->length is zero this function doesn't do anything.  Simply
short-circuit the function when this is the case.  This is especially
helpful now as it suppresses an unnecessary journal_meta_refresh() when
the length is zero, a common occurrence in force_flush == true (eof)
invocations.

It also clarifies the code, since you must study the implementation to
verify this fact without the explicit short-circuit being present.
@poettering
Copy link
Member

OK, if the creds cache is used only within the same buffer we read, and when the data is otherwise gone, then I think I am onboard with the concept.

@poettering
Copy link
Member

Hmm, any chance you can maybe rework this to to use the sd_bus_creds object instead? It is pretty much exactly the same as the new JournalMeta structure you added, but generic, and exposed as part of sd-bus... It can do a lot of the stuff we need here anyway (such as augmenting ucred data from /proc), and I think it would be really good to just add a few new calls to it to make it support this new usecase, too, rather than adding a fully new object...

For example, we could add sd_bus_creds_refresh() as public API which tries to refresh the data it already knows. And we could add a new (internal?) bus_creds_from_ucred_and_label() or so, which allows allocating an sd_bus_creds() object from the socket metadata we have.

I think this would be much nicer, and would actually result in the patch reducing footprint while adding new features, instead of increasing it!

Hope that makes sense!

@poettering poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Jan 12, 2016
@vcaputo
Copy link
Member Author

vcaputo commented Jan 14, 2016

@poettering I'll poke at the sd_bus_creds idea, do you have any thoughts re: JournalEntryItem and the vectorized stuff?

@poettering
Copy link
Member

What's the precise reason for the vectorization stuff? It's well-known that kernel-based scatter/gather IO for smaller stuff is actually very slow, much slower than copying things together in userspace and submitting things as one. There are good reasons for using kernel-based scatter/gather, but speed isn't one of them. What's your rationale here, precisely?

@vcaputo
Copy link
Member Author

vcaputo commented Jan 15, 2016

@poettering The vectorizing of individual "KEY=VALUE" items is to permit the constant key names to be referenced from a static array of strings and the values to be referenced in-place in the JournalMeta structure (or sd_bus_creds). We currently do all sorts of allocating and freeing to accommodate the combined "KEY=VALUE" form just for the sake of fulfilling the non-vectorized journal_file_append_entry() interface, not to mention we have to copy the strings into these tiny ephemeral buffers.

Look at the current body of dispatch_message_real():
https://github.com/systemd/systemd/blob/master/src/journal/journald-server.c#L586
it's a mess of snprintf, strjoina(), and free() calls just so we can feed journal_file_append_entry() contiguous "KEY=VALUE" strings. I'm not vectorizing under the impression that I'll be making copying faster, I'm doing this to elide a bunch of pointless allocator activity and string concatenation. Defer turning this stuff into anything contiguous until journal-file.c is sticking it into the mmap'd file.

@vcaputo
Copy link
Member Author

vcaputo commented Jan 15, 2016

@poettering at a glance I'm not really liking the sd_bus_creds idea, it includes a bunch of stuff we're not using but would need to be refreshed by a generic implementation and doesn't cache everything in the string form ready for the iovec to reference.

@philips
Copy link
Contributor

philips commented Jan 28, 2016

@vcaputo Can we separate out the vectorization and the caching for the purpose of getting part of this merged? Do they have separate performance impact?

@vcaputo
Copy link
Member Author

vcaputo commented Feb 23, 2016

@poettering What are your current feelings about this PR? I had responded to your questions regarding the vectorized items, and would like us to come to some kind of agreement here before going and dropping the vectorized items in favor of allocating space for and converting types into concatenated key=value string items for every message.

@centos-ci
Copy link

Can one of the admins verify this patch?

@gdamjan
Copy link
Contributor

gdamjan commented Jun 2, 2016

Why doesn't systemd inform the journal about the service that is going to use the logging socket before even starting the service?
Then the journal would know exactly what to tag the message with and never have to interrogate the system about it.

@markstos
Copy link
Contributor

Today I ran into the exactly the problem case described: Just before exiting, I logged to STDOUT the key details about what failed, but those details are confusingly missing when I do journalctl -u my-unit-name.

I then spent hours tonight trying to figure out how to make sure /all/ my logs from a service end up tagged with the service name. I have a write-up about it on the StackExchange site here:

http://unix.stackexchange.com/questions/291483/how-to-fix-unit-attribution-of-last-lines-of-systemd-logging-before-exit

The issue turned out to be easy to reproduce. Start with a simple service file named stdout-before-exit.service

[Unit]
Description=Stdout-before-exit

[Service]
StandardOutput=journal
ExecStart=/home/mark/bin/stdout-before-exit.sh

Then in the referenced bash script, just put:

#!/bin/bash
echo "Logging just before exit!"
exit;

Then run the service with systemctl start.

Finally, check if you can find the log entry by looking up the logs for that service:

journalctl -u stdout-before-exit

@markstos
Copy link
Contributor

To amend my earlier comment, I found the situation is much worse for user units, which is how I was testing. I had trouble getting the unit attribution attached to a user-unit even when I did "sleep 10 && echo 'some log' && sleep 10". For a system unit, the attribution gap loss is much smaller, but still present.

@vcaputo
Copy link
Member Author

vcaputo commented Jul 14, 2016

Well, I'm back from a prolonged and unexpected break, and curious what everyone feels about this stuff now that it's been steeping for months.

There was an outstanding question regarding the vectorizing code, which is an important part of the optimizations as it enables arbitrarily scattering the "key=value" strings. By scattering them we're able to cache the values alone and stitch together arbitrary combinations of static keys and potentially cached values. In some cases the values will be the same across multiple fields, and in those cases it's better to simply reference the same shared value. The splitting also elides the need for allocating, constructing, and freeing contiguous regions for the combined "key=value" fields.

This PR probably needs cleanups and at least rebasing, coming to some agreement on what the mergable form should approximately look like would be helpful.

/cc @poettering

@phemmer
Copy link

phemmer commented Aug 31, 2016

I'm also interested in this, but would favor an even more aggressive caching scheme.
Right now, we have boxes where journald is consuming ~50% CPU. These are log heavy machines, such as web servers. Polling the same /proc/$pid data thousands of times a second seems like overkill.
I haven't tested this PR specifically, but I don't think once-per-buffer would be enough. This would mainly help if the rate of incoming events is faster than the rate that journald can process. Such as if journald is CPU bound and the buffer starts accumulating. As long as journald is able to keep up, and can process one event at a time, it's still going to chew up the CPU.

I can certainly understand the benefit of checking the /proc/$pid data per-event, so while I personally would favor more aggressive caching, I don't think it should be mandatory, or default. So perhaps an option we could set. And even better would be a per-unit option so that we can enable it specifically for log heavy apps.

@vcaputo
Copy link
Member Author

vcaputo commented Aug 31, 2016

@phemmer Could you try this as-is with your workload to see how things perform? (assuming I rebased against master again...)

@phemmer
Copy link

phemmer commented Aug 31, 2016

Assuming I can get it working on an EL7 box, yes. If I can just do a quick and dirty swap of the journald binary on the system, then it should be easy to test.

@vcaputo
Copy link
Member Author

vcaputo commented Sep 7, 2016

@phemmer hm, EL7 is relatively old systemd anyways, if you're just going to do a swap of the binary then this PR as-is should be fine for an experiment, just apply it to an older branch like v229.

@towolf
Copy link

towolf commented Sep 26, 2016

Is this still on the radar of systemd reviewers? I've hit the race problem that this PR promises to ameliorate. Can't with good confidence suggest making use of journal for log heavy stuff when these issues remain.

@poettering are we really gonna stick to waiting for a kernel fix?

@martinpitt
Copy link
Contributor

This PR is almost year old and thus stale, and got rejected in its present form. Hence I am closing it now. If this is still relevant, please fix and rebase to current master, and either reopen this PR or open a new one. Thank you for understanding!

@martinpitt martinpitt closed this Oct 12, 2016
@phemmer phemmer mentioned this pull request Jan 18, 2017
2 tasks
poettering added a commit to poettering/systemd that referenced this pull request Jul 17, 2017
Cache client metadata, in order to be improve runtime behaviour under
pressure.

This is inspired by @vcaputo's work, specifically:

systemd#2280

That code implements related but different semantics.

For a longer explanation what this change implements please have a look
at the long source comment this patch adds to journald-context.c.

After this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s

        real	0m11.283s
        user	0m0.007s
        sys	0m6.216s

Before this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s

        real	0m52.099s
        user	0m0.014s
        sys	0m7.170s

As side effect, this corrects the journal's rate limiter feature: we now
always use the unit name as key for the ratelimiter.
poettering added a commit to poettering/systemd that referenced this pull request Jul 17, 2017
Cache client metadata, in order to be improve runtime behaviour under
pressure.

This is inspired by @vcaputo's work, specifically:

systemd#2280

That code implements related but different semantics.

For a longer explanation what this change implements please have a look
at the long source comment this patch adds to journald-context.c.

After this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s

        real	0m11.283s
        user	0m0.007s
        sys	0m6.216s

Before this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s

        real	0m52.099s
        user	0m0.014s
        sys	0m7.170s

As side effect, this corrects the journal's rate limiter feature: we now
always use the unit name as key for the ratelimiter.
poettering added a commit to poettering/systemd that referenced this pull request Jul 18, 2017
Cache client metadata, in order to be improve runtime behaviour under
pressure.

This is inspired by @vcaputo's work, specifically:

systemd#2280

That code implements related but different semantics.

For a longer explanation what this change implements please have a look
at the long source comment this patch adds to journald-context.c.

After this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s

        real	0m11.283s
        user	0m0.007s
        sys	0m6.216s

Before this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s

        real	0m52.099s
        user	0m0.014s
        sys	0m7.170s

As side effect, this corrects the journal's rate limiter feature: we now
always use the unit name as key for the ratelimiter.
poettering added a commit to poettering/systemd that referenced this pull request Jul 20, 2017
Cache client metadata, in order to be improve runtime behaviour under
pressure.

This is inspired by @vcaputo's work, specifically:

systemd#2280

That code implements related but different semantics.

For a longer explanation what this change implements please have a look
at the long source comment this patch adds to journald-context.c.

After this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s

        real	0m11.283s
        user	0m0.007s
        sys	0m6.216s

Before this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s

        real	0m52.099s
        user	0m0.014s
        sys	0m7.170s

As side effect, this corrects the journal's rate limiter feature: we now
always use the unit name as key for the ratelimiter.
poettering added a commit to poettering/systemd that referenced this pull request Jul 20, 2017
Cache client metadata, in order to be improve runtime behaviour under
pressure.

This is inspired by @vcaputo's work, specifically:

systemd#2280

That code implements related but different semantics.

For a longer explanation what this change implements please have a look
at the long source comment this patch adds to journald-context.c.

After this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s

        real	0m11.283s
        user	0m0.007s
        sys	0m6.216s

Before this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s

        real	0m52.099s
        user	0m0.014s
        sys	0m7.170s

As side effect, this corrects the journal's rate limiter feature: we now
always use the unit name as key for the ratelimiter.
poettering added a commit to poettering/systemd that referenced this pull request Jul 20, 2017
Cache client metadata, in order to be improve runtime behaviour under
pressure.

This is inspired by @vcaputo's work, specifically:

systemd#2280

That code implements related but different semantics.

For a longer explanation what this change implements please have a look
at the long source comment this patch adds to journald-context.c.

After this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s

        real	0m11.283s
        user	0m0.007s
        sys	0m6.216s

Before this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s

        real	0m52.099s
        user	0m0.014s
        sys	0m7.170s

As side effect, this corrects the journal's rate limiter feature: we now
always use the unit name as key for the ratelimiter.
poettering added a commit to poettering/systemd that referenced this pull request Jul 31, 2017
Cache client metadata, in order to be improve runtime behaviour under
pressure.

This is inspired by @vcaputo's work, specifically:

systemd#2280

That code implements related but different semantics.

For a longer explanation what this change implements please have a look
at the long source comment this patch adds to journald-context.c.

After this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s

        real	0m11.283s
        user	0m0.007s
        sys	0m6.216s

Before this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s

        real	0m52.099s
        user	0m0.014s
        sys	0m7.170s

As side effect, this corrects the journal's rate limiter feature: we now
always use the unit name as key for the ratelimiter.
andir pushed a commit to andir/systemd that referenced this pull request Sep 7, 2017
Cache client metadata, in order to be improve runtime behaviour under
pressure.

This is inspired by @vcaputo's work, specifically:

systemd#2280

That code implements related but different semantics.

For a longer explanation what this change implements please have a look
at the long source comment this patch adds to journald-context.c.

After this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s

        real	0m11.283s
        user	0m0.007s
        sys	0m6.216s

Before this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s

        real	0m52.099s
        user	0m0.014s
        sys	0m7.170s

As side effect, this corrects the journal's rate limiter feature: we now
always use the unit name as key for the ratelimiter.
andir pushed a commit to andir/systemd that referenced this pull request Sep 22, 2017
Cache client metadata, in order to be improve runtime behaviour under
pressure.

This is inspired by @vcaputo's work, specifically:

systemd#2280

That code implements related but different semantics.

For a longer explanation what this change implements please have a look
at the long source comment this patch adds to journald-context.c.

After this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s

        real	0m11.283s
        user	0m0.007s
        sys	0m6.216s

Before this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s

        real	0m52.099s
        user	0m0.014s
        sys	0m7.170s

As side effect, this corrects the journal's rate limiter feature: we now
always use the unit name as key for the ratelimiter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

journal reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks

Development

Successfully merging this pull request may close these issues.