Kernel analysis using eBPF
Daniel Thompson (and Leo Yan)
LEADING
COLLABORATION
IN THE ARM
ECOSYSTEM
Extending the Berkeley Packet Filter
● Historically Berkeley Packet Filter
provided a means to filter network Framework of eBPF
packets ‘raw’
○ If you ever used tcpdump you’ve probably ply BCC
building
already used it
○ tcpdump host helios and \( hot or ace \)
● eBPF has extended BPF hugely: eBPF
eBPF core eBPF map
○ Re-encoded and more expressive verifier
opcodes
Program loading
○ Multiple new hook points within the kernel
to attach eBPF programs to arm /
aarch64 bpf_func
○ Rich data structures to pass information JIT
to/from kernel Data transferring
○ C functional call interface (an eBPF
program can call kernel function) LEADING COLLABORATION
IN THE ARM ECOSYSTEM
Using eBPF for debugging
Userspace Kernel
foo_kern.c foo_user.c eBPF
Program Load program eBPF bytecode JIT
Update data maps Read data maps eBPF maps eBPF func
LLVM/clang
foo_kern.o kprobes/ftrace
Program working flow Data transferring flow
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
Using eBPF for debugging - cont.
● eBPF program is written in C code and compiled to eBPF bytecode
○ LLVM/clang provides us a eBPF compiler (no support in gcc)
○ Direct code generation is also possible (or LLVM without clang)
● eBPF program is loaded inside eBPF virtual machine with sanity-checking
● eBPF program is "attached" to a designated code path in the kernel
○ eBPF in its traditional use case is attached to networking hooks allowing it to filter and classify
network traffic using (almost) arbitrarily complex programs
○ Furthermore, we can attach eBPF programs to tracepoints, kprobes, and perf events for
debugging the kernel and carrying out performance analysis.
● Kernel and user space typically use eBPF map; it is a generic data structure
well suited to transfer data from kernel to userspace
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
Debugging with eBPF versus tracing
Without eBPF
Tracing is very powerful but it can also be
Huge buffer size to avoid
cumbersome for whole system analysis due Buffers
tracing data overflow
to the volume of trace information generated.
Kernel trace Event
trace-cmd
events processing
Most developers end up writing programs to
summarize the trace.
Frequent kernel and user space
context switching
eBPF allows us to write program to
With eBPF
summarize trace information without tracing.
Kernel User space
Kernel trace
eBPF statistics
events
program program
Seldom kernel and user space
context switching
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
Introducing the VM
● Instruction set architecture (ISA)
● Verifier
● Maps
● Just-in-time compilation
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
eBPF bytecode instruction set architecture (ISA)
Uses a simple RISC-like instruction set. It eBPF Register Description
is intentionally easy to map eBPF
program to native instructions (especially R0 Return value from in-kernel function,
and exit value for eBPF program
on RISC machines).
R1 ~ R5 Arguments from eBPF program to
in-kernel function
10 general purpose 64-bit registers and R6 ~ R9 Callee saved registers that in-kernel
function will preserve
one register for frame pointer, maps 1:1
to registers on many 64-bit architectures. R10 Read-only frame pointer to access
stack
Every instruction is 64-bit, the eBPF
program can contain a maximum of 4096
instructions. LEADING COLLABORATION
IN THE ARM ECOSYSTEM
Instruction encoding
MSB LSB
Three instruction types: 32 bits 16 bits 4 bits 4 bits 8 bits
immediate offset src dst opcode
ALU instructions
memory instructions
branch instructions
4 bits 1 bit 3 bits
operation code source instruction class
New BPF_CALL instruction made Opcode for arithmetic and jump instructions
it possible to call in-kernel
functions cheaply. 3 bits 2 bits 3 bits
mode size instruction class
Opcode for memory instructions
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
Just-in-time compilation (JIT)
Just-in-time (JIT) compiler translates eBPF eBPF Aarch64 Description
bytecode into a host system's assembly Register Register
code and speed up program execution. For R0 X7 Return value from in-kernel
most opcodes there is a 1:1 mapping function, and exit value for
eBPF program
between eBPF and AArch64 instructions.
R1 ~ R5 X0 ~ X4 Arguments from eBPF
program to in-kernel
ARM/ARM64 JIT is enabled by kernel config: function
CONFIG_BPF_JIT. R6 ~ R9 X19 ~ X22 Callee saved registers that
in-kernel function will
preserve
ARM/ARM64 JIT complies with Procedure
R10 X25 Read-only frame pointer to
Call Standard for the ARM® Architecture access stack
(AAPCS) to map eBPF registers to machine
registers and build prologue/epilogue for
LEADING COLLABORATION
function entry and exit. IN THE ARM ECOSYSTEM
Verifier
● eBPF programs are loaded from user space but will run in kernel space; the eBPF verifier
checks that the program is safe to run before invoking it
● Checks that the program license is GNU GPL and, for kprobes, also the kernel version
● Function call verification
○ Allows function calls from one bpf function to another
○ Only calls to known functions are allowed
○ Unresolved function calls and dynamic linking are not permitted
● Check that control flow graph of eBPF program is a directed acyclic graph
○ Used to disallow loops to ensure the program don’t cause the kernel to lock up
○ Detect unreachable instructions
○ Program terminates with BPF_EXIT instruction
○ All branch instructions except for BPF_EXIT or BPF_CALL instructions are within program
boundary
● Simulates execution of every instructions and observes the state change of registers and
stack
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
Control flow graph (CFG) to detect loop
Example 1: detect back edge for loop. Example 2: detect back edge for conditional loop.
BPF_MOV64_REG(BPF_REG_1, BPF_REG_0) BPF_MOV64_REG(BPF_REG_1, BPF_REG_0)
BPF_MOV64_REG(BPF_REG_2, BPF_REG_0) BPF_MOV64_REG(BPF_REG_2, BPF_REG_0)
BPF_MOV64_REG(BPF_REG_3, BPF_REG_0) BPF_MOV64_REG(BPF_REG_3, BPF_REG_0)
BPF_JMP_IMM(BPF_JA, 0, 0, -4) BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, -3)
BPF_EXIT_INSN() BPF_EXIT_INSN()
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
The state change of registers and stack
The verifier tracks register state and monitors the Example 1: Registers with uninitialized
usage for stack: contents cannot be read.
● Registers with uninitialized contents cannot be read.
BPF_MOV64_REG(BPF_REG_0, BPF_REG_2)
● After a kernel function call, R1-R5 are reset to BPF_EXIT_INSN()
unreadable and R0 has a return type of the function.
● Since R6-R9 are callee saved, their state is
preserved across the call. Example 2: Allow eBPF program to read
● load/store instructions are allowed only with data from stack only after it wrote into it.
registers of valid types, which are PTR_TO_CTX,
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10)
PTR_TO_MAP, PTR_TO_STACK and verify if out of BPF_LDX_MEM(BPF_DW, BPF_REG_0,
bound. BPF_REG_2, -8)
BPF_EXIT_INSN()
● Allow eBPF program to read data from stack only if
it wrote into it.
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
Maps
● eBPF uses map as generic key/value data structure for data transfer
between Kernel and user space
● The maps are managed by using file descriptor and they are accessed from
user space via BPF syscall:
○ bpf(BPF_MAP_CREATE, attr, size): create a map with given type and attributes
○ bpf(BPF_MAP_LOOKUP_ELEM, attr, size): lookup key in a given map
○ bpf(BPF_MAP_UPDATE_ELEM, attr, size): create or update key/value pair in a given map
○ bpf(BPF_MAP_DELETE_ELEM, attr, size): find and delete element by key in a given map
○ close(fd): delete map
● eBPF programs can use map file descriptors of the process that loaded the
program.
● When the userspace generates an eBPF program the file descriptors will
embedded into immediate values of the appropriate opcode.
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
Access map in Kernel space
imm off src dst opcode Map accessing instruction opcode is ‘BPF_LD | BPF_DW |
BPF_LD | BPF_IMM’, which means "load 64-bit (Double Word)
fd 0 P DST BPF_DW |
BPF_IMM immediate"; the instruction is to combine the two ‘imm’ fields
0 0 0 0 0 of this instruction and the subsequent one for ‘DST’ register.
The ‘imm’ field is set to file descriptor and ‘src’ field =
BPF_PSEUDO_MAP_FD to indicate this is a pseudo instruction
for loading map data. BPF_LD_MAP_FD() macro is used for
instruction assembly. Because ‘src’ is non-zero so the
imm off src dst opcode
opcode is invalid at this stage.
BPF_LD |
map 0 0 DST BPF_DW |
BPF_IMM
The invalid opcode is fixed up during programing loading
map >> 32 0 0 0 0
bpf_prog_load(). At this stage the ‘fd’ will be replaced
with a map pointer that can be used as an argument during a
BPF_CALL.
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
Coding for eBPF in assembler
● Before introducing the high level tools let’s look at a
simple userspace program (in C) that runs an eBPF
program
● It is not very common to write eBPF programs in
assembler
○ Writing in assembler allows us to explore the syscalls that hold
everything together
○ We’ll look at the higher level tools in a moment
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
libbpf: helper functions for eBPF
libbpf library makes easier to write eBPF programs,
int bpf_map_lookup_elem(int fd, const void *key,
which includes helper functions for loading eBPF void *value)
programs from kernel space to user space and {
union bpf_attr attr;
creating and manipulating eBPF maps:
● User program reads the eBPF bytecode into a bzero(&attr, sizeof(attr));
attr.map_fd = fd;
buffer and pass it to bpf_load_program()
attr.key = ptr_to_u64(key);
for program loading and verification. attr.value = ptr_to_u64(value);
● The eBPF program includes the libbpf header
return sys_bpf(BPF_MAP_LOOKUP_ELEM, &attr,
for the function definition for building, when run sizeof(attr));
by the kernel, will call }
bpf_map_lookup_elem() to find an element
in a map and store a new value in it.
● The user application calls
bpf_map_lookup_elem() to read out the
value stored by the eBPF program in the kernel.
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
Coding for eBPF in assembler
int main(void)
{
The example is ~50 lines of code for eBPF
int map_fd, i, key; in assembler; it demonstrates the eBPF
long long value = 0, cnt;
code have components: eBPF bytecode,
map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value), 5000, 0);
struct bpf_insn prog[] = { syscalls, maps.
BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), attach_kprobe() is used to enable
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
BPF_LD_MAP_FD(BPF_REG_1, map_fd), kprobe event and attach the event with
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), eBPF program.
BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */ void attach_kprobe(void)
BPF_EXIT_INSN(), {
system("echo 'p:sys_read sys_read' >> \
};
/sys/kernel/debug/tracing/kprobe_events")
size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);
pfd = bpf_load_program(BPF_PROG_TYPE_KPROBE, prog, insns_cnt, "GPL", efd = open(“/sys/kernel/debug/tracing/events/kprobes/sys_read/id”,
LINUX_VERSION_CODE, bpf_log_buf, BPF_LOG_BUF_SIZE); O_RDONLY, 0);
read(efd, buf, sizeof(buf));
close(efd);
attach_kprobe();
buf[err] = 0;
sleep(1); id = atoi(buf);
attr.config = id;
key = 0;
assert(bpf_map_lookup_elem(map_fd, &key, &cnt) == 0); efd = sys_perf_event_open(&attr, -1/*pid*/, 0/*cpu*/, -1, 0);
printf("sys_read counts %lld\n", cnt); ioctl(efd, PERF_EVENT_IOC_ENABLE, 0);
ioctl(efd, PERF_EVENT_IOC_SET_BPF, pfd); LEADING COLLABORATION
return 0;
} IN THE ARM ECOSYSTEM
}
eBPF tooling
● Kernel examples
● Ply
● bcc
● SystemTap (stapbpf)
● ...
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
Kernel samples
It’s good to start from eBPF kernel samples; Linux kernel tree
provides eBPF system call wrapper functions in lib libbpf; the sample_kern.o
samples use bpf_load.c to create map and load kernel
program, attach trace point.
sample_user.o
Kernel and user space programs use the naming convention sample
xxx_user.c and xxx_kern.c, and the user space program to
bpf_load.o
use file name xxx_kern.o to search kernel program.
The user space program is compiled by GCC for executable file libbpf Data
and it reacts for ‘CROSS_COMPILE=aarch64-linux-gnu-’ for transferring
cross compiling. Kernel program is compiled by LLVM/Clang, by
default it uses LLVM/Clang in distro and can specify path for Program loading
new built LLVM/Clang. Build commands:
make headers_install # creates "usr/include" directory in the build top directory Kernel
make samples/bpf/ LLC=xxx/llc CLANG=xxx/clang program
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
Sample code: trace kmem_cache_alloc_node
tracex4_kern.c tracex4_user.c
struct bpf_map_def SEC("maps") my_map = {
.type = BPF_MAP_TYPE_HASH,
static void print_old_objects(int fd)
{
Step 3: user space program
.key_size = sizeof(long),
.value_size = sizeof(struct pair),
long long val = time_get_ns();
__u64 key, next_key;
reads map data
.max_entries = 1000000, struct pair v;
};
/* Based on current ‘key’ value, we can get next key value
SEC("kretprobe/kmem_cache_alloc_node") * and iterate all bpf map elements. */
int bpf_prog2(struct pt_regs *ctx) key = -1;
{ Step 2: kernel program while (bpf_map_get_next_key(map_fd[0], &key, &next_key) == 0) {
long ptr = PT_REGS_RC(ctx); bpf_map_lookup_elem(map_fd[0], &next_key, &v);
long ip = 0; update map data key = next_key;
printf("obj 0x%llx is %2lldsec old was allocated at ip %llx\n",
/* get ip address of kmem_cache_alloc_node() caller */ next_key, (val - v.val) / 1000000000ll, v.ip);
BPF_KRETPROBE_READ_RET_IP(ip, ctx); }
}
struct pair v = {
.val = bpf_ktime_get_ns(), int main(int ac, char **argv)
.ip = ip, {
}; char filename[256];
int i;
bpf_map_update_elem(&my_map, &ptr, &v, BPF_ANY);
return 0; snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
}
char _license[] SEC("license") = "GPL"; if (load_bpf_file(filename)) {
u32 _version SEC("version") = LINUX_VERSION_CODE; printf("%s", bpf_log_buf);
return 1;
} Step 1: load kernel program &
for (i = 0; ; i++) { enable kretprobe trace point
print_old_objects(map_fd[1]);
sleep(1);
}
return 0;
} LEADING COLLABORATION
IN THE ARM ECOSYSTEM
Ply: light dynamic tracer for eBPF https://wkz.github.io/ply/
Ply uses an awk-like mini language describing how trace:raw_syscalls/sys_exit / (ret() < 0) /
to attach eBPF programs to tracepoints and {
@[comm()].count()
kprobes. It has a built-in compiler and can perform }
compilation and execution with a single command.
Ply can extract arbitrary data, i.e register values,
function arguments, stack/heap data, stack traces. ^Cde-activating probes
@:
Ply keeps dependencies to a minimum, leaving libc dbus-daemon 2
ply 3
as the only runtime dependency. Thus, ply is well irqbalance 4
suited for embedded targets.
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
System call (sys_exit) failure statistics in ply
provider: selects which probe interface to use
probe definition: the point(s) of instrumentation
predicate: filter events to match criteria
trace:raw_syscalls/sys_exit / (ret() < 0) /
{
@[comm()].count() method: common way is to aggregate data using methods, have two
} functions: .count() and .quantize()
Key value
@: sign of map
^Cde-activating probes
@: Tracing result: task name + counts
dbus-daemon 2
ply 3
irqbalance 4
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
Build ply https://github.com/iovisor/ply
If applicable, please check build: Fix kernel header installation on ARM64 is in
your repository before building.
Method 1: Native compilation $ ldd src/ply
./autogen.sh linux-vdso.so.1 (0x0000ffff9320d000)
libc.so.6 =>
./configure --with-kerneldir=/path/to/linux
/lib/aarch64-linux-gnu/libc.so.6
make (0x0000ffff93028000)
make install /lib/ld-linux-aarch64.so.1
(0x0000ffff931e2000)
Method 2: Cross-Compilation for arm64
./autogen.sh
./configure --host=aarch64 --with-kerneldir=/path/to/linux
make CC=aarch64-linux-gnu-gcc
# copy src/ply to target board
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
BPF Compiler Collection (BCC)
BPF compiler collection (BCC) project is a toolchain User space Kernel
which reduces the difficulty for writing, compiling
(invokes LLVM/Clang) and loading eBPF programs. BPF Compiler Collection (BCC)
BCC reports errors for mistake for compiling, loading
program, etc; this reduces difficulty for eBPF bcc-tool Back-end eBPF
programming.
eBPF
libbcc.so
Front-end bytecode
For writing short and expressive programs, high-level Python LLVM/
Compiling
languages are available in BCC (python, Lua, go, etc). clang eBPF maps
Lua
golang
BCC provides scripts that use User Statically-Defined libbpf.so
C / C++
Tracing (USDT) probes to place tracepoints in
user-space code; these are probes that are inserted into Load program & read data
kprobes/
user applications statically at compile-time. ftrace
Trace and probe ops
BCC includes an impressive collection of examples and Program working flow
ready-to-use tracing tools. LEADING COLLABORATION
Data transferring flow IN THE ARM ECOSYSTEM
BCC example code https://github.com/iovisor/bcc/blob/master/examples/tracing/task_switch.py
b = BPF(text="""
struct key_t {
u32 prev_pid, curr_pid;
};
BPF_HASH(stats, struct key_t, u64, 1024);
int count_sched(struct pt_regs *ctx, struct task_struct *prev) { Kernel program
struct key_t key = {};
u64 zero = 0, *val;
key.curr_pid = bpf_get_current_pid_tgid();
key.prev_pid = prev->pid;
val = stats.lookup_or_init(&key, &zero);
(*val)++;
return 0;
}
""")
b.attach_kprobe(event="finish_task_switch", fn_name="count_sched") Enable kprobe event
# generate many schedule events
for i in range(0, 100): sleep(0.01)
for k, v in b["stats"].items(): LEADING COLLABORATION
print("task_switch[%5d->%5d]=%u" % (k.prev_pid, k.curr_pid, v.value)) Read map data “stats”
IN THE ARM ECOSYSTEM
https://github.com/iovisor/bcc/blob/master/INSTALL.md
Build BCC
BCC runs on the target but cannot be easily Build LLVM/Clang
cross-compiled. These instructions show how to cd where-llvm-live
perform a native build (and work on an AArch64 svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
cd where-llvm-live
platform)
cd llvm/tools
svn co http://llvm.org/svn/llvm-project/cfe/trunk clang
cd where-llvm-live
Install build dependencies mkdir build (in-tree build is not supported)
cd build
sudo apt-get install debhelper cmake libelf-dev bison cmake -G "Unix Makefiles" \
flex libedit-dev python python-netaddr python-pyroute2 -DCMAKE_INSTALL_PREFIX=$PWD/install ../llvm
arping iperf netperf ethtool devscripts zlib1g-dev make; make install
libfl-dev
Build BCC
Build luajit lib # Use self built LLVM/clang binaries
git clone http://luajit.org/git/luajit-2.0.git export PATH=where-llvm-live/build/install/bin:$PATH
cd luajit-2.0
git checkout -b v2.1 origin/v2.1 git clone https://github.com/iovisor/bcc.git
make mkdir bcc/build; cd bcc/build
sudo make install cmake .. -DCMAKE_INSTALL_PREFIX=/usr
make
LEADING COLLABORATION
sudo make install IN THE ARM ECOSYSTEM
BCC and embedded systems
● BCC native build has many dependencies
○ Dependency with libs and binaries, e.g. cmake, luajit lib, etc
○ Most dependencies can be resolved for Debian/Ubuntu by using ‘apt-get’ command
○ BCC depends on LLVM/Clang to compile for eBPF bytecode, but LLVM/Clang itself also
introduces many dependencies
● BCC and LLVM building requires powerful hardware
○ Have big pressure for both memory and filesystem space
○ Building is impossible or, with swap, extremely slow on systems without sufficient memory
○ Consumes lots of disk space. For AArch64: BCC needs 12GB, additionally LLVM needs 42GB
○ Even with strong hardware, the compilation process takes a long time
○ Save LLVM and BCC binaries on PC and use them by mounting NFS node :)
● Difficult to deploy BCC on Android system
○ No package manager means almost all library dependencies must be compiled from scratch
○ Android uses bionic C library, which makes it difficult to build libraries that use GNU
extensions
○ androdeb: https://github.com/joelagnel/adeb
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
SystemTap - eBPF backend
# stap --runtime=bpf -v - <<EOF
● SystemTap introduced, stapbpf, an
> probe kernel.function("ksys_read") {
eBPF backend in Oct, 2017 > printf("ksys_read(%d): %d, %d\n",
○ Joins existing backends: > pid(), $fd, $count);
kernel module and Dyninst > exit();
> }
> EOF
● SystemTap is both the tool and the Pass 1: parsed user script and 61 library
scripting language scripts using
410728virt/101984res/8796shr/93148data kb, in
○ Language is inspired by awk, and
260usr/20sys/272real ms.
predecessor tracers such as DTrace… Pass 2: analyzed script: 1 probe, 2 functions,
○ Uses the familar awk-like structure: 0 embeds, 0 globals using
probe.point { action(s) } 468796virt/161004res/9684shr/151216data kb, in
○ Extracts symbolic information based on 820usr/10sys/843real ms.
Pass 4: compiled BPF into "stap_10960.bo" in
DWARF parsing
10usr/0sys/33real ms.
Pass 5: starting run.
ksys_read(18719): 0, 8191
LEADING COLLABORATION
Pass 5: run completed in 0usr/0sys/30real ms.
IN THE ARM ECOSYSTEM
SystemTap - Revenge of the verifier
● eBPF verifier is more aggressive than the SystemTap language
○ Language permits looping but verifier prohibits loops (3.2 did not implement loop unrolling to
compensate)
○ The 4096 opcode limit restriction also looms
○ $$vars and $$locals cause verification failure if used (likely depends on traced function)
○ This runtime is in an early stage of development and it currently lacks support for a number of
features available in the default runtime. -- STAPBPF(8)
● SystemTap has a rich library of useful tested examples and war stories
○ Almost all are tested and developed using the kernel module backend
○ Thus it common to find canned examples that only work with the kernel module backend
○ This quickly grows frustrating… so one tends to end up using the default backend
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
BPFtrace - high level tracing language for eBPF
HOLD THE PRESS… HOLD THE PRESS...
BPFtrace language is inspired by awk and C, and predecessor tracers such as DTrace and SystemTap.
Brendan Gregg’s blogged about it: bpftrace (DTrace 2.0) for Linux 2018 (and most of this slide comes from
that blog post). I picked up from lwn.net (many thanks) three days before my slides were due in ;-)
# cat > path.bt <<EOF
#include <linux/path.h>
#include <linux/dcache.h> “Created by Alastair Robertson, bpftrace is an
open source high-level tracing front-end that
kprobe:vfs_open lets you analyze systems in custom ways. It's
{
printf("open path: %s\n", shaping up to be a DTrace version 2.0: more
str(((path *)arg0)->dentry->d_name.name)); capable, and built from the ground up for the
}
modern era of the eBPF virtual machine.”
EOF
# bpftrace path.bt -- Brendan Gregg
Attaching 1 probe...
open path: dev
open path: if_inet6
open path: retrans_time_ms LEADING COLLABORATION
IN THE ARM ECOSYSTEM
BPFtrace - Internals
Good news:
bpftrace has
superpowers
Bad news:
Dependencies
are inconsistently
packaged
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
Examples
● Report CPU power state
● Who is hammering a library function?
● Hunting leaks
● Debug kernel functions at the runtime
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
The story - Report CPU power state
When I run one test case, I want to quickly ● The target is to use high efficient method to
do statistics for CPU frequency so I can get count CPU frequency duration time.
to know if CPU frequency can meet the
● Kernel has existing trace points to record CPU
performance requirement or not.
frequency, eBPF kernel program can finish
simple computation for CPU frequency state
We can do this with ‘offline’ mode like duration based on these trace points.
idlestat tool, but is there any method that can
display live info? ● Need to get rid of CPU idle duration from CPU
frequency time.
● In this example we use tools from the kernel
samples/bpf/ directory.
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
CPU power state statistics with eBPF
T3 Kernel program
OPP1
T1 T(pstate-0) += T1
T(pstate-1) += T3
OPP0 T2
T4 T(cstate-0) += T2
WFI T(cstate-1) += T4
CPU_OFF
User space program
CPU states statistics:
state(ms) cstate-0 cstate-1 cstate-2 pstate-0 pstate-1 pstate-2 pstate-3 pstate-4
CPU-0 767 6111 111863 561 31 756 853 190
CPU-1 241 10606 107956 484 125 646 990 85
CPU-2 413 19721 98735 636 84 696 757 89
CPU-3 84 11711 79989 17516 909 4811 5773 341
CPU-4 152 19610 98229 444 53 649 708 1283
CPU-5 185 8781 108697 666 91 671 677 1365
CPU-6 157 21964 95825 581 67 566 684 1284
CPU-7 125 15238 102704 398 20 665 786 1197
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
The story - Who is hammering a library function?
I did a quick profile and its showing a library # ply -t 5 -c ‘kprobe:kmem_cache_alloc_node
{ @[stack()].count() }’
function dominating one of the cores. …
What now? kmem_cache_alloc_node
_do_fork+0xd0
__se_sys_clone+0x4c
el0_svc_naked+0x30 31
kmem_cache_alloc_node
alloc_skb_with_frags+0x70
sock_alloc_send_pskb+0x220
unix_stream_sendmsg+0x1f4
sock_sendmsg+0x60
__sys_sendto+0xd4
__se_sys_sendto+0x50
__sys_trace_return 232
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
The story - Hunting leaks
I know I’m leaking memory (or some other # cat track.ply
kprobe:kmem_cache_alloc_node {
precious resource) from a particular pool # Can’t read stack from a retprobe :-(
whenever I run a particular workload. @[0] = stack();
Unfortunately my system is almost ready to }
kretprobe:kmem_cache_alloc_node {
ship and we’ve started disabling all the
@[retval()] = @[0];
resource tracking. Is there anything I can do @[0] = nil;
to get a clue about what is going on? }
kprobe:kmem_cache_free {
@[arg(1)] = nil;
}
# ply -t 1 track.ply
3 probes active
de-activating probes
@:
<leaks show up here>
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
The story - Debug kernel functions at the runtime
Inspired by: BPF: Tracing and More (Brendan Gregg)
When I debug CPU frequency change flow ● SystemTap and Kprobes can be used to debug
in kernel, kernel have several different kernel function, but eBPF is safer to deploy
because the verifier will ensure kernel integrity.
components to work together for frequency
changing, including clock driver, mailbox
● For kernel functions tracing, eBPF can avoid to
driver, etc.
change kernel code and save time for
compilation.
I want to confirm if the functions have been
properly called and furthermore to check ● If it’s safe enough, we even can use it in
function arguments have expected values. production for customer support.
How can I dynamically debug kernel ● In this example, we use tools from the bcc
distribution
functions at the runtime with high efficiency
and safe method?
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
Debug kernel functions
static int hi3660_stub_clk_set_rate(struct clk_hw *hw, unsigned long rate, BCC tools/trace.py can
unsigned long parent_rate)
be used to debug kernel function; this
{
struct hi3660_stub_clk *stub_clk = to_stub_clk(hw); tool can trace function with infos:
kernel or user space stack, timestamp,
stub_clk->msg[0] = stub_clk->cmd;
stub_clk->msg[1] = rate / MHZ; CPU ID, PID/TID.
mbox_send_message(stub_clk_chan.mbox, stub_clk->msg);
mbox_client_txdone(stub_clk_chan.mbox, 0); We can use tool trace.py to confirm
stub_clk->rate = rate; function hi3660_stub_clk_set_rate()
return 0; has been invoked and print out the
}
target frequency.
$ ./tools/trace.py 'hi3660_stub_clk_set_rate "rate: %d" arg2'
PID TID COMM FUNC -
2002 2002 kworker/3:2 hi3660_stub_clk_set_rate rate: 1421000000
2469 2469 kworker/3:1 hi3660_stub_clk_set_rate rate: 1421000000
2469 2469 kworker/3:1 hi3660_stub_clk_set_rate rate: 1421000000
84 84 kworker/0:1 hi3660_stub_clk_set_rate rate: 903000000
2469 2469 kworker/3:1 hi3660_stub_clk_set_rate rate: 903000000
84 84 kworker/0:1 hi3660_stub_clk_set_rate rate: 903000000
84 84 kworker/0:1 hi3660_stub_clk_set_rate rate: 903000000 LEADING COLLABORATION
2469 2469 kworker/3:1 hi3660_stub_clk_set_rate rate: 903000000 IN THE ARM ECOSYSTEM
Debug kernel functions - cont.
static int hi3660_mbox_send_data(struct mbox_chan *chan, void *msg)
We can continue to check program flow
{ from high level function to low level
[...]
function for arguments, and BCC
/* Fill message data */ supports C style sentence to print out
for (i = 0; i < MBOX_MSG_LEN; i++)
writel_relaxed(buf[i], base + MBOX_DATA_REG + i * 4); more complex data structure.
/* Trigger data transferring */
writel(BIT(mchan->ack_irq), base + MBOX_SEND_REG); These data “watch points” can easily
return 0;
}
help us to locate the issue happens in
which component.
$ ./tools/trace.py 'hi3660_mbox_send_data(struct mbox_chan *chan, void *msg)
"msg_id: 0x%x rate: %d", *((unsigned int *)msg), *((unsigned int *)msg + 1)' For left example, we can observe the
msg_id value to check if pass correct
PID TID COMM FUNC -
84 84 kworker/0:1 hi3660_mbox_send_data msg_id: 0x2030a rate: 903 message ID to MCU firmware.
2413 2413 kworker/1:0 hi3660_mbox_send_data msg_id: 0x2030a rate: 903
2413 2413 kworker/1:0 hi3660_mbox_send_data msg_id: 0x2030a rate: 903
LEADING COLLABORATION
IN THE ARM ECOSYSTEM
Statistics based on function arguments
static int hi3660_stub_clk_set_rate(struct clk_hw *hw, unsigned long rate, After the kernel functionality has
unsigned long parent_rate)
{ been validated, we can continue to
struct hi3660_stub_clk *stub_clk = to_stub_clk(hw); do simple profiling based on Kernel
stub_clk->msg[0] = stub_clk->cmd; function argument statistics.
stub_clk->msg[1] = rate / MHZ;
mbox_send_message(stub_clk_chan.mbox, stub_clk->msg); Using the argdist.py invocation
mbox_client_txdone(stub_clk_chan.mbox, 0);
below, we can observe the the CPI
stub_clk->rate = rate; frequency mostly changes to
return 0;
} 533MHz and 1844MHz.
$ tools/argdist.py -I 'linux-mainline/include/linux/clk-provider.h'
-c -C 'p::hi3660_stub_clk_set_rate(struct clk_hw *hw, unsigned long rate,
unsigned long parent_rate):u64:rate'
COUNT EVENT
1 rate = 903000000
1 rate = 2362000000
1 rate = 999000000
27 rate = 1844000000
LEADING COLLABORATION
31 rate = 533000000 IN THE ARM ECOSYSTEM
Summary (and thank you) Everything is awesome…
… and many, many thanks to
Hand-rolled all the people who have
Asm Hack value? worked to make it so!
Pure C No “magic”, great examples in kernel
Awk-like
Ply Easy to deploy esp. on embedded system
SystemTap DWARF parsing (and wait a bit?)
BPFtrace #include <linux/dentry.h>
BCC Great tool for tool makers
[email protected] (and running tools from tool makers) LEADING COLLABORATION
IN THE ARM ECOSYSTEM