ply
- dynamically instrument the kernel
ply
program-file
ply
program-text
ply dynamically instruments the running kernel to aggregate and extract user-defined data. It compiles an input program to one or more Linux bpf(2) binaries and attaches them to arbitrary points in the kernel using kprobes and tracepoints.
-A
, --ascii
Restrict output to ASCII, no Unicode runes.
-c
command, --command
=commandWhen all probes are running, run command. When the command
exits, stop all probes and exit. The command is run as if invoked
with sh -c <command>
.
-d
, --debug
Enable debugging output.
-e
, --dry-run
Exit after compilation, without actually instrumenting the
system. Typically used in conjunction with --dump
.
-h
, --help
Print usage message.
-S
, --dump
After compilation, dump the internal AST, generated BPF instructions and other internal information. This is very useful to include when reporting a bug.
-v
, --version
Print version information.
The syntax is C-like in general, taking its inspiration dtrace(1) and, by extension, from awk(1).
A program consists of one or more probes, which are analogous to awk's pattern-action statements. The syntax for a probe is as follows:
provider:probe-definition ['/' predicate '/']
{
statement ';'
[statement ';' ... ]
}
The provider selects which probe interface to use. See the PROVIDERS section for more information about each provider. It is then up to the provider to parse the probe-definition to determine the point(s) of instrumentation.
When tracing, it is often desirable to filter events to match some criteria. Because of this, ply allows you to provide a predicate, i.e. an expression that must evaluate to a non-zero value in order for the probe to be executed.
Then follows a block of statements that perform the actual information gathering.
A provider may define a default probe clause to be used if the user does not supply one.
Probes support basic conditional control of flow via an if-statement, which conforms to the same rules as C's equivalent:
'if' '(' expr ')'
statement ';' | block
[else
statement ';' | block]
In order to ensure that a probe will have a finite run-time the kernel does not allow backwards branching. As a result, ply does not have any loop construct like for or while. A simple for statement with an invariant that is known at compile-time could be added later. In that case we could unroll the loop when generating BPF.
The type system is modeled after C. As such ply understands the difference between signed and unsigned integers, the difference between a short and a long long, what separates an integer from a pointer, how a struct is laid out in memory and so on. It is not complete though, notably floating point numbers and unions are missing.
Programs are statically typed, but all types are inferred automatically. Thus, the type system is mostly hidden from the user. Plans are to expose more of it in the future by allowing casts, type declarations and so on.
Numbers and string literals are specified in the same way as in C.
The primary way to extract information is to store it in a map, i.e. in a hash table. Like awk(1), ply dynamically creates any referenced maps and their key and value types are inferred from the context in which they are used. All maps are in the global scope and can thus be used both for extracting data to the end-user, and for carrying data between probes. Map names follow the rules of identifiers from C.
mapname[exprs]
Data can be stored in a map by assigning a value to a given key:
mapname[exprs] = expr
If a map key is assigned the special value nil, the key is deleted and will return its zero value if referenced again.
More often than not, looking at each individual datum from a trace is not nearly as helpful as an aggregation of the data. Therefore ply supports aggregating data at the source, thereby reducing tracing overhead. Aggregations are syntactically similar to maps, indeed they are a kind of map, but they are distinguished by a leading '@'. Also, they can only be assigned the result of one of the following aggregation functions:
@agg[exprs] = count()
Bump a counter.
@agg[exprs] = quantize(scalar-expr)
Evaluates the argument and aggregates on the most significant bit of the result. In other words, it stores the distribution of the expression.
A provider makes data available to the user by exporting functions and variables to the probe. Function calls use the same syntax as most languages that inherit from C. In addition to the provider-specific functions, all providers inherits a set of common functions and variables:
char[16] comm
, char[16] execname
name of the running process's executable.
u32 cpu
CPU ID of the processor on which the probe fired.
u32 gid
Group ID of the running process.
u32 kpid
:
Kernel PID of the running process. Also known as pid by the
kernel. For a single-threaded process kpid is equal to
pid. For multi-threaded processes, kpid will be unique while
pid will be the same across all threads.
char[N] mem(void *address [, int size])
Copy size bytes from address. If size is omitted, 64 bytes
will be copied.
s64 time
, s64 walltime
:
Nanoseconds elaped since system boot. time is intended for time
deltas and walltime should be used for timestamps. They refer to
the same data, but with different default output formats.
u32 pid
:
Process ID of the running process. Also known as thread group
ID (tgid) by the kernel.
void print(...)
:
Print each expression with its default output format, separated
by commas and terminated with a newline, to ply's standard out.
void printf(format, ...)
:
Prints formatted output to ply's standard out. In addition to
the formats recognized by the printf sitting in your <stdio.h>,
ply's also recognizes '%v' which will dump the value according to
the inferred type's default (i.e. how print would print it).
int strcmp(char *a, char *b)
:
Returns -1, 0 or 1 if the first argument is less than, equal to or
greater than the second argument respectively. Strings are
compared by their lexicographical order.
u32 uid
:
User ID of the running process.
These providers use the corresponding kernel features to instrument arbitrary instructions in the kernel. The probe-definition may be either an address or a symbol name. When using a symbol name, glob expansion is performed allowing a single probe to be inserted at multiple locations. An offset relative to a symbol may also be specfied for kprobes.
Examples:
schedule
returns.Shared variables:
struct pt_regs *regs
Hardware register contents from when the probe was triggered. This matches the definition in <sys/ptrace.h> on your system.
u32 stack
Stack trace ID of the current probe. This is just returns an index into a separate map containing the actual instruction pointers. As a user though, you can think of this function as returning a string containing the stack trace at the current location. Indeed print(stack) will produce exactly that.
CAUTION: On some architectures (looking at you, ARM), capturing stack traces at the entry of a function, before the prologue has run, does not work. Setting your probe after the prologue will work around the issue (typically two instructions, or +8, on ARM).
kprobe specific functions:
arg0
, arg1
... argN
:Returns the value of the specified argument of the function to which the probe was attached, zero-indexed. I.e. arg0 is the 1st argument, arg1 is the 2nd, and so on.
CAUTION: ply simply maps registers to arguments according to the syscall ABI. If your compiler decides to optimize out arguments or do other sneaky things, ply will be utterly oblivious to that.
void *caller
The program counter, as recorded in regs
, at the time the probe
was triggered. was attached. The default output format will
resolve it to a symbolic name if one is available.
kretprobe specific function:
retval
Print all openat
ed files on the system, and who opened them:
kprobe:SyS_openat
{
print(comm, pid, str(arg1));
}
Record the distribution of the return value of read(2):
kretprobe:SyS_read
{
@["dist"] = quantize(retval);
}
Count all syscalls made on the system, grouped by function:
kprobe:SyS_*
{
@[caller] = count();
}
Count all syscalls made by every dd(1) process, grouped by function:
kprobe:SyS_* / !strcmp(execname, "dd") /
{
@[caller] = count();
}
Record the distribution of the time it takes an skb to go from netif_receive to ip_rcv:
kprobe:__netif_receive_skb_core
{
rx[arg0] = time;
}
kprobe:ip_rcv / rx[arg0] /
{
@["diff"] = quantize(time - rx[arg0]);
}
0
Program was successfully compiled and loaded into the kernel.
Non-Zero
An error occurred during compilation or during kernel setup.
Tobias Waldekranz tobias@waldekranz.com
Copyright 2018 Tobias Waldekranz
License: GPLv2