基于 Fprobe 的事件追踪

概述

Fprobe 事件类似于 kprobe 事件,但仅限于探测函数入口和出口。这对于许多仅跟踪某些特定函数的用例来说已经足够了。

本文档还涵盖了跟踪点探测事件 (tprobe),因为它也仅在跟踪点入口处工作。用户可以跟踪跟踪点参数的一部分,或不带跟踪事件的跟踪点,这些跟踪点不会在 tracefs 上公开。

与其他动态事件一样,fprobe 事件和跟踪点探测事件是通过 tracefs 上的 dynamic_events 接口文件定义的。

fprobe-events 的概要

 f[:[GRP1/][EVENT1]] SYM [FETCHARGS]                       : Probe on function entry
 f[MAXACTIVE][:[GRP1/][EVENT1]] SYM%return [FETCHARGS]     : Probe on function exit
 t[:[GRP2/][EVENT2]] TRACEPOINT [FETCHARGS]                : Probe on tracepoint

GRP1           : Group name for fprobe. If omitted, use "fprobes" for it.
GRP2           : Group name for tprobe. If omitted, use "tracepoints" for it.
EVENT1         : Event name for fprobe. If omitted, the event name is
                 "SYM__entry" or "SYM__exit".
EVENT2         : Event name for tprobe. If omitted, the event name is
                 the same as "TRACEPOINT", but if the "TRACEPOINT" starts
                 with a digit character, "_TRACEPOINT" is used.
MAXACTIVE      : Maximum number of instances of the specified function that
                 can be probed simultaneously, or 0 for the default value
                 as defined in Documentation/trace/fprobe.rst

FETCHARGS      : Arguments. Each probe can have up to 128 args.
 ARG           : Fetch "ARG" function argument using BTF (only for function
                 entry or tracepoint.) (\*1)
 @ADDR         : Fetch memory at ADDR (ADDR should be in kernel)
 @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
 $stackN       : Fetch Nth entry of stack (N >= 0)
 $stack        : Fetch stack address.
 $argN         : Fetch the Nth function argument. (N >= 1) (\*2)
 $retval       : Fetch return value.(\*3)
 $comm         : Fetch current task comm.
 +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
 \IMM          : Store an immediate value to the argument.
 NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
 FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
                 (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
                 (x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
                 and bitfield are supported.

 (\*1) This is available only when BTF is enabled.
 (\*2) only for the probe on function entry (offs == 0). Note, this argument access
       is best effort, because depending on the argument type, it may be passed on
       the stack. But this only support the arguments via registers.
 (\*3) only for return probe. Note that this is also best effort. Depending on the
       return value type, it might be passed via a pair of registers. But this only
       accesses one register.
 (\*4) this is useful for fetching a field of data structures.
 (\*5) "u" means user-space dereference.

有关 TYPE 的详细信息,请参阅 kprobetrace 文档

退出时的函数参数

可以使用 $arg<N> fetcharg 在退出探测时访问函数参数。这对于一次记录函数参数和返回值,并跟踪结构字段的差异(用于调试函数是否正确更新了给定的数据结构)很有用。有关其工作原理,请参阅下面的 示例

BTF 参数

BTF (BPF 类型格式) 参数允许用户通过其名称而不是 $argN 来跟踪函数和跟踪点参数。如果内核配置了 CONFIG_BPF_SYSCALL 和 CONFIG_DEBUG_INFO_BTF,则此功能可用。如果用户仅指定 BTF 参数,则事件的参数名称也会自动设置为给定的名称。

# echo 'f:myprobe vfs_read count pos' >> dynamic_events
# cat dynamic_events
f:fprobes/myprobe vfs_read count=count pos=pos

它还从 BTF 信息中选择获取类型。例如,在上面的示例中,count 是无符号长整型,而 pos 是一个指针。因此,两者都转换为 64 位无符号长整型,但只有 pos 具有如下所示的 “%Lx” 打印格式

# cat events/fprobes/myprobe/format
name: myprobe
ID: 1313
format:
       field:unsigned short common_type;       offset:0;       size:2; signed:0;
       field:unsigned char common_flags;       offset:2;       size:1; signed:0;
       field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
       field:int common_pid;   offset:4;       size:4; signed:1;

       field:unsigned long __probe_ip; offset:8;       size:8; signed:0;
       field:u64 count;        offset:16;      size:8; signed:0;
       field:u64 pos;  offset:24;      size:8; signed:0;

print fmt: "(%lx) count=%Lu pos=0x%Lx", REC->__probe_ip, REC->count, REC->pos

如果用户不确定参数的名称,$arg* 将会很有帮助。$arg* 会扩展到函数或跟踪点的所有函数参数。

# echo 'f:myprobe vfs_read $arg*' >> dynamic_events
# cat dynamic_events
f:fprobes/myprobe vfs_read file=file buf=buf count=count pos=pos

BTF 也会影响 $retval。如果用户未设置任何类型,则会自动从 BTF 中选取 retval 类型。如果函数返回 void,则会拒绝 $retval

您可以使用允许运算符 ->(用于指针类型)和点运算符 .(用于数据结构类型)访问数据结构的字段。

# echo 't sched_switch preempt prev_pid=prev->pid next_pid=next->pid' >> dynamic_events

字段访问运算符 ->. 可以组合使用,以访问更深层的成员和成员指向的其他结构成员。例如,foo->bar.baz->qux 如果存在非名称联合成员,您可以像 C 代码一样直接访问它。例如

struct {
       union {
       int a;
       int b;
       };
} *foo;

要访问 ab,请在这种情况下使用 foo->afoo->b

此数据字段访问可以通过 $retval 用于返回值,例如 $retval->name

对于这些 BTF 参数和字段,:string:ustring 会更改行为。如果这些用于 BTF 参数或字段,它会检查参数或数据字段的 BTF 类型是否为 char *char []。如果不是,则会拒绝应用字符串类型。此外,通过 BTF 支持,您无需使用内存解引用运算符 (+0(PTR)) 来访问 PTR 指向的字符串。它会根据 BTF 类型自动添加内存解引用运算符。例如。

# echo 't sched_switch prev->comm:string' >> dynamic_events
# echo 'f getname_flags%return $retval->name:string' >> dynamic_events

prev->comm 是数据结构中嵌入的字符数组,而 $retval->name 是数据结构中的字符指针。但在两种情况下,您都可以使用 :string 类型来获取字符串。

使用示例

这是一个在 vfs_read() 函数入口和出口添加 fprobe 事件的示例,带有 BTF 参数。

 # echo 'f vfs_read $arg*' >> dynamic_events
 # echo 'f vfs_read%return $retval' >> dynamic_events
 # cat dynamic_events
f:fprobes/vfs_read__entry vfs_read file=file buf=buf count=count pos=pos
f:fprobes/vfs_read__exit vfs_read%return arg1=$retval
 # echo 1 > events/fprobes/enable
 # head -n 20 trace | tail
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
              sh-70      [000] ...1.   335.883195: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
              sh-70      [000] .....   335.883208: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
              sh-70      [000] ...1.   335.883220: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
              sh-70      [000] .....   335.883224: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
              sh-70      [000] ...1.   335.883232: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c687a count=1 pos=0xffffc900005aff08
              sh-70      [000] .....   335.883237: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
              sh-70      [000] ...1.   336.050329: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
              sh-70      [000] .....   336.050343: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1

您可以看到所有函数参数和返回值都记录为有符号整数。

此外,这是一个在 sched_switch 跟踪点上添加跟踪点事件的示例。为了比较结果,这也会启用 sched_switch 跟踪事件。

 # echo 't sched_switch $arg*' >> dynamic_events
 # echo 1 > events/sched/sched_switch/enable
 # echo 1 > events/tracepoints/sched_switch/enable
 # echo > trace
 # head -n 20 trace | tail
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
              sh-70      [000] d..2.  3912.083993: sched_switch: prev_comm=sh prev_pid=70 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
              sh-70      [000] d..3.  3912.083995: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffff88800664e100 next=0xffffffff828229c0 prev_state=1
          <idle>-0       [000] d..2.  3912.084183: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=16 next_prio=120
          <idle>-0       [000] d..3.  3912.084184: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffffffff828229c0 next=0xffff888004208000 prev_state=0
     rcu_preempt-16      [000] d..2.  3912.084196: sched_switch: prev_comm=rcu_preempt prev_pid=16 prev_prio=120 prev_state=I ==> next_comm=swapper/0 next_pid=0 next_prio=120
     rcu_preempt-16      [000] d..3.  3912.084196: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffff888004208000 next=0xffffffff828229c0 prev_state=1026
          <idle>-0       [000] d..2.  3912.085191: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=16 next_prio=120
          <idle>-0       [000] d..3.  3912.085191: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffffffff828229c0 next=0xffff888004208000 prev_state=0

如您所见,sched_switch 跟踪事件显示处理过的参数,另一方面,sched_switch 跟踪点探测事件显示原始参数。这意味着您可以访问由 prevnext 参数指向的任务结构中的任何字段值。

例如,通常不会跟踪 task_struct::start_time,但是通过此跟踪探测事件,您可以如下所示地跟踪该字段。

 # echo 't sched_switch comm=next->comm:string next->start_time' > dynamic_events
 # head -n 20 trace | tail
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
              sh-70      [000] d..3.  5606.686577: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="rcu_preempt" usage=1 start_time=245000000
     rcu_preempt-16      [000] d..3.  5606.686602: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="sh" usage=1 start_time=1596095526
              sh-70      [000] d..3.  5606.686637: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
          <idle>-0       [000] d..3.  5606.687190: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="rcu_preempt" usage=1 start_time=245000000
     rcu_preempt-16      [000] d..3.  5606.687202: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
          <idle>-0       [000] d..3.  5606.690317: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000
     kworker/0:1-14      [000] d..3.  5606.690339: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
          <idle>-0       [000] d..3.  5606.692368: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000

返回探测器允许我们访问某些函数的结果,这些函数返回错误代码,并且其结果通过函数参数传递,例如结构初始化函数。

例如,vfs_open() 会将文件结构链接到 inode 并更新模式。您可以使用返回探测器跟踪这些更改。

# echo 'f vfs_open mode=file->f_mode:x32 inode=file->f_inode:x64' >> dynamic_events
# echo 'f vfs_open%%return mode=file->f_mode:x32 inode=file->f_inode:x64' >> dynamic_events
# echo 1 > events/fprobes/enable
# cat trace
             sh-131     [006] ...1.  1945.714346: vfs_open__entry: (vfs_open+0x4/0x40) mode=0x2 inode=0x0
             sh-131     [006] ...1.  1945.714358: vfs_open__exit: (do_open+0x274/0x3d0 <- vfs_open) mode=0x4d801e inode=0xffff888008470168
            cat-143     [007] ...1.  1945.717949: vfs_open__entry: (vfs_open+0x4/0x40) mode=0x1 inode=0x0
            cat-143     [007] ...1.  1945.717956: vfs_open__exit: (do_open+0x274/0x3d0 <- vfs_open) mode=0x4a801d inode=0xffff888005f78d28
            cat-143     [007] ...1.  1945.720616: vfs_open__entry: (vfs_open+0x4/0x40) mode=0x1 inode=0x0
            cat-143     [007] ...1.  1945.728263: vfs_open__exit: (do_open+0x274/0x3d0 <- vfs_open) mode=0xa800d inode=0xffff888004ada8d8

您可以看到 file::f_modefile::f_inodevfs_open() 中已更新。