kuniga.me > NP-Incompleteness > bpftrace in C++

bpftrace in C++

06 Feb 2026

I learned a lot about BPF tools after reading the book BPF Performance Tools by Brendan Gregg. The book focused mostly on performance analysis at a lower level, either through kernel functions or libraries such as libc.

I wanted to leverage BPF at the application layer, in particular in C++. The book covers C++ very briefly in Chapter 12 and Chapter 13 provides an example of analyzing a MySQL database but still, most of the examples assume the implementation being in C.

In this post want to investigate how to inspect C++ applications.

Dynamic Probes

Let’s start with dynamic probes. As we explained in [1], these are probes we can attach to without having to recompile the code. Since we’re only interested in user space, we’ll use uprobes.

Functions

Let’s start with a simple example: free functions. In our example, we run an infinite loop simulating a webserver.

// main.cpp

long long fibonacci(int n) {
  return n <= 1 ? n : fibonacci(n - 1) + fibonacci(n - 2);
}

int main() {
  while (true) {
    std::cout << fibonacci(40) << std::endl;
  }

  return 0;
}

We can run this binary say via

g++ main.cpp -o fibo -O3

and then use this bpftrace script that counts how many times fibonacci() was called during a 2-second period:

// fib.bt

uprobe:./fibo:_Z9fibonaccii { @calls = count(); }

interval:s:2 { exit(); }

We can run it via:

$ sudo bpftrace /tmp/fib.bt

Attached 3 probes
@calls: 9723017

Name mangling. The first hurdle is that C++ mangles the name of functions, even free ones, so we need to provide that instead, _Z9fibonaccii. The easiest way to find the mangled symbols from a binary is by running this bpftrace command:

$ sudo bpftrace -l 'uprobe:./fibo:*fib*'
uprobe:./fibo:_Z9fibonaccii

We can also get a distribution of values passed to fibonacci by changing fib.bt:

uprobe:./fibo:_Z9fibonaccii {
  @h = hist(arg0);
}

interval:s:2 { exit(); }

And then:

$ sudo bpftrace /tmp/fib.bt

@h:
[0]              1720997 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                    |
[1]              2784630 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 4)           2784630 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)           1469908 |@@@@@@@@@@@@@@@@@@@@@@@                         |
[8, 16)           245746 |@                                               |
[16, 32)            5339 |                                                |
[32, 64)               3 |                                                |

Note that because fibonacci is recursive, we don’t need to worry about it being inlined by the compiler optimizer. Otherwise we need to modify the code to add __attribute__((noinline)) if we want to turn off inlining for this function.

Methods

Class methods look like regular functions after compilation and the concept of method visibility does not exist in runtime, so we can also inspect private methods. We can modify our example to implement such a case:

// main.cpp

struct Fibo {
  // Expensive recursive Fibonacci computation
  long long run() {
    return run(40);
  }

 private:
  long long run(int n) {
    return n <= 1 ? n : run(n - 1) + run(n - 2);
  }
};

int main() {
  Fibo f;
  while (true) {
    std::cout << f.run() << std::endl;
  }

  return 0;
}

If we compile it to fibo_class, we can inspect the symbols by looking for the class name:

$ sudo bpftrace -l 'uprobe:./fibo_class:*Fibo*'
uprobe:./fibo_class:_ZN4Fibo3runEi
uprobe:./fibo_class:_ZN4Fibo3runEv

Since we have overloaded signatures, it shows both. We can use c++filt to identify the correct overload, the one taking int as argument:

$ c++filt _ZN4Fibo3runEi
Fibo::run(int)

So we can write a similar script to count the distribution of arguments. In C++, when methods are compiled, the this object that is implicit in code is made explicit as the first argument, so we have to account for that and probe the second argument arg1 instead.

uprobe:./fibo_class:_ZN4Fibo3runEi { @h = hist(arg1); }

interval:s:2 { exit(); }

STL Strings

bpftrace can handle C-style char* but not std::string. For that, we need to make assumptions about which STL implementation is used, operating systems and compilation mode, making this unportable. Suppose we have:

void my_print(const std::string& s) {
  std::cout << s << std::endl;
}
int main() {
  std::string s1 = "hello";
  std::string s2 = "world";

  int i = 0;
  while (++i) {
    i % 3 == 0 ? my_print(s1) : my_print(s2);
    sleep(1);
  }

  return 0;
}

And we want to probe s in my_print. Suppose we compile this to strhist. I’m running this on Linux on a x86_64 architecture and can verify my program links to stdc++

ldd strhist
...
libstdc++.so.6 => /lib64/libstdc++.so.6 (...)
...

In this case it’s relatively safe to assume that if std::string is at the address addr, then addr + 0 is a pointer to the data and addr + 8 contains the length of the string. This works even with SSO (small string optimization) in which the raw data is not stored in the heap, but within the std::string structure itself, in the buffer zone. In that case addr + 0 will not point to the heap, but to addr + 16, but for our purposes it doesn’t matter.

We can write a bpftrace script as such:

// strhist.bt
uprobe:./fibo:_Z8my_printRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
{
  $s  = arg0;
  $p  = *(uint64*)($s + 0);    // char* - raw data
  $n  = *(uint64*)($s + 8);    // size_t - strlen

  // print as raw bytes (handles embedded NULs)
  printf("len=%lu str=%r\n", $n, buf($p, $n));
}

We’ll see it shows hello and world correctly. We can get a frequency count of each:

// strhist.bt
uprobe:./fibo:_Z8my_printRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
{
  $s  = arg0;
  $p  = *(uint64*)($s + 0);    // char* - raw data
  $n  = *(uint64*)($s + 8);    // size_t - strlen

  @cnt[str($p)] = count();
}

The major drawback of str($p) is that it’s truncated, typically 64 bytes, so if the strings share the same long prefix, it will not distinguish between then. Also, if these strings have \0, then it will use as delimiter instead of honoring $n. It’s possible to create a custom hashing function with length larger than 64 but it must be a constant.

User-Defined Classes

If the argument is a more complex class, it can be a lot more work to probe it. Suppose we have this contrived class C which we want to probe:

struct B {
  std::string s;
};

class C {
 public:
  C(int id) : id_(id) {}

  void add(std::string key, B b) {
    m_[key] = b;
  }

 private:
  std::map<std::string, B> m_;
  int id_;
};

It’s very complex, having STL data structures and other classes as member variables. Suppose we want to probe the member id_ when calling a function by passing C as reference:

__attribute__((noinline)) void random_func(C& c) {
  c.add("key", B{"hello 2"});
}

If we have debug symbols for the binary, we can inspect C’s structure to find the offset of id_:

$ pahole -F dwarf -C 'C' struct.cpp.dwo
...
class C {
        class {
        } m_;                                            /*     0     0 */

        /* XXX 48 bytes hole, try to pack */

        int                        id_;                  /*    48     4 */
public:
...
}

It tells us the offset of id_ is 48 bytes, so in bpftrace we can do:

uprobe:./struct:_Z11random_funcR1C
{
  $c = *(int*)(arg0 + 48);
  printf("x=%d\n", $c);
}

We can also add a pseudo-struct that mimics the offsets from C by declaring a struct inline in the script:

struct C_stub {
    char _pad[48];
    int x_;
}

uprobe:./struct:_Z11random_funcR1C
{
  $c = *(struct C_stub*)arg0;
  printf("x=%d\n", $c->x_);
}

which is slighly more readable and could be useful if multiple bpftrace scripts need this C definition.

The conclusion from my exploration on uprobes is that while they are provided out of the box, working with non-primitive types can get easily very complex and brittle. We now explore the scenario where we can modify the source code to make it easier to inspect.

Static Probes

We can add tracepoints to specific parts of the code and because it requires modifying the code, these are known as static probes. One of the easiest ways to do it is using the Folly library, via the folly/tracing/StaticTracepoint.h header, which defines a few macros. We can add those to one of our previous examples:

#include <folly/tracing/StaticTracepoint.h>

void my_print(const std::string& s) {
  FOLLY_SDT(my_project, start_proc, s.c_str());

  std::cout << s << std::endl;

  FOLLY_SDT(my_project, end_proc);
}
...

Suppose we compile this to strhist. To list the USDTs we can do:

$ sudo bpftrace -l 'usdt:./strhist:*'
usdt:./strhist:my_project:end_proc
usdt:./strhist:my_project:start_proc

The first thing to notice is that the prefix is now usdt instead of uprobe. The second is that because we provided the name specifically, the name is not mangled.

Also notice we passed s.c_str() which is of type char*, which is well supported by bpftrace. We don’t have to assume the data layout in std::string anymore, so our script becomes:

// strhist.bt
usdt:./strhist:my_project:start_proc
{
  printf("str=%s\n", str(arg0));
}

In this case s.c_str() is cheap enough to run every time, even when no tracing is taking place. If we want to gate so that code only runs during tracing, we can use FOLLY_SDT_IS_ENABLED + FOLLY_SDT_WITH_SEMAPHORE:

#include <folly/tracing/StaticTracepoint.h>

FOLLY_SDT_DEFINE_SEMAPHORE(my_project, start_proc);

void my_print(const std::string& s) {
  if (FOLLY_SDT_IS_ENABLED(my_project, start_proc)) {
    // This block is only executed when tracing is going on.
    FOLLY_SDT_WITH_SEMAPHORE(my_project, start_proc, msg.c_str());
  }
  std::cout << s << std::endl;
}
...

So USDTs are easier to work with but this requires more foresight.

Applications

Now that we know how to define tracepoints, we can do other types of analysis besides counting how many times a tracepoint is hit.

Latency

We can measure the amount of time it takes between two probes, say start_proc and end_proc. As long a end_proc tracepoint is always executed after a start_proc for a given thread, we can do:

usdt:./my_binary:my_project:start_proc {
  @ts[tid] = nsecs;
}
usdt:./my_binary:my_project:end_proc {
  @lat_us = hist((nsecs - @ts[tid]) / 1000);
  delete(@ts[tid]);
}

Upon termination, @lat_us will display a histogram of duration of end_proc - start_proc in microseconds.

Memory Allocation

We can measure how much memory was allocated between two probes, say start_proc and end_proc. As long a end_proc tracepoint is always executed after a start_proc for a given thread, and that we assume a specific memory allocator. Suppose the allocator is the standard malloc provided by libc. We can do:

uprobe:/usr/lib/libc.so.6:malloc
/ @in_req[tid] / { @req_size[tid] = sum(arg0); }

usdt:/my_binary:my_project:start_proc { @in_req[tid] = 1; }
usdt:/my_binary:my_project:end_proc {
  printf("allocs=%d\n", @req_size[tid]);
  delete(@req_size[tid]);
  delete(@in_req[tid]);
}

When we hit the first tracepoint, start_proc, we set a flag. If this flag is set, the malloc action will run, which in this case adds up the first argument.

This mechanism of setting a flag on start and clearing it at the end can be used with many other analysis described in [1].

Conclusion

I haven’t had the opportunity to use bpftrace in production at work, but now I have a better idea on what are the capabilities for a C++ application. I may update this post with other use cases should I run into them.

References

[1] BPF Performance Tools - Brendan Gregg

| Tags: c++