marselester's blog

DIY CPU profiler: from BPF maps to pprof

Thu 20 October 2022

In previous post I had a DIY BPF profiler printing stack trace IDs of a given process, though they were not very helpful.

﹩ sudo go run ./cmd/profiler/ -pid 15958
Waiting for stack traces...
{PID:15958 UserStackID:132 KernelStackID:114} seen 1 times

Let's try to show more useful information this time, i.e., sampled memory addresses. You can find the results of my experiments in github.com/marselester/diy-parca-agent.

Reading addresses from BPF maps

I continued to explore github.com/parca-dev/parca-agent/pkg/profiler/profiler.go and found the place where stack trace samples are read from the BPF maps — profileLoop() method. Here is its simplified version.

const (
    // Always needs to be sync with MAX_STACK_DEPTH in parca-agent.bpf.c.
    stackDepth = 127
    // Twice the stack depth because we have a User and a potential Kernel stack.
    doubleStackDepth = 254
)

func (p *CgroupProfiler) profileLoop(ctx context.Context, captureTime time.Time) error {
    it := p.bpfMaps.counts.Iterator()
    for it.Next() {
        // Note, pid, userStackID, and kernelStackID are read from it.Key().
        stack := [doubleStackDepth]uint64{}

        // Collect user-space stack trace samples.
        stackBytes, err := p.bpfMaps.stackTraces.GetValue(
            unsafe.Pointer(&userStackID),
        )
        binary.Read(bytes.NewBuffer(stackBytes), byteOrder, stack[:stackDepth])
        for _, addr := range stack[:stackDepth] {
            m, err := mapping.PIDAddrMapping(pid, addr)
        }

        // Collect kernel-space stack trace samples.
        stackBytes, err = p.bpfMaps.stackTraces.GetValue(
            unsafe.Pointer(&kernelStackID),
        )
        binary.Read(bytes.NewBuffer(stackBytes), byteOrder, stack[stackDepth:])
        for _, addr := range stack[stackDepth:] {}
    }
}

From looking at it I can tell that PID, user-space stack ID, and kernel-space stack ID are read from the counts BPF map on each iteration.

{ pid; userStackID; kernelStackID }	seen
{ 10342; 1253; 0234 }	45

A stack ID allows to fetch a stack trace (an array of memory addresses that represent the code executed during profiling) from the stack_traces BPF map.

stack ID	memory addresses
1253	[ 0xdeadbeef; 0x123abcde; ... ]
0234	[ 0x597be95a; 0xae5ee03; ... ]

A user-space stack trace indexed by userStackID=1253 is placed in the beginning of the stack [254]uint64 array. The kernelStackID=0234 kernel-space stack trace is placed in the middle of the array.

// User-space stack trace is limited by stackDepth=127.
stack[0] = 0xdeadbeef
stack[1] = 0x123abcde
stack[126] = ...

// Kernel-space stack trace is limited by doubleStackDepth=254.
stack[127] = 0x597be95a
stack[128] = 0xae5ee03
stack[253] = ...

Memory mapping

Alright, but what does m, err := mapping.PIDAddrMapping(pid, addr) line do? Basically the function call does the following (see maps.go):

opens a memory map file of a given process /proc/PID/maps
parses a memory map using github.com/google/pprof/profile package
looks up a mapping for a given memory address previously found in a stack trace (e.g., 0xdeadbeef), i.e., it should be located between m.Start and m.Limit inclusively. For example, in [0x55dd2d5eb000; 0x55dd2d65f000] interval.

f, err := os.Open(
    fmt.Sprintf("/proc/%d/maps", pid),
)

mm, err := profile.ParseProcMaps(f)

for _, m := range mm {
    if m.Start <= addr && m.Limit >= addr {
        return m
    }
}

Note, mm is a slice of *profile.Mapping{} structs that could look as it's shown below.

mm = []*profile.Mapping{
    {
        ID:                     0,
        Start:                  94408437313536, // 0x55dd2d5eb000
        Limit:                  94408437788672, // 0x55dd2d65f000
        Offset:                 45056,          // 0xb000
        File:                   "/usr/sbin/sshd",
        BuildID:                "",
        HasFunctions:           false,
        HasFilenames:           false,
        HasLineNumbers:         false,
        HasInlineFrames:        false,
        KernelRelocationSymbol: "",
    },
    {
        ID:                     0,
        Start:                  140349706260480, // 0x7fa5b662d000
        Limit:                  140349706448896, // 0x7fa5b665b000
        Offset:                 24576,           // 0x6000
        File:                   "/usr/lib/x86_64-linux-gnu/libnss_systemd.so.2",
        BuildID:                "",
        HasFunctions:           false,
        HasFilenames:           false,
        HasLineNumbers:         false,
        HasInlineFrames:        false,
        KernelRelocationSymbol: "",
    },
    // ...
}

Here is the corresponding memory map file.

﹩ sudo cat /proc/4068/maps
55dd2d5e0000-55dd2d5eb000 r--p 00000000 08:01 2869         /usr/sbin/sshd
55dd2d5eb000-55dd2d65f000 r-xp 0000b000 08:01 2869         /usr/sbin/sshd
55dd2d65f000-55dd2d69d000 r--p 0007f000 08:01 2869         /usr/sbin/sshd
55dd2d69e000-55dd2d6a1000 r--p 000bd000 08:01 2869         /usr/sbin/sshd
55dd2d6a1000-55dd2d6a2000 rw-p 000c0000 08:01 2869         /usr/sbin/sshd
55dd2d6a2000-55dd2d6a6000 rw-p 00000000 00:00 0
55dd2e9d6000-55dd2ea68000 rw-p 00000000 00:00 0            [heap]
55dd2ea68000-55dd2ea8a000 rw-p 00000000 00:00 0            [heap]
7fa5b6627000-7fa5b662d000 r--p 00000000 08:01 4761         /usr/lib/x86_64-linux-gnu/libnss_systemd.so.2
7fa5b662d000-7fa5b665b000 r-xp 00006000 08:01 4761         /usr/lib/x86_64-linux-gnu/libnss_systemd.so.2
7fa5b665b000-7fa5b666a000 r--p 00034000 08:01 4761         /usr/lib/x86_64-linux-gnu/libnss_systemd.so.2
7fa5b666a000-7fa5b666e000 r--p 00042000 08:01 4761         /usr/lib/x86_64-linux-gnu/libnss_systemd.so.2
7fa5b666e000-7fa5b666f000 rw-p 00046000 08:01 4761         /usr/lib/x86_64-linux-gnu/libnss_systemd.so.2
7fa5b6675000-7fa5b6676000 r--p 00000000 08:01 3974         /usr/lib/x86_64-linux-gnu/security/pam_env.so

...

7fa5b72f4000-7fa5b72f6000 rw-p 00034000 08:01 3657         /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7fff5522d000-7fff55271000 rw-p 00000000 00:00 0            [stack]
7fff55285000-7fff55289000 r--p 00000000 00:00 0            [vvar]
7fff55289000-7fff5528b000 r-xp 00000000 00:00 0            [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0    [vsyscall]

🤔

Let's do a similar thing in a new profiler called profiler2. The v2 DIY profiler is able to print a memory map file of a given process and show the traced memory addresses of user and kernel space.

﹩ sudo go run ./cmd/profiler2 -pid=4068
/proc/4068/maps
start=0x55dd2d5eb000 limit=0x55dd2d65f000 offset=0xb000 /usr/sbin/sshd
...

Waiting for stack traces...

pprof

It would be great to finally see symbols (function names) instead of memory addresses and pprof is going to help us with that. pprof is a tool for visualization and analysis of profiling data. It reads a collection of profiling samples in profile.proto format and generates reports to visualize the collected profiles.

Let's get back to profileLoop() method and see what else it's doing:

it prepares prof := profile.Profile{} struct which is an in-memory representation of profile.proto
using addresses from BPF maps it fills Profile.Location — the set of program locations; each location describes function and line table debug information
it fills Profile.Sample — the set of samples recorded in a profile
it fills Profile.Mapping — mapping from address ranges to the object file
it uploads object files to Parca server
it uploads pprof profile

The first step is to create a CPU profile which will be saved somewhere by prof.Write() after a profile was filled with samples.

prof := profile.Profile{
    // SampleType is a description of the samples associated with each Sample.Value.
    // By convention the number of events should use unit "count" and
    // it should be at index 0.
    SampleType: []*profile.ValueType{{
        Type: "samples",
        Unit: "count",
    }},
    // TimeNanos is a time of collection (UTC) represented as nanoseconds past the epoch.
    TimeNanos: time.Now().UnixNano(),
    // DurationNanos is a profiling duration in nanoseconds, Parca uses 10s by default.
    DurationNanos: int64(10 * time.Second),
    // PeriodType is the kind of events between sampled ocurrences,
    // i.e., ["cpu", "nanoseconds"].
    //
    // ValueType describes the semantics and measurement units of a value.
    PeriodType: &profile.ValueType{
        Type: "cpu",
        Unit: "nanoseconds",
    },
    // Period is the number of events between sampled occurrences.
    // We sample at 100Hz (100 times per second),
    // which is every 0.01s or 10 million nanoseconds.
    Period: 10000000,
}

A simplified version of the steps 2 to 4 show how user-space stack traces (samples) are added to a pprof profile.

For every non-zero unique address found in stack traces we create a profile.Location{} to describe a function. A corresponding memory mapping is added to that newly created location ("Memory mapping" section above shows how to find a mapping for an address). All the locations are appended to prof.Location slice.

prof.Location = append(prof.Location, &profile.Location{
    // ID is a unique nonzero id for the location.
    ID: uint64(locIndex + 1),
    // Address is the instruction address for this location.
    // It should be within [Mapping.memory_start...Mapping.memory_limit]
    // for the corresponding mapping.
    Address: addr,
    // Mapping is the corresponding profile.Mapping for this location.
    // It can be nil if the mapping is unknown.
    Mapping: m,
})

Locations related to a given stack trace are added to a sample. Each profile.Sample records how many times a particular stack trace was seen.

prof.Sample = append(prof.Sample, &profile.Sample{
    Value:    []int64{int64(seen)},
    Location: sampleLocations,
})

At the end, all the mappings get assigned IDs starting with one because ID zero is reserved. The pprof tool will show the following error if ID zero is used: malformed profile: found mapping with reserved ID=0. See the whole code snippet in the details below.

type locationIndex struct {
    PID  int
    Addr uint64
}

// locationIndices maps {pid; addr} found in the sample
// to an index in the Profile.Location slice
// to look up a respective Location.
locationIndices := make(map[locationIndex]int)

for it.Next() {
    // Collect user-space stack trace samples.
    sampleLocations := []*profile.Location{}
    for _, addr := range userStack {
        if addr == 0 {
            continue
        }

        // Look up a location for the given PID and address
        // and append it to the current sample's locations.
        //
        // In case a location wasn't found, create one.
        key := locationIndex{pid, addr}
        locIndex, found := locationIndices[key]
        if !found {
            m := mappingForAddr(mm, addr)
            if m == nil {
                log.Printf("failed to get process mapping for addr: %x", addr)
            }

            locIndex = len(prof.Location)
            // Each Location describes function and line table debug information.
            prof.Location = append(prof.Location, &profile.Location{
                // ID is a unique nonzero id for the location.
                ID: uint64(locIndex + 1),
                // Address is the instruction address for this location.
                // It should be within [Mapping.memory_start...Mapping.memory_limit]
                // for the corresponding mapping.
                Address: addr,
                // Mapping is the corresponding profile.Mapping for this location.
                // It can be nil if the mapping is unknown.
                Mapping: m,
            })
            locationIndices[key] = locIndex
        }

        sampleLocations = append(sampleLocations, prof.Location[locIndex])
    }

    // Each Sample records how many times a particular stack trace was seen
    // along with the corresponding locations.
    prof.Sample = append(prof.Sample, &profile.Sample{
        Value:    []int64{int64(seen)},
        Location: sampleLocations,
    })
}

// Mappings in pprof must have IDs and need to start with 1.
for i, m := range mm {
    m.ID = uint64(i + 1)
}
prof.Mapping = mm

🧪

Putting those steps together gives us profiler3. In v3 DIY profiler we write a CPU profile to cpu.pprof file instead of uploading it to a Parca server. Let's run profiler3 and make sure the profiled program does something instead of idling, so there will be enough CPU samples.

﹩ sudo go run ./cmd/profiler3 -pid=4068
/proc/4068/maps
start=0x55dd2d5eb000 limit=0x55dd2d65f000 offset=0xb000 /usr/sbin/sshd
...

Waiting for stack traces for 10s...

After ten seconds the profiler will terminate leaving you with cpu.pprof file. You can inspect it with go tool pprof cpu.pprof command.

Exploring Parca Agent's internals and reconstructing it in the simplest form helped me demystify some of aspects of BPF CPU profilers.

Category: Profiler Tagged: bpf golang profiling

comments

Continuous profiling in Go

Fri 22 April 2022

In this post I explore possibilities of continuous profiling of Go programs using Parca and also peek under the hood of its BPF agent. You can find the results of my experiments in github.com/marselester/diy-parca-agent.

Ad hoc profiling

I was looking for a way to profile Go programs …

Category: Profiler Tagged: bpf golang monitoring profiling

comments

Wed 17 November 2021

BPF opens a lot of possibilities of making observability tools running in Kubernetes. One can start with BCC libbpf-tools written in C, e.g., launch tcpconnlat program and process its stdout with another program to detect cases when it took too long to establish a TCP connection. For example, curl …

Category: Go Tagged: bpf golang kubernetes

comments

Mon 01 November 2021

In the previos BPF post I shared my experiment of writing Go frontend for execsnoop. Here I would like to focus on tcpconnect, a BCC tool to trace new TCP active connections. It is useful for determining who is connecting to whom. This works by tracing the tcp_v4_connect() and tcp_v6_connect …

Category: Go Tagged: bpf golang

comments

Tue 26 October 2021

After reading Brendan Gregg's books about BPF I was excited to try and implement something in Go. At a first glance it turned out I would need to install LLVM, Clang, and kernel header dependencies to run a simple program. Fortunately BTF, CO-RE technologies eliminate those dependencies though kernel >=5 …

Category: Go Tagged: bpf golang

comments

Fri 28 September 2018

I came to Go from Django where the framework defines project layout, thus I wanted to know how to structure my Go applications. After reading documentation and building a few Django projects, you get a clear mental picture, as most of the questions are already answered. That helps to keep …

Category: Go Tagged: golang project structure Domain-Driven Design

comments

Sun 13 November 2016

Prometheus is a monitoring toolkit. Let's set it up on Kubernetes and test how it works by scraping HTTP request metrics from hello web application which also runs in the same cluster.

First of all, we need Kubernetes cluster running. It's easy to bootstrap one via Google Container Engine.

$ gcloud …

Category: Infrastructure Tagged: prometheus kubernetes monitoring golang Google Container Engine

comments

DIY CPU profiler: from BPF maps to pprof

Thu 20 October 2022

Reading addresses from BPF maps

Memory mapping

pprof

Continuous profiling in Go

Fri 22 April 2022

Ad hoc profiling

BPF Go program in Kubernetes

Wed 17 November 2021

BPF: Go frontend for tcpconnect

Mon 01 November 2021

BPF: Go frontend for execsnoop

Tue 26 October 2021

How to Structure Go Projects

Fri 28 September 2018

Prometheus on Kubernetes

Sun 13 November 2016