Below are a few very practical ways to ship millions‑per‑second metrics from every process to one daemon on today’s Linux, ranked by how much code you need to write and how low the steady‑state overhead can get.
1. tracefs “user‑events” (↔ perf/ftrace ring‑buffer)
Available in mainline since v6.5 (CONFIG_USER_EVENTS
).
c
/* one‑time setup (constructor) */ int fd = open("/sys/kernel/tracing/user_events_data", O_WRONLY); struct user_reg reg = { .size = sizeof(reg), .name = "my_counter" }; ioctl(fd, DIAG_IOCSREGUSER, ®); /* register event */ /* the hot path – ~15 instructions + 1 `writev` that hits a per‑CPU ring page */ struct iovec iov[2] = { { ®.write_index, 4 }, { &value, sizeof(value) } }; writev(fd, iov, 2);
Why it’s attractive
- no shared‑memory API at all – you just
writev()
; kernel copies into the pre‑mapped ftrace ring buffer; - measured overhead is tens‑of‑nanoseconds per event on 2024 CPUs – in the same ballpark as a
rdtsc
; the write is simply amemcpy()
into a per‑CPU page and a store‑release. www.kernel.org - the daemon can reuse every perf/ftrace tool or
libtracefs_iterate_buffers()
to stream events; Prometheus/OpenTelemetry exporters already understand it. When to skip it – you need your own binary layout (not TLV) or the target kernels are pre‑6.5.
2. io_uring + IORING_OP_MSG_RING
Set up one io_uring in the daemon (ringD
) and let every process create its own ring (ringP
).
During initial handshake export ringD.ring_fd
.
Hot path:
c
io_uring_prep_msg_ring(sqe, ringD_fd, MY_EVENT_ID, value); io_uring_submit(ringP);
Latency: a single cache‑hot io_uring_enter()
fast‑path (~120 ns round‑trip on Zen 4).
There is zero kernel thread wake‑up when the daemon is already polling its CQE ring. man7.orgDown‑side: you still need a tiny helper to keep one ring per process, and kernels before 5.19 lack MSG_RING
.
3. eBPF ring buffer + uprobes (header‑only “log nothing” API)
Idea: the application just calls an inline helper that increments a per‑CPU counter array:
c
static inline void metric_inc(unsigned idx) { __u64 *cnt = &__metric_shm[__getcpu()][idx]; __atomic_fetch_add(cnt, 1, __ATOMIC_RELAXED); }
Setup path (executed once per process):
- the daemon creates a sealed
memfd
and maps one page per CPU; - it pushes the fd number via an env‑var or
/proc//fd
SCM_RIGHTS; - the constructor in the client does
mmap(fd, MAP_SHARED)
. Collector – a BPF program attached totracepoint:sched:sched_process_exit
(or just a timer) reads and zeroes the counters viabpf_copy_from_user()
into its own ring buffer.
No system call on the hot path; only a relaxed atomic inc.The lowest possible overhead (< 10 ns per call), at the cost of writing the setup code yourself.
See the 2024‑2025 eBPF observability write‑up for recent ring‑buffer changes. eunomia.dev
4. Plain shared‑memory + futex (baseline)
If you keep the shared page per CPU and use futex_wake()
only when a counter page overflows, modern kernels give sub‑µs round‑trips; complexity is mostly about getting huge‑page alignment and NUMA locality right. Still the simplest thing that always works.
Which one should you prototype?
Criterion | user‑events | io_uring MSG_RING | eBPF ringbuf | raw shm |
---|---|---|---|---|
Lines in client | ≈30 | ≈40 | ≈80 | 80‑100 |
Syscalls on hot path | 1 × writev | 1 × io_uring_enter | 0 | 0–1 (futex rarely) |
Typical overhead | 40‑80 ns | 90‑130 ns | < 10 ns | 10‑50 ns |
Kernel version | ≥ 6.5 | ≥ 5.19 (better 6.11) | ≥ 5.8 | any |
Tooling support | perf / ftrace | liburing only | bpftool, perf | custom |
For system‑wide metrics on mainstream distros in 2025, tracefs user‑events give the best ratio of effort to performance:
_drop‑in support across languages, automatic timestamping, and no worries about NUMA collisions or cgroup accounting._If you need the absolute last few dozen nanoseconds and control the whole stack, move to the eBPF shared‑page counter model; otherwise start with user‑events and benchmark – you’re unlikely to saturate it before hitting memory‑bandwidth limits.