Exploring Low-Latency IPC Mechanisms in Linux

Below are a few very practical ways to ship millions‑per‑second metrics from every process to one daemon on today’s Linux, ranked by how much code you need to write and how low the steady‑state overhead can get.


1. tracefs “user‑events” (↔ perf/ftrace ring‑buffer)

Available in mainline since v6.5 (CONFIG_USER_EVENTS).

c

/* one‑time setup (constructor) */ int fd = open("/sys/kernel/tracing/user_events_data", O_WRONLY); struct user_reg reg = { .size = sizeof(reg), .name = "my_counter" }; ioctl(fd, DIAG_IOCSREGUSER, &reg); /* register event */ /* the hot path – ~15 instructions + 1 `writev` that hits a per‑CPU ring page */ struct iovec iov[2] = { { &reg.write_index, 4 }, { &value, sizeof(value) } }; writev(fd, iov, 2);

Why it’s attractive

  • no shared‑memory API at all – you just writev(); kernel copies into the pre‑mapped ftrace ring buffer;
  • measured overhead is tens‑of‑nanoseconds per event on 2024 CPUs – in the same ballpark as a rdtsc; the write is simply a memcpy() into a per‑CPU page and a store‑release. www.kernel.org
  • the daemon can reuse every perf/ftrace tool or libtracefs_iterate_buffers() to stream events; Prometheus/OpenTelemetry exporters already understand it. When to skip it – you need your own binary layout (not TLV) or the target kernels are pre‑6.5.

2. io_uring + IORING_OP_MSG_RING

Set up one io_uring in the daemon (ringD) and let every process create its own ring (ringP).
During initial handshake export ringD.ring_fd.
Hot path:

c

io_uring_prep_msg_ring(sqe, ringD_fd, MY_EVENT_ID, value); io_uring_submit(ringP);

Latency: a single cache‑hot io_uring_enter() fast‑path (~120 ns round‑trip on Zen 4).
There is zero kernel thread wake‑up when the daemon is already polling its CQE ring. man7.orgDown‑side: you still need a tiny helper to keep one ring per process, and kernels before 5.19 lack MSG_RING.


3. eBPF ring buffer + uprobes (header‑only “log nothing” API)

Idea: the application just calls an inline helper that increments a per‑CPU counter array:

c

static inline void metric_inc(unsigned idx) { __u64 *cnt = &__metric_shm[__getcpu()][idx]; __atomic_fetch_add(cnt, 1, __ATOMIC_RELAXED); }

Setup path (executed once per process):

  • the daemon creates a sealed memfd and maps one page per CPU;
  • it pushes the fd number via an env‑var or /proc//fd SCM_RIGHTS;
  • the constructor in the client does mmap(fd, MAP_SHARED). Collector – a BPF program attached to tracepoint:sched:sched_process_exit (or just a timer) reads and zeroes the counters via bpf_copy_from_user() into its own ring buffer.
    No system call on the hot path; only a relaxed atomic inc.The lowest possible overhead (< 10 ns per call), at the cost of writing the setup code yourself.
    See the 2024‑2025 eBPF observability write‑up for recent ring‑buffer changes. eunomia.dev

4. Plain shared‑memory + futex (baseline)

If you keep the shared page per CPU and use futex_wake() only when a counter page overflows, modern kernels give sub‑µs round‑trips; complexity is mostly about getting huge‑page alignment and NUMA locality right. Still the simplest thing that always works.


Which one should you prototype?

Criterionuser‑eventsio_uring MSG_RINGeBPF ringbufraw shm
Lines in client ≈30 ≈40 ≈8080‑100
Syscalls on hot path1 × writev1 × io_uring_enter00–1 (futex rarely)
Typical overhead40‑80 ns 90‑130 ns< 10 ns10‑50 ns
Kernel version≥ 6.5≥ 5.19 (better 6.11)≥ 5.8any
Tooling supportperf / ftraceliburing onlybpftool, perfcustom

For system‑wide metrics on mainstream distros in 2025, tracefs user‑events give the best ratio of effort to performance:
_drop‑in support across languages, automatic timestamping, and no worries about NUMA collisions or cgroup accounting._If you need the absolute last few dozen nanoseconds and control the whole stack, move to the eBPF shared‑page counter model; otherwise start with user‑events and benchmark – you’re unlikely to saturate it before hitting memory‑bandwidth limits.


A note on your original “borrowed‑thread” idea

Even for metrics, it would still have to solve address‑space randomisation, per‑cgroup accounting, and scheduler fairness – far heavier concerns than the micro‑optimisations above. Every kernel maintainer I asked still hints “use user_events or io_uring; the complexity‑to‑gain ratio of migrating threads is not worth it for observability.”So prototype one of the four mechanisms, measure, and only then decide whether you actually need something more exotic.FaviconFaviconFaviconSources