Skip to content

Proposal: Expose debug offsets for external profiler/debugger support #505

@taegyunkim

Description

@taegyunkim

Disclaimer: I work on Datadog Python profiler. Our Python profiler (dd-trace-py) supports gevent profiling and would directly benefit from this change. I believe this would also benefit the broader ecosystem.

Summary

Expose byte offsets of key fields within the greenlet object struct, following the same pattern CPython uses for asyncio tasks (_Py_AsyncioModuleDebugOffsets in Modules/_asynciomodule.c). This would allow external profilers and debuggers to read greenlet state from a separate thread or process by reading memory at known offsets.

Motivation

CPython has been moving toward better external observability:

  • _Py_DebugOffsets (placed at the start of PyRuntime) enables external tools to locate internal interpreter structures in a running process.
  • _Py_AsyncioModuleDebugOffsets (CPython 3.14+, Modules/_asynciomodule.c) exposes TaskObj field offsets (task_name, task_coro, task_awaited_by, task_node, etc.) as compile-time constants in a dedicated debug section. This allows profilers to read task state using process_vm_readv at sample time (~100Hz) with zero per-task-switch overhead.

Greenlet has no equivalent. The greenlet struct is fully opaque in modern greenlet (>= 2.0), with all real state hidden behind a pimpl pointer. While it is possible to read raw memory from a greenlet object (e.g., via process_vm_readv), external tools have no reliable way to know the byte offsets of fields like gr_frame or state flags.

Real-world impact

Python profilers that support greenlet/gevent currently work around the opaque struct by installing a greenlet.settrace() callback that fires on every greenlet switch. On each switch, the tracer reads gr_frame from Python and caches it for the sampler thread. This adds significant latency to every greenlet switch:

Per-greenlet switch latency (us), 256 greenlets, Apple M3 Max:
                          p50     p99
  no tracer:              247     334
  noop tracer:            286     349
  profiler w/ settrace:   711     895

For asyncio tasks, profilers avoid this problem entirely: the sampler thread reads task frames at sample time using known struct offsets from CPython headers. No per-task-switch hook is needed. The same approach would work for greenlets if the struct offsets were available.

Current workarounds

Without debug offsets, profilers must choose between:

  1. Per-switch hook: Install a greenlet.settrace() callback that reads gr_frame from Python on every switch and passes it to the native side. This works but adds significant overhead to every greenlet switch (see benchmark above).

  2. Runtime offset discovery via ctypes: Create a paused greenlet, treat id(greenlet) as a memory address, scan the object's memory for id(gr_frame), and infer the byte offset. This avoids per-switch overhead but is fragile, undocumented, and could break silently across greenlet releases.

Proposed API

Debug section (following CPython's pattern)

This mirrors CPython's _Py_AsyncioModuleDebugOffsets exactly. A static struct with compile-time offsetof() values is placed in a named debug section using the same GENERATE_DEBUG_SECTION macro pattern:

typedef struct _GreenletDebugOffsets {
    struct _greenlet_object {
        uint64_t size;
        uint64_t gr_frame;        /* current Python frame (paused greenlets) */
        uint64_t stack_start;     /* C stack start pointer */
        uint64_t stack_stop;      /* C stack stop pointer */
        uint64_t run;             /* the run callable */
        uint64_t parent;          /* parent greenlet */
        uint64_t started;         /* bool: has been switched to at least once */
        uint64_t active;          /* bool: currently executing */
    } greenlet_object;
} GreenletDebugOffsets;

/* Placed in a named section for out-of-process discovery.
   Field names and internal struct types are illustrative;
   the actual offsetof() targets would use greenlet's internal types. */
GENERATE_DEBUG_SECTION(GreenletDebug, GreenletDebugOffsets _GreenletDebug)
    = {.greenlet_object = {
           .size = sizeof(InternalGreenlet),
           .gr_frame = /* offset from PyGreenlet* to the frame pointer */,
           .stack_start = /* offset to C stack start */,
           .stack_stop = /* offset to C stack stop */,
           .run = /* offset to the run callable */,
           .parent = /* offset to parent greenlet */,
           .started = /* offset to started flag */,
           .active = /* offset to active flag */,
       }};

Tools discover the section in the greenlet shared library by parsing the ELF/Mach-O section table, the same way they discover CPython's debug sections.

Minimum useful set

For profilers, the critical fields are:

  1. gr_frame - to unwind paused greenlet stacks from a native thread
  2. active (or equivalent) - to distinguish running vs. paused vs. dead

Everything else is useful but not blocking.

Compatibility

  • The offsets struct version can be bumped on layout changes (a version field or the size field serves this purpose).
  • Adding this has zero runtime cost: the offsets are compile-time constants stored in a static struct.

Precedent

Project Mechanism Fields exposed
CPython _Py_DebugOffsets Struct at PyRuntime start PyThreadState, PyInterpreterState, eval breaker offsets
CPython _Py_AsyncioModuleDebugOffsets Named binary section in _asyncio module TaskObj fields: task_name, task_coro, task_awaited_by, task_node

Alternatives considered

  1. Expose the full PyGreenlet struct layout in the public header - too much API surface to maintain.
  2. Add a C API function PyGreenlet_GetFrame() - still requires the GIL, doesn't help out-of-process tools or native-thread profilers.
  3. Runtime ctypes probing - fragile, undocumented, current workaround.

Who benefits

Any Python profiler or debugger that needs to inspect greenlet state from a native thread or external process.

Today, most open-source profilers either lack greenlet support entirely (py-spy, Austin, Scalene) or rely on Python-level APIs like greenlet.getcurrent() and sys.setprofile() to detect context switches (dd-trace-py, yappi). None can unwind individual greenlet stacks from a native sampling thread. Exposing debug offsets would lower the barrier for these tools to add efficient greenlet support, the same way CPython's asyncio debug offsets enabled external profilers to support asyncio task profiling.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions