Disclaimer: I work on Datadog Python profiler. Our Python profiler (dd-trace-py) supports gevent profiling and would directly benefit from this change. I believe this would also benefit the broader ecosystem.
Summary
Expose byte offsets of key fields within the greenlet object struct, following the same pattern CPython uses for asyncio tasks (_Py_AsyncioModuleDebugOffsets in Modules/_asynciomodule.c). This would allow external profilers and debuggers to read greenlet state from a separate thread or process by reading memory at known offsets.
Motivation
CPython has been moving toward better external observability:
_Py_DebugOffsets (placed at the start of PyRuntime) enables external tools to locate internal interpreter structures in a running process.
_Py_AsyncioModuleDebugOffsets (CPython 3.14+, Modules/_asynciomodule.c) exposes TaskObj field offsets (task_name, task_coro, task_awaited_by, task_node, etc.) as compile-time constants in a dedicated debug section. This allows profilers to read task state using process_vm_readv at sample time (~100Hz) with zero per-task-switch overhead.
Greenlet has no equivalent. The greenlet struct is fully opaque in modern greenlet (>= 2.0), with all real state hidden behind a pimpl pointer. While it is possible to read raw memory from a greenlet object (e.g., via process_vm_readv), external tools have no reliable way to know the byte offsets of fields like gr_frame or state flags.
Real-world impact
Python profilers that support greenlet/gevent currently work around the opaque struct by installing a greenlet.settrace() callback that fires on every greenlet switch. On each switch, the tracer reads gr_frame from Python and caches it for the sampler thread. This adds significant latency to every greenlet switch:
Per-greenlet switch latency (us), 256 greenlets, Apple M3 Max:
p50 p99
no tracer: 247 334
noop tracer: 286 349
profiler w/ settrace: 711 895
For asyncio tasks, profilers avoid this problem entirely: the sampler thread reads task frames at sample time using known struct offsets from CPython headers. No per-task-switch hook is needed. The same approach would work for greenlets if the struct offsets were available.
Current workarounds
Without debug offsets, profilers must choose between:
-
Per-switch hook: Install a greenlet.settrace() callback that reads gr_frame from Python on every switch and passes it to the native side. This works but adds significant overhead to every greenlet switch (see benchmark above).
-
Runtime offset discovery via ctypes: Create a paused greenlet, treat id(greenlet) as a memory address, scan the object's memory for id(gr_frame), and infer the byte offset. This avoids per-switch overhead but is fragile, undocumented, and could break silently across greenlet releases.
Proposed API
Debug section (following CPython's pattern)
This mirrors CPython's _Py_AsyncioModuleDebugOffsets exactly. A static struct with compile-time offsetof() values is placed in a named debug section using the same GENERATE_DEBUG_SECTION macro pattern:
typedef struct _GreenletDebugOffsets {
struct _greenlet_object {
uint64_t size;
uint64_t gr_frame; /* current Python frame (paused greenlets) */
uint64_t stack_start; /* C stack start pointer */
uint64_t stack_stop; /* C stack stop pointer */
uint64_t run; /* the run callable */
uint64_t parent; /* parent greenlet */
uint64_t started; /* bool: has been switched to at least once */
uint64_t active; /* bool: currently executing */
} greenlet_object;
} GreenletDebugOffsets;
/* Placed in a named section for out-of-process discovery.
Field names and internal struct types are illustrative;
the actual offsetof() targets would use greenlet's internal types. */
GENERATE_DEBUG_SECTION(GreenletDebug, GreenletDebugOffsets _GreenletDebug)
= {.greenlet_object = {
.size = sizeof(InternalGreenlet),
.gr_frame = /* offset from PyGreenlet* to the frame pointer */,
.stack_start = /* offset to C stack start */,
.stack_stop = /* offset to C stack stop */,
.run = /* offset to the run callable */,
.parent = /* offset to parent greenlet */,
.started = /* offset to started flag */,
.active = /* offset to active flag */,
}};
Tools discover the section in the greenlet shared library by parsing the ELF/Mach-O section table, the same way they discover CPython's debug sections.
Minimum useful set
For profilers, the critical fields are:
gr_frame - to unwind paused greenlet stacks from a native thread
active (or equivalent) - to distinguish running vs. paused vs. dead
Everything else is useful but not blocking.
Compatibility
- The offsets struct version can be bumped on layout changes (a
version field or the size field serves this purpose).
- Adding this has zero runtime cost: the offsets are compile-time constants stored in a static struct.
Precedent
| Project |
Mechanism |
Fields exposed |
CPython _Py_DebugOffsets |
Struct at PyRuntime start |
PyThreadState, PyInterpreterState, eval breaker offsets |
CPython _Py_AsyncioModuleDebugOffsets |
Named binary section in _asyncio module |
TaskObj fields: task_name, task_coro, task_awaited_by, task_node |
Alternatives considered
- Expose the full
PyGreenlet struct layout in the public header - too much API surface to maintain.
- Add a C API function
PyGreenlet_GetFrame() - still requires the GIL, doesn't help out-of-process tools or native-thread profilers.
- Runtime ctypes probing - fragile, undocumented, current workaround.
Who benefits
Any Python profiler or debugger that needs to inspect greenlet state from a native thread or external process.
Today, most open-source profilers either lack greenlet support entirely (py-spy, Austin, Scalene) or rely on Python-level APIs like greenlet.getcurrent() and sys.setprofile() to detect context switches (dd-trace-py, yappi). None can unwind individual greenlet stacks from a native sampling thread. Exposing debug offsets would lower the barrier for these tools to add efficient greenlet support, the same way CPython's asyncio debug offsets enabled external profilers to support asyncio task profiling.
Disclaimer: I work on Datadog Python profiler. Our Python profiler (
dd-trace-py) supports gevent profiling and would directly benefit from this change. I believe this would also benefit the broader ecosystem.Summary
Expose byte offsets of key fields within the greenlet object struct, following the same pattern CPython uses for asyncio tasks (
_Py_AsyncioModuleDebugOffsetsinModules/_asynciomodule.c). This would allow external profilers and debuggers to read greenlet state from a separate thread or process by reading memory at known offsets.Motivation
CPython has been moving toward better external observability:
_Py_DebugOffsets(placed at the start ofPyRuntime) enables external tools to locate internal interpreter structures in a running process._Py_AsyncioModuleDebugOffsets(CPython 3.14+,Modules/_asynciomodule.c) exposesTaskObjfield offsets (task_name,task_coro,task_awaited_by,task_node, etc.) as compile-time constants in a dedicated debug section. This allows profilers to read task state usingprocess_vm_readvat sample time (~100Hz) with zero per-task-switch overhead.Greenlet has no equivalent. The greenlet struct is fully opaque in modern greenlet (>= 2.0), with all real state hidden behind a
pimplpointer. While it is possible to read raw memory from a greenlet object (e.g., viaprocess_vm_readv), external tools have no reliable way to know the byte offsets of fields likegr_frameor state flags.Real-world impact
Python profilers that support greenlet/gevent currently work around the opaque struct by installing a
greenlet.settrace()callback that fires on every greenlet switch. On each switch, the tracer readsgr_framefrom Python and caches it for the sampler thread. This adds significant latency to every greenlet switch:For asyncio tasks, profilers avoid this problem entirely: the sampler thread reads task frames at sample time using known struct offsets from CPython headers. No per-task-switch hook is needed. The same approach would work for greenlets if the struct offsets were available.
Current workarounds
Without debug offsets, profilers must choose between:
Per-switch hook: Install a
greenlet.settrace()callback that readsgr_framefrom Python on every switch and passes it to the native side. This works but adds significant overhead to every greenlet switch (see benchmark above).Runtime offset discovery via
ctypes: Create a paused greenlet, treatid(greenlet)as a memory address, scan the object's memory forid(gr_frame), and infer the byte offset. This avoids per-switch overhead but is fragile, undocumented, and could break silently across greenlet releases.Proposed API
Debug section (following CPython's pattern)
This mirrors CPython's
_Py_AsyncioModuleDebugOffsetsexactly. A static struct with compile-timeoffsetof()values is placed in a named debug section using the sameGENERATE_DEBUG_SECTIONmacro pattern:Tools discover the section in the greenlet shared library by parsing the ELF/Mach-O section table, the same way they discover CPython's debug sections.
Minimum useful set
For profilers, the critical fields are:
gr_frame- to unwind paused greenlet stacks from a native threadactive(or equivalent) - to distinguish running vs. paused vs. deadEverything else is useful but not blocking.
Compatibility
versionfield or thesizefield serves this purpose).Precedent
_Py_DebugOffsetsPyRuntimestartPyThreadState,PyInterpreterState, eval breaker offsets_Py_AsyncioModuleDebugOffsets_asynciomoduleTaskObjfields:task_name,task_coro,task_awaited_by,task_nodeAlternatives considered
PyGreenletstruct layout in the public header - too much API surface to maintain.PyGreenlet_GetFrame()- still requires the GIL, doesn't help out-of-process tools or native-thread profilers.Who benefits
Any Python profiler or debugger that needs to inspect greenlet state from a native thread or external process.
Today, most open-source profilers either lack greenlet support entirely (py-spy, Austin, Scalene) or rely on Python-level APIs like
greenlet.getcurrent()andsys.setprofile()to detect context switches (dd-trace-py, yappi). None can unwind individual greenlet stacks from a native sampling thread. Exposing debug offsets would lower the barrier for these tools to add efficient greenlet support, the same way CPython's asyncio debug offsets enabled external profilers to support asyncio task profiling.