Essays

PEP-734: Subinterpreters in Python 3.14

A deep dive into Python 3.14's new subinterpreters feature — how they achieve true parallelism with per-interpreter GILs, the massive C-level refactoring required for isolation, and benchmark comparisons against threading, multiprocessing, and free-threading.

Hi! My name is Nikita Sobolev, I'm a core developer of the CPython programming language, and also the author of a video series about its internals.

I'm continuing my series of articles on about CPython implementation details. Today we'll talk about subinterpreters — their architecture, past, and hopefully bright future.

Under the cut you'll find: new Python APIs for speeding up and parallelizing your programs, memory management, data duplication. And lots of C code!

To understand the topic and tell you about it, I took several important steps: I read almost all the code for this feature, started committing to subinterpreters, and interviewed the author of this project. The interview is available with Russian and English subtitles. I also added tons of context directly in the video. Pause and read the code.

If you find this interesting or completely unfamiliar — welcome!

What Was Added in 3.14?

Two important new parts were added to Python. The first part is concurrent.interpreters:

>>> from concurrent import interpreters

>>> interp = interpreters.create()
>>> interpreters.list_all()
[Interpreter(0), Interpreter(1)]

>>> def worker(arg: int) -> None:
...     print('do some work', arg)
...
>>> interp.call(worker, 1)  # run code in a subinterpreter
do some work 1

This is a Python API for conveniently working with subinterpreters. There isn't much of it yet, but you can already do useful things. Most likely you won't need it in this raw form. This API is more for library developers. An analogy: how often do you use the threading module directly?

The second part is concurrent.futures.InterpreterPoolExecutor, which is analogous to concurrent.futures.ThreadPoolExecutor. You can run work in fully parallel fashion:

>>> from concurrent.futures import InterpreterPoolExecutor

>>> CPUS = 4
... with InterpreterPoolExecutor(CPUS) as executor:
...     list(executor.map(worker, [1, 2, 3, 4, 5, 6]))
...
do some work 1
do some work 2
do some work 4
do some work 3
do some work 5
do some work 6
[None, None, None, None, None, None]

Now this is more interesting. This API can and should be used if you want to parallelize something across many subinterpreters.

Now let's talk about how it works.

Comparing Subinterpreters with threading / multiprocessing / free-threading / asyncio

Multitasking (or rather its absence) has long been a weak spot of CPython. Many different attempts have been made to work around the GIL limitations and solve the problem for various cases.

The first approach is threading (in default GIL mode), which works wonderfully if you're calling some binary code that releases the GIL through the C-API.

For example, the mmap module does this:

Py_BEGIN_ALLOW_THREADS
m_obj->data = mmap(NULL, map_size, prot, flags, fd, offset);
Py_END_ALLOW_THREADS

Since we have a special C-API for managing PyThreadState, anyone can call the necessary functions and make their program faster. But not everyone does this. And you can't do this in Python code. Therefore threading in CPython had limited usefulness. Moreover: threads require synchronization primitives for data access control.

Thus appeared the second approach — multiprocessing. It creates full-fledged new Python processes with their own memory. Yes, you can share some things. But the costs of creating a new process and N-times memory consumption remain a significant problem. For example, if your dataset already weighs 2 GB, it's very hard to just duplicate it to a neighboring process for another +2 GB.

The next approach is asyncio. It required us to rewrite all of Python, introduced the function coloring problem, but didn't solve the fundamental issues. It still operated within a single GIL, a single thread, a single process. It didn't help at all with CPU-bound tasks. The final nail in asyncio's coffin was that no one thought through how it should actually work. Missing clear primitives, explicit cancellation scopes, introspection. Plus suboptimal performance, many bad APIs, reachable scaling limits, a complex mental model, poor integration with threads, and difficulties with synchronization primitives.

The emergence of free-threading wasn't a surprise. Because threads are cheap to create and don't require data copying (which, on the other hand, can lead to mutable access, races, deadlocks, and other multithreading fun). Now you can disable the GIL and get overhead on every object when using free-threading. Plus you need to thoroughly wrap your code with threading.Lock and other mutexes. But there will still be places, even right in builtins, that physically can't be hidden behind a single critical section (mutex). Which means races will exist even in builtins. Example: regular iterators in --disable-gil mode:

import concurrent.futures

N = 10000
for _ in range(100):
    it = iter(range(N))
    with concurrent.futures.ThreadPoolExecutor() as executor:
        data = set(executor.map(lambda _: next(it), range(N)))
        assert len(data) == N, f"Expected {N} distinct elements, got {len(data)}"

# Traceback (most recent call last):
# File "<python-input-0>", line 8, in <module>
#   assert len(data) == N, f"Expected {N} distinct elements, got {len(data)}"
#          ^^^^^^^^^^^^^^
# AssertionError: Expected 10000 distinct elements, got 9999

Shooting ourselves in the foot with N hands!

So subinterpreters look like a very good solution. They're fairly cheap to launch (but can be made even faster and better in the future, as discussed in the video), each works with its own GIL, there are no locks or data mutation in user code, they're suitable for both CPU and IO-bound tasks, they support data sharing without copying (both simple reuse of immutable and immortal types like int, and special magic around using memoryview), and there will be even more ways in upcoming releases.

Example of memoryview magic with buffers:

>>> from concurrent import interpreters
>>> interp = interpreters.create()
>>> queue = interpreters.create_queue()

>>> b = bytearray(b'123')
>>> m = memoryview(b)
>>> queue.put_nowait(m)
>>> interp.exec('(m := queue.get_nowait()); print(m); m[:] = b"456"')  # changing memory directly
<memory at 0x103274940>

>>> b  # was changed in another interpreter!
bytearray(b'456')

It's the same object! No copying happened. This will allow us to share, for example, np.array or any other buffers. Looks very promising. And of course, you can build an actor model on top (which we also discuss in the video), and CSP like in Go.

What Are Subinterpreters?

To understand what subinterpreters are, you first need to understand what an interpreter is in CPython :)

When we start a new python process, we begin executing code from pylifecycle.c, which launches the interpreter and handles all other lifecycle questions. There's a function like this:

static PyStatus
pycore_create_interpreter(_PyRuntimeState *runtime,
                          const PyConfig *src_config,
                          PyThreadState **tstate_p)
{
    PyStatus status;
    PyInterpreterState *interp;
    status = _PyInterpreterState_New(NULL, &interp);
    if (_PyStatus_EXCEPTION(status)) {
        return status;
    }
    assert(interp != NULL);
    assert(_Py_IsMainInterpreter(interp));
    _PyInterpreterState_SetWhence(interp, _PyInterpreterState_WHENCE_RUNTIME);
    interp->_ready = 1;

    status = _PyConfig_Copy(&interp->config, src_config);
    if (_PyStatus_EXCEPTION(status)) {
        return status;
    }

    /* Auto-thread-state API */
    status = _PyGILState_Init(interp);
    if (_PyStatus_EXCEPTION(status)) {
        return status;
    }

    // ...
}

It creates the main interpreter, its state, GIL, and everything else needed for startup. Now let's look more closely at the "states" it creates and uses. There are two main ones: PyThreadState (thread state) and PyInterpreterState (interpreter state), which is bound to the thread state. Here are the definitions:

typedef struct _ts PyThreadState;
typedef struct _is PyInterpreterState;

struct _ts {
    /* See Python/ceval.c for comments explaining most fields */

    PyThreadState *prev;
    PyThreadState *next;
    PyInterpreterState *interp;

    uintptr_t eval_breaker;

    /* Currently holds the GIL. Must be its own field to avoid data races */
    int holds_gil;

    // ...
};

Inside you can see lots of useful system details. Including a pointer to PyInterpreterState, which contains more application-level things. For example: its own builtins and sys, its own imports, its own GIL in _gil (or a shared one in ceval in PyInterpreterConfig_SHARED_GIL mode, see _PyEval_InitGIL), and everything else needed for operation:

struct _is {
    struct _ceval_state ceval;

    struct _gc_runtime_state gc;

    // Dictionary of the sys module
    PyObject *sysdict;

    // Dictionary of the builtins module
    PyObject *builtins;

    struct _import_state imports;

    /* The per-interpreter GIL, which might not be used. */
    struct _gil_runtime_state _gil;

    /* cross-interpreter data and utils */
    _PyXI_state_t xi;

    // ...
};

Now we understand: this is what an "interpreter" really is. If we have several such states, we can create multiple independent interpreters. The C-API for this has been available since Python 3.10:

PyInterpreterConfig config = {
    .use_main_obmalloc = 0,
    .allow_fork = 0,
    .allow_exec = 0,
    .allow_threads = 1,
    .allow_daemon_threads = 0,
    .check_multi_interp_extensions = 1,
    .gil = PyInterpreterConfig_OWN_GIL,
};

PyThreadState *tstate = NULL;
PyStatus status = Py_NewInterpreterFromConfig(&tstate, &config);
if (PyStatus_Exception(status)) {
    Py_ExitStatusException(status);
}

In the example above we also create new PyThreadState and PyInterpreterState. In plain language: we create a new thread and a new interpreter within that thread. And here's the key difference from "just a Python thread." Because we can choose the GIL type (in the example we use PEP-684 Per-Interpreter GIL), threads will be managed by the OS scheduler, not the usual Python GIL logic. And they work fully in parallel and in isolation!

Now we need to understand: why is it isolated? How do interpreters not interfere with each other? How do they achieve isolation?

Subinterpreter Isolation

The simplest yet most complex aspect. For isolation to work, we had to rewrite ALL built-in C modules, ALL built-in C classes in Python. A gigantic amount of work that touched literally the entire standard library. And if authors of other C extensions want to support subinterpreters (or free-threading, by the way), they also need to rewrite everything.

This isolation was achieved through several factors:

Two-phase module initialization
Module isolation using ModuleState
Refactoring static types into Heap Types

Let's examine each with examples. Starting with PEP-489 — two-phase module initialization. We'll look at mmap as an example. Here's what it looked like before:

static struct PyModuleDef mmapmodule = {  // !!!
    PyModuleDef_HEAD_INIT,
    "mmap",
    NULL,
    -1,
    NULL,
    NULL,
    NULL,
    NULL,
    NULL
};

PyMODINIT_FUNC
PyInit_mmap(void)
{
    PyObject *dict, *module;

    if (PyType_Ready(&mmap_object_type) < 0)
        return NULL;

    module = PyModule_Create(&mmapmodule);
    if (module == NULL)
        return NULL;
    dict = PyModule_GetDict(module);
    if (!dict)
        return NULL;
    PyDict_SetItemString(dict, "error", PyExc_OSError);
    PyDict_SetItemString(dict, "mmap", (PyObject*) &mmap_object_type);

    // ...

    return module;
}

Creating one shared module for everyone in PyInit_mmap? That won't work for us!

Now its definition looks like this:

static int
mmap_exec(PyObject *module)
{
    if (PyModule_AddObjectRef(module, "error", PyExc_OSError) < 0) {
        return -1;
    }

    PyObject *mmap_object_type = PyType_FromModuleAndSpec(module,
                                                  &mmap_object_spec, NULL);
    if (mmap_object_type == NULL) {
        return -1;
    }
    int rc = PyModule_AddType(module, (PyTypeObject *)mmap_object_type);
    Py_DECREF(mmap_object_type);
    if (rc < 0) {
        return -1;
    }

    // ...

    return 0;
}

static PyModuleDef_Slot mmap_slots[] = {
    {Py_mod_exec, mmap_exec},
    {Py_mod_multiple_interpreters, Py_MOD_PER_INTERPRETER_GIL_SUPPORTED},
    {Py_mod_gil, Py_MOD_GIL_NOT_USED},
    {0, NULL}
};

static struct PyModuleDef mmapmodule = {
    .m_base = PyModuleDef_HEAD_INIT,
    .m_name = "mmap",
    .m_size = 0,
    .m_slots = mmap_slots,
};

PyMODINIT_FUNC
PyInit_mmap(void)
{
    return PyModuleDef_Init(&mmapmodule);
}

What changed?

The definition of static struct PyModuleDef mmapmodule changed slightly — now we specify C slots for the future module. The most important for us is Py_mod_exec.
Now an already-created module comes into the mmap_exec function specified as the special slot {Py_mod_exec, mmap_exec}, and we simply initialize it there. Previously we created the module's PyObject * right in PyInit_mmap from its static global object module = PyModule_Create(&mmapmodule). That wasn't repeatable for new copies. It's precisely this new exec-based API that allows us to create a new independent copy of the module on demand.
PyType_Ready is no longer called in the module body, and the creation of mmap_object_spec also changed, which will be useful in the Heap Types section.

The PyInit_mmap(void) function remains for backward compatibility and some compilers. Now it calls PyModuleDef_Init.

Note that we explicitly indicate in the slots that subinterpreters are supported: {Py_mod_multiple_interpreters, Py_MOD_PER_INTERPRETER_GIL_SUPPORTED}. If we tried to import an unsupported module, we'd get an error:

>>> from concurrent.interpreters import create

>>> interp = create()
>>> interp.exec('import _suggestions')
Traceback (most recent call last):
  File "<python-input-3>", line 1, in <module>
    interp.exec('import _suggestions')
concurrent.interpreters.ExecutionFailed: ImportError: module _suggestions does not support loading in subinterpreters

The second part: module state isolation. We move all global variables into a special place. Let's look at the _csv module as an example. Here's how it was before:

static PyObject *error_obj;     /* CSV exception */
static PyObject *dialects;      /* Dialect registry */
static long field_limit = 128 * 1024;   /* max parsed field size */
static PyTypeObject Dialect_Type;
// ...

Looks very simple. No complex APIs. But everything uses global state. And we can't have that!

Here's how it looks now, part one:

typedef struct {
    PyObject *error_obj;   /* CSV exception */
    PyObject *dialects;   /* Dialect registry */
    PyTypeObject *dialect_type;
    PyTypeObject *reader_type;
    PyTypeObject *writer_type;
    Py_ssize_t field_limit;   /* max parsed field size */
    PyObject *str_write;
} _csvstate;

static struct PyModuleDef _csvmodule;

static inline _csvstate*
get_csv_state(PyObject *module)
{
    void *state = PyModule_GetState(module);
    assert(state != NULL);
    return (_csvstate *)state;
}

static int
_csv_clear(PyObject *module)
{
    _csvstate *module_state = PyModule_GetState(module);
    Py_CLEAR(module_state->error_obj);
    Py_CLEAR(module_state->dialects);
    Py_CLEAR(module_state->dialect_type);
    Py_CLEAR(module_state->reader_type);
    Py_CLEAR(module_state->writer_type);
    Py_CLEAR(module_state->str_write);
    return 0;
}

static int
_csv_traverse(PyObject *module, visitproc visit, void *arg)
{
    _csvstate *module_state = PyModule_GetState(module);
    Py_VISIT(module_state->error_obj);
    Py_VISIT(module_state->dialects);
    Py_VISIT(module_state->dialect_type);
    Py_VISIT(module_state->reader_type);
    Py_VISIT(module_state->writer_type);
    return 0;
}

static void
_csv_free(void *module)
{
    (void)_csv_clear((PyObject *)module);
}

Here we simply define the state _csvstate, describe how to get it via get_csv_state, how to clear it via _csv_clear, and how to traverse it in GC via _csv_traverse. The second part relates to module slots, state sizes, and its creation:

static struct PyModuleDef _csvmodule = {
    PyModuleDef_HEAD_INIT,
    "_csv",
    csv_module_doc,
    sizeof(_csvstate), // !!! state size
    csv_methods,
    csv_slots,
    _csv_traverse,  // Py_tp_traverse
    _csv_clear,  // Py_tp_clear
    _csv_free  // Py_tp_free
};

static int
csv_exec(PyObject *module) {
    PyObject *temp;
    _csvstate *module_state = get_csv_state(module);

    temp = PyType_FromModuleAndSpec(module, &Dialect_Type_spec, NULL);
    // State initialization here:
    module_state->dialect_type = (PyTypeObject *)temp;
    if (PyModule_AddObjectRef(module, "Dialect", temp) < 0) {
        return -1;
    }

    // ...
}

The most important line here is sizeof(_csvstate), which indicates how much memory is needed to store the module state inside the module object itself.

As a result — each module has its own isolated state. When creating a new module copy for a subinterpreter, its new internal state will be created.

And finally — Heap Types. Our goal is again to make types not shared but copyable. We remove static types and make them isolated. Let's continue looking at the _csv module. Before:

static PyTypeObject Dialect_Type = {
    PyVarObject_HEAD_INIT(NULL, 0)
    "_csv.Dialect",                         /* tp_name */
    sizeof(DialectObj),                     /* tp_basicsize */
    0,                                      /* tp_itemsize */
    (destructor)Dialect_dealloc,            /* tp_dealloc */
    0,                                      /* tp_vectorcall_offset */
    // ... many more fields ...
    dialect_new,                            /* tp_new */
    0,                                      /* tp_free */
};

One big static global object! Terrible!

After:

static PyType_Slot Dialect_Type_slots[] = {
    {Py_tp_doc, (char*)Dialect_Type_doc},
    {Py_tp_members, Dialect_memberlist},
    {Py_tp_getset, Dialect_getsetlist},
    {Py_tp_new, dialect_new},
    {Py_tp_methods, dialect_methods},
    {Py_tp_dealloc, Dialect_dealloc},
    {Py_tp_clear, Dialect_clear},
    {Py_tp_traverse, Dialect_traverse},
    {0, NULL}
};

PyType_Spec Dialect_Type_spec = {
    .name = "_csv.Dialect",
    .basicsize = sizeof(DialectObj),
    .flags = (Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE | Py_TPFLAGS_HAVE_GC |
              Py_TPFLAGS_IMMUTABLETYPE),
    .slots = Dialect_Type_slots,
};

static int
csv_exec(PyObject *module) {
    PyObject *temp;
    _csvstate *module_state = get_csv_state(module);

    // Creating a real type from spec:
    temp = PyType_FromModuleAndSpec(module, &Dialect_Type_spec, NULL);
    module_state->dialect_type = (PyTypeObject *)temp;
    if (PyModule_AddObjectRef(module, "Dialect", temp) < 0) {
        return -1;
    }
    // ...
}

Now, instead of global state, we create the type from a specification inside Py_mod_exec using PyType_FromModuleAndSpec, save the type to the module state in module_state->dialect_type, and use it wherever needed. This gives us full isolation — each subinterpreter will have its own modules and its own types. Convenient!

To the Benchmarks!

What article would be complete without synthetic benchmarks with errors and inaccuracies? That's what I thought too. Sit back, look at the benchmark code, comment away. Here's a CPU-bound task test using different approaches:

def worker_cpu(arg: tuple[int, int]):
    start, end = arg
    fact = 1
    for i in range(start, end + 1):
        fact *= i

Results:

Regular: Mean +- std dev: 163 ms +- 1 ms
Threading with GIL: Mean +- std dev: 168 ms +- 2 ms
Threading NoGIL: Mean +- std dev: 48.7 ms +- 0.6 ms
Multiprocessing: Mean +- std dev: 73.4 ms +- 1.5 ms
Subinterpreters: Mean +- std dev: 44.8 ms +- 0.5 ms

Subinterpreters show the best time! Here's how we called them:

import os
from concurrent.futures import InterpreterPoolExecutor

WORKLOADS = [(1, 5), (6, 10), (11, 15), (16, 20)]

CPUS = os.cpu_count() or len(WORKLOADS)

def bench_subinterpreters():
    with InterpreterPoolExecutor(CPUS) as executor:
        list(executor.map(worker, WORKLOADS))

And for IO-bound tasks:

def worker_io(arg: tuple[int, int]):
    start, end = arg
    with httpx.Client() as client:
        for i in range(start, end + 1):
            client.get(f'http://jsonplaceholder.typicode.com/posts/{i}')

Results:

Regular: Mean +- std dev: 1.45 sec +- 0.03 sec
Threading with GIL: Mean +- std dev: 384 ms +- 17 ms (~1/4 of 1.45s)
Threading NoGIL: Mean +- std dev: 373 ms +- 20 ms
Multiprocessing: Mean +- std dev: 687 ms +- 32 ms
Subinterpreters: Mean +- std dev: 547 ms +- 13 ms

Here free-threading is significantly faster, but subinterpreters still deliver nearly 3x speedup from the baseline. Let's separately compare with an asyncio version:

async def bench_async():
    start, end = 1, 20
    async with httpx.AsyncClient() as client:
        await asyncio.gather(*[
            client.get(f'http://jsonplaceholder.typicode.com/posts/{i}')
            for i in range(start, end + 1)
        ])

Which, of course, wins with 166 ms +- 13 ms.

It would also be interesting to see benchmarks for working with buffers. Here's the code (terrible, unoptimized, but demonstrating the potential possibility of sharing such data):

import random
import numpy as np

data = np.array([  # PyBuffer
    random.randint(1, 1024)
    for _ in range(1_000_000_0)
], dtype=np.int32)
mv = memoryview(data)   # TODO: multiprocessing can't pickle it

def worker_numpy(arg: tuple[Any, int, int]):
    # VERY inefficient way of summing numpy array, just to illustrate
    # the potential possibility:
    data, start, end = arg
    sum(data[start:end])

worker = worker_numpy
chunks_num = os.cpu_count() * 2 + 1
chunk_size = int(len(data) / chunks_num)
WORKLOADS = [
    (mv, chunk_size * i, chunk_size * (i + 1))
    for i, cpu in enumerate(range(chunks_num))
]

Yes, we chose the worst possible way to sum an np.array, but what matters is not the summing algorithm but the very ability to perform any computational operations: data sharing, process parallelization. We create several chunks for summing, ensure it supports the PyBuffer protocol via memoryview, and send them to be computed different ways. Results:

Regular: Mean +- std dev: 109 ms +- 1 ms
Threading: Mean +- std dev: 112 ms +- 1 ms
Multiprocessing: DISQUALIFIED, my macbook exploded
Subinterpreters: Mean +- std dev: 58.4 ms +- 3.1 ms

What does this benchmark show? That we can fairly quickly share a buffer and perform some logic on it. And even without any optimizations and with all the overhead, it works twice as fast as baseline. Comparing with multiprocessing is especially amusing — it nearly blew up my laptop. I couldn't wait for it to finish.

Can subinterpreter performance be improved? Of course! We'll see soon!

Conclusion

That's the feature we got. There's much we didn't have time to discuss in this article. For example: the nuances of PEP acceptance, Channels, Queues, concurrent.futures.InterpreterPool and its current issues. However, you can't squeeze everything into one article, even if you really want to. So stay tuned for the next articles!

Materials I used to write this article:

My post about measuring subinterpreter performance
My post about concurrent.interpreters.Queue
Multiple Interpreters in the Stdlib: PEP-554
Older PEP Multiple Interpreters in the Stdlib: PEP-734
A Per-Interpreter GIL: PEP-684
Multi-phase extension module initialization: PEP-489
A ModuleSpec Type for the Import System: PEP-451
Isolating modules in the standard library: PEP-687
Module State Access from C Extension Methods: PEP-573
Immortal Objects, Using a Fixed Refcount: PEP-683
Documentation for concurrent.interpreters
Documentation for InterpreterPoolExecutor

You can also check out an article by one of our chat members on this topic, with a deeper dive into implementation details.

A big thank you to everyone for your interest in Python's internals — reading an article like this is no easy feat! You're awesome!

If you want more deep dives:

You can subscribe to my Telegram channel, where there's lots of this kind of content
Watch deep Python videos on my channel
Support my work on CPython core development, videos, and articles. If you want more quality technical content

See you next time in the guts of Python!