Linux: the long road to lazy preemption

Currently, the CPU core scheduler in the Linux kernel provides several preemption modes. These modes offer a range of tradeoffs between response time and system throughput. Back in September 2023, it turned around discussion about the work of planners , as a result of which the c

Editor's Context

This article is an English adaptation with additional editorial framing for an international audience.

  • Terminology and structure were localized for clarity.
  • Examples were rewritten for practical readability.
  • Technical claims were preserved with source attribution.

Source: original publication

Currently, the CPU core scheduler in the Linux kernel provides several preemption modes. These modes offer a range of tradeoffs between response time and system throughput. Back in September 2023, it turned around discussion about the work of planners, as a result of which the concept of “lazy displacement” was developed. This concept makes it easier to schedule tasks in the kernel while improving results. This work proceeded quietly for some time, but then lazy repression was re-implemented by Peter Zijlstra in the form this series of patches. Despite the fact that the concept itself seems to work well, there is still a lot of work to be done.

❯ A short overview

Modern kernels have four different operating modes that control how one task can be preempted in favor of another. In simplest mode PREEMPT_NONE preemption is allowed only when the current task has exhausted its allotted time window. When mode PREEMPT_VOLUNTARY There are also many points located inside the kernel, at each of which, if necessary, displacement can be performed. In mode PREEMPT_FULL eviction is possible at almost any point except where the kernel explicitly prohibits it, for example, where the spinlock is held. Finally, in mode PREEMPT_RT preemption is given priority over almost any other operation. Even most of the code that is held through spinlocks can be evicted.

The higher the displacement level, the faster the system can respond to events, be it the movement of a computer mouse or a signal from a nuclear reactor that we are on the verge of a core meltdown. The faster the response, the more useful it is. However, if the system is set to a relatively high eviction rate, this can have a detrimental effect on the overall throughput of the system. If you have to deal with long-lasting and resource-intensive tasks, then it is better to interfere with the operation of the machine as little as possible. In addition, frequent evictions can lead to increased lock contention. That is why there are different modes of repression. The optimal displacement behavior will vary from load to load.

Most distributions provide kernels built in pseudo mode PREEMPT_DYNAMIC, which allows you to select any of the first three modes during loading, moreover, PREEMPT_VOLUNTARY is set by default. On systems with debugfs mounted, the current mode can be read from /sys/kernel/debug/sched/preempt.

In modes PREEMPT_NONE And PREEMPT_VOLUNTARY Arbitrary preemption of code running in the kernel is not allowed. It happens that this approach leads to excessive delays, even in systems where minimum latency is not a priority. This problem affects areas of the kernel where much of the work may be done. If left to chance, this work could disrupt planning throughout the system. To prevent this from happening, long-running loops are spiced up with calls to cond_resched(), each of which serves as an additional arbitrary eviction point that remains active even in PREEMPT_NONE. There are hundreds of such calls in the kernel.

There are also some problems with this approach. IN cond_resched() there is a kind of heuristic, and it works only at those points where the developer purposefully placed it. Some calls can definitely be done without, while there are bound to be other places in the kernel where calls would be useful cond_resched(), but in reality they are not there. Using cond_resched(), we are essentially making a decision that should only apply to the scheduler code - and we distribute it throughout the entire kernel. In general, this is a small crutch that, as a rule, works, but in general this problem can be solved better.

❯ How to do better

If you try to track whether a particular task can be preempted at any given time, you will immediately see how difficult it is. When doing this, you need to take into account several variables at once - these problems are discussed in more detail in this And this article. One of these variables is just a flag TIF_NEED_RESCHED, signaling whether there is a higher priority task waiting to access the CPU core. If, for example, you need to wake up such a task, then with a given event such a flag can be set within any task that is currently running. In the absence of such a flag, the kernel has no need to weigh whether to preempt the current task.

There are various points at which the kernel can notice this flag and preempt whatever task is currently running. An example of such a task is counting a timer. Another example is returning a task to user space after a system call. The third example is the termination of the interrupt handler. However, the third situation with checking for such termination is possible only in configurations where interrupts are enabled, and only in kernels with a valid option PREEMPT_FULL. When calling cond_resched() this flag will also be checked and, if set, the scheduler will be called to give up the CPU core to another task.

In essence, lazy eviction patches are simple. When working with them, another flag is added, TIF_NEED_RESCHED_LAZY, signaling the need to reschedule tasks, but not necessarily right now. In lazy preemption mode (PREEMPT_LAZY) most events will set a new flag rather than TIF_NEED_RESCHED. For operations such as returning to user space from the kernel, either of these flags will cause the scheduler to be called. But at the points for arbitrary preemption and on the return path after an interrupt, only the flag is checked TIF_NEED_RESCHED.

This change means that in lazy preemption mode, most kernel events do not result in the current task being preempted. But sooner or later this task should be forced out. For this to happen, the kernel timer handler will check to see if the flag is set TIF_NEED_RESCHED_LAZY; if so, it will be installed and TIF_NEED_RESCHED, which may result in the current task being preempted. As a rule, tasks spend all or almost all of their allocated time window on work, after which they free up the CPU core without coercion. In this case, it is possible to achieve good throughput.

Taking into account all these changes, in the lazy preemption mode it is possible, as in the mode PREEMPT_FULL, (almost) do not turn off the preemption function. Repression Maybe occur at any time when the corresponding counter deems that the task should be preempted. In this case, long-running code can be evicted at times when no other conditions prevent it. In addition, repression can be organized very quickly in cases where it is really needed. For example, as soon as a real-time task becomes ready to execute—say, as a result of handling an interrupt—the flag is set TIF_NEED_RESCHED, then the repression will occur almost immediately. In such cases, you do not have to wait until the required part of the countdown has passed in the timer.

But repression Not will happen if only the flag is set TIF_NEED_RESCHED_LAZY, and this is exactly what will happen in most cases. Therefore the kernel is in mode PREEMPT_LAZY preempts an active task with much less probability than a kernel in PREEMPT_FULL.

❯ Finally, we get rid of cond_resched() 

As a result of this work, we strive to obtain a scheduler that, in addition to the real-time execution mode, would provide only two more modes: PREEMPT_LAZY And PREEMPT_FULL. Lazy mode is in between PREEMPT_NONE And PREEMPT_VOLUNTARY, so it replaces both of them. But it will not need to provide for the voluntary expulsion points that were added in the two regimes it replaces. Since displacement can now occur almost anywhere, there is no need to activate it at specific points.

But the challenges cond_resched() for now they remain. One way or another, they are needed at least taking into account the fact that the regimes PREEMPT_NONE And PREEMPT_VOLUNTARY exist. These calls also help ensure that no new problems arise after the lazy preemption regime stabilizes.

When this patch is installed, call cond_resched() only checks TIF_NEED_RESCHED. Therefore, repression will often be postponed precisely in those situations where in the regimes PREEMPT_VOLUNTARY or PREEMPT_NONE  it would immediately occur under the influence cond_resched(). Steve Rostedt  doubted, is this change really necessary, and shouldn’t cond_resched() retain its old semantics - at least in the case of  PREEMPT_VOLUNTARY. Let PREEMPT_VOLUNTARY is already in the queue for deletion, it would be better to keep the old behavior - this would simplify the transition.

Thomas Gleixner replied, that in this case it is correct to set only TIF_NEED_RESCHED, as this will help in finally getting rid of the calls cond_resched():

In this case, we must take a closer look at them and decide whether they should be extended to include lazy functionality. Those where it is not needed can be eliminated with LAZY in effect, since then the preemption will occur at the next designated point as soon as the code enters the region of “non-lazy” execution.

He also added that, in his opinion, in “less than 5%” of calls cond_resched() will have to exhibit TIF_NEED_RESCHED_LAZY and, therefore, this function will have to remain even after the transition to PREEMPT_LAZY.

Until then, hundreds of calls will remain cond_resched(), which will need to be checked, and most of them will need to be deleted. There are also many other details to deal with; some of them concern this patch set by Ankur Arora. Additionally, comprehensive performance testing is required. Mike Galbraith one of the first started this work by demonstrating that using lazy preemption is only slightly less than the throughput of using lazy preemption. PREEMPT_VOLUNTARY.

This adds up a lot of work, but when it's finished it should be a very compact and simple kernel with predictable latency and no scheduler calls throughout the code. This is definitely a very attractive solution, but there is still a long way to go.


News, product reviews and competitions from the Timeweb.Cloud team  in our Telegram channel 

Перейти ↩
Go

📚 Read also:

  •  GPU-Accelerated Computing in Python;

  • Elves and Penguins: What is ELF and how does it work on Linux?

  •  STM32. Preparing the development environment in Linux;

  • ➤ Astra Linux: group policies in ALDPro;

  •  What does my mom play while I sleep?.

Why This Matters In Practice

Beyond the original publication, Linux: the long road to lazy preemption matters because teams need reusable decision patterns, not one-off anecdotes. Currently, the CPU core scheduler in the Linux kernel provides several preemption modes. These modes offer a range of tradeoffs between resp...

Operational Takeaways

  • Separate core principles from context-specific details before implementation.
  • Define measurable success criteria before adopting the approach.
  • Validate assumptions on a small scope, then scale based on evidence.

Quick Applicability Checklist

  • Can this be reproduced with your current team and constraints?
  • Do you have observable signals to confirm improvement?
  • What trade-off (speed, cost, complexity, risk) are you accepting?

FAQ

What is this article about in one sentence?

This article explains the core idea in practical terms and focuses on what you can apply in real work.

Who is this article for?

It is written for engineers, technical leaders, and curious readers who want a clear, implementation-focused explanation.

What should I read next?

Use the related articles below to continue with closely connected topics and concrete examples.