英语

西班牙语

处理回归¶

我们不引入回归 —— 本文档阐述了这条“Linux 内核开发首要规则”对开发者而言在实践中意味着什么。它是《报告回归》的补充，后者从用户的角度涵盖了该主题；如果您从未阅读过那篇文章，请在继续阅读本文之前至少快速浏览一遍。

要点（即“TL;DR”）¶

确保回归邮件列表 (regressions mailing list) 的订阅者 (regressions@lists.linux.dev) 能迅速获知任何新的回归报告
- 当收到一份未抄送给列表的邮件报告时，立即发送至少一份简短的“回复全部”邮件，并抄送给列表，使其进入处理流程。
- 将通过 Bug 跟踪器提交的任何报告转发或弹回（bounce）到列表。
让 Linux 内核回归跟踪机器人“regzbot”跟踪该问题（这是可选的，但建议这样做）
- 对于邮件报告，检查报告者是否包含类似 #regzbot introduced: v5.13..v5.14-rc1 的行。如果没有，发送一封回复（抄送给回归列表），其中包含如下段落，告诉 regzbot 问题何时开始出现
  #regzbot ^introduced: 1f2e3d4c5b6a
- 当将 Bug 跟踪器中的报告转发到回归列表时（见上文），包含如下段落
  #regzbot introduced: v5.13..v5.14-rc1 #regzbot from: Some N. Ice Human <some.human@example.com> #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
提交回归修复时，请在补丁描述中添加“Closes:”标签，指向所有报告该问题的地方，如《提交补丁：将代码引入内核的必备指南》和《Documentation/process/5.Posting.rst》所规定。如果您只修复导致回归问题的一部分，则可以使用“Link:”标签代替。regzbot 目前不对两者进行区分。
一旦确定了罪魁祸首，应尽快修复回归；大多数回归的修复应在两周内合并，但有些需要在两到三天内解决。

与开发者相关的 Linux 内核回归问题的所有详情¶

更详细的要点¶

收到回归报告时该怎么做¶

确保 Linux 内核的回归跟踪者和回归邮件列表 (regressions mailing list) 的其他订阅者 (regressions@lists.linux.dev) 能获知任何新报告的回归问题

当您收到一份未抄送给列表的邮件报告时，立即发送至少一份简短的“回复全部”邮件，并抄送给列表，使其进入处理流程；如果回复的回复中又遗漏了列表，请尝试确保再次抄送。

如果 Bug 跟踪器中提交的报告到达您的收件箱，请将其转发或弹回（bounce）到列表。如果报告者已按照《报告问题》中的指示转发了报告，请考虑事先检查列表存档。

在执行上述任一操作时，请考虑让 Linux 内核回归跟踪机器人“regzbot”立即开始跟踪该问题

对于邮件报告，检查报告者是否包含类似 #regzbot introduced: 1f2e3d4c5b6a 的“regzbot 命令”。如果没有，发送一封回复（抄送给回归列表），其中包含如下段落：
#regzbot ^introduced: v5.13..v5.14-rc1
这会告诉 regzbot 问题开始出现的版本范围；您也可以使用 commit-id 来指定范围，或者在报告者已二分法定位到问题提交时，直接指定单个 commit-id。

请注意“introduced”前的插入符号 (^)：它告诉 regzbot 将父邮件（您回复的邮件）视为您希望跟踪的回归问题的初始报告；这很重要，因为 regzbot 稍后会查找带有“Closes:”标签的补丁，这些标签指向 lore.kernel.org 存档中的报告。
当转发一个报告到 Bug 跟踪器的回归问题时，包含一个带有这些 regzbot 命令的段落
#regzbot introduced: 1f2e3d4c5b6a
#regzbot from: Some N. Ice Human <some.human@example.com>
#regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
Regzbot 将自动把包含指向您的邮件或提到的工单的“Closes:”标签的补丁与报告关联起来。

修复回归问题时的要点¶

提交回归修复时无需做任何特殊操作，只需记住按照《提交补丁：将代码引入内核的必备指南》、《Documentation/process/5.Posting.rst》和《关于 Linux -stable 版本的你需要了解的一切》中已详细解释的内容进行即可。

使用“Closes:”标签指向所有报告该问题的地方
Closes: https://lore.kernel.org/r/30th.anniversary.repost@klaava.Helsinki.FI/
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=1234567890
如果您只修复问题的一部分，可以如上述第一份文档中所述，使用“Link:”代替。regzbot 目前将两者视为等同，并认为链接的报告已解决。
添加“Fixes:”标签以指定导致回归的提交。

如果罪魁祸首是在较早的开发周期中合并的，请使用 Cc: stable@vger.kernel.org 标签明确标记该修复以进行反向移植（backporting）。

所有这些都是您应做的，并且在处理回归问题时非常重要，因为这些标签对于（包括您在内的）将来可能在数周、数月甚至数年后调查该问题的每个人都非常有价值。这些标签对于其他内核开发者或 Linux 发行版使用的工具和脚本也至关重要；其中一个工具就是 regzbot，它严重依赖“Closes:”标签来将回归报告与解决它们的更改关联起来。

修复回归的期望和最佳实践¶

作为一名 Linux 内核开发者，您应尽最大努力避免出现因您最近的更改导致回归，从而只留给用户以下选择的情况

运行一个存在影响使用的回归问题的内核。

切换到更旧或更新的内核系列。

在回归问题的罪魁祸首被识别后，继续运行一个过时且可能不安全的内核超过三周。理想情况下应少于两周。如果问题严重或影响许多用户——无论是普遍情况还是在常见环境中——则应在几天内解决。

如何在实践中实现这一点取决于多种因素。以下经验法则可作为指导。

总的来说

优先处理回归问题，高于所有其他 Linux 内核工作，除非后者涉及严重问题（例如：严重安全漏洞、数据丢失、硬件损坏等）。

加速修复最近已进入正式 mainline、stable 或 longterm 版本的回归问题（无论是直接合并还是通过反向移植）。

不要将当前周期的回归视为可以等到周期结束再处理的问题，因为该问题可能会阻碍或阻止用户和 CI 系统现在或普遍地测试 mainline。

在解决问题时需谨慎，以避免造成额外或更大的损害，即使这样解决问题可能比下面所述的时间更长。

一旦确定回归问题的罪魁祸首，关于时间安排

如果问题严重或困扰许多用户——无论是普遍情况还是在特定硬件环境、发行版或 stable/longterm 系列等常见条件下——目标是在两到三天内将修复合并到 mainline。

如果罪魁祸首已进入最近的 mainline、stable 或 longterm 版本（无论是直接合并还是通过反向移植），目标是在下下个周日之前将修复合并到 mainline；如果罪魁祸首在一周初被发现且易于解决，请尝试在同一周内将修复合并到 mainline。

对于其他回归问题，目标是在未来三周内的最后一个周日之前将修复合并到 mainline。如果回归是人们可以轻松忍受一段时间的，例如轻微的性能回归，则推迟一两个周日是可以接受的。

强烈不建议将回归修复的合入 mainline 延迟到下一个合并窗口，除非修复的风险极高或罪魁祸首是在一年多前合入 mainline 的。

关于流程

始终考虑回滚（reverting）罪魁祸首，因为它通常是修复回归问题最快、最不危险的方法。不必担心之后再将修复后的版本合并到 mainline：这应该很简单，因为大部分代码已经审查过一次了。

尝试在当前开发周期结束前解决过去十二个月内引入 mainline 的所有回归问题：Linus 希望这类回归能像当前周期的回归一样处理，除非修复带来异常风险。

如果回归问题看起来很棘手，请考虑在讨论或补丁审查时抄送 Linus。在紧急或危急情况下也这样做——特别是当子系统维护者可能无法联系时。当您知道此类回归已进入 mainline、stable 或 longterm 版本时，也请抄送 stable 团队。

对于紧急回归问题，考虑请求 Linus 直接从邮件列表中接收修复：对于没有争议的修复，他完全可以接受。但理想情况下，此类请求应与子系统维护者协商一致或直接由他们提出。

如果您不确定某个修复在新的 mainline 版本发布前几天应用是否值得冒险，请给 Linus 发送一封邮件，抄送给常规列表和相关人员；在邮件中，总结情况并请求他考虑直接从列表中接收修复。他可以自行决定，必要时甚至可以推迟发布。此类请求也应理想地与子系统维护者协商一致或直接由他们提出。

关于 stable 和 longterm 内核

如果回归问题从未在 mainline 中出现，或者已经在 mainline 中修复，您可以将其留给 stable 团队处理。

如果在过去十二个月内，某个回归问题进入了正式的 mainline 版本，请确保为修复标记“Cc: stable@vger.kernel.org”，因为单独的“Fixes:”标签并不能保证进行反向移植。如果您知道罪魁祸首已被反向移植到 stable 或 longterm 内核，请添加相同的标签。

当收到有关近期 stable 或 longterm 内核系列中回归问题的报告时，请至少简要评估该问题是否也可能发生在当前 mainline 中——如果可能性较大，请接手该报告。如有疑问，请要求报告者检查 mainline。

每当您想迅速解决一个最近也进入了正式 mainline、stable 或 longterm 版本的回归问题时，请在 mainline 中快速修复它；适当时，请 Linus 介入以加速修复（见上文）。这是因为 stable 团队通常既不会回滚也不会修复在 mainline 中造成相同问题的任何更改。

对于紧急的回归修复，一旦修复被合并到 mainline，您可能希望通过给 stable 团队发一个通知来确保及时反向移植；这在合并窗口期间和之后不久尤其值得推荐，因为否则修复可能会落在大量补丁队列的末尾。

关于补丁流程

开发者们，在尝试达到上述时间段时，请记住要考虑修复经过测试、审查并由 Linus 合并所需的时间，理想情况下它们至少会在 linux-next 中短暂存在。因此，如果修复是紧急的，请使其显而易见，以确保其他人能适当处理。

评审者们，请您及时审查回归修复，以帮助开发者达到上述时间段。

子系统维护者们，同样鼓励您加速处理回归修复。因此，评估对于特定修复跳过 linux-next 是否可行。必要时，也请考虑比平时更频繁地发送 git pull 请求。并尽量避免在周末拖延回归修复——特别是当该修复被标记为需要反向移植时。

开发者应了解的更多关于回归的方面¶

如何处理已知存在回归风险的变更¶

评估回归风险有多大，例如通过在 Linux 发行版和 Git 仓库中执行代码搜索。同时，考虑要求可能受影响的其他开发者或项目评估甚至测试拟议的更改；如果出现问题，或许可以找到一个所有人都接受的解决方案。

如果最终回归风险看起来相对较小，请继续进行更改，但要让所有相关方了解风险。因此，请确保您的补丁描述清晰地说明了这一点。一旦更改合并，请告知 Linux 内核的回归跟踪器和回归邮件列表有关风险，以便在报告陆续出现时，每个人都能关注到该更改。根据风险情况，您可能还希望要求子系统维护者在他的 mainline pull request 中提及该问题。

关于回归还有哪些需要了解？¶

查阅《报告回归》，它涵盖了您可能想了解的许多其他方面

“无回归”规则的目的

哪些问题实际属于回归

谁负责寻找回归的根本原因

如何处理棘手情况，例如回归是由安全修复引起时，或修复回归可能导致另一个回归时

遇到回归问题时应向谁寻求建议¶

向回归邮件列表 (regressions@lists.linux.dev) 发送邮件，同时抄送 Linux 内核的回归跟踪者 (regressions@leemhuis.info)；如果问题最好私下处理，可以省略列表。

Linus 关于回归问题的引言¶

以下是 Linus Torvalds 期望如何处理回归问题的几个实际例子

摘自 2017-10-26 (1/2)

If you break existing user space setups THAT IS A REGRESSION.

It's not ok to say "but we'll fix the user space setup".

Really. NOT OK.

[...]

The first rule is:

 - we don't cause regressions

and the corollary is that when regressions *do* occur, we admit to
them and fix them, instead of blaming user space.

The fact that you have apparently been denying the regression now for
three weeks means that I will revert, and I will stop pulling apparmor
requests until the people involved understand how kernel development
is done.

摘自 2017-10-26 (2/2)

People should basically always feel like they can update their kernel
and simply not have to worry about it.

I refuse to introduce "you can only update the kernel if you also
update that other program" kind of limitations. If the kernel used to
work for you, the rule is that it continues to work for you.

There have been exceptions, but they are few and far between, and they
generally have some major and fundamental reasons for having happened,
that were basically entirely unavoidable, and people _tried_hard_ to
avoid them. Maybe we can't practically support the hardware any more
after it is decades old and nobody uses it with modern kernels any
more. Maybe there's a serious security issue with how we did things,
and people actually depended on that fundamentally broken model. Maybe
there was some fundamental other breakage that just _had_ to have a
flag day for very core and fundamental reasons.

And notice that this is very much about *breaking* peoples environments.

Behavioral changes happen, and maybe we don't even support some
feature any more. There's a number of fields in /proc/<pid>/stat that
are printed out as zeroes, simply because they don't even *exist* in
the kernel any more, or because showing them was a mistake (typically
an information leak). But the numbers got replaced by zeroes, so that
the code that used to parse the fields still works. The user might not
see everything they used to see, and so behavior is clearly different,
but things still _work_, even if they might no longer show sensitive
(or no longer relevant) information.

But if something actually breaks, then the change must get fixed or
reverted. And it gets fixed in the *kernel*. Not by saying "well, fix
your user space then". It was a kernel change that exposed the
problem, it needs to be the kernel that corrects for it, because we
have a "upgrade in place" model. We don't have a "upgrade with new
user space".

And I seriously will refuse to take code from people who do not
understand and honor this very simple rule.

This rule is also not going to change.

And yes, I realize that the kernel is "special" in this respect. I'm
proud of it.

I have seen, and can point to, lots of projects that go "We need to
break that use case in order to make progress" or "you relied on
undocumented behavior, it sucks to be you" or "there's a better way to
do what you want to do, and you have to change to that new better
way", and I simply don't think that's acceptable outside of very early
alpha releases that have experimental users that know what they signed
up for. The kernel hasn't been in that situation for the last two
decades.

We do API breakage _inside_ the kernel all the time. We will fix
internal problems by saying "you now need to do XYZ", but then it's
about internal kernel API's, and the people who do that then also
obviously have to fix up all the in-kernel users of that API. Nobody
can say "I now broke the API you used, and now _you_ need to fix it
up". Whoever broke something gets to fix it too.

And we simply do not break user space.

摘自 2020-05-21

The rules about regressions have never been about any kind of
documented behavior, or where the code lives.

The rules about regressions are always about "breaks user workflow".

Users are literally the _only_ thing that matters.

No amount of "you shouldn't have used this" or "that behavior was
undefined, it's your own fault your app broke" or "that used to work
simply because of a kernel bug" is at all relevant.

Now, reality is never entirely black-and-white. So we've had things
like "serious security issue" etc that just forces us to make changes
that may break user space. But even then the rule is that we don't
really have other options that would allow things to continue.

And obviously, if users take years to even notice that something
broke, or if we have sane ways to work around the breakage that
doesn't make for too much trouble for users (ie "ok, there are a
handful of users, and they can use a kernel command line to work
around it" kind of things) we've also been a bit less strict.

But no, "that was documented to be broken" (whether it's because the
code was in staging or because the man-page said something else) is
irrelevant. If staging code is so useful that people end up using it,
that means that it's basically regular kernel code with a flag saying
"please clean this up".

The other side of the coin is that people who talk about "API
stability" are entirely wrong. API's don't matter either. You can make
any changes to an API you like - as long as nobody notices.

Again, the regression rule is not about documentation, not about
API's, and not about the phase of the moon.

It's entirely about "we caused problems for user space that used to work".

摘自 2017-11-05

And our regression rule has never been "behavior doesn't change".
That would mean that we could never make any changes at all.

For example, we do things like add new error handling etc all the
time, which we then sometimes even add tests for in our kselftest
directory.

So clearly behavior changes all the time and we don't consider that a
regression per se.

The rule for a regression for the kernel is that some real user
workflow breaks. Not some test. Not a "look, I used to be able to do
X, now I can't".

摘自 2018-08-03

YOU ARE MISSING THE #1 KERNEL RULE.

We do not regress, and we do not regress exactly because your are 100% wrong.

And the reason you state for your opinion is in fact exactly *WHY* you
are wrong.

Your "good reasons" are pure and utter garbage.

The whole point of "we do not regress" is so that people can upgrade
the kernel and never have to worry about it.

> Kernel had a bug which has been fixed

That is *ENTIRELY* immaterial.

Guys, whether something was buggy or not DOES NOT MATTER.

Why?

Bugs happen. That's a fact of life. Arguing that "we had to break
something because we were fixing a bug" is completely insane. We fix
tens of bugs every single day, thinking that "fixing a bug" means that
we can break something is simply NOT TRUE.

So bugs simply aren't even relevant to the discussion. They happen,
they get found, they get fixed, and it has nothing to do with "we
break users".

Because the only thing that matters IS THE USER.

How hard is that to understand?

Anybody who uses "but it was buggy" as an argument is entirely missing
the point. As far as the USER was concerned, it wasn't buggy - it
worked for him/her.

Maybe it worked *because* the user had taken the bug into account,
maybe it worked because the user didn't notice - again, it doesn't
matter. It worked for the user.

Breaking a user workflow for a "bug" is absolutely the WORST reason
for breakage you can imagine.

It's basically saying "I took something that worked, and I broke it,
but now it's better". Do you not see how f*cking insane that statement
is?

And without users, your program is not a program, it's a pointless
piece of code that you might as well throw away.

Seriously. This is *why* the #1 rule for kernel development is "we
don't break users". Because "I fixed a bug" is absolutely NOT AN
ARGUMENT if that bug fix broke a user setup. You actually introduced a
MUCH BIGGER bug by "fixing" something that the user clearly didn't
even care about.

And dammit, we upgrade the kernel ALL THE TIME without upgrading any
other programs at all. It is absolutely required, because flag-days
and dependencies are horribly bad.

And it is also required simply because I as a kernel developer do not
upgrade random other tools that I don't even care about as I develop
the kernel, and I want any of my users to feel safe doing the same
time.

So no. Your rule is COMPLETELY wrong. If you cannot upgrade a kernel
without upgrading some other random binary, then we have a problem.

摘自 2021-06-05

THERE ARE NO VALID ARGUMENTS FOR REGRESSIONS.

Honestly, security people need to understand that "not working" is not
a success case of security. It's a failure case.

Yes, "not working" may be secure. But security in that case is *pointless*.

摘自 2011-05-06 (1/3)

Binary compatibility is more important.

And if binaries don't use the interface to parse the format (or just
parse it wrongly - see the fairly recent example of adding uuid's to
/proc/self/mountinfo), then it's a regression.

And regressions get reverted, unless there are security issues or
similar that makes us go "Oh Gods, we really have to break things".

I don't understand why this simple logic is so hard for some kernel
developers to understand. Reality matters. Your personal wishes matter
NOT AT ALL.

If you made an interface that can be used without parsing the
interface description, then we're stuck with the interface. Theory
simply doesn't matter.

You could help fix the tools, and try to avoid the compatibility
issues that way. There aren't that many of them.

摘自 2011-05-06 (2/3)

it's clearly NOT an internal tracepoint. By definition. It's being
used by powertop.

摘自 2011-05-06 (3/3)

We have programs that use that ABI and thus it's a regression if they break.

摘自 2012-07-06

> Now this got me wondering if Debian _unstable_ actually qualifies as a
> standard distro userspace.

Oh, if the kernel breaks some standard user space, that counts. Tons
of people run Debian unstable

摘自 2019-09-15

One _particularly_ last-minute revert is the top-most commit (ignoring
the version change itself) done just before the release, and while
it's very annoying, it's perhaps also instructive.

What's instructive about it is that I reverted a commit that wasn't
actually buggy. In fact, it was doing exactly what it set out to do,
and did it very well. In fact it did it _so_ well that the much
improved IO patterns it caused then ended up revealing a user-visible
regression due to a real bug in a completely unrelated area.

The actual details of that regression are not the reason I point that
revert out as instructive, though. It's more that it's an instructive
example of what counts as a regression, and what the whole "no
regressions" kernel rule means. The reverted commit didn't change any
API's, and it didn't introduce any new bugs. But it ended up exposing
another problem, and as such caused a kernel upgrade to fail for a
user. So it got reverted.

The point here being that we revert based on user-reported _behavior_,
not based on some "it changes the ABI" or "it caused a bug" concept.
The problem was really pre-existing, and it just didn't happen to
trigger before. The better IO patterns introduced by the change just
happened to expose an old bug, and people had grown to depend on the
previously benign behavior of that old issue.

And never fear, we'll re-introduce the fix that improved on the IO
patterns once we've decided just how to handle the fact that we had a
bad interaction with an interface that people had then just happened
to rely on incidental behavior for before. It's just that we'll have
to hash through how to do that (there are no less than three different
patches by three different developers being discussed, and there might
be more coming...). In the meantime, I reverted the thing that exposed
the problem to users for this release, even if I hope it will be
re-introduced (perhaps even backported as a stable patch) once we have
consensus about the issue it exposed.

Take-away from the whole thing: it's not about whether you change the
kernel-userspace ABI, or fix a bug, or about whether the old code
"should never have worked in the first place". It's about whether
something breaks existing users' workflow.

Anyway, that was my little aside on the whole regression thing.  Since
it's that "first rule of kernel programming", I felt it is perhaps
worth just bringing it up every once in a while

Linux 内核

目录

本页

处理回归¶

要点（即“TL;DR”）¶

与开发者相关的 Linux 内核回归问题的所有详情¶

更详细的要点¶

收到回归报告时该怎么做¶

修复回归问题时的要点¶

修复回归的期望和最佳实践¶

开发者应了解的更多关于回归的方面¶

如何处理已知存在回归风险的变更¶

关于回归还有哪些需要了解？¶

遇到回归问题时应向谁寻求建议¶

更多关于回归跟踪和 regzbot 的信息¶

为什么 Linux 内核有回归跟踪者，以及为什么使用 regzbot？¶

regzbot 如何进行回归跟踪？¶

我必须使用 regzbot 吗？¶

我必须向 regzbot 报告我遇到的每一个回归问题吗？¶

如何查看 regzbot 目前正在跟踪哪些回归？¶

regzbot 监控哪些地方？¶

regzbot 应该跟踪哪类问题？¶

我可以将 CI 系统发现的回归添加到 regzbot 的跟踪中吗？¶

如何与 regzbot 互动？¶

关于 regzbot 及其命令，还有更多要说的吗？¶

Linus 关于回归问题的引言¶